CN110677461A

CN110677461A - Graph calculation method based on key value pair storage

Info

Publication number: CN110677461A
Application number: CN201910842562.3A
Authority: CN
Inventors: 陈榕; 柯学翰; 陈海波; 臧斌宇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-09-06
Filing date: 2019-09-06
Publication date: 2020-01-10

Abstract

The invention provides a graph calculation method based on key-value pair storage, which comprises the following steps: the server loads an original graph data set and stores the original graph data set into a memory according to a key value pair mode; for graph computation tasks, a traversal index is added for key-value pair storage. And the server receives the graph calculation request sent by the client, and sends the graph calculation request to the graph calculation engine for execution after the graph calculation request is analyzed. The graph computation engine accesses graph data by traversing the index, updates key vertices belonging to keys in the local key value store, and sends the updated key vertices to the remote server; and receiving the updating data sent by other servers and then updating the local data. And repeating the steps until all the calculations are completed, and returning the calculation result to the client. The invention uses the traversal index to accelerate the traversal speed of the graph data, simultaneously fully utilizes the distribution characteristics of the key value pair to carry out the data transmission and update, reduces the communication overhead and enables the efficient graph calculation to be carried out under the storage mode of the key value pair.

Description

Graph calculation method based on key value pair storage

Technical Field

The invention relates to a communication method for distributed computing, in particular to a communication method for processing a query task by a distributed system.

Background

With the development of the internet and the arrival of the big data era, the research of graph data is more and more emphasized. Many relational data can be abstracted into graph structure data, including data of social networks, knowledge graphs, link relations between web pages, and the like. The current research on graph data mainly comprises two aspects of a graph computing system and a graph query system.

Graph computing systems usually run complex algorithms on the whole graph, such as Pagerank, SSSP (single source shortest path), and the like, and need to traverse all vertices of the whole graph, and return results after multiple iterations. Therefore, for better performance, graph computation uses a traversal access-friendly compressed sparse matrix (compressed sparse rowCSR and compressed sparse column CSC) storage structure in system bottom storage, and a lot of optimization is proposed on graph division and data transmission. In a graph query system, only a small part of specific vertices on a graph are generally required to be accessed, but the delay requirement for a query task is high, so the bottom layer generally stores graph data in a key-value pair mode and then can quickly access a specific key through a hash table.

The current graph query and graph computation systems are independent studies, and in practical applications, the two generally exist simultaneously. For a piece of data, for example, a piece of graph data about the friend relationship of the social network, a user usually has a need for performing friend query thereon (graph query) and also has a need for performing friend recommendation on graph data analysis (graph calculation). The existing solution needs to load the same data into two different systems, which causes problems of memory space waste, data consistency and the like.

The key-value store has the characteristics of simple structure, strong expandability, convenience in searching and the like, and is widely applied to a graph query system. Thus, if graph computation can be efficiently supported on key-value pair storage, it may be convenient to support both graph computation and graph query operations on the same system. Implementing efficient graph computation directly at key-value pair storage faces many challenges: firstly, when the graph computation engine traverses and accesses all the vertexes, the graph computation engine needs to search through a hash table, which causes great performance overhead; secondly, in the key-value pair storage mode, how the graph computation engine can efficiently perform data propagation and updating.

Therefore, how to design a graph calculation method based on key-value pair storage can overcome the defect that the original key-value pair storage is not efficient enough for graph data traversal, reduce the data transmission overhead in the graph calculation process, and become a difficult problem of graph calculation by using key-value pair storage.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a graph calculation method based on key value pair storage, which is suitable for a distributed graph system based on a key value pair storage mode, can accelerate the traversal speed of data on a graph, reduce communication overhead and improve the performance of a server for graph calculation.

The graph calculation method based on key-value pair storage provided by the invention comprises the following steps:

a storage step: and the server loads an original graph data set and stores the original graph data set into a memory according to a key value pair mode.

And (3) index construction: for graph computation tasks, a traversal index is added for key-value pair storage.

A task analyzing step: and the server receives the graph calculation request sent by the client, and sends the graph calculation request to the graph calculation engine for execution after the graph calculation request is analyzed.

And key value updating: the graph computation engine accesses the graph data by traversing the index, updates vertices belonging to keys (key vertices) in the local key value store, and transmits the updated key vertices to the remote server.

Local data updating step: and receiving the updating data sent by other servers and then updating the local data. At this point, one iteration of the calculation is completed.

And (3) calculating and returning: if the iteration of the graph calculation task is not finished, repeating the key value updating step and the local data updating step; and if all the calculations are completed, returning the calculation result to the client.

Preferably, the storing step comprises: and loading the graph data into a memory and storing the graph data in a key-value mode. Wherein the key includes information such as vertex ID (usually in a hash), edge type, etc. assigned to the machine, which is stored in a hash table; the value is the set of all neighbor vertices that the vertex in the key corresponds to.

Preferably, the step of constructing an index comprises:

arranging and storing: the values in the key value pair are stored according to the ascending of the vertex ID in the key and are continuously stored in a block of memory which is allocated in advance.

And a step of establishing a traversal index, wherein the index is stored in an array mode. The index x of the array corresponds to the xth vertex of the original key, with the vertices arranged from small to large. The value stored at the x position in the array is the starting position of the corresponding key in the value store. Graph computation can obtain all the side information on the graph by traversing the index.

Preferably, the step of parsing the task includes: receiving a graph calculation request sent by a client, wherein the request comprises a graph calculation algorithm to be executed and corresponding parameters. And after the analysis is finished, the parameters are transmitted to a graph calculation engine, and the graph calculation engine executes a corresponding algorithm.

Preferably, the key value updating step includes:

information access step: when the graph computation engine executes the graph computation task, adjacent side information of vertexes (key vertexes) belonging to the keys in the local key value storage is accessed one by one. For the access of the key vertex, the traversal index is traversed, the hash table is not used for searching one by one, and the information of the key vertex and the corresponding adjacent side vertex is directly obtained by sequentially accessing the index.

And (3) state acquisition and storage steps: after the key vertex and adjacent side information are obtained, the current state of the corresponding vertex can be obtained according to the vertex ID. According to different algorithm requirements (user definition), updating the state of the key vertex by the current state of the neighbor vertex, and saving the new state of the key vertex.

An update preparation step: and sending the states of all the updated key vertexes to all other machines, and updating the states of the vertexes by the remote machine to prepare for the next calculation.

Preferably, the local data updating step includes:

receiving the updating data: and receiving the updated data of the key vertex sent by other machines in the key value updating step, wherein the received data comprises the vertex ID and the new state of the vertex.

A data replacement step: when data updating is carried out, for the local key vertex, the new state stored locally is directly used for replacing the old state; for other vertexes, the state of the corresponding vertex is updated according to the vertex ID of the received data. After the update is complete, all vertices are in the new state.

Preferably, the calculation returning step includes: after all updating operations are finished, judging that the current execution belongs to the iteration of the graph calculation task for the second time, if the iteration is not finished, continuing to execute the key value updating step and the local data updating step; if all iterations have been completed, the results are aggregated and returned to the user.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a graph calculation method based on key value pair storage, which adds a graph traversal index for key value pair storage, so that a graph calculation engine does not need to search a key corresponding to each vertex through a hash table for traversal of data on a graph to acquire the storage position of a corresponding neighbor vertex, but can directly acquire the storage position of the neighbor vertex through the traversal index, thereby reducing the searching overhead of the hash table and greatly improving the performance of graph calculation.

2. The invention provides a graph calculation method based on key value pair storage, which is characterized in that the positions of values are continuously stored from small to large according to the sequence of key vertexes, so that a graph calculation engine has good spatial locality when accessing the vertexes on opposite sides, and the graph calculation speed is improved.

3. The invention provides a graph calculation method based on key value pair storage, which is characterized in that the aggregation operation of vertex data is locally carried out by using a mode of locally updating the key vertices and synchronously updating the key vertices among different machines, and the data transmission among different servers only needs O (V) (V is the number of vertices on a graph), thereby avoiding the need of O (E) data transmission quantity (E is the number of vertices on the graph) in the prior art, greatly reducing the communication overhead and improving the graph calculation speed.

4. The invention provides a graph calculation method based on key value pair storage, which solves the problem of low performance efficiency of graph calculation by using key value pair storage in the prior art, and provides reference significance for more complex mixed calculation by using key value pair storage later.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a graph computation method based on key-value pair storage provided by the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention provides a graph calculation method based on key-value pair storage, which comprises the following steps:

a storage step: and starting the loading graph data set by a plurality of machines, and adding data to the memory in a key value pair mode to be stored.

The storing step includes: the multiple machines start to load the graph data set, and the graph data is divided into multiple machines. The vertex is assigned to a unique machine through a hash function, and then information such as a vertex ID and an edge type is jointly formed into a key, the vertex is also called a key vertex on the machine, and the value is a set of target vertices of all edges of the type of the vertex. The key may be stored in a hash table for later retrieval and access.

The index building step comprises the following steps:

arranging and storing: adjusting the storage sequence of the original values, arranging according to the size sequence of the key vertex IDs, arranging adjacent vertexes of the same key vertex ID according to the size sequence of the vertex IDs, continuously storing the adjacent vertexes in a pre-allocated memory, and recording the initial position of the storage area

And a step of establishing a traversal index, which is to newly establish a traversal index array, wherein the size of the array is to add one to the data distributed at the top point of the current machine. The subscript x of the array corresponds to the xth vertex of the key vertex arranged from small to large, the value stored at the x position in the array is the initial position of the corresponding key vertex in the value storage, and the last bit of the index array stores the position of the last bit in the value area. Specifically, when the x-th bit of the index data is accessed, the key top ID corresponding to the position is calculated first, and then the start address of the value storage area is obtained according to the index, because the value area is stored continuously, the end address of the value corresponding to the x-th bit in the index is the start address of the x + 1-th bit in the index.

Further specifically, by adding the index, hash table lookup is avoided during sequential traversal of graph data, and the vertices of neighbors can be directly obtained through the index. Compared with the mode that the corresponding key is required to be constructed firstly for the original vertex access, then the matching is carried out through the hash table, and finally the information of the neighbor vertex can be obtained, the access operation only needs one time through the index, and the computing performance is greatly improved. Meanwhile, the data of the edges are continuously stored, so that the data has good spatial locality in traversal.

The analyzing and sending step comprises: receiving a graph calculation request sent by a client, wherein the request comprises a graph calculation algorithm to be executed and corresponding parameters. And after the analysis is finished, the parameters are transmitted to a graph calculation engine, and the graph calculation engine executes a corresponding algorithm. The graph computation algorithm needs to be implemented in advance according to an interface provided by the graph computation engine. The interfaces provided by the graph computation engine mainly include two types: define vertices and define computation of edges.

The key value updating step comprises the following steps:

information access step: when the graph computation engine executes the graph computation task, adjacent side information of vertexes (key vertexes) belonging to the keys in the local key value storage is accessed one by one. For the access of the key vertex, the information of the key vertex and the corresponding adjacent edge vertex can be directly obtained by traversing the traversal index instead of searching through a hash table by sequentially accessing the index.

And (3) state acquisition and storage steps: after key vertex and adjacent side information are obtained, the current state of the corresponding vertex is obtained according to the vertex ID, and then the state of the key vertex is updated by gathering the current states of the adjacent vertices according to different algorithm requirements (defined by users). The system uses a full graph vertex number size array (current _ state) to record the current state of all vertices, and a local key vertex number size array (update _ key _ state) to record the state of the vertices updated in this step. According to the storage characteristics of the key value pairs, the side information of the key top points can be obtained locally, and then the aggregation updating operation can be completed locally.

An update preparation step: the updating state of the key vertex is recorded in the update _ key _ state array, when all the key vertices are updated, the updating data of the array are sequentially sent to other machines together, and the state of the corresponding vertex is updated by the remote machine to prepare for the next round of calculation.

The local data updating step comprises:

receiving the updating data: and receiving updated data of the key vertex sent by other machines in the key value updating step by using the ordinary network TCP or the high-speed network RDMA, wherein the received updated information comprises the vertex ID and the new state of the vertex. The sent vertex updates of the different machines are saved in preparation for the next update.

A data replacement step: a data update is performed, i.e., the value in the current _ state array is updated. For the local key vertex, updating the local key vertex by directly using locally-stored update _ key _ state data; for other vertices, the value of the corresponding vertex in the current _ state array is modified based on the vertex ID of the received data and the new state. Since the sum of the key vertices of all machines is equal to all vertices of the full graph, all data in the current _ state array is updated to a new value.

More specifically, in the key-value pair storage mode, each vertex is divided into a unique machine as a key vertex, and the machine has information of all its edges. Therefore, the characteristic that the key vertex aggregation and update operation can be carried out locally is utilized, each machine is responsible for updating the key vertex, then the new states of the key vertices are exchanged among all machines, and the update states of all the vertices are obtained through summarization. Therefore, the communication overhead O (V) (V is the number of the top points) among the machines is avoided, the O (E) overhead (E is the number of the edges) possibly caused by cross-machine aggregation is avoided, the communication overhead is reduced, and the performance of graph calculation is improved.

The calculation returning step includes: after all updating operations are finished, judging that the current execution belongs to the iteration of the graph calculation task for the second time, if the iteration is not finished, continuing to execute the key value updating step and the local data updating step; if all iterations have been completed, the results are aggregated and returned to the user.

In conclusion, the invention uses the traversal index, avoids the expense of accessing the graph data through the hash table, accelerates the traversal speed of the graph data, simultaneously fully utilizes the distribution characteristics of the key value pair to carry out the data transmission and update, reduces the communication expense, and enables the efficient graph calculation to be carried out under the storage mode of the key value pair.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A graph computation method based on key-value pair storage is characterized by comprising the following steps:

a storage step: the server loads an original graph data set and stores the original graph data set into a memory according to a key value pair mode;

and (3) index construction: adding a traversal index for key-value pair storage aiming at the graph calculation task;

a task analyzing step: the server receives the graph calculation request sent by the client, and sends the graph calculation request to the graph calculation engine for execution after the graph calculation request is analyzed;

and key value updating: the graph computation engine accesses graph data by traversing the index, updates key vertices belonging to keys in the local key value store, and sends the updated key vertices to the remote server;

local data updating step: receiving updating data sent by other servers, and then updating local data;

2. The key-value pair storage based graph computation method of claim 1, wherein the storing step comprises: loading the graph data into a memory and storing the graph data in a key value pair mode; wherein the key includes the vertex ID assigned to the machine, the type information of the edge, the key is stored in a hash table; the value is the set of all neighbor vertices that the vertex in the key corresponds to.

3. The key-value pair storage based graph computation method of claim 1, wherein the building a traversal index step comprises:

arranging and storing: arranging the value storage in the key value pair from small to large according to the ID of the top point in the key, and continuously storing the values in a pre-allocated memory;

establishing a traversal index, wherein the index is stored in an array mode; the subscript x of the array corresponds to the xth vertex arranged from small to large in the original key; the value stored at the x position in the array is the initial position of the corresponding key in the value storage; graph computation can obtain all the side information on the graph by traversing the index.

4. The key-value pair storage based graph computation method of claim 1, wherein the parsing sending step comprises: receiving a graph calculation request sent by a client, wherein the request comprises a graph calculation algorithm to be executed and corresponding parameters; and after the analysis is finished, the parameters are transmitted to a graph calculation engine, and the graph calculation engine executes a corresponding algorithm.

5. The key-value-pair-storage-based graph computation method of claim 1, wherein the key-value updating step comprises:

information access step: when the graph calculation engine executes the graph calculation task, adjacent side information belonging to key vertexes in the local key value storage is accessed one by one; for the key vertex, the access is carried out by traversing the traversal index; the sequential access index directly acquires the information of the key vertex and the adjacent side vertex corresponding to the key vertex;

and (3) state acquisition and storage steps: after key vertex and adjacent side information are obtained, the current state of the corresponding vertex can be obtained according to the vertex ID; updating the state of the key vertex by the current state of the neighbor vertex according to different algorithm requirements, and storing the new state of the key vertex;

6. The key-value pair storage based graph computation method of claim 1, wherein the local data update step comprises:

receiving the updating data: receiving updated data of the key vertex sent by other machines in the key value updating step, wherein the received data comprises a vertex ID and a new state of the vertex;

a data replacement step: when data updating is carried out, for the local key vertex, the new state stored locally is directly used for replacing the old state; for other vertexes, updating the state of the corresponding vertex according to the vertex ID of the received data; after the update is complete, all vertices are in the new state.

7. The key-value pair storage based graph computation method of claim 1, wherein the computation returning step comprises: after all updating operations are finished, judging that the current execution belongs to the iteration of the graph calculation task for the second time, if the iteration is not finished, continuing to execute the key value updating step and the local data updating step; if all iterations have been completed, the results are aggregated and returned to the user.