CN112988064B

CN112988064B - Concurrent multitask-oriented disk graph processing method

Info

Publication number: CN112988064B
Application number: CN202110175548.XA
Authority: CN
Inventors: 王芳; 冯丹; 徐湘灏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2022-11-08
Anticipated expiration: 2041-02-09
Also published as: CN112988064A

Abstract

The invention provides a concurrent multitask-oriented disk graph processing method, which belongs to the technical field of computer big data processing and comprises the following steps: storing the edge data block and the vertex value set converted from the input graph data into a disk; when a plurality of graph tasks are executed, loading the vertex value set into a memory, and loading the edge data blocks in the memory in an in-and-out mode; updating a target vertex value by using a task updating function based on the edge data block and the vertex value set which are accessed by a plurality of graph tasks concurrently; when the target vertex values of all the accessed edge data blocks are updated and meet the convergence condition, outputting the final vertex value; otherwise, circularly loading the edge data block in the memory and updating the target vertex value; the invention can reduce the I/O access expense of the disk.

Description

Concurrent multitasking-oriented disk image processing method

Technical Field

The invention belongs to the technical field of computer big data processing, and particularly relates to a concurrent multitasking-oriented disk graph processing method.

Background

With the increasing demand for graph computation in the real world, a graph computation system is required to simultaneously and concurrently output a plurality of graph computation tasks in many scenarios. However, existing concurrent multitasking-oriented graph computing systems typically rely on a large-scale distributed system or a stand-alone based shared memory system. These systems face problems of high hardware cost and communication overhead, or poor scalability when processing concurrent graphics tasks on large-scale graphics data. These problems are further exacerbated by the large number of intermediate results that concurrent graphics tasks produce when executed. In this case, the external memory mode diagram with high cost performance and good expandability is adopted for calculation, and a potential feasible option is provided.

However, existing external memory pattern graph computing systems face the following challenges when handling concurrent graph tasks. Firstly, due to different I/O access characteristics, concurrent graph tasks access graph data in a disk according to different traversal paths during execution. These accesses tend to produce many random and redundant data reads that greatly affect the performance of the system. Second, since concurrent graph tasks simultaneously issue I/O requests to the operating system. This causes more intense competition for the otherwise limited disk bandwidth, resulting in severe I/O collisions that impact the throughput of the system.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a concurrent multitasking-oriented disk map processing method, aiming at solving the problem of high I/O (input/output) overhead when the existing concurrent graph tasks are executed.

In order to achieve the above object, the present invention provides a concurrent multitasking-oriented disk graph processing method, which includes the following steps:

storing the edge data block and the vertex value set converted from the input graph data into a disk;

when a plurality of graph tasks are executed, loading the vertex value set into a memory, and loading the edge data blocks in the memory in an in-and-out mode;

updating a target vertex value by using a task updating function based on the edge data block and the vertex value set which are accessed by a plurality of graph tasks concurrently;

when the target vertex values of all the accessed edge data blocks are updated and meet the convergence condition, outputting the final vertex value; otherwise, circularly loading the edge data block in the memory and updating the target vertex value;

wherein the graph data sub-blocks comprise edge data blocks; the edge data block is used for storing emergent edge data of the vertex.

Preferably, the mode of the multiple graph tasks concurrently accessing the side data block is as follows:

in the access process, the multiple graph tasks skip the side data blocks in the inactive state in a selective data access mode and only access the side data blocks containing the active sides; the inactive side data block is an edge data block that does not contain active side data.

Preferably, the specific steps include:

(1) Converting input graph data into P edge data blocks and a vertex value set; wherein each vertex in the input graph data is assigned a vertex value;

(2) Loading the vertex value set into a memory;

(3) Loading the kth edge data block to a memory; the initial value of k is 1;

(4) When the kth edge data block is in an active state, updating a target vertex value by using a task updating function based on the kth edge data block and the vertex value set which are accessed by a plurality of graph tasks concurrently; when the k-th edge data block is in an active state, turning to the step (5);

(5) Returning the k-th edge data block to the disk;

(6) Judging whether k = P, if so, turning to the step (7), otherwise, enabling k = k +1, and returning to the step (3);

(7) Judging whether the convergence condition is met, and if so, outputting a final vertex value; otherwise, let k =1, return to step (3).

Preferably, the map data subblocks further comprise an index structure; an index structure corresponds to an edge data block, and the index structure is used for recording the offset of a first emergent edge of a vertex corresponding to the corresponding edge data block in the edge data block; when executing multiple graph tasks, loading the graph tasks and corresponding edge data blocks into the memory.

Preferably, the method for loading the currently processed edge data block into the memory includes:

respectively calculating the read-write expenditure of a disk for sequentially loading all edge data and the read-write expenditure of a disk for randomly loading an active edge;

wherein a vertex is defined as an active vertex if and only if its vertex value was updated in the previous iteration; if and only if the source vertex of the edge is an active vertex, the edge is defined as an active edge; the read-write overhead is calculated by dividing the total data volume of the read-write graph data by the access bandwidth of the disk;

judging whether the read-write expenditure of the disk for sequentially loading all the edge data is less than the read-write expenditure of the disk for randomly loading the active edge; if yes, selecting to load all edge data in sequence, otherwise, selecting to load active edge data randomly.

Preferably, the specific steps of converting the input graph data into the edge data blocks and the vertex values are as follows:

allocating a vertex value to each vertex in the input graph data, and storing a vertex value set into a disk;

dividing the vertex value into P disjoint subintervals, and setting each subinterval to correspond to one edge data block; the value of P ensures that the size of each edge data block is smaller than the capacity of the memory;

and storing the edge data block and the vertex value set into a disk.

Preferably, the method for updating the destination vertex is as follows: and reading the source vertex by adopting a push model, and updating the target vertex value by adopting atomic operation according to the updating functions of a plurality of graph tasks.

Preferably, the convergence condition is that the vertex value of each subinterval no longer changes.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the method converts input graph data into graph data sub-blocks and vertex value sets, and all tasks can uniformly access the graph data and the vertex value sets; the index structure is arranged in the graph data subblock, so that quick access of a plurality of tasks can be supported, and meanwhile, the graph data is stored in a disk and loaded to a memory when needed, so that the storage overhead of the disk can be reduced.

The method for loading the side data block in the memory provided by the invention is used for calculating the disk read-write expense for loading all the side data in the current sequence and the disk read-write expense for randomly loading the active side, and then determining which mode to load, so that the disk read-write expense is reduced.

In the invention, a plurality of graph tasks skip the side data blocks in the inactive state and access the side data blocks on the active side in a selective data method mode, thereby avoiding the loading of useless disk data and the waste of disk reading and writing.

The invention adopts the graph data to update the destination vertex, solves the problems of redundant access and storage overhead in the processing process, and avoids the competition of disk bandwidth.

Drawings

Fig. 1 is a schematic diagram of a concurrent multitasking-oriented disk map processing method according to an embodiment of the present invention;

fig. 2 (a) is a schematic diagram of a directed graph G provided by the embodiment of the present invention;

fig. 2 (b) is a schematic process diagram of organizing a directed graph G into a CSR structure according to an embodiment of the present invention;

fig. 3 is a schematic diagram of processing a vertex and an edge of a sub-section 1 in a directed graph G by a concurrent graph task according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a concurrent multitask-oriented disk graph processing method, which comprises the following steps of:

wherein the graph data subblocks comprise edge data blocks; the edge data block is used for storing emergent edge data of the vertex.

Preferably, the way for the multiple graph tasks to access the edge data block concurrently is:

in the access process, the multiple graph tasks skip the side data blocks in the inactive state in a selective data access mode and only access the side data blocks containing the active sides; the side data block in the inactive state is a side data block which does not contain active side data.

Preferably, the specific steps include:

(2) Loading the vertex value set into a memory;

(3) Loading the kth edge data block to a memory; the initial value of k is 1;

(5) Returning the k-th edge data block to the disk;

judging whether the read-write expenditure of the disk for sequentially loading all the edge data is less than the read-write expenditure of the disk for randomly loading the active edge; if so, selecting to load all the edge data in sequence, and otherwise, selecting to load the active edge data randomly.

dividing the vertex value into P disjoint subintervals, and setting each subinterval to correspond to an edge data block; the value of P is to ensure that the size of each edge data block is smaller than the capacity of the memory;

and storing the edge data block and the vertex value set into a disk.

Preferably, the method for updating the destination vertex is: and reading the source vertex by adopting a push model, and updating the target vertex value by adopting atomic operation according to the updating functions of a plurality of graph tasks.

Examples

As shown in fig. 1, the present invention provides a concurrent multitasking-oriented disk graph processing method, including the following steps:

(1) Converting input graph data into P graph data sub-blocks and vertex value sets;

wherein, the graph data sub-blocks comprise edge data blocks based on CSR (Compressed spare Row) and an index structure; assigning a vertex value to each vertex in the input graph data; the side data block is used for storing emergent side data of the corresponding vertex; the index structure is used for recording the offset of the first emergent edge of each vertex in the edge data block;

(2) Loading the vertex value set into a memory;

(3) Selecting to load all edge data in sequence or randomly load active edge data to a memory according to an index structure by comparing read-write overheads of all edge data of a kth edge data block and an active edge data disk; k has an initial value of 1;

(4) When the kth edge data block is in an active state, updating a target vertex value by using a task updating function based on the kth edge data block and a vertex value set which are accessed by a plurality of graph tasks concurrently; when the k-th edge data block is in an active state, turning to the step (5);

(5) Returning the kth graph data sub-block to the disk;

Preferably, the step (1) specifically comprises the following steps:

dividing the vertex value into P disjoint subintervals, and setting each subinterval to correspond to one edge data block in the disk; the value of P ensures that the size of each edge data block is smaller than the capacity of the memory; each edge data block is used for storing emergent edge data of a corresponding vertex;

each edge data block is correspondingly provided with an index structure; the index structure is used for recording the offset of the first emergent edge of each vertex corresponding to the edge data block in the edge data block;

constructing the edge data block and the corresponding index structure into a graph data sub-block, and storing the graph data sub-block into a disk;

the sub-blocks of map data are stored in disk, and each sub-block of map data is loaded into memory in turn during the computation.

Fig. 2 (a) is a schematic diagram of a directed graph G provided by the embodiment of the present invention, and fig. 2 (b) is a schematic diagram of a structure in which the directed graph G provided by the embodiment of the present invention is organized into edge data blocks based on a CSR; as shown in fig. 2 (b), the specific process is as follows:

(1.1) partitioning the vertices in the directed graph G into two disjoint subintervals 1 (comprising vertex values 1,2, 3) and 2 (comprising vertex values 4,5, 6);

(1.2) correspondingly creating an edge block structure (edge block) in a disk for each subinterval so as to store emergent edge data of a vertex of the subinterval; fig. 2 (a) is divided into an edge data block 1 and an edge data block 2;

(1.3) creating an index structure for each edge data block, wherein the index structure is used for storing the offset of the first emergent edge of each vertex in the edge block; wherein, the edge data block and the index structure form a graph data sub-block;

(1.4) storing the 2 sub-blocks of graph data in a magnetic disk.

Preferably, the step (3) selects to load all the edge data to the memory sequentially or randomly according to the index structure according to the disk read-write overhead of the kth edge data block; the method specifically comprises the following steps:

(3.1) respectively calculating the read-write expenses of the current disk for sequentially loading all the edge data and the read-write expenses of the current disk for randomly loading the active edge;

one vertex is defined as an active vertex if and only if the vertex value of that vertex was updated in the previous iteration; an edge is defined as an active edge if and only if the source vertex of the edge is an active vertex;

the read-write overhead is calculated by dividing the total data volume of the read-write graph data by the access bandwidth of the disk;

(3.2) judging whether the read-write expense of the disk for sequentially loading all the edge data is smaller than the read-write expense of the disk for randomly loading the active edge; if so, selecting to load all the edge data in sequence, otherwise selecting to load the active edge data randomly;

and (4) loading the edge data block according to the selected disk I/O access mode to enable a plurality of concurrent graph tasks to access the graph data sub-blocks and the source vertexes and update the destination vertex values, wherein the method specifically comprises the following steps:

(4.1) side data block access: in each iteration process, each edge data block is sequentially loaded into a memory to realize the shared access of a Concurrent Graph Computing (CGP) task; in addition, the vertex value of each CGP task is also loaded at the same time in step (2); in the process of access, a certain edge data block may no longer contain active edge data, that is, all CGP tasks no longer need to access the edge data of the edge data block, in this case, the inactive edge data blocks are skipped by selective data access, and only the edge data block containing the active edge is accessed;

(4.2) processing the data blocks in parallel: after each edge data block is loaded into the memory, the related CGP task (an active edge exists in the edge block) starts to access the edge data in the edge block concurrently, and executes the updating process of the destination vertex value; after each edge block is processed by all related CGP tasks, the memory starts to load the next edge block;

(4.3) update propagation: when processing the edge data in each edge data block, reading the source vertex data and updating the destination vertex by adopting a push model; the process of updating the vertex is carried out according to a specific updating function of each CGP task; meanwhile, atomic operation (atomic operation) is used when the destination vertex is updated so as to ensure the consistency of the calculation result;

fig. 3 is a schematic diagram of processing a vertex and an edge of a subinterval 1 in a directed graph G by a concurrent graph task according to an embodiment of the present invention; the system needs to process three CGP tasks, including a PageRank diagram task, a Connected Components (CC) diagram task and a Single Source Shortest Path (SSSP) diagram task; the system also decouples the task graph from the application-based vertex attribute values so that multiple CGP tasks can share a piece of graph data; meanwhile, each CGP task can maintain a specific vertex value of the application, and the vertex values are continuously updated in the calculation process until the corresponding CGP task reaches a convergence state;

as shown in fig. 3, after the edge data block 1 is loaded into the memory, the CGP tasks concurrently access and process the edge data block as a shared subgraph; subsequently, each CGP task updates the application-specific vertex value according to the push update model, namely, data are read from the source vertex of each edge, and then a corresponding update function is called to update the target vertex value; after all the edge data blocks are processed by the CGP tasks, the system starts to execute the next iteration until all the CGP tasks reach a convergence state;

step (7) judging whether a convergence condition is reached, if so, outputting a final vertex value; otherwise, enabling k =1, and returning to the step (3); the convergence condition is preset by a user; in the present embodiment, the convergence condition is reached when the vertex values {1,2,3} and {4,5,6} of each subinterval are not changed;

the system reaches a convergence condition, ends the iterative processing, and outputs vertex values in the graph data.

Compared with the prior art, the invention has the following advantages:

the method converts input graph data into graph data subblocks and vertex value sets, and all tasks can access unified graph data and the vertex value sets; the index structure is arranged in the graph data subblock, so that quick access of a plurality of tasks can be supported, and meanwhile, the graph data is stored in a disk and loaded to a memory when needed, so that the I/O access overhead of the disk can be reduced.

The method for loading the side data block in the memory calculates the disk read-write expense for loading all the side data in the current sequence and the disk read-write expense for randomly loading the active side, and then determines which mode to load, thereby reducing the disk read-write expense.

It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims

1. A concurrent multitask-oriented disk graph processing method is characterized by comprising the following steps:

updating a target vertex value by using a task updating function based on the edge data blocks and the vertex value sets which are accessed by a plurality of graph tasks concurrently;

wherein the graph data sub-blocks comprise edge data blocks; the side data block is used for storing emergent side data of the vertex;

the specific steps of converting the input graph data into the edge data blocks and the vertex values are as follows:

allocating a plurality of vertex values to each vertex in the input graph data, and storing a vertex value set into a disk; the number of vertex values distributed by one vertex is consistent with the number of graph tasks;

dividing the vertex into P disjoint subintervals, and setting each subinterval to correspond to one edge data block; the value of P ensures that the size of each edge data block is smaller than the capacity of the memory;

storing the edge data block and the vertex set into a disk;

wherein the graph data sub-blocks further comprise an index structure; the index structure is used for recording the offset of a first emergent edge of a vertex corresponding to the corresponding edge data block in the edge data block; when executing a plurality of graph tasks, loading the graph tasks and the corresponding edge data blocks into a memory;

the mode of the multiple graph tasks for concurrently accessing the edge data block is as follows:

in the access process, the multiple graph tasks skip the side data blocks in the inactive state in a selective data access mode and only access the side data blocks containing the active sides; wherein, the side data block in the inactive state is the side data block which does not contain active side data;

the specific execution steps of the disk map processing method comprise:

(1) Converting input graph data into P edge data blocks and a vertex value set; each vertex in the input graph data is assigned with a plurality of vertex values, and the quantity of the vertex values assigned by one vertex is consistent with the quantity of graph tasks;

(2) Loading the vertex value set into a memory;

(3) Loading the kth edge data block to a memory; k has an initial value of 1;

(4) When the kth edge data block is in an active state, updating a target vertex value of each graph task by using a task update function based on the kth edge data block and a vertex value set which are accessed by a plurality of graph tasks concurrently; when the k-th edge data block is in an active state, turning to the step (5);

(5) Returning the k-th edge data block to the disk;

(7) Judging whether the convergence condition is met, and if so, outputting a final vertex value; otherwise, let k =1, return to step (3);

the method for loading the edge data block into the memory comprises the following steps:

judging whether the read-write overhead of the disk for sequentially loading all the edge data is less than the read-write overhead of the disk for randomly loading the active edge; if so, selecting to load all the edge data in sequence, otherwise selecting to load the active edge data randomly;

the method for updating the destination vertex comprises the following steps:

reading a source vertex by adopting a push model based on the edge data block and the vertex value set;

and inputting the source vertex into an updating function of a plurality of graph tasks, and updating the destination vertex value by adopting an atomic operation.

2. The disk map processing method according to claim 1, wherein the convergence condition is that the vertex value of each subinterval does not change any more.