CN109003222B

CN109003222B - Asynchronous energy-efficient graph calculation accelerator

Info

Publication number: CN109003222B
Application number: CN201810697919.9A
Authority: CN
Inventors: 李曦; 周学海; 徐冲冲; 王超
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-07-24
Anticipated expiration: 2038-06-29
Also published as: CN109003222A

Abstract

The invention discloses an asynchronous energy-efficient graph calculation accelerator, which comprises a data preprocessing module, a data transmission module and a data processing module which are sequentially connected, wherein: the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data Batch Row Vector with the size of Batch _ size; and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA. The data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module. The invention designs an asynchronous energy-efficient graph calculation accelerator, accelerates the convergence speed of a graph algorithm by adopting an asynchronous calculation mode, and reduces the power consumption and the energy consumption of a system by customizing the accelerator based on a hardware platform.

Description

Asynchronous energy-efficient graph calculation accelerator

Technical Field

The present invention relates to graph data processing, and more particularly to an asynchronous energy efficient graph computation accelerator.

Background

With the gradual maturity and development of big data, cloud computing technology and internet industry, human life enters the era of data explosion. Big data has the characteristics of "4V", which are large Volume of data (Volume), large type of data (Variety), fast growth (Volume) and low Value density (Value). Graph as one of the most classical, most commonly used data structures, much of the data in real life is often abstracted into data of a wide variety of graph structures. Vertices in the graph may represent different entities and edges in the graph may represent relationships between different entities. Common data of graph structure type include social relationship graph (social network), webpage link graph (webgraphs), traffic network graph (transport network), and gene analysis graph (genome analysis graphs)And the like. In addition, the size of graph data has increased very rapidly, e.g., in 2011, Twitter corporation has released more than 2 x 10 tweets per day⁸In 2013, the amount of tweets released by Twitter corporation every day increased to 5 x 10⁹. With the increasing popularity of machine learning and data mining applications, the size of graph data is also becoming larger and larger. On the other hand, large-scale map data shows extreme irregularity, so that a large amount of data communication is generated in the process of calculation on the conventional MapReduce and Hadoop systems, and further, the problem of low calculation efficiency is caused. How to effectively process and analyze large-scale graph data is a major research hotspot in academia and industry at present, and in order to effectively solve the above challenges, researchers at home and abroad are designed to throw attention to the design of a graph calculation accelerator. In order to improve the convergence efficiency of an acceleration map algorithm and reduce the power consumption and energy consumption of map calculation, the invention designs an asynchronous high-energy-efficiency map calculation accelerator, which accelerates the convergence speed of the algorithm in an asynchronous calculation mode and reduces the power consumption and energy consumption of a system by means of customizing a special accelerator.

Disclosure of Invention

The invention aims to: the asynchronous energy-efficient graph calculation accelerator is provided, the calculation efficiency of graph data processing is improved, and the power consumption and the energy consumption of the graph calculation accelerator are reduced.

The technical scheme of the invention is as follows:

an asynchronous energy-efficient graph computation accelerator comprises a data preprocessing module, a data transmission module and a data processing module which are connected in sequence, wherein:

the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data BatchRowVector with the size of batch _ size;

and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA.

The data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module.

Preferably, the storage module includes five buffer areas, which are a source vertex buffer area, a target vertex buffer area, a source vertex state value buffer area, a target vertex buffer area, and an output result buffer area, and these buffer areas are used to store the graph data to be calculated and the output result on the chip.

Preferably, the storage module stores the graph data based on a representation form of Row vectors, which is called a Batch Row Vector, the Batch Row Vector is a 2-dimensional storage structure and is composed of 4 Row vectors, a first Row Vector stores a source vertex of an edge in the sub-graph data, and a second Row Vector stores a target vertex of the edge in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors.

Preferably, the on-chip computation module is divided into 4 submodules, which are StreamEdges, asynchronous controller, Reduce and Update respectively; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:

the StreamEdges is responsible for traversing the sub-graph data stored on the sheet, reading the state value of the source vertex and updating the temporary state value of the target vertex;

the Asynchronous Controller performs Asynchronous control in the calculation process and is responsible for transmitting the updated latest vertex state value to the subsequent calculation process, in the transmission process, the Controller compares whether the source vertex of the edge in the subsequent calculation is consistent with the target vertex of the current calculation edge, and if so, Asynchronous update is performed;

reduce is responsible for collecting state value components from different source vertexes on the same target vertex, and is a plus operation for the PageRank algorithm and a min operation for the BFS algorithm;

update executes an Update function to Update the state value to the corresponding vertex.

Preferably, the execution process of the accelerator comprises four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage;

in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the StreamEdges stage, the accelerator sequentially traverses edges in the Batch Row Vector and updates the state value of the target vertex according to the state value of the source vertex; domino performs asynchronous control through a linear updating strategy and a binary updating strategy, and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.

Preferably, the linear update strategy performs a linear query on all graph data on each batch.

Preferably, the binary update strategy is implemented by using a binary search algorithm.

The invention has the advantages that:

the invention designs an asynchronous energy-efficient graph calculation accelerator, accelerates the convergence speed of a graph algorithm by adopting an asynchronous calculation mode, and reduces the power consumption and the energy consumption of a system by customizing the accelerator based on a hardware platform.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is a system architecture of an asynchronous energy efficient graph computation accelerator according to the present invention

FIG. 2 is a schematic diagram of an asynchronous energy efficient map computing accelerator implementation of the present invention;

FIG. 3 is a flow chart of asynchronous update of an asynchronous energy efficient graph computation accelerator according to the present invention.

Detailed Description

As shown in fig. 1, the asynchronous energy-efficient graph computation accelerator disclosed by the present invention is mainly used for processing large-scale graph data, adopts a hardware customization mode to improve the efficiency of a system and reduce the energy consumption of the system, and comprises a data preprocessing module, a data transmission module and a data processing module which are connected in sequence, wherein:

the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data due to the limitation of resources on a hardware platform chip, and organizing the graph G into sub-graph data BatchRow Vector with the size of batch _ size;

and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA, and the transmission process of the calculation result is opposite to that of the sub-graph data.

In specific implementation, the storage module includes five cache regions, which are a source vertex cache region, a target vertex cache region, a source vertex state value cache region, a target vertex cache region, and an output result cache region, and these cache regions are used to store the graph data and the output result that need to be calculated on the chip. The invention designs the storage of graph data based on the representation form of Row vectors, called Batch Row Vector, as shown in FIG. 1. The Batch Row Vector is a 2-dimensional storage structure and consists of 4 Row vectors, wherein the first Row Vector stores a source vertex of an edge in the sub-graph data, and the second Row Vector stores a target vertex of the edge in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors, where the vertex state values are associated with a particular graph algorithm, such as the Rank value of the vertex for the PageRank algorithm, and the BFS depth of the vertex from the root node for the BFS algorithm.

The on-chip computing module is divided into 4 submodules, namely StreamEdges, Asynchronous controller, Reduce and Update; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:

As shown in fig. 2, the execution process of the accelerator includes four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage; in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the StreamEdges stage, the accelerator sequentially traverses edges in the Batch Row Vector and updates the state value of the target vertex according to the state value of the source vertex; domino performs asynchronous control through a linear update strategy (negative update strategy) and a binary update strategy (binary update strategy), and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.

The accelerator asynchronous update flow is shown in fig. 3, where bold numbers indicate the state values of the vertices that have been updated, and the arrows of the arcs indicate the process of asynchronous update. In the initial stage, the first edge (1,2) is accessed and processed, and the state value of the target vertex 2 is updated to 1 according to the state value 0 of the source vertex 1; then the asynchronous controller traverses the subsequent edge set, selects the edge with the source vertex being 2, and refreshes the updated state value of the vertex 2 into the source vertex state value domain of the edges, such as the third edge (2,3), so that in the process of (2,3), the state value of the update target vertex 3 can be updated according to the latest state value of the vertex 2.

The linear update strategy is shown below, which performs a linear query on all graph data on each batch.

The binary update strategy is as follows, and the binary update strategy is realized by adopting a binary search algorithm.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims

1. The utility model provides an asynchronous energy-efficient map calculation accelerator which characterized in that, includes data preprocessing module, data transmission module and the data processing module that connects gradually, wherein:

the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data Batch Row Vector with the size of Batch _ size;

the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXIDMA;

the data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module;

the on-chip computing module is divided into 4 submodules, namely Stream Edges, Asynchronous Controller, Reduce and Update; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:

the Stream Edges are responsible for traversing the sub-graph data stored on the chip, reading the state value of the source vertex and updating the temporary state value of the target vertex;

the Update executes an Update function to Update the state value to a corresponding vertex;

the execution process of the accelerator comprises four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage;

in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the Stream Edges stage, the accelerator sequentially traverses Edges in the BatchRow Vector and updates the state value of the target vertex according to the state value of the source vertex; the asynchronous high-energy-efficiency graph calculation accelerator performs asynchronous control through a linear updating strategy and a binary updating strategy, and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.

2. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the storage module comprises five buffers, namely a source vertex buffer, a destination vertex buffer, a source vertex state value buffer, a destination vertex buffer, and an output result buffer, for storing the graph data to be computed and the output result on the chip.

3. The asynchronous energy-efficient graph computation accelerator of claim 2, wherein the storage module stores graph data based on a representation of Row vectors, referred to as a Batch Row Vector, which is a 2-dimensional storage structure consisting of 4 Row vectors, a first Row Vector storing source vertices of edges in the sub-graph data, and a second Row Vector storing target vertices of edges in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors.

4. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the linear update policy performs a linear lookup on all graph data on each batch.

5. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the binary update strategy is implemented using a binary search algorithm.