CN109003222B - Asynchronous energy-efficient graph calculation accelerator - Google Patents

Asynchronous energy-efficient graph calculation accelerator Download PDF

Info

Publication number
CN109003222B
CN109003222B CN201810697919.9A CN201810697919A CN109003222B CN 109003222 B CN109003222 B CN 109003222B CN 201810697919 A CN201810697919 A CN 201810697919A CN 109003222 B CN109003222 B CN 109003222B
Authority
CN
China
Prior art keywords
graph
vertex
accelerator
module
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810697919.9A
Other languages
Chinese (zh)
Other versions
CN109003222A (en
Inventor
李曦
周学海
徐冲冲
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810697919.9A priority Critical patent/CN109003222B/en
Publication of CN109003222A publication Critical patent/CN109003222A/en
Application granted granted Critical
Publication of CN109003222B publication Critical patent/CN109003222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses an asynchronous energy-efficient graph calculation accelerator, which comprises a data preprocessing module, a data transmission module and a data processing module which are sequentially connected, wherein: the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data Batch Row Vector with the size of Batch _ size; and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA. The data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module. The invention designs an asynchronous energy-efficient graph calculation accelerator, accelerates the convergence speed of a graph algorithm by adopting an asynchronous calculation mode, and reduces the power consumption and the energy consumption of a system by customizing the accelerator based on a hardware platform.

Description

Asynchronous energy-efficient graph calculation accelerator
Technical Field
The present invention relates to graph data processing, and more particularly to an asynchronous energy efficient graph computation accelerator.
Background
With the gradual maturity and development of big data, cloud computing technology and internet industry, human life enters the era of data explosion. Big data has the characteristics of "4V", which are large Volume of data (Volume), large type of data (Variety), fast growth (Volume) and low Value density (Value). Graph as one of the most classical, most commonly used data structures, much of the data in real life is often abstracted into data of a wide variety of graph structures. Vertices in the graph may represent different entities and edges in the graph may represent relationships between different entities. Common data of graph structure type include social relationship graph (social network), webpage link graph (webgraphs), traffic network graph (transport network), and gene analysis graph (genome analysis graphs)And the like. In addition, the size of graph data has increased very rapidly, e.g., in 2011, Twitter corporation has released more than 2 x 10 tweets per day8In 2013, the amount of tweets released by Twitter corporation every day increased to 5 x 109. With the increasing popularity of machine learning and data mining applications, the size of graph data is also becoming larger and larger. On the other hand, large-scale map data shows extreme irregularity, so that a large amount of data communication is generated in the process of calculation on the conventional MapReduce and Hadoop systems, and further, the problem of low calculation efficiency is caused. How to effectively process and analyze large-scale graph data is a major research hotspot in academia and industry at present, and in order to effectively solve the above challenges, researchers at home and abroad are designed to throw attention to the design of a graph calculation accelerator. In order to improve the convergence efficiency of an acceleration map algorithm and reduce the power consumption and energy consumption of map calculation, the invention designs an asynchronous high-energy-efficiency map calculation accelerator, which accelerates the convergence speed of the algorithm in an asynchronous calculation mode and reduces the power consumption and energy consumption of a system by means of customizing a special accelerator.
Disclosure of Invention
The invention aims to: the asynchronous energy-efficient graph calculation accelerator is provided, the calculation efficiency of graph data processing is improved, and the power consumption and the energy consumption of the graph calculation accelerator are reduced.
The technical scheme of the invention is as follows:
an asynchronous energy-efficient graph computation accelerator comprises a data preprocessing module, a data transmission module and a data processing module which are connected in sequence, wherein:
the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data BatchRowVector with the size of batch _ size;
and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA.
The data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module.
Preferably, the storage module includes five buffer areas, which are a source vertex buffer area, a target vertex buffer area, a source vertex state value buffer area, a target vertex buffer area, and an output result buffer area, and these buffer areas are used to store the graph data to be calculated and the output result on the chip.
Preferably, the storage module stores the graph data based on a representation form of Row vectors, which is called a Batch Row Vector, the Batch Row Vector is a 2-dimensional storage structure and is composed of 4 Row vectors, a first Row Vector stores a source vertex of an edge in the sub-graph data, and a second Row Vector stores a target vertex of the edge in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors.
Preferably, the on-chip computation module is divided into 4 submodules, which are StreamEdges, asynchronous controller, Reduce and Update respectively; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:
the StreamEdges is responsible for traversing the sub-graph data stored on the sheet, reading the state value of the source vertex and updating the temporary state value of the target vertex;
the Asynchronous Controller performs Asynchronous control in the calculation process and is responsible for transmitting the updated latest vertex state value to the subsequent calculation process, in the transmission process, the Controller compares whether the source vertex of the edge in the subsequent calculation is consistent with the target vertex of the current calculation edge, and if so, Asynchronous update is performed;
reduce is responsible for collecting state value components from different source vertexes on the same target vertex, and is a plus operation for the PageRank algorithm and a min operation for the BFS algorithm;
update executes an Update function to Update the state value to the corresponding vertex.
Preferably, the execution process of the accelerator comprises four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage;
in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the StreamEdges stage, the accelerator sequentially traverses edges in the Batch Row Vector and updates the state value of the target vertex according to the state value of the source vertex; domino performs asynchronous control through a linear updating strategy and a binary updating strategy, and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.
Preferably, the linear update strategy performs a linear query on all graph data on each batch.
Preferably, the binary update strategy is implemented by using a binary search algorithm.
The invention has the advantages that:
the invention designs an asynchronous energy-efficient graph calculation accelerator, accelerates the convergence speed of a graph algorithm by adopting an asynchronous calculation mode, and reduces the power consumption and the energy consumption of a system by customizing the accelerator based on a hardware platform.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a system architecture of an asynchronous energy efficient graph computation accelerator according to the present invention
FIG. 2 is a schematic diagram of an asynchronous energy efficient map computing accelerator implementation of the present invention;
FIG. 3 is a flow chart of asynchronous update of an asynchronous energy efficient graph computation accelerator according to the present invention.
Detailed Description
As shown in fig. 1, the asynchronous energy-efficient graph computation accelerator disclosed by the present invention is mainly used for processing large-scale graph data, adopts a hardware customization mode to improve the efficiency of a system and reduce the energy consumption of the system, and comprises a data preprocessing module, a data transmission module and a data processing module which are connected in sequence, wherein:
the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data due to the limitation of resources on a hardware platform chip, and organizing the graph G into sub-graph data BatchRow Vector with the size of batch _ size;
and the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXI DMA, and the transmission process of the calculation result is opposite to that of the sub-graph data.
The data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module.
In specific implementation, the storage module includes five cache regions, which are a source vertex cache region, a target vertex cache region, a source vertex state value cache region, a target vertex cache region, and an output result cache region, and these cache regions are used to store the graph data and the output result that need to be calculated on the chip. The invention designs the storage of graph data based on the representation form of Row vectors, called Batch Row Vector, as shown in FIG. 1. The Batch Row Vector is a 2-dimensional storage structure and consists of 4 Row vectors, wherein the first Row Vector stores a source vertex of an edge in the sub-graph data, and the second Row Vector stores a target vertex of the edge in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors, where the vertex state values are associated with a particular graph algorithm, such as the Rank value of the vertex for the PageRank algorithm, and the BFS depth of the vertex from the root node for the BFS algorithm.
The on-chip computing module is divided into 4 submodules, namely StreamEdges, Asynchronous controller, Reduce and Update; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:
the StreamEdges is responsible for traversing the sub-graph data stored on the sheet, reading the state value of the source vertex and updating the temporary state value of the target vertex;
the Asynchronous Controller performs Asynchronous control in the calculation process and is responsible for transmitting the updated latest vertex state value to the subsequent calculation process, in the transmission process, the Controller compares whether the source vertex of the edge in the subsequent calculation is consistent with the target vertex of the current calculation edge, and if so, Asynchronous update is performed;
reduce is responsible for collecting state value components from different source vertexes on the same target vertex, and is a plus operation for the PageRank algorithm and a min operation for the BFS algorithm;
update executes an Update function to Update the state value to the corresponding vertex.
As shown in fig. 2, the execution process of the accelerator includes four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage; in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the StreamEdges stage, the accelerator sequentially traverses edges in the Batch Row Vector and updates the state value of the target vertex according to the state value of the source vertex; domino performs asynchronous control through a linear update strategy (negative update strategy) and a binary update strategy (binary update strategy), and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.
The accelerator asynchronous update flow is shown in fig. 3, where bold numbers indicate the state values of the vertices that have been updated, and the arrows of the arcs indicate the process of asynchronous update. In the initial stage, the first edge (1,2) is accessed and processed, and the state value of the target vertex 2 is updated to 1 according to the state value 0 of the source vertex 1; then the asynchronous controller traverses the subsequent edge set, selects the edge with the source vertex being 2, and refreshes the updated state value of the vertex 2 into the source vertex state value domain of the edges, such as the third edge (2,3), so that in the process of (2,3), the state value of the update target vertex 3 can be updated according to the latest state value of the vertex 2.
The linear update strategy is shown below, which performs a linear query on all graph data on each batch.
Figure GDA0002475366440000041
The binary update strategy is as follows, and the binary update strategy is realized by adopting a binary search algorithm.
Figure GDA0002475366440000042
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims (5)

1. The utility model provides an asynchronous energy-efficient map calculation accelerator which characterized in that, includes data preprocessing module, data transmission module and the data processing module that connects gradually, wherein:
the data preprocessing module is used for preprocessing a given example graph G in an initialization stage, dividing large-scale graph data and organizing the graph G into sub-graph data Batch Row Vector with the size of Batch _ size;
the data transmission module is used for transmitting the sub-graph data Batch Row Vector to the onboard DDR through PCIe DMA and then transmitting the sub-graph data Batch Row Vector to the on-chip accelerator through AXIDMA;
the data processing module is logic on the accelerator chip and comprises a storage module and a calculation module; the storage module comprises an on-chip cache, and the calculation module comprises an on-chip calculation module;
the on-chip computing module is divided into 4 submodules, namely Stream Edges, Asynchronous Controller, Reduce and Update; the on-chip computing module is connected with the storage module through AXI-Stream, wherein:
the Stream Edges are responsible for traversing the sub-graph data stored on the chip, reading the state value of the source vertex and updating the temporary state value of the target vertex;
the Asynchronous Controller performs Asynchronous control in the calculation process and is responsible for transmitting the updated latest vertex state value to the subsequent calculation process, in the transmission process, the Controller compares whether the source vertex of the edge in the subsequent calculation is consistent with the target vertex of the current calculation edge, and if so, Asynchronous update is performed;
reduce is responsible for collecting state value components from different source vertexes on the same target vertex, and is a plus operation for the PageRank algorithm and a min operation for the BFS algorithm;
the Update executes an Update function to Update the state value to a corresponding vertex;
the execution process of the accelerator comprises four stages: a preprocessing stage, a StreamEdges stage, a Reduce stage and an Update stage;
in the preprocessing stage, the accelerator initializes the distances of all the vertexes, the distance of the vertex 1 is set to be 1, the distance is set as a root node, the distances of all the vertexes in the graph data are stored in the dist [ ], and each vertex has a corresponding state value in the dist [ ]; before each time of processing of the Batch Row Vector, host reads the state value of a vertex from dist [ ] to fill the Batch Row Vector, and then the Batch Row Vector is transmitted to the on-board DDR of the FPGA one by one, and the Batch Row Vector on the on-board DDR is also transmitted to the accelerator one by one; in the Stream Edges stage, the accelerator sequentially traverses Edges in the BatchRow Vector and updates the state value of the target vertex according to the state value of the source vertex; the asynchronous high-energy-efficiency graph calculation accelerator performs asynchronous control through a linear updating strategy and a binary updating strategy, and an asynchronous control mechanism spreads the latest state value of a vertex to a subsequent calculation process; different state value components of the same target vertex are collected in the Reduce stage, the Update function is executed in the Update stage to Update the calculated state value to the corresponding vertex, and host completes state value propagation among batchs.
2. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the storage module comprises five buffers, namely a source vertex buffer, a destination vertex buffer, a source vertex state value buffer, a destination vertex buffer, and an output result buffer, for storing the graph data to be computed and the output result on the chip.
3. The asynchronous energy-efficient graph computation accelerator of claim 2, wherein the storage module stores graph data based on a representation of Row vectors, referred to as a Batch Row Vector, which is a 2-dimensional storage structure consisting of 4 Row vectors, a first Row Vector storing source vertices of edges in the sub-graph data, and a second Row Vector storing target vertices of edges in the sub-graph data; the state values of the source vertices and the target vertex state values are stored in the third and fourth row vectors.
4. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the linear update policy performs a linear lookup on all graph data on each batch.
5. The asynchronous energy efficient graph computation accelerator of claim 1, wherein the binary update strategy is implemented using a binary search algorithm.
CN201810697919.9A 2018-06-29 2018-06-29 Asynchronous energy-efficient graph calculation accelerator Active CN109003222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810697919.9A CN109003222B (en) 2018-06-29 2018-06-29 Asynchronous energy-efficient graph calculation accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810697919.9A CN109003222B (en) 2018-06-29 2018-06-29 Asynchronous energy-efficient graph calculation accelerator

Publications (2)

Publication Number Publication Date
CN109003222A CN109003222A (en) 2018-12-14
CN109003222B true CN109003222B (en) 2020-07-24

Family

ID=64602130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810697919.9A Active CN109003222B (en) 2018-06-29 2018-06-29 Asynchronous energy-efficient graph calculation accelerator

Country Status (1)

Country Link
CN (1) CN109003222B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949202B (en) * 2019-02-02 2022-11-11 西安邮电大学 Parallel graph computation accelerator structure

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Domino:an asynchronous and energy-efficient accelerator for graph processing;Chongchong Xu等;《ACM digital library》;20180225;第189页 *
Domino:understanding wide-area,asynchronous event causality in web appllication;Ding Li等;《ACM》;20151231;第1-7页 *
omnigraph:A scalable hardware accelerator hardware accelerator for graph processing;Chongchong Xu等;《IEEE》;20171231;第623-624页 *
面向大数据应用的异构可重构平台关键技术研究;陈鹏;《中国博士学位论文全文数据库信息科技辑》;20150915;I138-17页 *

Also Published As

Publication number Publication date
CN109003222A (en) 2018-12-14

Similar Documents

Publication Publication Date Title
CN108563808B (en) Design method of heterogeneous reconfigurable graph computing accelerator system based on FPGA
WO2022017038A1 (en) Three-dimensional lidar point cloud highly efficient k-nearest neighbor search algorithm for autonomous driving
Arslan et al. Use of relaxation methods in sampling-based algorithms for optimal motion planning
CN105740424A (en) Spark platform based high efficiency text classification method
Singh et al. A survey of traditional and mapreducebased spatial query processing approaches
Sun et al. Application research based on improved genetic algorithm in cloud task scheduling
CN109003222B (en) Asynchronous energy-efficient graph calculation accelerator
AU2017288044B2 (en) Method and system for flexible, high performance structured data processing
CN109189994B (en) CAM structure storage system for graph computation application
CN107992358A (en) A kind of asynchronous IO suitable for the outer figure processing system of core performs method and system
CN103116593B (en) A kind of parallel method of the calculating convex hull based on multicore architecture
CN109685208A (en) A kind of method and device accelerated for the dilute combization of neural network processor data
CN107480096B (en) High-speed parallel computing method in large-scale group simulation
CN108319604B (en) Optimization method for association of large and small tables in hive
CN110264467B (en) Dynamic power law graph real-time repartitioning method based on vertex cutting
CN112149814A (en) Convolutional neural network acceleration system based on FPGA
Zhang et al. An efficient and balanced graph partition algorithm for the subgraph-centric programming model on large-scale power-law graphs
CN115496133A (en) Density data stream clustering method based on self-adaptive online learning
Gugan et al. Towards the development of a robust path planner for autonomous drones
Liu et al. AAPP: An Accelerative and Adaptive Path Planner for Robots on GPU
CN107529638B (en) Accelerated method, storage database and the GPU system of linear solution device
CN113010748A (en) Distributed big graph partitioning method based on affinity clustering
CN105354243A (en) Merge clustering-based parallel frequent probability subgraph searching method
CN105183875A (en) FP-Growth data mining method based on shared path
CN108959460A (en) A kind of diagram data layout method based on the vertex degree of association

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant