CN114819127B

CN114819127B - Back pressure index type combined calculation unit based on FPGA

Info

Publication number: CN114819127B
Application number: CN202210482666.XA
Authority: CN
Inventors: 黄以华; 许圣钧
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2024-03-29
Anticipated expiration: 2042-05-05
Also published as: CN114819127A

Abstract

The invention relates to a back pressure index type combined calculating unit based on an FPGA, which is used for combining and encoding a plurality of sparse vertex feature vectors aiming at a GNN combined calculating stage, indexing weight data slices sent into the calculating unit, calculating the indexed numbers, and accumulating the calculated numbers into an intermediate result register to finish the GNN combined stage calculation. The method fully utilizes the sparsity of the node feature vectors, greatly shortens the time required by the first layer combination calculation under the limit of the calculation resources on the limited FPGA chip, reduces the blockage of the pipeline, and does not need complex control logic.

Description

Back pressure index type combined calculation unit based on FPGA

Technical Field

The invention relates to the technical field of machine learning, in particular to a back pressure index type combined calculation unit based on an FPGA.

Background

Compared with a Deep Convolutional Neural Network (DCNN), the Graph Neural Network (GNN) has incomparable advantages of other neural networks in processing non-Euclidean data, and is widely used in the fields of node classification, natural language processing, recommendation systems, graph clustering and link prediction. Unlike DCNN algorithms which already have more sophisticated landing schemes, the landing of GNN algorithms still has more problems. GNNs have more expensive computation, storage and bandwidth overhead than DCNN algorithms and introduce irregular computation and memory access patterns. Therefore, acceleration means oriented to DCNN, such as nested loop optimization techniques, quantization strategies, winograd algorithms, and the like, are difficult to migrate directly into GNN accelerator designs.

As the most widespread data format, the combination of graph data and neural networks inevitably results in a large number of dedicated hardware architecture designs, such as DCNN applications represented by images, catalyzing the development of GPUs. Therefore, the special system architecture design-GNN accelerator oriented to the field of graph data processing has remarkable practical engineering value and academic research value. The efficient computation of GNNs is not much compared to DCNN. Although the matrix sizes involved in GNN calculations are large, the percentage of non-zero elements in the contiguous matrix required for the calculation is often only 10 ^-3 To 10 ^-2 On the order of magnitude, and the distribution is irregular, therefore, the problem of low computing resource utilization rate, uneven load and the like is that the GNN computing speed is limited. In the process of designing the GNN accelerator, throughput of the accelerator cannot be considered singly, more importantly, how to efficiently schedule data and improve calculation costUtilization of the source.

Most of the existing GNN accelerators are specially optimized for the aggregation stage with irregular access to GNNs, while the computationally intensive combination stage directly uses a series of existing methods such as systolic arrays in DCNN. However, the vector dimension in the combination stage is high, the data sparsity is also high, and the direct use of the DCNN calculation method causes a great deal of calculation resource waste and longer calculation time. Therefore, by utilizing the data characteristics of the calculation in the good combination stage, the calculation resources and the calculation time are effectively saved, the pipeline blockage is reduced, the utilization rate of the calculation resources is improved, and the performance of the accelerator is improved.

Disclosed in the prior art is a Neural Network Unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising: a first memory configured to hold a row of N words logically divided into G input blocks, wherein each input block is a B word; a second memory configured to hold a row of N words logically divided into G filter blocks, wherein each filter block is a B word; wherein B is a minimum factor of N greater than W, and wherein N is at least 512; an array of N Processing Units (PUs), wherein each PU of the array has: an accumulator; a register configured to receive a respective word of N words from a row of the second memory; a multiplexing register configured to selectively receive a corresponding word of the N words from the row of the first memory or a word rotated from a multiplexing register of a logically adjacent PU; and an arithmetic logic unit coupled to the accumulator, register, and multiplexing register, wherein the N PUs are logically divided into G PU blocks, each PU block being B PUs; wherein the input blocks are stored in H rows of the first memory, wherein each of the H rows of the first memory stores a respective 2-dimensional tile of a corresponding one of the H rows of the input, wherein the respective 2-dimensional tile is stored in at least C of the G input blocks, wherein each of the at least C input blocks stores a row of words of a 2-dimensional tile specified by a respective one of the C channels; wherein the filter blocks are stored in R x S x C rows of the second memory, wherein each of F filter blocks in G filter blocks of each row of the R x S x C rows of the second memory is stored in P copies of weights of corresponding filters in corresponding rows and corresponding columns of corresponding filters and F filters of corresponding channels; and wherein to convolve the input with the filter, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in column-channel-row order, wherein the G PU blocks read one of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs during performing a portion of the multiply-accumulate operations such that each of the F of the G PU blocks receives each of the at least C input blocks of the row before reading another of the H rows from the first memory. The application of the scheme to the GNN accelerator still causes a great deal of waste of computing resources.

Disclosure of Invention

The invention provides a back pressure index type combined calculation unit based on an FPGA, which improves the calculation efficiency of a GNN combination stage.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the back pressure index type combined calculation unit based on the FPGA is used for calculating a combined stage of a graph neural network, a weight matrix slice is input into the combined calculation unit in each clock period, and the combined calculation unit comprises an index buffer, a range buffer, a control unit, a weight tiling buffer, a multiplier, an accumulator and m intermediate result registers, wherein:

the index buffer is used for storing data after non-zero element coding in the node characteristic matrix;

the range buffer is used for determining the number range of data in the weight matrix slice input in the current clock cycle;

the control unit judges whether the top index number of the index buffer is in the number range of the range buffer, if so, the control unit indexes out the corresponding weight data in the weight matrix slice, sends the corresponding weight data into the multiplier together with the specific value of the corresponding non-zero element in the index buffer, and accumulates the multiplier result into the corresponding intermediate result register by the accumulator according to the node number of the non-zero element;

the weight flat buffer is used for storing weight matrix slices which cannot be processed in the current clock period, judging whether the weight matrix slices in the weight flat buffer have data to be indexed after the control unit completes the current task, if not, discarding the data, and if so, indexing the data and sending the data to the multiplier and the accumulator for searching and calculating until the weight flat buffer is empty;

after all the weight matrix slices are sent and processed, the data in the m intermediate result registers are the final result of the combined calculation.

Preferably, the node feature matrix is composed of an adjacency matrix and a node feature vector, wherein the adjacency matrix is used for representing connection relations among nodes, and the node feature vector is used for representing features of each node.

Preferably, the node feature matrix needs to be sliced, where slicing of the node feature matrix vector only needs to correspond to slicing of the adjacent matrix, specifically:

and taking out the node characteristic vector of the node related to a certain slice of the adjacency matrix as a node characteristic vector slice.

Preferably, when the index buffer stores data, node feature vectors with the same numbers in slices of different node feature vectors are combined, and after combination, the node feature vectors are arranged according to the sequence from small to large of element coordinates in a line of the encoded non-zero element, wherein the same numbers in slices of the different node feature vectors are specifically as follows:

when the slice size of the adjacency matrix is determined, the number of nodes of each slice of the node feature vector is the same, and N node feature vectors are assumed, and the nodes in each slice are numbered from 0 to N-1, and the node feature vectors with the same number in different slices are called as the node feature vectors with the same number in different slices.

Preferably, the non-zero element in the node feature matrix is coded, specifically:

each non-zero element in the node characteristic matrix is represented by a triplet, and the triplet comprises three data of a row, a column and a specific value, wherein the row represents a node number, and the column represents coordinates of the non-zero element in the row.

Preferably, the range buffer is numbered in a range that is incremented at each clock cycle by a value equal to the number of weight data in the weight matrix slice.

Preferably, the control unit determines whether the top index number of the index buffer is within the number range of the range buffer, specifically:

if the index value sent by the index buffer is smaller than the current value in the range buffer, determining that the data conforming to the index value exists in the current weight matrix slice.

An FPGA-based back-pressure indexed combined computing system, the system comprising a preprocessing module, a combined computing unit array, and a plurality of weight memories, wherein:

the preprocessing module encodes and stores non-zero elements in the node characteristic matrix;

the combined calculation unit array comprises M multiplied by N back pressure index type combined calculation units based on the FPGA, wherein data formed by non-zero elements in every M node feature vectors are input into one combined calculation unit, and each combined calculation unit is responsible for combined calculation of M node feature vectors in the same column of combined calculation units;

the weight memories store slices of the weight matrix, the slices of the weight matrix are transmitted to each combined calculation unit of the same column in the combined calculation unit array in a broadcast mode, each clock unit transmits one weight matrix slice, the number of the weight memories is equal to the column number of the weight matrix, and each weight memory stores data of one column of the weight matrix.

Preferably, the preprocessing module further obtains sparsity of the node feature matrix by performing static statistics on the node feature matrix, and determines the slice fusion number m of the node feature vector and the slice size of the weight matrix according to the sparsity of the node feature matrix.

Preferably, when the weight-tiled buffer is Full of data, the Full signal of the weight-tiled buffer is pulled up, the combined computing unit sends a back-pressure signal to the weight memory, the transmission of the weight matrix slice data and the operation of the subsequent modules of the system are suspended, and the special situation of the current combined computing unit is waited to be solved, namely, the computing task of the last weight matrix slice is completed completely or the Full signal of the weight-tiled buffer is pulled down.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1) The invention fully utilizes the sparsity of the node characteristic vector in the first layer combination calculation of the GNN network, completes the combination calculation in a fixed period by using a back pressure index method, saves a great amount of time for the calculation of the high-dimensional vector, and reduces the hardware cost of each calculation unit at the same time, so that the calculation array can have larger parallelism.

2) The invention combines the non-zero elements of the m node feature vectors, so that the time division multiplexing calculation unit of the invention improves the m-time parallelism compared with the non-combination condition, simultaneously provides more available vertexes for the subsequent aggregation stage, effectively improves the utilization rate of the aggregation calculation unit under proper parameters and reduces the pipeline blockage.

3) The invention makes the combined calculation time depend on the slice size of the weight data, makes the calculation time of all calculation units tend to be the same, provides regular data flow for the subsequent aggregation stage, and has simple control logic. Meanwhile, the size of the slice is adjusted so that the hardware resource cost and the combined calculation time can be freely adjusted.

Drawings

Fig. 1 is a schematic diagram of a back pressure index type combined calculating unit according to the present invention.

Fig. 2 is a data flow diagram of the back pressure index type combination calculating unit.

FIG. 3 is a schematic diagram of a back pressure index type combined computing system framework.

FIG. 4 is a schematic diagram of the utilization of the computing units and the performance improvement of the accelerator under different parameters in a combined computing system using back pressure index.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment provides a back pressure Index type combined computation Unit (CPE) based on an FPGA, as shown in fig. 1, where the combined computation Unit is configured to compute a combined stage of a neural network, and input a weight matrix slice into the combined computation Unit in each clock cycle, and the combined computation Unit includes an Index Buffer (Index Buffer), a Range Buffer (Range Buffer), a Control Unit (Control Unit), a weight tiling Buffer (Weight Tile Buffer), a multiplier, an accumulator, and m intermediate result registers (Reg in the figure represents a group of m intermediate result registers), where:

Example 2

The present embodiment continues to disclose the following on the basis of embodiment 1:

the node characteristic matrix consists of an adjacent matrix and node characteristic vectors, wherein the adjacent matrix is used for representing the connection relation among the nodes, and the node characteristic vectors are used for representing the characteristics of each node.

The node feature matrix needs to be sliced, wherein the slicing of the node feature matrix vector only needs to correspond to the slicing of the adjacent matrix, specifically:

In GNN, an adjacency matrix is used to represent the connection relationship between nodes, and a node feature vector is used to represent the feature of each node, and the core of the GCN reasoning process is the update of the node feature vector. However, because of limited storage and computing resources on the FPGA chip, it is necessary to slice the node feature matrix composed of the adjacency matrix and the node feature vector and then deploy the slice. As shown in fig. 2, when the index buffer stores data, node feature vectors with the same numbers in slices of different node feature vectors are combined, if the number of the combined node feature vectors is m, m node feature vectors are combined and then arranged according to the sequence from small to large in-line element coordinates of the encoded non-zero elements, data formed by the non-zero elements in each m node feature vectors is input into a back pressure index type combination calculation unit, and the same numbers in slices of the different node feature vectors are specifically:

each slice of the adjacency matrix corresponds to a part of the connection relations of the whole graph, i.e. one sub-graph, involving only part of the nodes. Therefore, the slices of the node feature matrix vector only need to correspond to the slices of the adjacent matrix, after the slice size of the adjacent matrix is determined, the number of nodes of each slice of the node feature vector is the same, and N node feature vectors are assumed, and the nodes in each slice are numbered from 0 to N-1, so that the node feature vectors with the same number in different slices are called as the node feature vectors with the same number in different slices.

The non-zero element codes in the node characteristic matrix are specifically as follows:

In the whole GCN reasoning process, the vector dimension of multiplication is extremely large in the combination stage of the first layer, so COO coding mainly aims at original data. The input data of the second layer combination calculation is the calculation result of the first layer, and the vector dimension is greatly reduced and is more dense, so that the direct calculation can be performed without encoding.

Since the number of weight data transmitted by the weight memory is fixed every clock cycle, the number range of the range buffer is increased every clock cycle, and the increased value is equal to the number of weight data in the weight matrix slice.

The control unit judges whether the top index number of the index buffer is within the number range of the range buffer, specifically:

After the combination calculation is started, the coded non-zero elements are stored in Index Buffer. And if the Index Buffer top Index number is in the currently received weight slice number range, indexing out the corresponding weight data in the slice, delivering the weight data to a subsequent calculation unit for calculation, and popping up the top data in the Index Buffer. When there are multiple data in a slice to be indexed, multiple clock cycles are required to complete the operation of the slice, and each clock cycle a new slice is sent, thus creating backpressure. When the Control Unit has not completed the task of indexing the previous slice, the new slice is stored in Weight Tile Buffer, and when the Control Unit has completed the current task, it is started to determine Weight Tile Buffer whether there is data to be indexed in the slice, if not, it is discarded, and if so, it is indexed until Weight Tile Buffer is empty.

Since the combination calculation Unit is responsible for the combination calculation of m nodes at the same time, the Control Unit decodes the node to which the non-zero element belongs and the specific value of the non-zero element from the data sent from the Index Buffer in addition to the coordinates of the non-zero element. And after the corresponding data are taken out from the weight slice according to the index value, sending the weight data and the specific value of the non-zero element into the multiplier, and accumulating the multiplier result into the corresponding intermediate result register according to the node number to which the non-zero element belongs. After all the weight slices are sent, the data in the m intermediate result registers are combined and calculated to obtain a final result.

Fig. 2 is a case when the slice size is 128 and the merging number m=3. Let the dimension of the node feature vector be L. The non-zero element in node 0 has an intra-row coordinate of 0,3, L-4, the non-zero element in node 128 has an intra-row coordinate of 3, L-1, and the non-zero element in node 256 has an intra-row coordinate of 5. The non-zero elements of the three node feature vectors are combined and arranged from small to large in the order of the element coordinates within the row to obtain a right-hand graph, which is arranged as (0, 0), (0.3), (128,3), (256,5), (0, l-4), (128. L-1). After that, it was put into Index Buffer. Assuming that the weight slice size is 4, there are T slices in total, and one slice is fed into the back pressure index calculation unit for each clock cycle. After the combination calculation starts, a first weight slice of the first clock cycle is sent to a calculation unit, the slice has data of 0-3, at the moment, the Index value at the top of Index Buffer is 0, the 0 th data in the slice is taken out according with the requirement, multiplied by a non-zero element value in the data at the top of Index Buffer, and accumulated into an intermediate result register corresponding to node 0 according to the node number. Meanwhile, the Index task corresponding to the Index Buffer top data is completed and popped up. At this point slice 0 cannot be discarded yet because the current Index Buffer top Index value is 3, the next clock cycle still needs to Index slice 0. However, since slice 1 has arrived at this time, slice 1 is buffered in Weight Tile Buffer, and when there is no data to be indexed in slice 0, the Control Unit will fetch the accumulated data from Weight Tile Buffer and calculate it. After T clock cycles, all weight slices are sent out, and most computing units complete the computing task because the sparsity of the node feature vectors of the first layer of the network is higher.

Example 3

The embodiment further discloses an FPGA back pressure index-based combined computing system based on the bases of embodiment 1 and embodiment 2, as shown in fig. 3, where the system includes a preprocessing module, a combined computing unit array, and a plurality of weight memories, where:

the combination calculation unit array comprises m×n back pressure index type combination calculation units based on the FPGA of any one of claims 1 to 7, wherein data composed of non-zero elements in every M node feature vectors is input into one combination calculation unit, and each combination calculation unit is responsible for combination calculation of M node feature vectors in the same column of combination calculation units;

Considering that the sparsity of the node feature vectors of different data sets is different, the preprocessing module further obtains the sparsity of the node feature matrix by carrying out static statistics on the node feature matrix, and determines the slice fusion number m of the node feature vectors and the slice size of the weight matrix according to the sparsity of the node feature matrix.

Because of the irregular distribution of non-zero elements of the node feature vector, two special cases exist in the calculation scheme, 1) a plurality of data are needed to be indexed in the last slice of an individual calculation unit, and the calculation unit cannot finish within T clock cycles; 2) When the weight flat buffer is Full of data, the Full signal of the weight flat buffer is pulled up, the combined calculation unit sends a back pressure signal to the weight memory, transmission of weight matrix slice data and work of a subsequent module of the system are suspended, and the special situation of the current combined calculation unit is waited to be solved, namely, the calculation task of the last weight matrix slice is completed completely or the Full signal of the weight flat buffer is pulled down. However, according to experiments, the possibility of occurrence of special situations is low, and running water of a system is not blocked in most cases.

FIG. 4 shows the utilization of the combined and aggregate computing units at different slice sizes and merge numbers m and the performance improvement of accelerators using the method at different parameters after the method is used.

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The back pressure index type combined calculation unit based on the FPGA is characterized in that the combined calculation unit is used for calculating a combined stage of a graph neural network, a weight matrix slice is input into the combined calculation unit in each clock period, and the combined calculation unit comprises an index buffer, a range buffer, a control unit, a weight tiling buffer, a multiplier, an accumulator and m intermediate result registers, wherein:

2. The FPGA-based backpressure index type combination computing unit according to claim 1, wherein the node feature matrix is composed of an adjacency matrix and a node feature vector, wherein the adjacency matrix is used for representing connection relations among nodes, and the node feature vector is used for representing features of each node.

3. The back pressure index type combined calculation unit based on the FPGA of claim 2, wherein the node feature matrix needs to be sliced, and the slicing of the node feature matrix vector only needs to correspond to the slicing of the adjacent matrix, specifically:

4. The back pressure index type combined calculation unit based on the FPGA of claim 3, wherein when the index buffer stores data, node feature vectors with the same number in slices of different node feature vectors are combined, and after combination, element coordinates in a row of the encoded non-zero element are arranged in order from small to large, and the same number in slices of the different node feature vectors is specifically:

5. The back pressure index type combined calculation unit based on the FPGA according to claim 4, wherein the non-zero element codes in the node characteristic matrix are specifically:

6. The FPGA-based backpressure index type combination computing unit of claim 5, wherein the range buffer number range is incremented every clock cycle by a value equal to the number of weight data in a weight matrix slice.

7. The FPGA-based backpressure index type combination computing unit of claim 6, wherein the control unit determines whether a top index number of the index buffer is within a range of numbers of the range buffer, specifically:

8. The back pressure index type combined computing system based on the FPGA is characterized by comprising a preprocessing module, a combined computing unit array and a plurality of weight memories, wherein:

9. The FPGA backpressure index based combination computing system of claim 8, wherein the preprocessing module further obtains sparsity of the node feature matrix by performing static statistics on the node feature matrix, and determines a slice fusion number m of the node feature vectors and a slice size of the weight matrix according to the sparsity of the node feature matrix.

10. The FPGA-based backpressure index type combined computing system as claimed in claim 9, wherein when the weight-tiled buffer is Full of data, the Full signal of the weight-tiled buffer is pulled up, the combined computing unit sends a backpressure signal to the weight memory, transmission of weight matrix slice data and operation of a system subsequent module are suspended, and special cases of the current combined computing unit are waited for to be solved, namely, the computing task of the last weight matrix slice is completed completely or the Full signal of the weight-tiled buffer is pulled down.