CN114781632A - Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine - Google Patents

Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine Download PDF

Info

Publication number
CN114781632A
CN114781632A CN202210548997.9A CN202210548997A CN114781632A CN 114781632 A CN114781632 A CN 114781632A CN 202210548997 A CN202210548997 A CN 202210548997A CN 114781632 A CN114781632 A CN 114781632A
Authority
CN
China
Prior art keywords
tensor
unit
pulse
array
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210548997.9A
Other languages
Chinese (zh)
Inventor
利节
颜定江
董志诚
吴瑞
张渝楠
覃锐
黄晓薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Science and Technology
Original Assignee
Chongqing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Science and Technology filed Critical Chongqing University of Science and Technology
Priority to CN202210548997.9A priority Critical patent/CN114781632A/en
Publication of CN114781632A publication Critical patent/CN114781632A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the technical field of accelerating calculation architecture of a special application chip in the field of neural networks, and particularly discloses a deep neural network accelerator based on a dynamic reconfigurable pulse tensor operation engine, which adopts a dynamic reconfigurable pulse tensor array operation engine unit to realize high throughput tensor multiplication, adopts a weight parameter storage unit to store weight tensors, adopts a tensor sorting module to realize sparse network weight unpacking and sort and optimize network layer parameters to complete parallel pipeline operation, adopts an activated value vector storage unit as an on-chip high-speed data temporary storage area of the neural network operation engine, improves the data repetition utilization rate and reduces interaction with an external memory, adopts an accumulator and a matrix transposition vector unit to complete accumulated summation of calculation results, fast processes matrix transposition and tensor dimension transformation, adopts a scalar operation unit to realize nonlinear function calculation of a network model, the method has the advantages of high resource utilization rate, low energy consumption, low precision loss, high model operation speed and the like.

Description

Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
Technical Field
The invention relates to the technical field of an application chip acceleration computing architecture special for the field of neural networks, in particular to a deep neural network accelerator based on a dynamic reconfigurable pulse tensor operation engine.
Background
The rapid and vigorous development of Artificial Intelligence (AI) technology, Deep Neural Network (DNN) becomes one of the most popular artificial intelligence technology today, and it is derived from the fact that it shows excellent performance in solving many problems in the AI field, but the excellent performance of DNN comes at the cost that the parameters and computation amount of the model are raised exponentially, the network is deeper and more complicated, the tensor multiplication operation of the weight information and feature map information of the network layer makes the traditional processor unable to effectively perform, and the traditional processor has low throughput and high power consumption in such operation.
The existing calculation model system is mainly based on a two-dimensional pulse array operation engine, although the operation can accelerate most of matrix operations of DNN models, tensor multiplication still needs to be subjected to matrixing splitting when the tensor multiplication is carried out, the matrix multiplication is used as the most basic operation unit, and the tensor multiplication needs to be replaced by a large amount of matrix multiplication usually, so that the calculation model system cannot be designed to efficiently carry out the operation on large-scale DNN models. The two-dimensional systolic array operation system model has the main defects of low operation throughput rate, high operation time delay, high resource occupation and high energy consumption when the tensor operation is processed by a large-scale DNN model.
Disclosure of Invention
The invention provides a deep neural network accelerator based on a dynamically reconfigurable pulse tensor operation engine, which solves the technical problems that: how to improve the throughput rate of large-scale DNN model operation, reduce calculation time delay, reduce resource occupation and energy consumption occupation so as to effectively deal with more complex large-scale DNN algorithm models in the future.
In order to solve the above technical problems, the present invention provides a deep neural network accelerator based on a dynamically reconfigurable pulse tensor operation engine, comprising: the device comprises an HBM high-speed memory interface, a weight parameter storage unit, a tensor ordering module, an activation value vector storage unit, a dynamic reconfigurable pulse tensor array operation engine unit, an accumulator, a matrix transposition vector unit and a scalar operation unit;
the HBM high-speed memory interface is used for respectively calling a weight tensor and an eigen map tensor of a deep neural network from a DRAM external memory into the weight parameter storage unit and the activation value vector storage unit;
the weight parameter storage unit is used for beating the self-stored weight tensors in parallel and sending the weight tensors to a plurality of tensor ordering units of the tensor ordering module, and the activation value vector storage unit is used for beating the self-stored feature map tensors in parallel and sending the feature map tensors to the dynamic reconfigurable pulsation tensor array operation engine unit;
the tensor ordering units are used for carrying out ordering optimization on the sent weight tensor, completing the decomposition of the tensor into a matrix, and then beating and sending the weight tensor into the dynamic reconfigurable pulse tensor array operation engine unit;
the dynamic reconfigurable pulse tensor array operation engine unit is used for dynamically reconfiguring a plurality of pulse tensor array operation engines according to the input weight tensor and the feature map tensor, respectively carrying out parallel multiply-accumulate operation on the pulse tensor array operation engines and outputting corresponding multiply-accumulate operation results to the accumulator and the matrix transposition vector unit;
the accumulator and matrix transposition vector unit is used for performing accumulation caching and asynchronous matrix transposition operation on the input multiplication and accumulation operation result and outputting the final operation result of the weight tensor and the feature map tensor to the scalar operation unit;
the scalar operation unit is used for performing data operation on the input final operation result in a specific application scene, and outputting the corresponding data operation result to the activated value vector storage unit to be used as a new feature map tensor of the deep neural network to participate in the next round of calculation or directly used as a final reasoning result of the deep neural network;
the HBM high-speed memory interface is also used for fetching a final inference result of the deep neural network from the activation value vector storage unit and transmitting the final inference result to the DRAM external storage.
Specifically, the dynamically reconfigurable pulse tensor array operation engine unit includes a plurality of pulse tensor operation engines, and the number and operation mode of each pulse tensor operation engine are dynamically determined by an external compiler; the input of each pulse tensor operation engine is a variable-dimension weight tensor and an eigen map tensor, the pulse tensor operation engines are independently operated in parallel in a production line, and finally a plurality of operation results are obtained and input into the accumulator and the matrix transposition vector unit.
Specifically, the number of the pulse tensor operation engines is determined by the maximum parallel executable number of the compiler according to the available resources of the FPGA hardware and the model calculation diagram; the operation modes of the pulse tensor operation engine comprise floating point operation, fixed point operation and shift operation, and the compiler determines and selects the corresponding operation mode according to the quantization mode of the neural network model.
Specifically, each pulse tensor operation engine comprises an instruction controller, a weight parameter register set, an eigen map register set, an operation array and an output buffer; the process of carrying out tensor multiplication operation by each pulse tensor operation engine comprises the following steps:
1) the instruction controller sends a weight parameter register group index control signal to the weight parameter register group so as to call a weight tensor into the weight parameter register group from the connected tensor sorting unit;
2) the instruction controller sends an eigen map register group index control signal to the eigen map register group so as to call an eigen map tensor from the activation value vector storage unit to the eigen map register group;
3) the instruction controller sends an arithmetic unit control signal to the arithmetic array and starts the arithmetic array to complete tensor operation;
4) saving the calculation results of the operation array into the output buffer;
5) the output buffer outputs the calculation results of the operation array to the accumulator and matrix transpose vector unit.
Specifically, the operation array is a arranged by A × A array2Each operational array family is composed of B arranged in BxB array2Each arithmetic unit is provided with C rows multiplied by C columns of input and C rows multiplied by C columns of output, and completes dot product operation and output of 2C-dimensional vectors;
when high-order matrix multiplication is carried out, a pulse tensor operation engine firstly carries out block splitting on an input matrix according to the dimension of an operation array, then each block matrix is sent into each operation array group according to a certain clock cycle sequence, each operation array group calculates partial element dot products of a product matrix in a pipeline concurrent mode, calculation results are temporarily stored by a register among groups, meanwhile, temporary storage results are sent into the next operation array group in a beating mode in different clock cycles, when temporary storage data are sent into the bottom end of the operation array in a beating mode, the accumulator and a tensor transposition unit at the bottom end process the operation results and carry out dimension conversion, and calculation results are output.
Specifically, the operation process of each operation array family is as follows:
sending all the row data of the matrix 1 into an arithmetic unit in the vertical direction, sending all the column data of the matrix 2 into an arithmetic unit in the horizontal direction, receiving the data, and beating and transmitting the input data in the vertical direction and the horizontal direction, wherein each arithmetic unit completes dot product operation of two paths of C-dimensional vectors, namely one element, B of the result matrix D is obtained2And the operation units simultaneously calculate each element of the result matrix D in one clock period, and temporarily store and output the result of the matrix operation in the horizontal direction and the vertical direction respectively.
Specifically, the scalar operation unit comprises an operation controller, a multi-port input/output register set, a general arithmetic logic operation unit, a data input buffer unit and a data output buffer unit; the data input buffer unit is connected with the accumulator and the matrix transposition vector unit, and the data output buffer unit is connected with the activation value vector storage unit;
the specific operation flow of the scalar operation unit comprises the following steps:
1) the operation controller controls the data input buffer unit to load input tensor data and nonlinear parameter data to the general arithmetic logic operation unit;
2) the operation controller controls and starts the general arithmetic logic operation unit to carry out nonlinear operation on input tensor data by adopting nonlinear parameter data;
3) the general arithmetic logic unit sends the result of the nonlinear operation to the data output buffer unit, and the multiport input and output register set is used as a memory of the general arithmetic logic unit to provide corresponding data storage and data reading operation.
Specifically, the accumulator and matrix transposition vector unit comprises a multi-port vector input interface, a multi-path parallel adder, a fast matrix transposition unit, a path of data output interface and an operation type configuration parameter register; the multiport vector input interface is connected with an output interface of the dynamic reconfigurable pulse tensor array operation engine unit and used for transmitting multiplication and accumulation operation results to the multipath parallel adder, the multipath parallel adder accumulates the multiplication and transmits the accumulation results to the fast matrix transposition unit, the fast matrix transposition unit performs dimension transformation to obtain a tensor matrix consistent with the network layer dimension, the data output interface outputs the tensor matrix subjected to dimension transformation to the scalar operation unit, and the operation type configuration parameter register is used for configuring specific parameter information and transformation types of tensor dimension transformation during dimension transformation.
The invention provides a deep neural network accelerator based on a dynamic reconfigurable pulse tensor operation engine, which adopts a dynamic reconfigurable pulse tensor array operation engine unit, the unit adopts a compiler and hardware cooperative dynamic reconfigurable configuration mode, the compiler configures the number and the calculation mode of engines according to specific scenes and parameters, a pulse tensor operation array realizes high throughput tensor multiplication calculation, the output of a preceding stage calculation engine is used as the input of a subsequent stage engine, data flow sequentially flows to the tail end of the engine array from the top layer, frequent access operation in the operation process is avoided, uninterrupted high concurrent assembly line matrix operation is realized, three dynamic configurable arithmetic operation units are adopted, namely floating point multiplication, dynamic fixed point multiplication and shift operation, so as to reduce the dynamic power consumption of hardware, and the arithmetic operation is replaced by the shift operation, so that the operation time delay and the system power consumption can be greatly reduced. The weight parameter storage unit stores weight tensors, a tensor sorting module is adopted to realize sparse network weight unpacking and carry out sorting optimization on network layer parameters to complete parallel pipeline operation, the activation value vector storage unit is adopted as an on-chip high-speed data temporary storage area of a neural network calculation engine to improve the data repetition utilization rate and reduce interaction with an external memory, an accumulator and a matrix transposition vector unit are adopted to complete accumulated summation of calculation results, matrix transposition and tensor dimension transformation are processed quickly, nonlinear function calculation of a network model is achieved by adopting a scalar operation unit, delay and energy consumption use of tensor multiplication operation in a deep neural network are effectively reduced on the whole, dynamic accelerator online reconfiguration can be achieved according to specific application and an FPGA hardware parameter table, and the method has the advantages of being high in resource utilization rate, low in energy consumption occupation, low in precision loss, high in model operation speed and the like.
Drawings
Fig. 1 is a block diagram of a deep neural network accelerator based on a dynamically reconfigurable pulse tensor operation engine according to an embodiment of the present invention;
FIG. 2 is a flowchart of the operation of a deep neural network accelerator based on a dynamic reconfigurable pulse tensor operation engine according to an embodiment of the present invention;
FIG. 3 is an architecture diagram of a pulse tensor array operation engine provided by an embodiment of the present invention;
FIG. 4 is an architecture diagram of a family of operation arrays provided by an embodiment of the present invention;
FIG. 5 is an architecture diagram of an operational array provided by an embodiment of the present invention;
fig. 6 is an architecture diagram of a scalar operation unit according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.
The deep neural network accelerator based on the dynamic reconfigurable pulse tensor operation engine, as shown in fig. 1, includes an HBM high-speed memory interface, a weight parameter storage unit, a tensor sorting module, an activation value vector storage unit, a dynamic reconfigurable pulse tensor array operation engine unit, an accumulator, a matrix transposition vector unit, and a scalar operation unit. The operation flow of the neural network model refers to the flow chart shown in fig. 2:
the HBM high-speed memory interface is used for respectively calling a weight tensor and an eigen map tensor of the deep neural network into the weight parameter storage unit and the activation value vector storage unit from the DRAM external storage;
the operation controller weight parameter storage unit is used for parallelly beating the weight tensors stored by the operation controller into a plurality of tensor ordering units of the tensor ordering module, and the activated value vector storage unit is used for parallelly beating the characteristic diagram tensors stored by the operation controller into the dynamic reconfigurable pulse tensor array operation engine unit;
the plurality of tensor ordering units are used for ordering and optimizing the sent weight tensor, completing the decomposition of the tensor into a matrix, and then beating the weight tensor and sending the weight tensor into the dynamic reconfigurable pulse tensor array operation engine unit;
the dynamic reconfigurable pulse tensor array operation engine unit is used for dynamically reconfiguring a plurality of pulse tensor array operation engines according to the input weight tensor and the characteristic map tensor, respectively carrying out parallel multiply-accumulate operation on the pulse tensor array operation engines and outputting corresponding multiply-accumulate operation results to the accumulator and the matrix transposition vector unit;
the accumulator and matrix transposition vector unit is used for performing accumulation caching and asynchronous matrix transposition operation on the input multiplication and accumulation operation result and outputting the final operation result of the weight tensor and the feature map tensor to the scalar operation unit;
the scalar operation unit is used for performing data operation on the input final operation result in a specific application scene, and outputting the corresponding data operation result to the activated value vector storage unit to be used as a new feature map tensor of the deep neural network to participate in the next round of calculation or directly used as a final reasoning result of the deep neural network;
the HBM high-speed memory interface is also used for fetching the final inference result of the deep neural network from the activation value vector storage unit and transmitting the final inference result to the DRAM external memory.
As shown in fig. 1, the dynamically reconfigurable pulse tensor array operation engine unit includes a plurality of pulse tensor operation engines, wherein the number of pulse tensor operation engines and the operation mode are dynamically determined by an external compiler. The number of the pulse tensor operation engines is determined by a compiler according to the maximum parallel executable number of available resources of FPGA hardware and a model calculation graph, the operation modes of the pulse tensor operation engines totally comprise floating point operation, fixed point operation and shift operation, the compiler determines and selects a corresponding operation mode according to the quantization mode of the neural network model, wherein the quantization mode of the neural network model can select any one of three quantization modes of fixed point quantization (corresponding to the fixed point operation) and quadratic exponent quantization (corresponding to the shift operation) and floating point operation, and can also select the combination of two quantization modes, such as the fixed point quantization mode and the quadratic exponent quantization mode. The input of each pulse tensor operation engine is a variable-dimension weight tensor and an eigen map tensor, the pulse tensor operation engines perform independent parallel pipeline operation, and finally a plurality of operation results are obtained and input into an accumulator and a matrix transposition vector unit.
Specifically, the compiler dynamically reconstructs the dimensions of the plurality of pulse tensor engines and the operation array according to the input weight tensor and eigen map tensor, and the plurality of operation engines complete the neural network model calculation map tensor operation in parallel. When the operation array completes three-dimensional convolution operation in the neural network, firstly, the tensor of the high-dimensional characteristic diagram is decomposed into a plurality of two-dimensional matrixes, the high-dimensional weight parameters (convolution kernels) are decomposed into a plurality of two-dimensional matrixes, and then the input operation array completes convolution operation on the input characteristic diagram by utilizing matrix multiplication.
As shown in fig. 3, the pulse tensor operation engine includes an instruction controller, a weight parameter register set, an eigen map register set, an operation array, and an output buffer. The instruction control signal comprises an arithmetic unit control signal, a weight parameter register group index control signal and a characteristic diagram register group index control signal. The process of the pulse tensor operation engine for carrying out tensor multiplication operation is divided into the following steps: 1) the instruction controller sends a weight parameter register group index control signal to the weight parameter register group so as to call the weight tensor to the weight parameter register group from the tensor sorting unit; 2) the instruction controller sends a characteristic diagram register group index control signal to the characteristic diagram register group so as to call characteristic diagram data from the activated value vector storage unit into the characteristic diagram register group; 3) the instruction controller sends an operator control signal to the operation array and starts the operation array to complete tensor operation (such as convolution operation of tensor in the neural network); 4) saving the calculation result of the operation array into an output buffer; 5) the output buffer outputs the calculation result of the operation array to the accumulator and the matrix transposition vector unit.
As shown in FIG. 4, the operation array of the pulse tensor operation engine is A arranged by A × A in an array2Each operation array family is composed of B arranged in BXB array2Each arithmetic unit is a DP4 module in FIG. 4, and each arithmetic unit has C row × C column input and C row × C column output, and completes dot product operation and output of 2C-dimensional vectors. Here, fig. 4 exemplifies that a is 2, B is 4, and C is 4.
When high-order matrix multiplication is carried out, the pulse tensor operation engine firstly carries out block splitting on an input matrix according to the dimension of an operation array and then sends each block matrix into each operation array family according to a certain clock cycle sequence. Each operational array group calculates partial element dot products of the product matrix in a pipeline concurrent mode, calculation results are temporarily stored by an inter-group register, the temporarily stored results are sent to the next operational array group in a beating mode in different clock cycles, when the temporarily stored data are sent to the bottom end of the operational array in a beating mode, the operation results and dimensionality transformation are processed by an accumulator and a tensor transposition unit at the bottom end, calculation results are output, and therefore multiplication acceleration operation of two high-order matrixes is completed. The data flow of the operation array is shown in the left of fig. 4.
As shown in fig. 4, the DP4 module includes a floating-point multiply module (float module), a fixed-point multiply module (fix module), a shift module (shift module), and a result register (ACC module). The weight tensor inputs are W0, W1, W2 and W3, and the feature input tensors are A0, A1, A2 and A3. The instruction controller issues an operation mode selection signal for controlling the DP4 module to select the type of multiplication module to be executed, a data path selection signal for selecting the output of the bypass input data and the result of the multiplier operation, and a bypass input selection signal for controlling the DP4 module to bias the input source of the input data.
As shown in fig. 5, taking B-4 and C-4 as examples, one operational array family of the pulse tensor operation engine includes 4 × 4 DP4 modules (operation units) that perform multiplication of two 4 × 4 matrices in one clock cycle, and the specific implementation process is as follows: all data of the matrix 1 are sent to a DP4 module in the vertical direction at the same time, all column data of the matrix 2 are sent to a DP4 module in the horizontal direction at the same time, the DP4 module which receives the data sends the input data in the vertical direction and the horizontal direction at the same time in a flapping mode, each DP4 module completes dot product operation of two paths of 4-dimensional vectors, namely one element of a result matrix D is obtained, 16 DP4 modules can complete simultaneous obtaining of each element of the result matrix D in one clock cycle, and the result of the matrix operation is output in the horizontal direction and the vertical direction in a temporary storage mode.
The accelerator adopts a pulse tensor array type hybrid calculation engine (dynamic reconfigurable pulse tensor operation engine) with dynamic reconfigurable configuration, adopts a compiler (software) and hardware collaborative reconfigurable design mode, the compiler configures the number and the calculation mode of pulse operation arrays according to specific application, the pulse tensor operation arrays can realize the calculation form based on data flow, the output of an upper array is used as the input of a rear array, the data flow sequentially flows to the lower end of the operation arrays from the top layer, the frequent access operation in the operation process is avoided, the uninterrupted and high-concurrency assembly line of calculation is realized, and the operation power consumption is reduced. The accelerator adopts three dynamically configurable matrix multiplication units, as shown in the right side of figure 4, floating point multiplication (float module in the figure), fixed point multiplication (fix module in the figure) and shift operation (shift module in the figure) so as to reduce the dynamic power consumption of hardware, and particularly, the operation delay and the system power consumption can be greatly reduced by replacing floating point operation and fixed point operation with shift operation.
As shown in fig. 6, the scalar operation unit of this embodiment includes an operation controller, a multi-port input/output register set, a general arithmetic logic operation unit, a data input buffer unit, and a data output buffer unit. The data input buffer unit is connected with the accumulator and the matrix transposition vector unit, and the data output buffer unit is connected with the activation value vector storage unit. The specific operation process comprises the following steps: 1) the operation controller controls the data input buffer unit to load the input tensor data and the nonlinear parameter data to the general arithmetic logic operation unit; 2) the operation controller controls and starts the general arithmetic logic operation unit to carry out nonlinear operation on the input tensor data by adopting nonlinear parameter data, such as ReLu activated function operation and normalization operation (Batchnormalization) on an activated value; 3) the general arithmetic logic operation unit sends the result of the nonlinear operation to the data output buffer unit. The multi-port input/output register set serves as a memory of the general arithmetic logic operation unit and provides corresponding data saving and data reading operations.
The HBM high-speed memory interface of the accelerator design of the embodiment adopts a third generation high-speed memory interface (HBM 3). The HBM3 includes 8 independent memory data access channel buses, wherein the 8 memory data access channel buses are connected to 8 independent storage areas of the DRAM external memory, respectively. The HBM3 provides 4 independent channels to the weight parameter storage unit and the activation value vector storage unit, respectively, each channel including an independent 32-bit address bus and a 128-bit bidirectional data bus for connecting signal lines corresponding to the respective channels of the weight parameter storage unit and the activation value vector storage unit.
Further, the weight parameter storage unit of the accelerator design of this example uses the SRAM cache for storing the weight tensor based on the Ping-Pong mechanism on chip. The input interface of the weight parameter SRAM cache is connected with the data read-write channel of the HBM3 and is used for loading the weight data of the network model from the external memory of the DRAM. And an output interface of the weight parameter SRAM cache is connected to a tensor ordering module of the next stage and used for optimizing and ordering tensor data so as to accelerate the operation speed of the systolic array. Handshake signals and control signals of the weight parameter SRAM cache are connected to the status request interface of the HBM3 for sending data transfer interrupt signals to the external master CPU chip. The input valid control signal of the weight parameter SRAM cache is connected to the valid status request interface of the HBM3 for notifying the external master CPU to start data transfer. The weight information output state identification signal of the weight parameter SRAM cache is connected to a data reading indication signal line of the tensor ordering module, and the tensor ordering module accesses the weight information of the weight parameter SRAM cache through an on-chip bus.
The activation value vector storage unit designed by the accelerator adopts an on-chip activation value SRAM (static random access memory) cache as an on-chip high-speed data temporary storage area of a dynamic reconfigurable pulse tensor array operation engine unit so as to improve the data reuse utilization rate and reduce the interaction with an external memory. The on-chip activation value SRAM cache specifically comprises a multi-port data output interface, a path of data input interface, a path of bidirectional data transmission bus interface, a data register and an address register. The multiport data output interface is connected with the dynamic reconfigurable pulse tensor array operation engine unit and is used for transmitting the characteristic map tensor to the characteristic map register set of each pulse tensor operation engine. The data input interface is connected with the output ports of the accumulator and the matrix transposition vector unit and used for transmitting the operation result data of the scalar operation unit to the activation value vector storage unit. The bidirectional data transmission bus interface is connected with a data transmission bus of the HBM high-speed memory interface and is used for data transmission between the DRAM external memory and the activated value vector storage unit, such as predicted input data of a transmission model.
The accelerator of the embodiment adopts an accumulator and a matrix transposition vector unit to complete accumulated summation of calculation results, fast processing matrix transposition and tensor dimension transformation. The accumulator and matrix transposition vector unit comprises a multi-port vector input interface, a multi-path parallel adder, a fast matrix transposition unit, a path of data output interface and an operation type configuration parameter register; the multi-port vector input interface is connected with an output interface of the dynamic reconfigurable pulse tensor array operation engine unit and used for transmitting multiplication and accumulation operation results to the multi-path parallel adder, the multi-path parallel adder accumulates the multiplication results and transmits the accumulation results to the fast matrix transposition unit, the fast matrix transposition unit conducts dimension transformation to obtain a tensor matrix consistent with the network layer dimension, the data output interface outputs the tensor matrix subjected to dimension transformation to the scalar operation unit, and the operation type configuration parameter register is used for configuring specific parameter information and transformation types of tensor dimension transformation during dimension transformation, such as transposition and channel turnover.
The deep neural network accelerator based on the dynamic reconfigurable pulse tensor operation engine provided by the embodiment of the invention adopts a dynamic reconfigurable pulse tensor array operation engine unit, the unit adopts a compiler and hardware cooperative dynamic reconfigurable configuration mode, the compiler configures the number and the calculation mode of engines according to specific scenes and parameters, a pulse tensor operation array realizes high throughput tensor multiplication calculation, the output of a preceding stage calculation engine is used as the input of a subsequent stage engine, data flow sequentially flows to the tail end of the engine array from the top layer, frequent access operation in the operation process is avoided, uninterrupted high concurrent assembly line matrix operation is realized, three dynamic configurable arithmetic operation units are adopted, namely floating point multiplication, dynamic fixed point multiplication and shift operation, so as to reduce the dynamic power consumption of hardware, and the arithmetic operation is replaced by the shift operation, so that the operation delay and the system power consumption can be greatly reduced. The weight parameter storage unit stores weight tensor, the tensor ordering module is adopted to realize sparse network weight unpacking and order and optimize network layer parameters to complete parallel pipeline operation, the activation value vector storage unit is adopted as an on-chip high-speed data temporary storage area of a neural network calculation engine to improve the data repeated utilization rate and reduce interaction with external memory, an accumulator and a matrix transposition vector unit are adopted to complete accumulated summation of calculation results, the method has the advantages of high resource utilization rate, low energy consumption occupation, low precision loss, high model operation speed and the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (8)

1. Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine is characterized by comprising: the device comprises an HBM high-speed memory interface, a weight parameter storage unit, a tensor ordering module, an activation value vector storage unit, a dynamic reconfigurable pulse tensor array operation engine unit, an accumulator, a matrix transposition vector unit and a scalar operation unit;
the HBM high-speed memory interface is used for respectively calling a weight tensor and an eigen map tensor of a deep neural network from a DRAM external memory into the weight parameter storage unit and the activation value vector storage unit;
the weight parameter storage unit is used for beating the self-stored weight tensors in parallel and sending the weight tensors to a plurality of tensor ordering units of the tensor ordering module, and the activation value vector storage unit is used for beating the self-stored feature map tensors in parallel and sending the feature map tensors to the dynamic reconfigurable pulsation tensor array operation engine unit;
the tensor ordering units are used for carrying out ordering optimization on the sent weight tensor, completing the decomposition of the tensor into a matrix, and then beating and sending the weight tensor into the dynamic reconfigurable pulse tensor array operation engine unit;
the dynamic reconfigurable pulse tensor array operation engine unit is used for dynamically reconfiguring a plurality of pulse tensor array operation engines according to the input weight tensor and the feature map tensor, respectively carrying out parallel multiply-accumulate operation on the pulse tensor array operation engines and outputting corresponding multiply-accumulate operation results to the accumulator and the matrix transposition vector unit;
the accumulator and matrix transposition vector unit is used for performing accumulation caching and asynchronous matrix transposition operation on the input multiplication and accumulation operation result and outputting the final operation result of the weight tensor and the feature map tensor to the scalar operation unit;
the scalar operation unit is used for performing data operation on the input final operation result in a specific application scene, outputting the corresponding data operation result to the activation value vector storage unit to be used as a new feature map tensor of the deep neural network to participate in the next round of calculation or directly used as a final reasoning result of the deep neural network;
the HBM high-speed memory interface is also used for fetching a final inference result of the deep neural network from the activation value vector storage unit and transmitting the final inference result to the DRAM external storage.
2. The deep neural network accelerator based on the dynamically reconfigurable pulse tensor operation engine as claimed in claim 1, wherein the dynamically reconfigurable pulse tensor array operation engine unit comprises a plurality of pulse tensor operation engines, and the number and operation mode of each pulse tensor operation engine are dynamically determined by an external compiler; the input of each pulse tensor operation engine is a variable-dimension weight tensor and an eigen map tensor, the pulse tensor operation engines are independently operated in parallel in a production line, and finally a plurality of operation results are obtained and input into the accumulator and the matrix transposition vector unit.
3. The deep neural network accelerator based on the dynamically reconfigurable pulse tensor operation engine according to claim 2, wherein the number of the pulse tensor operation engines is determined by the compiler according to available resources of FPGA hardware and the maximum parallel executable number of a model calculation graph; the operation modes of the pulse tensor operation engine comprise floating point operation, fixed point operation and shift operation, and the compiler determines and selects the corresponding operation mode according to the quantization mode of the neural network model.
4. The deep neural network accelerator based on the dynamic reconfigurable pulse tensor operation engine of claim 2, wherein: each pulse tensor operation engine comprises an instruction controller, a weight parameter register group, an eigen map register group, an operation array and an output buffer; the process of carrying out tensor multiplication operation by each pulse tensor operation engine comprises the following steps:
1) the instruction controller sends a weight parameter register group index control signal to the weight parameter register group so as to call a weight tensor into the weight parameter register group from the connected tensor sorting unit;
2) the instruction controller sends an eigen map register group index control signal to the eigen map register group so as to call an eigen map tensor into the eigen map register group from the activated value vector storage unit;
3) the instruction controller sends an arithmetic unit control signal to the arithmetic array and starts the arithmetic array to complete tensor operation;
4) saving the calculation results of the operation array to the output buffer;
5) the output buffer outputs the calculation results of the operation array to the accumulator and matrix transpose vector unit.
5. The deep neural network accelerator based on the dynamic reconfigurable pulse tensor operation engine of claim 4, wherein: the operation array is A arranged by A multiplied by A array2Each operation array family is composed of B × B array rowsB of cloth2Each arithmetic unit is provided with C rows multiplied by C columns of input and C rows multiplied by C columns of output, and completes dot product operation and output of 2C-dimensional vectors;
when high-order matrix multiplication is carried out, the pulse tensor operation engine firstly carries out block splitting on an input matrix according to the dimension of an operation array, then each block matrix is sent into each operation array group according to a certain clock cycle sequence, each operation array group calculates partial element dot products of the product matrix in a pipeline concurrent mode, calculation results are temporarily stored by a register among groups, meanwhile, temporary storage results are sent into the next operation array group in a beating mode in different clock cycles, when temporary storage data are sent into the bottom end of the operation array in a beating mode, the accumulator and the transposition tensor unit at the bottom end process the operation results and carry out dimension conversion, and calculation results are output.
6. The deep neural network accelerator based on the dynamically reconfigurable pulse tensor operation engine as claimed in claim 5, wherein the operation process of each operation array family is as follows:
sending all the row data of the matrix 1 into an arithmetic unit in the vertical direction, sending all the column data of the matrix 2 into an arithmetic unit in the horizontal direction, receiving the data, and beating and transmitting the input data in the vertical direction and the horizontal direction, wherein each arithmetic unit completes dot product operation of two paths of C-dimensional vectors, namely one element, B of the result matrix D is obtained2And the operation units simultaneously calculate each element of the result matrix D in one clock period, and temporarily store and output the result of the matrix operation in the horizontal direction and the vertical direction respectively.
7. The deep neural network accelerator based on the dynamic reconfigurable pulse tensor operation engine of claim 1, wherein the scalar operation unit comprises an operation controller, a multi-port input-output register set, a general arithmetic logic operation unit, a data input buffer unit and a data output buffer unit; the data input buffer unit is connected with the accumulator and the matrix transposition vector unit, and the data output buffer unit is connected with the activation value vector storage unit;
the specific operation flow of the scalar operation unit comprises the following steps:
1) the operation controller controls the data input buffer unit to load input tensor data and nonlinear parameter data to the general arithmetic logic operation unit;
2) the operation controller controls and starts the general arithmetic logic operation unit to carry out nonlinear operation on input tensor data by adopting nonlinear parameter data;
3) the general arithmetic logic unit sends the result of the nonlinear operation to the data output buffer unit, and the multi-port input/output register set is used as a memory of the general arithmetic logic unit to provide corresponding data saving and data reading operations.
8. The deep neural network accelerator based on the dynamically reconfigurable pulse tensor operation engine as claimed in claim 1, wherein: the accumulator and matrix transposition vector unit comprises a multi-port vector input interface, a multi-path parallel adder, a fast matrix transposition unit, a path of data output interface and an operation type configuration parameter register; the multiport vector input interface is connected with an output interface of the dynamic reconfigurable pulse tensor array operation engine unit and used for transmitting multiplication and accumulation operation results to the multipath parallel adder, the multipath parallel adder accumulates the multiplication and transmits the accumulation results to the fast matrix transposition unit, the fast matrix transposition unit performs dimension transformation to obtain a tensor matrix consistent with the network layer dimension, the data output interface outputs the tensor matrix subjected to dimension transformation to the scalar operation unit, and the operation type configuration parameter register is used for configuring specific parameter information and transformation types of tensor dimension transformation during dimension transformation.
CN202210548997.9A 2022-05-20 2022-05-20 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine Pending CN114781632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210548997.9A CN114781632A (en) 2022-05-20 2022-05-20 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210548997.9A CN114781632A (en) 2022-05-20 2022-05-20 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine

Publications (1)

Publication Number Publication Date
CN114781632A true CN114781632A (en) 2022-07-22

Family

ID=82409195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210548997.9A Pending CN114781632A (en) 2022-05-20 2022-05-20 Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine

Country Status (1)

Country Link
CN (1) CN114781632A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method
CN115794411A (en) * 2022-12-27 2023-03-14 阿里巴巴(中国)有限公司 Data processing system, method and storage medium for model
CN116862019A (en) * 2023-07-06 2023-10-10 清华大学 Model training method and device based on data parallel paradigm
CN117574976A (en) * 2024-01-16 2024-02-20 北京大学 Large language model software and hardware collaborative quantization acceleration calculation method and system
TWI835244B (en) * 2022-08-16 2024-03-11 聯陽半導體股份有限公司 Computing device, operation method of computing device and system on chip

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825730B1 (en) * 2011-10-04 2014-09-02 Altera Corporation Matrix decomposition using dataflow techniques
US20170316312A1 (en) * 2016-05-02 2017-11-02 Cavium, Inc. Systems and methods for deep learning processor
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A kind of deep neural network hardware accelerator based on power exponent quantization
CN110716751A (en) * 2018-07-12 2020-01-21 赛灵思公司 High-parallelism computing platform, system and computing implementation method
US20200097818A1 (en) * 2018-09-26 2020-03-26 Xinlin LI Method and system for training binary quantized weight and activation function for deep neural networks
US20200193274A1 (en) * 2018-12-18 2020-06-18 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats
WO2020258528A1 (en) * 2019-06-25 2020-12-30 东南大学 Configurable universal convolutional neural network accelerator
CN112639839A (en) * 2020-05-22 2021-04-09 深圳市大疆创新科技有限公司 Arithmetic device of neural network and control method thereof
CN113159285A (en) * 2021-04-14 2021-07-23 广州放芯科技有限公司 Neural network accelerator

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825730B1 (en) * 2011-10-04 2014-09-02 Altera Corporation Matrix decomposition using dataflow techniques
US20170316312A1 (en) * 2016-05-02 2017-11-02 Cavium, Inc. Systems and methods for deep learning processor
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN110716751A (en) * 2018-07-12 2020-01-21 赛灵思公司 High-parallelism computing platform, system and computing implementation method
US20200097818A1 (en) * 2018-09-26 2020-03-26 Xinlin LI Method and system for training binary quantized weight and activation function for deep neural networks
US20200193274A1 (en) * 2018-12-18 2020-06-18 Microsoft Technology Licensing, Llc Training neural network accelerators using mixed precision data formats
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm
CN110390383A (en) * 2019-06-25 2019-10-29 东南大学 A kind of deep neural network hardware accelerator based on power exponent quantization
WO2020258528A1 (en) * 2019-06-25 2020-12-30 东南大学 Configurable universal convolutional neural network accelerator
CN112639839A (en) * 2020-05-22 2021-04-09 深圳市大疆创新科技有限公司 Arithmetic device of neural network and control method thereof
WO2021232422A1 (en) * 2020-05-22 2021-11-25 深圳市大疆创新科技有限公司 Neural network arithmetic device and control method thereof
CN113159285A (en) * 2021-04-14 2021-07-23 广州放芯科技有限公司 Neural network accelerator

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
涂凯杰等: "一种双向脉动数据流的全卷积神经网络加速器", 《微电子学与计算机》, vol. 37, no. 01, 31 January 2020 (2020-01-31), pages 33 - 37 *
陈怡然等: "深度神经网络加速器体系结构概述", 《ENGINEERING》, vol. 06, no. 03, 15 March 2020 (2020-03-15), pages 131 - 154 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI835244B (en) * 2022-08-16 2024-03-11 聯陽半導體股份有限公司 Computing device, operation method of computing device and system on chip
CN115599442A (en) * 2022-12-14 2023-01-13 成都登临科技有限公司(Cn) AI chip, electronic equipment and tensor processing method
CN115794411A (en) * 2022-12-27 2023-03-14 阿里巴巴(中国)有限公司 Data processing system, method and storage medium for model
CN115794411B (en) * 2022-12-27 2023-05-30 阿里巴巴(中国)有限公司 Model data processing system, method and storage medium
CN116862019A (en) * 2023-07-06 2023-10-10 清华大学 Model training method and device based on data parallel paradigm
CN116862019B (en) * 2023-07-06 2024-03-19 清华大学 Model training method and device based on data parallel paradigm
CN117574976A (en) * 2024-01-16 2024-02-20 北京大学 Large language model software and hardware collaborative quantization acceleration calculation method and system
CN117574976B (en) * 2024-01-16 2024-04-30 北京大学 Large language model software and hardware collaborative quantization acceleration calculation method and system

Similar Documents

Publication Publication Date Title
CN114781632A (en) Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN110083390B (en) GEMV operation method and device
CN111325321B (en) Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN109409510B (en) Neuron circuit, chip, system and method thereof, and storage medium
CN116562349A (en) Multifunctional unit for programmable hardware nodes for neural network processing
CN108170640B (en) Neural network operation device and operation method using same
Sun et al. A high-performance accelerator for large-scale convolutional neural networks
CN111858465A (en) Large-scale matrix QR decomposition parallel computing structure
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
Asgari et al. Meissa: Multiplying matrices efficiently in a scalable systolic architecture
CN109615061B (en) Convolution operation method and device
US11934482B2 (en) Computational memory
US11256503B2 (en) Computational memory
CN114595813A (en) Heterogeneous acceleration processor and data calculation method
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN114881217A (en) General convolutional neural network accelerator based on FPGA and system thereof
CN109343826B (en) Reconfigurable processor operation unit for deep learning
Qiu et al. An FPGA‐Based Convolutional Neural Network Coprocessor
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
CN113988280B (en) Array computing accelerator architecture based on binarization neural network
Hazarika et al. Hardware efficient convolution processing unit for deep neural networks
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: No. 20, East Road, University City, Chongqing, Shapingba District, Chongqing

Applicant after: Chongqing University of science and technology

Address before: No. 20, East Road, University City, Chongqing, Shapingba District, Chongqing

Applicant before: Chongqing University of Science & Technology

Country or region before: China