CN115222028A - One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method - Google Patents

One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method Download PDF

Info

Publication number
CN115222028A
CN115222028A CN202210804166.3A CN202210804166A CN115222028A CN 115222028 A CN115222028 A CN 115222028A CN 202210804166 A CN202210804166 A CN 202210804166A CN 115222028 A CN115222028 A CN 115222028A
Authority
CN
China
Prior art keywords
instruction
neural network
memory
result
lstm neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210804166.3A
Other languages
Chinese (zh)
Inventor
武斌
陈旭伟
李鹏
张葵
王钊
袁士博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210804166.3A priority Critical patent/CN115222028A/en
Publication of CN115222028A publication Critical patent/CN115222028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a one-dimensional CNN-LSTM acceleration platform and a method based on an FPGA, wherein the platform comprises a general CPU and an FPGA, the general CPU analyzes a one-dimensional CNN-LSTM neural network model and generates an instruction sequence to be loaded into an instruction memory at an FPGA end, a quantized model parameter is loaded into a result memory at the FPGA end, a controller at the FPGA end reads an operation instruction in the instruction memory and controls an operator to complete corresponding operation, and after all the operation instructions are completed, a final operation result is written into the result memory at the FPGA end for the CPU to read. The invention completes the parallel acceleration of two operations of one-dimensional convolution and matrix multiplication by multiplexing the same multiply-add array, solves the problem that the single acceleration scheme in the prior art can not support the calculation of the CNN-LSTM neural network model, and greatly improves the utilization rate of the calculation resources of the FPGA.

Description

One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
Technical Field
The invention belongs to the Field of heterogeneous computation, and further relates to a one-dimensional CNN-LSTM neural network acceleration platform based on a Field Programmable Gate Array (FPGA) and an implementation method in the technical Field of deep learning computation acceleration.
Background
With the continuous complication and abstraction of the problems to be solved by deep learning, the requirements of a deep learning algorithm on the computing power of equipment are higher and higher, and a general CPU (Central Processing Unit) cannot meet the computing power requirements of deep learning. In order to meet the demand of deep learning on computing power, hardware devices such as a GPU (Graphics Processing Unit), an ASIC (Application Specific Integrated Circuit) or an FPGA (field programmable gate array) are generally used to provide computing power support. The GPU is widely applied to training and reasoning processes of a deep learning algorithm model due to its powerful parallel computing processing capability, but the application of the GPU in mobile terminals and portable devices has been limited all the time due to the problem of power consumption of the GPU. An ASIC generally adopts an improved computing architecture, performs computation acceleration for a specific task, and generally has a high energy efficiency ratio, but the application of the ASIC is limited due to a long design and development period, a high difficulty, and the like. The FPGA is increasingly applied to deep learning computation acceleration due to the advantages of rich parallel computing resources, low cost, low power consumption, programmability and the like, particularly to the computation acceleration of an equipment end, but a deep learning algorithm is deployed on the FPGA, so that a developer is generally required to have higher software and hardware development capability, and the deep learning computation acceleration by using the FPGA is greatly hindered. Although there are many studies on computation acceleration of a Convolutional Neural Network or a recurrent Neural Network based on an FPGA, with the continuous development of artificial intelligence, a number of novel deep learning algorithm models are proposed, in which a CNN-LSTM Neural Network model combining a Convolutional Neural Network CNN (Convolutional Neural Network) and a Long Short-Term Memory Neural Network LSTM (Long Short-Term Memory) is used to extract local features by convolution operation, and then feature synthesis is performed by using the Long Short-Term Memory Neural Network, so that good performance is shown when problems related to time series are processed, but a single FPGA acceleration scheme for CNN or LSTM is no longer applicable.
The Suzhou research institute of China science and technology university discloses an FPGA-based deep neural network acceleration platform in a patent document 'FPGA-based deep neural network acceleration platform' (patent application number: CN201810010938.X, authorization publication number: CN 108229670B). The computing acceleration platform includes: general purpose processors, FPGAs, and DRAMs (Dynamic Random Access memories). The general processor is used for analyzing the neural network configuration information and the weight data, writing the neural network configuration information, the weight data and the image data to be processed into the DRAM, then the FPGA completes CNN calculation acceleration through a designed convolutional layer IP (internal Property) core, a pooling layer IP core, a full connection layer IP core and a stimulation layer IP core according to the configuration information of the DRAM, writes a calculation result into the DRAM, and finally reads the calculation result from the DRAM. Although the acceleration platform is simplified in the algorithm deployment process, the platform still has the defects that the designed convolution layer IP core cannot extract the matrix multiplication result and does not have the matrix multiplication function, so that the CNN calculation acceleration can be realized only and the matrix calculation acceleration in the LSTM cannot be completed.
He Junhua in its published paper, "FPGA-based deep learning computing platform design and implementation" (Beijing post and telecommunications university, 2020, master academic paper) discloses a method for CNN and LSTM neural network computation acceleration. The method adopts a Winograd algorithm to realize rapid convolution operation, adopts a pulse array idea to realize rapid matrix multiplication, and adopts fixed point number operation to replace floating point number operation and nonlinear function lookup table to optimize FPGA hardware resource occupation and system time delay. However, the method still has the defects that the operation structures of the convolution operation module realized by the Winograd algorithm and the matrix multiplication operation module realized by the pulse array are greatly different, and the multiplexing of the calculation resources cannot be realized, so that the situation of insufficient calculation resources occurs when two neural network acceleration schemes are deployed on the FPGA at the same time, and the calculation acceleration of the CNN-LSTM neural network model cannot be satisfied.
Disclosure of Invention
The invention aims to provide a one-dimensional CNN-LSTM resource multiplexing type calculation acceleration platform based on an FPGA and an implementation method thereof aiming at the defects in the prior art, and is used for solving the problems that the prior single acceleration scheme based on the FPGA cannot simultaneously support two operation accelerations of convolution and matrix multiplication and the two neural network acceleration schemes are simultaneously deployed on the FPGA to cause insufficient calculation resources.
In order to achieve the above object, the idea of the present invention is to construct an arithmetic unit module on the FPGA, wherein the arithmetic unit module is composed of a linear arithmetic unit and a nonlinear arithmetic unit. When the multiplication operation of vectors and matrixes in the LSTM neural network needs to be completed, the vectors to be calculated are used as the input of the one-dimensional convolution operation, the matrixes to be calculated are partitioned according to rows and are loaded into the multiplication and addition array as a plurality of convolution kernels of the one-dimensional convolution operation, then the operation is completed according to the one-dimensional convolution operation method, and finally the operation results of the matrix multiplication are extracted from the ping-pong result cache. The nonlinear operation unit completes the gate control coefficient activation and the neuron state updating in the LSTM neural network in a nonlinear function table look-up mode. And the calculation acceleration of the one-dimensional CNN-LSTM neural network model is completed through the cooperation of the linear operation unit and the nonlinear operation unit. The invention greatly improves the utilization rate of the computing resources of the FPGA by multiplexing the same multiply-add array in the linear operation unit through the one-dimensional convolution operation and the matrix multiplication operation, and solves the problems that two operation acceleration of the volume and the matrix multiplication cannot be simultaneously finished by a single acceleration scheme and the two acceleration schemes are simultaneously deployed on the FPGA to cause insufficient computing resources.
The platform of the invention comprises a general CPU and an FPGA, wherein the FPGA end also comprises an instruction memory, a data memory, a result memory, a controller and an arithmetic unit, wherein:
the general CPU is used for writing an operation instruction sequence of the one-dimensional CNN-LSTM neural network model to be accelerated provided by a user, loading the operation instruction sequence into an instruction memory of the FPGA end, quantizing parameters and input data of the one-dimensional CNN-LSTM neural network model into fixed point numbers, loading the fixed point numbers into a data memory of the FPGA end, writing an operation starting instruction into the highest address of the instruction memory, and reading a final operation result from a result memory after all operation instructions are executed.
The instruction memory is used for storing the operation instruction sequence of the one-dimensional CNN-LSTM neural network model to be accelerated, which is provided by the user and is written by the general CPU.
The data memory is used for storing parameters and input data of the one-dimensional CNN-LSTM neural network model quantized at the general CPU end.
And the result memory is used for storing the final operation result for being read by the general CPU.
The controller is used for monitoring an operation starting instruction, namely monitoring an address line of an instruction memory, when the CPU writes data into the highest address of the instruction memory, the CPU indicates that the calculation starting instruction of the general CPU is monitored, wherein the data written into the highest address of the instruction memory is the total number of instructions to be executed in the calculation, after the operation starting instruction is monitored, one operation instruction is read from the instruction memory and sent to the instruction bus, then an execution feedback bus is monitored, after an execution completion signal is monitored, the next operation instruction is sent to the instruction bus, until the highest address of the instruction memory is cleared from 0 after the instruction execution is completed, and the instruction starting monitoring state is entered again.
The arithmetic unit comprises a control unit, a linear arithmetic unit and a nonlinear arithmetic unit and is used for executing an arithmetic instruction sent to an instruction bus by the controller; the control unit is used for generating corresponding control information according to the instruction content, controlling the linear operation unit and the nonlinear operation unit to complete corresponding operation, and sending an instruction execution completion signal to the execution feedback bus after the operation instruction is executed; the linear operation unit consists of a multiplication and addition array and a result cache array with a ping-pong structure, is used for loading a weight parameter w and a bias parameter bias of the multiplication and addition array from a data memory according to control information provided by the control unit, loading input data of the multiplication and addition array from a data memory or a row of the ping-pong result cache array or a column of the ping-pong result cache array or a result cache of the nonlinear operation unit, and storing an operation result of the multiplication and addition array into the ping-pong result cache array after performing maximum pooling P operation, linear rectification R operation and channel addition operation according to the control information provided by the control unit to complete and accelerate one-dimensional convolution operation and matrix multiplication operation; the nonlinear operation unit intercepts the part of two nonlinear function arguments of sigmoid and tanh between [ -4, 4), quantizes the function value into fixed point number and stores the fixed point number into ROM (Read-Only Memory), and when nonlinear activation is carried out on the matrix operation result in LSTM neural network operation according to the control information provided by the control unit, the matrix operation result is converted into the address of ROM for storing the nonlinear function value, and the corresponding function value is Read out, so that the updating of the gate control coefficient and the neuron state of the LSTM neural network is completed.
The method for realizing the acceleration platform comprises the following specific steps:
step 1, compiling an operation instruction sequence of a one-dimensional CNN-LSTM neural network model to be accelerated provided by a user at a general CPU end, and loading the operation instruction sequence into an instruction memory.
And 2, quantizing the parameters of the one-dimensional CNN-LSTM neural network model and the input data into fixed points at the general CPU end, and loading the fixed points into a data memory.
And 3, setting the controller to be in a starting instruction monitoring state, monitoring an address line of the instruction memory, and when the CPU writes data into the highest address of the instruction memory, indicating that the calculation starting instruction of the general CPU is monitored, wherein the data written into the highest address of the instruction memory is the total number of instructions to be executed in the calculation.
And 4, the controller reads an instruction from the instruction memory and sends the instruction to the instruction bus, sets the controller to be in an execution feedback monitoring state, and monitors an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit module.
Step 5, executing the operation instruction according to the content of the operation instruction:
and 5.1, generating corresponding control information by the control unit of the arithmetic unit module according to the instruction content.
And 5.2, loading the weight parameter w and the bias parameter bias of the multiplication and addition array from the data memory by a linear operation unit of the operator module according to the control information provided by the control unit.
And 5.3, loading the input data of the multiplier-adder array from the data memory or the row of the ping-pong result buffer array or the column of the ping-pong result buffer array or the result buffer of the nonlinear operation unit by the linear operation unit of the operator module according to the control information provided by the control unit.
And 5.4, the linear operation unit performs maximum pooling P operation, linear rectification R operation and channel addition operation on the operation result of the multiply-add array according to the control information provided by the control unit and stores the operation result in a ping-pong result cache array.
The maximum pooling P operation is to calculate the maximum value of n multiplication and addition array operation results, wherein the value of n is equal to the length of a pooling kernel.
The linear rectification R operation is to compare the multiplication and addition array operation result with 0, when the multiplication and addition array operation result is greater than 0, the value is taken as the value of the linear rectification R operation, and when the multiplication and addition array operation result is less than or equal to 0, the value is taken as 0.
And 5.5, intercepting the part of two nonlinear function arguments of sigmoid and tanh between [ -4, 4) by a nonlinear operation unit, quantizing the function value into fixed points and storing the fixed points into a ROM, converting the matrix operation result into an address for storing the nonlinear function value into the ROM when carrying out nonlinear activation on the matrix operation result in the LSTM neural network operation according to control information provided by a control unit, reading out the corresponding function value, and finishing the neuron state updating of the LSTM neural network.
And 6, after the arithmetic unit finishes an arithmetic instruction, sending an instruction execution finishing signal to the execution feedback bus through the control unit.
And 7, after monitoring an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit, the controller judges whether all instructions of the operation are executed completely, if so, the controller executes the step 8, and otherwise, the controller executes the step 4.
And 8, writing the final operation result into the result memory by the arithmetic unit after all the instructions are executed.
And 9, clearing 0 of data in the highest address of the instruction memory through the controller module, and re-entering a starting instruction monitoring state.
And step 10, after the CPU detects that the data in the highest address of the instruction memory is cleared to 0, the CPU indicates that the operation is finished, and then the CPU reads a final operation result from the result memory.
Compared with the prior art, the invention has the following advantages:
firstly, the linear operation unit in the arithmetic unit module realized by the platform on the FPGA converts the multiplication operation of the vector and the matrix into the one-dimensional convolution operation, realizes the parallel acceleration of the two operations of the one-dimensional convolution and the matrix multiplication, and solves the defect that the traditional single calculation acceleration platform cannot simultaneously complete the acceleration of the two operations of the volume and the matrix multiplication, so that the platform can support the calculation acceleration of a one-dimensional CNN-LSTM neural network model and has wider applicability.
Secondly, the implementation method of the platform of the invention multiplexes the same multiply-add array through one-dimensional convolution operation and matrix multiplication operation, thereby ensuring the acceleration performance of the platform, saving FPGA operation resources, solving the problem of insufficient computation resources when two acceleration schemes are deployed on the FPGA at the same time in the prior art, and greatly improving the utilization rate of the computation resources of the FPGA.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a diagram of an operator architecture of the present invention;
FIG. 3 is a block diagram of a basic processing unit of the multiply-add array of the present invention;
fig. 4 is a flow chart of a method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings and examples.
The platform structure of the present invention is further described with reference to fig. 1.
The invention is composed of a general CPU (Central Processing Unit) and an FPGA (field programmable Gate array) which carry out data interaction through a high-speed serial bus PCIE (Peripheral Component Interconnect Express).
The FPGA part consists of an Instruction memory IBUF (Instruction Buffer), a Data memory DBUF (Data Buffer), a Result memory RBUF (Result Buffer), a controller and an arithmetic unit. The arithmetic unit is respectively in data interaction with the data Memory and the result Memory through RAM (Random Access Memory) interfaces, the controller is in data interaction with the instruction Memory through the RAM interfaces, and the controller is in data interaction with the arithmetic unit through an instruction bus and an execution feedback bus.
The general CPU is used for analyzing a one-dimensional CNN-LSTM neural network model provided by a user, generating an operation instruction sequence, loading the operation instruction sequence into an instruction memory of the FPGA end through a PCIE bus, quantizing the one-dimensional CNN-LSTM neural network model parameters and data to be processed provided by the user into fixed points, loading the fixed points into a data memory of the FPGA end through the PCIE bus, writing an operation starting instruction into the highest address of the instruction memory through the PCIE bus, and reading an operation result from a result memory through the PCIE bus after waiting for an operator to execute all operation instructions.
And the instruction memory is used for storing the operation instruction sequence generated by the CPU, and the memory is composed of a double-port RAM realized by Block RAM resources of the FPGA.
And the data memory is used for storing the user model parameters and the data to be processed after the CPU quantization, and the memory is composed of a dual-port RAM realized by Block RAM resources of an FPGA.
And the result memory is used for storing the final operation result of the arithmetic unit for being read by the CPU, and the memory consists of a double-port RAM realized by Block RAM resources of the FPGA.
And the controller is used for finishing the functions of monitoring the operation starting instruction, reading the instruction sequence and executing feedback information processing. After the system is reset, the controller module is in a starting instruction monitoring state, monitors an address line of the instruction memory, and indicates that the operation starting instruction is monitored when the CPU writes data into the highest address of the instruction memory, and takes the data written into the highest address of the instruction memory by the CPU as the total number of instructions to be executed by the arithmetic unit in the calculation process. After monitoring a starting instruction, the controller starts to read an instruction from the address 0 of the instruction memory and sends the instruction to the instruction bus, then the instruction enters an execution feedback monitoring state, after the arithmetic unit module finishes the instruction content, a finishing signal is sent to the execution feedback bus, after the controller monitors the feedback signal, the next instruction in the instruction memory is sent to the instruction bus, until all the instructions of the operation are finished, the controller module clears 0 the data in the highest address of the instruction memory, the controller enters the starting instruction monitoring state again, and when the CPU detects that the data in the highest address of the instruction memory is cleared 0 by the controller, the calculation is finished.
The structure of the operator will be further described with reference to fig. 2.
The arithmetic unit is used for executing the operation instruction sent to the instruction bus by the controller and feeding back the instruction execution condition to the controller through an execution feedback bus, and comprises a control unit, a linear operation unit and a nonlinear operation unit.
The control unit is used for monitoring and analyzing the operation instruction on the instruction bus, then generating control information, controlling the linear operation unit and the nonlinear operation unit to complete corresponding calculation tasks, and finally sending an instruction execution completion signal to the execution feedback bus.
The linear operation unit consists of a ping-pong type result Buffer array CBUF (Channel Buffer) and a multiply-add array, and the parallel acceleration of two operations of one-dimensional convolution and matrix multiplication is realized by multiplexing the multiply-add array.
In the embodiment of the present invention, the CBUF is a ping-pong cache composed of two 32 rows 10240 columns of memory arrays CBUF0 and CBUF1 implemented by a BRAM resource of the PFGA, that is, when the CBUF0 is used as a data load source, the CBUF1 is used as a result memory, and conversely, when the CBUF1 is used as a data load source, the CBUF0 is used as a result memory.
The basic processing cell structure of the multiply-add array is further described with reference to fig. 3.
The basic Processing unit Pe (Processing element) of the multiply-add array is composed of a Weight Buffer unit WBUF (Weight Buffer), a data input port, an offset input port, and a result output port. The WBUF is composed of four independent registers, and is configured to store four different weight parameters, and during operation, one of the weight parameters is selected as an effective weight parameter win according to control information provided by the control unit. And multiplying input data xin of the data input port by the weight parameter win, and adding the multiplied input data xin and input data bin of the offset input port to obtain a calculation result yout which is output from the result output port.
The size of the multiply-add array in the embodiment of the present invention is 32 rows and 64 columns. According to the control information provided by the control unit, the input data x of the multiply-add array is loaded from the DBUF or the row of the CBUF or the column of the CBUF or the result Buffer NBUF (Nonlinear results Buffer) of the Nonlinear operation unit, and the weight parameter w and the bias parameter bias of the multiply-add array are loaded from the DBUF. Each offset cache unit bi (i =0,1, \8230;, 31) of the multiply-add array has 4 independent registers, and the registers are selected to be used as effective parameters according to control information provided by a control unit during operation. And performing maximum pooling P operation, linear rectification R operation and channel addition operation on the operation result of the multiply-add array according to control information provided by the control unit, and storing the operation result in the CBUF.
The nonlinear operation unit intercepts the part of sigmoid and tanh with nonlinear function argument between-4, and quantifies the function value into fixed point number to be stored in ROM, when nonlinear activation is needed to be carried out on the matrix operation result in LSTM neural network operation, the matrix operation result is converted into ROM address to read out the corresponding function value.
The implementation of the platform of the present invention is further described with reference to fig. 4.
Step 1, compiling an operation instruction sequence of a one-dimensional CNN-LSTM neural network model to be accelerated provided by a user at a CPU end, and loading the operation instruction sequence into an instruction memory through a PCIE bus.
And 2, quantizing the parameters of the one-dimensional CNN-LSTM neural network model and the input data into fixed points, and loading the fixed points into a data memory through a PCIE bus.
In the embodiment of the invention, all data participating in the operation are quantized into 16-bit fixed point numbers, wherein 1bit is a sign bit, 4 bits are integer bits, 11 bits are decimal bits, and the multiplication and addition operation result is processed into 16-bit fixed point numbers by adopting a saturation bit-cutting method.
And 3, setting the controller to be in a starting instruction monitoring state, monitoring an address line of the instruction memory, and when the CPU writes data into the highest address of the instruction memory, indicating that a calculation starting instruction of the general CPU is received, wherein the data written into the highest address of the instruction memory is the total number of instructions to be executed in the calculation.
And 4, the controller reads an instruction from the instruction memory and sends the instruction to the instruction bus, sets the controller module to be in an execution feedback monitoring state, and monitors an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit module.
And 5, executing the operation instruction according to the content of the operation instruction.
And 5.1, generating corresponding control information by the control unit of the arithmetic unit module according to the instruction content.
And 5.2, loading the weight parameter w and the bias parameter bias of the multiplication and addition array from the DBUF by a linear operation unit of the operator module according to the control information provided by the control unit.
And 5.3, loading the input data x of the multiply-add array from the DBUF, the row of the CBUF, the column of the CBUF or a result buffer NBUF of the nonlinear operation unit by the linear operation unit of the operator module according to the control information provided by the control unit.
And 5.4, the linear operation unit performs maximum pooling P operation, linear rectification R operation and channel addition operation on the operation result of the multiply-add array according to the control information provided by the control unit and then stores the operation result in the CBUF.
The maximum pooling P operation is to calculate the maximum value of n multiplication and addition array operation results, wherein the value of n is equal to the length of a pooling kernel.
The linear rectification R operation is to compare the multiplication and addition array operation result with 0, when the multiplication and addition array operation result is greater than 0, the value is taken as the value of the linear rectification R operation, and when the multiplication and addition array operation result is less than or equal to 0, the value is taken as 0.
And 5.5, intercepting the part of the sigmoid and tanh nonlinear function independent variables between (-4, 4), quantizing the function value into fixed points, storing the fixed points into a ROM, converting the matrix operation result into an address for storing the nonlinear function value to obtain the ROM when the matrix operation result in the LSTM neural network operation is subjected to nonlinear activation according to the control information provided by the control unit, reading the corresponding function value, and finishing the neuron state updating of the LSTM neural network.
The LSTM neural network operation is defined by the following equation:
f t =sigmoid(W f *[h t-1 ,x t ]+b f );
i t =sigmoid(W i *[h t-1 ,x t ]+b i );
o t =sigmoid(W o *[h t-1 ,x t ]+b o );
Figure BDA0003735870520000081
Figure BDA0003735870520000091
h t =o t *C t
wherein f is t Forgetting gating coefficient vector, i, representing the LSTM neural network t Vector of input gating coefficients representing LSTM neural network, o t Representing the output gating coefficient vector of the LSTM neural network,
Figure BDA0003735870520000097
representing unfused neuron state vectors, C, in an LSTM neural network t Representing the neuron state vector, C, of the LSTM neural network at the current time t-1 Representing a neuron state vector, h, at a time on the LSTM neural network t Representing the hidden state vector, h, of the LSTM neural network at the current time t-1 Representing a hidden state vector, x, at a time on the LSTM neural network t Representing input vectors, W, of an LSTM neural network f 、W i 、W o And W c F of the LSTM neural network model, respectively, provided by the user t 、i t 、o t And
Figure BDA0003735870520000092
weight matrix of b f 、b i 、b o And b c F of the user-supplied LSTM neural network model, respectively t 、i t 、o t And
Figure BDA0003735870520000093
the bias parameter of (3).
The sigmoid function is defined by the following equation:
Figure BDA0003735870520000094
wherein e is (·) Denotes exponential operation with a natural constant e as the base, x denotes a gating coefficient vector f to be activated in LSTM operation t 、i t 、o t Of (1).
The tanh function is defined by the formula:
Figure BDA0003735870520000095
wherein c represents the unfused neuron state vector to be activated in the LSTM operation
Figure BDA0003735870520000096
Of (1).
And 6, after the arithmetic unit module finishes an arithmetic instruction, sending an execution completion signal to the execution feedback bus through the control unit.
And 7, after monitoring an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit module, judging whether all instructions of the operation are executed completely, if so, executing the step 8, otherwise, executing the step 4.
Step 8, after all the instructions are executed, the arithmetic unit writes the final operation result into the result memory
Step 9, the controller clears 0 the data in the highest address of the instruction memory and enters a starting instruction monitoring state again;
and step 10, after the CPU detects that the data in the highest address of the instruction memory is cleared to 0, the CPU indicates that the operation is finished, and then the final operation result is read from a result memory RBUF.

Claims (7)

1. The one-dimensional CNN-LSTM acceleration platform based on the FPGA comprises a general CPU and the FPGA, wherein the FPGA end comprises an instruction memory, a data memory, a result memory, a controller and an arithmetic unit, wherein:
the general CPU is used for compiling an operation instruction sequence of a one-dimensional CNN-LSTM neural network model to be accelerated provided by a user, loading the operation instruction sequence into an instruction memory of the FPGA end, quantizing parameters and input data of the one-dimensional CNN-LSTM neural network model into fixed point numbers, loading the fixed point numbers into a data memory of the FPGA end, writing an operation starting instruction into the highest address of the instruction memory, and reading a final operation result from a result memory after all operation instructions are executed;
the instruction memory is used for storing an operation instruction sequence of the one-dimensional CNN-LSTM neural network model to be accelerated, which is provided by a user and is written by the general CPU;
the data memory is used for storing parameters and input data of the one-dimensional CNN-LSTM neural network model quantized at the general CPU end;
the result memory is used for storing the final operation result for the general CPU to read;
the controller is used for monitoring an operation starting instruction, namely monitoring an address line of an instruction memory, when the CPU writes data into the highest address of the instruction memory, the CPU indicates that the calculation starting instruction of the general CPU is monitored, wherein the data written into the highest address of the instruction memory is the total number of instructions to be executed in the calculation, after the operation starting instruction is monitored, one operation instruction is read from the instruction memory and sent to the instruction bus, then an execution feedback bus is monitored, after an execution completion signal is monitored, the next operation instruction is sent to the instruction bus, until the highest address of the instruction memory is cleared from 0 after the instruction is executed, the instruction enters a starting instruction monitoring state again;
the arithmetic unit comprises a control unit, a linear arithmetic unit and a nonlinear arithmetic unit and is used for executing an arithmetic instruction sent to an instruction bus by the controller; the control unit is used for generating corresponding control information according to the instruction content, controlling the linear operation unit and the nonlinear operation unit to complete corresponding operation, and sending an instruction execution completion signal to the execution feedback bus after the operation instruction is executed; the linear operation unit consists of a multiplication and addition array and a result cache array with a ping-pong structure, is used for loading a weight parameter w and a bias parameter bias of the multiplication and addition array from a data memory according to control information provided by the control unit, loading input data of the multiplication and addition array from a data memory or a row of the ping-pong result cache array or a column of the ping-pong result cache array or a result cache of the nonlinear operation unit, and storing an operation result of the multiplication and addition array into the ping-pong result cache array after performing maximum pooling P operation, linear rectification R operation and channel addition operation according to the control information provided by the control unit to complete and accelerate one-dimensional convolution operation and matrix multiplication operation; the nonlinear operation unit intercepts the part of sigmoid and tanh nonlinear function arguments between-4 and 4, quantizes the function value into fixed point number and stores the fixed point number into ROM, and when nonlinear activation is carried out on the matrix operation result in LSTM neural network operation according to the control information provided by the control unit, the matrix operation result is converted into the address of ROM for storing the nonlinear function value, and the corresponding function value is read out, so that the updating of the gate control coefficient and the neuron state of the LSTM neural network is completed.
2. The implementation method of the one-dimensional CNN-LSTM acceleration platform based on the FPGA according to the platform of claim 1, wherein a linear operation unit of an operator performs parallel acceleration of two operations of one-dimensional convolution and matrix multiplication by multiplexing the same multiply-add array, and the method comprises the following specific steps:
step 1, compiling an operation instruction sequence of a one-dimensional CNN-LSTM neural network model to be accelerated provided by a user at a general CPU end, and loading the operation instruction sequence into an instruction memory;
step 2, quantizing the parameters of the one-dimensional CNN-LSTM neural network model and input data into fixed point numbers at a general CPU end, and loading the fixed point numbers into a data memory;
step 3, setting the controller to be in a starting instruction monitoring state, monitoring an address line of the instruction memory, and when the CPU writes data into the highest address of the instruction memory, indicating that a calculation starting instruction of the general CPU is monitored, wherein the data written into the highest address of the instruction memory is the total number of instructions to be executed in the calculation;
step 4, the controller reads an instruction from the instruction memory and sends the instruction to the instruction bus, sets the controller to be in an execution feedback monitoring state, and monitors an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit module;
step 5, executing the operation instruction according to the content of the operation instruction:
step 5.1, the control unit of the arithmetic unit module generates corresponding control information according to the instruction content;
step 5.2, a linear arithmetic unit of the arithmetic unit module loads a weight parameter w and a bias parameter bias of the multiplication and addition array from a data memory according to the control information provided by the control unit;
step 5.3, the linear operation unit of the operator module loads the input data of the multiplication and addition array from the data memory or the row of the ping-pong result cache array or the column of the ping-pong result cache array or the result cache of the nonlinear operation unit according to the control information provided by the control unit;
step 5.4, the linear operation unit performs maximum pooling P operation, linear rectification R operation and channel addition operation on the operation result of each multiply-add array according to the control information provided by the control unit and then stores the operation result into a ping-pong type result cache array;
step 5.5, the nonlinear operation unit intercepts the part of two nonlinear function arguments sigmoid and tanh between [ -4,4 ], quantifies the function value into fixed point number and stores the fixed point number into ROM, when the nonlinear activation is carried out on the matrix operation result in the LSTM neural network operation according to the control information provided by the control unit, the matrix operation result is converted into the address of the ROM for storing the nonlinear function value, the corresponding function value is read out, and the neuron state update of the LSTM neural network is completed;
step 6, after the arithmetic unit finishes an arithmetic instruction, sending an instruction execution finishing signal to an execution feedback bus through the control unit;
step 7, after monitoring an instruction execution completion signal sent to the execution feedback bus by the arithmetic unit, the controller judges whether all instructions of the operation are executed completely, if so, the controller executes step 8, otherwise, the controller executes step 4;
step 8, writing the final operation result into a result memory by the arithmetic unit after all the instructions are executed;
step 9, the controller clears 0 the data in the highest address of the instruction memory and enters a starting instruction monitoring state again;
and step 10, the CPU detects that the data in the highest address of the instruction memory is cleared by 0 to indicate that the operation is finished, and then reads the final operation result from the result memory.
3. The method of claim 2, wherein the maximum pooling P operation in step 5.4 is performed by taking a maximum value of n multiply-add array operation results, where n is equal to the length of the pooling kernel.
4. The method as claimed in claim 2, wherein the linear rectification R operation in step 5.4 is to compare each operation result in the multiply-add array with 0, and take its own value when the operation result is greater than 0, and take its value 0 when the operation result is less than or equal to 0.
5. The method for implementing one-dimensional CNN-LSTM acceleration platform based on FPGA of claim 2, wherein the LSTM neural network operation in step 5.5 is performed by the following formula:
f t =sigmoid(W f *[h t-1 ,x t ]+b f );
i t =sigmoid(W i *[h t-1 ,x t ]+b i );
o t =sigmoid(W o *[h t-1 ,x t ]+b o );
Figure FDA0003735870510000041
Figure FDA0003735870510000042
h t =o t *C t
wherein f is t Forgetting gating coefficient, i, representing the LSTM neural network t Vector of input gating coefficients representing LSTM neural network, o t The vector of output gating coefficients representing the LSTM neural network,
Figure FDA0003735870510000043
representing unfused neuron state vectors, C, in an LSTM neural network t Representing the neuron state vector, C, of the LSTM neural network at the current time t-1 Neuron state vector representing last moment of LSTM neural network current moment,h t Hidden layer state vector h representing the current time of the LSTM neural network t-1 Hidden state vector, x, representing the last moment of the LSTM neural network at the current moment t Representing input vectors, W, of an LSTM neural network f 、W i 、W o And W c F of the user-supplied LSTM neural network model, respectively t 、i t 、o t And
Figure FDA0003735870510000044
weight matrix of b f 、b i 、b o And b c F of the LSTM neural network model, respectively, provided by the user t 、i t 、o t And
Figure FDA0003735870510000045
the bias parameter of (1).
6. The method for one-dimensional CNN-LSTM acceleration platform implementation based on FPGA of claim 2, wherein sigmoid function in step 5.5 is defined by the following formula:
Figure FDA0003735870510000046
wherein e is (·) Denotes exponential operation with a natural constant e as the base, x denotes a gating coefficient vector f to be activated in LSTM operation t 、i t 、o t Of (2).
7. The method for implementing the one-dimensional CNN-LSTM acceleration platform based on FPGA of claim 1, wherein the tanh function in step 5.5 is defined by the following formula:
Figure FDA0003735870510000051
wherein c represents the un-fused to be activated in the LSTM operationResultant neuron state vector
Figure FDA0003735870510000052
Of (2).
CN202210804166.3A 2022-07-07 2022-07-07 One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method Pending CN115222028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210804166.3A CN115222028A (en) 2022-07-07 2022-07-07 One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210804166.3A CN115222028A (en) 2022-07-07 2022-07-07 One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method

Publications (1)

Publication Number Publication Date
CN115222028A true CN115222028A (en) 2022-10-21

Family

ID=83609288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210804166.3A Pending CN115222028A (en) 2022-07-07 2022-07-07 One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method

Country Status (1)

Country Link
CN (1) CN115222028A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116029332A (en) * 2023-02-22 2023-04-28 南京大学 On-chip fine tuning method and device based on LSTM network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116029332A (en) * 2023-02-22 2023-04-28 南京大学 On-chip fine tuning method and device based on LSTM network
CN116029332B (en) * 2023-02-22 2023-08-22 南京大学 On-chip fine tuning method and device based on LSTM network

Similar Documents

Publication Publication Date Title
CN111967468B (en) Implementation method of lightweight target detection neural network based on FPGA
CN108280514B (en) FPGA-based sparse neural network acceleration system and design method
CN107729989B (en) Device and method for executing artificial neural network forward operation
US10691996B2 (en) Hardware accelerator for compressed LSTM
Guo et al. Software-hardware codesign for efficient neural network acceleration
CN108427990B (en) Neural network computing system and method
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US11194549B2 (en) Matrix multiplication system, apparatus and method
JP2022070955A (en) Scheduling neural network processing
CN108090560A (en) The design method of LSTM recurrent neural network hardware accelerators based on FPGA
EP3869412A1 (en) Vector quantization decoding hardware unit for real-time dynamic decompression for parameters of neural networks
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
US20220164663A1 (en) Activation Compression Method for Deep Learning Acceleration
CN111797982A (en) Image processing system based on convolution neural network
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
Duan et al. Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights
Li et al. A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
CN113807998A (en) Image processing method, target detection device, machine vision equipment and storage medium
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
CN113240101B (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
CN114519425A (en) Convolution neural network acceleration system with expandable scale
Shivapakash et al. A power efficient multi-bit accelerator for memory prohibitive deep neural networks
CN110716751A (en) High-parallelism computing platform, system and computing implementation method
CN112183744A (en) Neural network pruning method and device
CN110659014B (en) Multiplier and neural network computing platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination