CN114881217A

CN114881217A - General convolutional neural network accelerator based on FPGA and system thereof

Info

Publication number: CN114881217A
Application number: CN202210135910.5A
Authority: CN
Inventors: 刘辉; 李政
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-08-09

Abstract

The invention discloses a general convolutional neural network accelerator based on FPGA and a system thereof, relates to the field of design of a neural network and an edge-end acceleration system, and provides a convolutional neural network accelerator based on Arria10SoCFPGA, wherein the accelerator simultaneously considers the performance of the accelerator and the adaptability to the convolution of standard convolution and DW (DepthWise), can be transplanted on Intel series SoCFPGA, and has a wide application prospect. The system of the invention comprises: the system comprises an ARM processor, a DDR3 memory, an AXI bus interconnection, a convolutional neural network accelerator system control module, a data distribution module, a convolutional operation engine module, a bias activation module, a pooling module, an input/output data memory and control modules thereof and a data collection module. In order to improve the parallelism of the system, 1024-bit data alignment is adopted in the data distribution module, so that the parallel transmission and calculation of the data are realized, and the calculation efficiency and the transmission bandwidth of the system are improved.

Description

General convolutional neural network accelerator based on FPGA and system thereof

Technical Field

The invention relates to a general convolutional neural network accelerator system based on an FPGA (field programmable gate array), and mainly relates to the field of neural network and edge terminal acceleration system design.

Background

With the high-speed development and wide application of artificial intelligence technology, researchers put forward various neural network models, such as convolutional neural networks, recurrent neural networks, and the like; the method is mainly applied to the fields of image recognition, target detection, voice analysis, semantic segmentation and the like. In the field of computer vision, Convolutional Neural Networks (CNNs) play an increasingly important role, and are important algorithms for image classification and recognition.

However, as the detection task becomes more and more complex, the depth of the CNN becomes larger and larger, the computation complexity also grows exponentially, and the general CPU cannot meet the computation requirement of the CNN. The GPU naturally has strong parallel computing capability, so the training process for CNN is generally performed by using the GPU, but when a neural network inference model is deployed in a mobile scene, the high cost and high power consumption of the GPU cannot meet the requirements of the scene.

Edge computing is a computing mode which is started in recent years and provides services such as intelligent storage, computation, analysis and the like at the edge of a network close to a data source; the inference process of the neural network is completed by using the edge computing platform in a mobile scene, and a series of advantages are provided. The reasoning process is directly completed at the data acquisition end without being transmitted back to a server of a data center, so that the data transmission delay and the control signaling overhead are greatly reduced; the real-time processing of data can be realized by unloading the reasoning task to the edge platform, and the method has important significance in practical application.

The FPGA is a programmable logic device with the characteristics of high parallel computation, flexible configuration, low power consumption, small size, portability and the like. It is consistent with the highly parallel computing characteristics of CNN; compared with a CPU, the method has the characteristic of parallel computing, so that the computing speed can be higher; lower power consumption due to its low operation compared to the GPU. Therefore, the neural network model is deployed on the FPGA platform, the characteristics of pipeline calculation and low power consumption of the FPGA are fully utilized, and the method is an effective means for realizing acceleration of convolutional neural network hardware and model deployment.

The invention has the following beneficial effects: the system considers the performance of the accelerator and the adaptability to different network models, and has wide application scenes. The system greatly improves the calculation speed and the resource utilization rate through resource multiplexing, parallel processing and pipeline design according to the parallelism and the calculation density of the convolutional neural network, thereby improving the reasoning speed. The method supports 8-bit quantized data input and 32-bit data output, supports standard convolution and depth separable convolution, and has certain universality. And the verification is carried out on an Intel Arria10 series SoCFPGA platform, the SSD-MobileNet-v1 reasoning model is accelerated, and the frame rate can reach 6.7 FPS.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a general convolutional neural network accelerator system based on an FPGA (field programmable gate array), which greatly improves the calculation speed and the resource utilization rate through resource multiplexing, parallel processing and pipeline design according to the parallelism and the calculation density of a convolutional neural network, thereby improving the reasoning speed. The method supports 8-bit quantized data input and 32-bit data output, supports standard convolution and depth separable convolution, and has certain universality.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a general convolutional neural network accelerator based on FPGA and a system thereof are characterized by comprising a system control unit, a data forwarding unit, an input data cache unit, a convolutional processing unit, a bias activation unit, a pooling unit and an output data cache unit.

Preferably, the system control unit is used for controlling the whole accelerator system, and comprises data loading, calculation execution and data output;

the Data forwarding unit is used for Data bit width conversion and comprises a Data _ Scather and a Data _ Gather;

(1) the Data _ Scather module receives Data sent by the PS end through a Data bus, performs bit width conversion and distributes the converted Data to a shared memory, a private memory or a weight memory;

(2) and the Data _ Gather module reads the Data in the 4 output characteristic diagram memories, performs bit width conversion, and sends the Data to an off-chip DDR3 memory buffer of the PS end through a Data bus.

Preferably, the input data caching unit is mainly used for caching the feature map data and the weight data, and the feature map data and the weight data are respectively a shared feature map memory, a private feature map memory and a weight memory; and when the 4 convolution kernels Share the same feature map data, loading the feature map data to Share _ IFM, otherwise, respectively loading the feature map data to 4 Private _ IFM.

Preferably, the convolution processing unit is configured to perform calculation on a convolution layer or a full connection layer in a convolutional neural network, each convolution core reads data in a memory through a respective bus interface to perform calculation, and the system control unit controls a convolution type and a data amount to be processed.

Preferably, the convolution calculation unit comprises 4 convolution kernels, and the 4 convolution kernels can independently work in parallel and have higher parallelism; each convolution kernel is composed of 64 DSP multipliers and 63 adders to form a multiplication and addition tree structure, the multiplication and addition tree has 5 levels of running water and has 6 levels of output, and independent output can be carried out on each level of the multiplication and addition tree to output 1, 2, 4, 16 and 64 results respectively.

Preferably, the bias activation unit is used for calculating a bias layer or an activation layer in the convolutional neural network, and supports ReLu activation; when the offset operation is executed, the offset value is read from the offset control register to perform offset calculation, and then the offset value is output to the next-stage processing unit.

Preferably, the pooling unit is configured to calculate a pooling layer in the convolutional neural network, support two modes of maximum pooling and average pooling, and have a pooling size of 2 × 2 and a pooling step size of 2, each pooling calculation module includes two comparison units for implementing maximum pooling, and a 4-input accumulation unit for implementing average pooling;

the output data cache unit is mainly used for caching the output results of the convolution processing unit, has an accumulation function, selects whether to accumulate the output results of a plurality of channels and then caches the accumulated results, for standard convolution, a plurality of convolution kernel convolution results need to be accumulated to obtain an effective output characteristic diagram, and if deep separable convolution is adopted, accumulation is not needed;

the accumulation function is placed on the output characteristic diagram memory side, the system control unit controls whether the memory side performs accumulation and caching on the convolution result of the current layer or not, and data are read from 4 Private _ OFM memories and cached in an off-chip DDR3 memory until the partial calculation of the current layer is completed.

The utility model provides a general type convolution neural network accelerating system based on FPGA which characterized in that:

1) converting a pre-trained convolutional neural network inference model and parameters into command parameters and weight data which can be identified by an acceleration system;

2) in the system initialization stage, weight data in float32 format is quantized into 8-bit integer data in advance, and the obtained quantization weight is cached in an off-chip DDR3 memory; analyzing the inference model to obtain model parameters of each layer of the inference model and storing the model parameters in an internal memory of an ARM processor;

3) when the image is reasoned, the ARM processor carries out 8bit quantization on the image data, and the image data are segmented and sequenced according to the model parameters;

4) when the inference process is started to be executed, the PS end loads data to an accelerator through a data bus, if standard convolution is executed, the same image data is loaded to a Share _ IFM on-chip memory, and 4 groups of convolution kernel parameters are loaded to a Weight on-chip memory, so that the multiplexing of the image data by 4 convolution kernels is realized; if the DW convolution is executed, 4 groups of image data are loaded to 4 Private memories of Private _ IFM, and 4 groups of convolution kernel parameters are loaded to a Weight on-chip memory, so that 4 convolution kernels work independently.

5) After the data loading is finished, the PS end sends a command through a control bus, 4 convolution kernels are started to start to execute convolution operation, the 4 convolution kernels respectively load image data from a Share _ IFM memory or a Private Private _ IFM memory, load Weight data from a Weight memory, and cache a calculation result in an OFM output characteristic diagram memory for caching;

6) when the data is cached to the OFM output characteristic graph memory, whether the convolution kernel output result is accumulated or not is selected to be cached according to the type of the current convolution layer;

7) after the convolution operation is executed, the PS end sends a command through the control bus to inform an accelerator whether the bias, the activation and the pooling are required to be executed or not, after the corresponding modules are called for processing in sequence, partial inference results of the PL end are read through the data bus, and are cached in an off-chip DDR3 memory;

8) repeatedly executing the step 4) to the step 7) until the reasoning of the whole picture is finished; and finally, outputting the inference result by the ARM processor.

The technical principle and the beneficial effects of the invention are as follows:

the general convolutional neural network acceleration system designed by the invention realizes high-speed calculation of the convolutional neural network on InelArria10 series FPGA chips, theoretically can accelerate various convolutional neural networks, can be adapted to various common convolutional neural network models, and can fully exert the acceleration performance of the system only by matching corresponding bottom-layer drive and application software. Compared with a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like, the accelerator has higher energy efficiency ratio and portability and higher actual reference value, and in actual use, application software and a quantification strategy can be reasonably designed according to the framework of the invention and a network model structure, so that the acceleration performance of an acceleration system can be effectively exerted.

The system considers the performance of the accelerator and the adaptability to different network models, and has wide application scenes. The system greatly improves the calculation speed and the resource utilization rate through resource multiplexing, parallel processing and pipeline design according to the parallelism and the calculation density of the convolutional neural network, thereby improving the reasoning speed. The method supports 8-bit quantized data input and 32-bit data output, supports standard convolution and depth separable convolution, and has certain universality. And the verification is carried out on an Intel Arria10 series SoCFPGA platform, the SSD-MobileNet-v1 reasoning model is accelerated, and the frame rate can reach 6.7 FPS.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments will be briefly introduced below, it is obvious that the drawings in the following description are only 4 of the embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is an overall architecture of an accelerator according to the present invention;

FIG. 2 is a block diagram of a convolution calculation unit according to the present invention;

FIG. 3 is a diagram of a multiply-add tree structure according to the present invention;

FIG. 4 is a memory structure of an output profile according to the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely preferred embodiments of the present invention, rather than all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

As shown in fig. 1, the convolutional neural network accelerator provided by the embodiment of the present invention implements an acceleration system with parallel computation of 4 convolutional kernels, and each convolutional kernel can operate independently, can support various types of convolutional neural network models, and has good performance in the aspect of universality. The convolution calculation is realized inside a single convolution kernel by adopting a pipeline structure and a GEMM mode, and the convolution calculation with different KernelSize sizes can be supported; the four convolution kernels can perform a maximum of 512 multiply-accumulate operations per clock cycle. And the on-chip memory structure is designed, off-chip data loading and caching are reduced, and data multiplexing is realized, so that the system operation efficiency is improved.

The invention designs a convolution neural network accelerator based on FPGA, which mainly comprises the following parts: the system comprises a system control unit, a data forwarding unit, an input data caching unit, a convolution processing unit, an offset activation unit, a pooling unit and an output data caching unit.

The system control unit is used for controlling the whole accelerator system and comprises data loading, calculation execution and data output;

the data forwarding unit is used for data bit width conversion and comprises two functions: 1) the data sent by the PS end through the data bus is subjected to bit width conversion and then distributed to a shared memory (Share _ IFM), a Private memory (Private _ IFM) or a Weight memory (Weight); 2) reading data in 4 output characteristic diagram (Private _ OFM) memories, performing bit width conversion, and sending the data to an off-chip DDR3 memory cache of a PS end through a data bus;

the input data caching unit is mainly used for caching the characteristic diagram data and the weight data so as to enable a subsequent calculation module to read the data for processing;

the convolution processing unit is used for calculating convolution layers or full connection layers in the convolution neural network, each convolution core reads data in the memory through respective bus interfaces to execute calculation, and the system control unit controls convolution types and data quantity required to be processed;

the bias activation unit is used for calculating a bias layer or an activation layer in the convolutional neural network and supporting ReLu activation;

the pooling unit is used for calculating a pooling layer in the convolutional neural network, supports two modes of maximum pooling and average pooling, has a pooling size of 2 multiplied by 2 and a pooling step length of 2, and controls the pooling mode by the system control unit;

the output data buffer unit is mainly used for buffering the output results of the convolution processing unit, has an accumulation function and can select whether to buffer the output results of a plurality of channels after accumulation.

The system comprises a SoCFPGA chip and an off-chip memory DDR3 chip; realizing the accelerated calculation of the convolutional neural network through the coordination of software and hardware;

the SoCFPGA chip is used for realizing an acceleration system and controlling an algorithm acceleration process; the off-chip memory DDR3 chip is used for storing intermediate characteristic diagram data, weight parameters and bias parameters of the convolutional neural network;

the SoCFPGA chip comprises an ARM-Coretex-A9 dual-core processor, an FPGA, a DDR3 controller and an AXI bus;

the ARM-Coretex-A9 dual-core processor is used for operating user software, executing convolutional neural network algorithm scheduling and controlling the process of the FPGA inference convolutional neural network model;

the FPGA comprises a large number of programmable logic resources, a DSP multiplier unit and an on-chip SRAM memory, and is used for realizing circuit configuration of convolution calculation, bias calculation and activation calculation acceleration units;

the DDR3 controller is to write intermediate feature map data into an off-chip memory DDR3 over an AXI bus, and to read weight data and offset data from the off-chip DDR3 memory;

the AXI bus interconnection is used for providing a unified bus interface, so that the ARM-Coretex-A9 dual-core processor can access the FPGA and the off-chip DDR3 memory through the bus interface.

The system control unit mainly comprises 5 state control registers for receiving various control commands of the PS-end reasoning model and controlling the acceleration system to execute corresponding calculation and data loading and output. The method mainly comprises the following steps: 1) and a data loading control register: the accelerator is used for controlling the accelerator to receive the cache data, indicating the type of the received data and the data batch needing to be received; 2) accelerator data output control register: the accelerator output characteristic diagram data and the data quantity required to be output are controlled; 3) and a convolution processing unit control register: the convolution processing unit is used for controlling the convolution processing unit to start to execute convolution calculation, convolution kernel size, whether the current convolution needs to be accumulated or not and a convolution calculation batch needing to be processed; 4) and a bias activation unit control register: the bias enable used for controlling the bias unit, store the bias value, activate and enable; 5) and a pooling unit control register: for controlling pooling enablement and pooling type, only 2 x 2 sized pooling operations are supported. By controlling these several registers, control of the entire accelerator system can be achieved.

The data forwarding unit comprises two modules: data _ scanner and Data _ Gather; the Data _ Scather module receives Data sent by a PS end through a Data bus, performs bit width conversion and distributes the Data to a shared memory (Share _ IFM), a Private memory (Private _ IFM) or a Weight memory (Weight). And the Data _ Gather module reads Data in 4 output characteristic diagram (Private _ OFM) memories, performs bit width conversion, and sends the Data to an off-chip DDR3 memory cache of the PS end through a Data bus.

The input data caching unit is mainly used for caching the feature map data and the Weight data, and is respectively a shared feature map memory (Share _ IFM), a Private feature map memory (Private _ IFM) and a Weight memory (Weight); when the 4 convolution kernels Share the same feature map data, loading the feature map data to Share _ IFM, otherwise, respectively loading the feature map data to 4 Private _ IFM; since the convolution kernel Weight data amount of each convolution layer is generally small, all the convolution kernel Weight data can be loaded into the Weight memory.

The convolution processing unit is used for calculating a convolution layer or a full connection layer in the convolution neural network; the convolution calculation unit comprises 4 convolution kernels, and the 4 convolution kernels can independently and parallelly work and have higher parallelism; each convolution Kernel forms a multiplication and addition tree structure by 64 DSP multipliers and 63 adders, the multiplication and addition tree has 5 levels of running water and 6 levels of output, each level of the multiplication and addition tree can be independently output, 1, 2, 4, 16 and 64 results are respectively output, and the convolution Kernel-Size has high convolution efficiency for different Kernel-sizes. Moreover, by multiplexing the DSP blocks, each DSP can simultaneously calculate two paths of 8-bit multiplication operations. Through the pipeline design and DSP multiplexing, each convolution kernel can simultaneously calculate 128 multiplication calculations and 127 addition calculations at most, and the DSP utilization rate is high. When the convolution is started, the data is read by convolution kernels through respective bus interfaces and through corresponding memories for calculation.

The bias activation unit is used for calculating a bias layer or an activation layer in the convolutional neural network, and can directly bias or ReLu activate the output characteristic diagram of the convolutional layer in a layer fusion mode or independently bias or ReLu activate the input characteristic diagram. When the offset operation is executed, the offset value is read from the offset control register to perform offset calculation, and then the offset value is output to the next-stage processing unit.

The pooling unit is used for calculating a pooling layer in the convolutional neural network and supporting maximum pooling and average pooling, the pooling size is 2 multiplied by 2, and the step length is 2. Each pooling calculation module comprises two comparison units for maximum pooling and a 4-input accumulation unit for average pooling.

The output data buffer unit is mainly used for buffering output results of 4 convolution kernels, has an accumulation function and can select whether to accumulate the output results of a plurality of channels or not. For a standard convolution, the result of the convolution is,

the convolution results of a plurality of convolution kernels need to be accumulated to obtain an effective output characteristic graph, and accumulation is not needed if the convolution results are deep separable. Therefore, the accumulation function is placed at the memory side of the output characteristic diagram, and the system control unit controls whether the memory side accumulates and caches the convolution result of the current layer. And reading data from 4 Private _ OFM memories and caching the data into an off-chip DDR3 memory after partial calculation of the current layer is completely finished so as to improve the data transmission efficiency.

The invention has the beneficial effects that: the general convolutional neural network acceleration system designed by the invention realizes high-speed calculation of the convolutional neural network on InelArria10 series FPGA chips, theoretically can accelerate various convolutional neural networks, can be adapted to various common convolutional neural network models, and can fully exert the acceleration performance of the system only by matching corresponding bottom-layer drive and application software. Compared with a CPU (Central processing Unit), a GPU (graphics processing Unit) and the like, the accelerator has higher energy efficiency ratio and portability and higher actual reference value, and in actual use, application software and a quantification strategy can be reasonably designed according to the framework of the invention and a network model structure, so that the acceleration performance of an acceleration system can be effectively exerted.

The accelerator simultaneously considers the performance of the accelerator and the adaptability of the standard convolution and DW (DepthWise) convolution, can be transplanted on SoCFPGA of Intel series, and has a wide application prospect. The system of the invention comprises: the system comprises an ARM processor, a DDR3 memory, an AXI bus interconnection, a convolutional neural network accelerator system control module, a data distribution module, a convolutional operation engine module, a bias activation module, a pooling module, an input/output data memory and control modules thereof and a data collection module. In order to improve the parallelism of the system, 1024-bit data alignment is adopted in the data distribution module, so that the parallel transmission and calculation of the data are realized, and the calculation efficiency and the transmission bandwidth of the system are improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A general convolutional neural network accelerator based on FPGA, comprising: the system comprises a system control unit, a data forwarding unit, an input data caching unit, a convolution processing unit, an offset activation unit, a pooling unit and an output data caching unit.

2. The FPGA-based generalized convolutional neural network accelerator system of claim 1, wherein:

the Data forwarding unit is used for Data bit width conversion and comprises Data _ Scather and Data _ Gather;

3. The FPGA-based generalized convolutional neural network accelerator system of claim 1, wherein:

the input data caching unit is mainly used for caching the feature map data and the weight data and is respectively a shared feature map memory, a private feature map memory and a weight memory; and when the 4 convolution kernels Share the same feature map data, loading the feature map data to Share _ IFM, otherwise, respectively loading the feature map data to 4 Private _ IFM.

4. The FPGA-based generalized convolutional neural network accelerator system of claim 1, wherein: the convolution processing unit is used for calculating convolution layers or full connection layers in the convolution neural network, each convolution core reads data in the memory through a respective bus interface to execute calculation, and the system control unit controls convolution types and the data amount required to be processed.

5. The FPGA-based generalized convolutional neural network accelerator system of claim 4, wherein: the convolution calculation unit comprises 4 convolution kernels, and the 4 convolution kernels can independently and parallelly work and have higher parallelism; each convolution kernel is composed of 64 DSP multipliers and 63 adders to form a multiplication and addition tree structure, the multiplication and addition tree has 5 levels of running water and has 6 levels of output, and independent output can be carried out on each level of the multiplication and addition tree to output 1, 2, 4, 16 and 64 results respectively.

6. The FPGA-based generalized convolutional neural network accelerator system of claim 1, wherein: the bias activation unit is used for calculating a bias layer or an activation layer in the convolutional neural network and supporting ReLu activation; when the offset operation is executed, the offset value is read from the offset control register to perform offset calculation, and then the offset value is output to the next-stage processing unit.

7. The FPGA-based generalized convolutional neural network accelerator system of claim 1, wherein: the pooling unit is used for calculating a pooling layer in the convolutional neural network, two modes of maximum pooling and average pooling are supported, the pooling size is 2 multiplied by 2, the pooling step length is 2, each pooling calculation module comprises two comparison units for realizing maximum pooling, and an accumulation unit comprising two stages of running water realizes four-input average pooling;

8. The utility model provides a general type convolution neural network accelerating system based on FPGA which characterized in that:

4) when the inference process is started to be executed, the PS end loads data to an accelerator through a data bus, if standard convolution is executed, the same image data is loaded to a Share _ IFM on-chip memory, and 4 groups of convolution kernel parameters are loaded to a Weight on-chip memory, so that the multiplexing of the image data by 4 convolution kernels is realized; if the DW convolution is executed, 4 groups of image data are loaded to 4 Private memories Private to Private _ IFM, and 4 groups of convolution kernel parameters are loaded to a Weight on-chip memory, so that 4 convolution kernels work independently;