CN115577747A

CN115577747A - High-parallelism heterogeneous convolutional neural network accelerator and acceleration method

Info

Publication number: CN115577747A
Application number: CN202211155291.2A
Authority: CN
Inventors: 潘晓英; 穆元震; 王昊; 贾凝心
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-01-06

Abstract

The invention discloses a high-parallelism heterogeneous convolutional neural network accelerator and an acceleration method, which comprises a control subsystem, a parallel processing subsystem and a storage subsystem, wherein a control module in the subsystem is used for loading input data from an off-chip storage module to an on-chip storage module or writing output data back to the off-chip storage module from the on-chip storage module; the parallel processing module is used for processing the input data and the weight and realizing convolution operation or pooling operation according to the instruction. The invention fully utilizes the control and logic calculation resources of the heterogeneous processor, realizes the data processing with higher parallelism, and ensures that the system realizes the convolutional neural network with lower power consumption and higher performance.

Description

High-parallelism heterogeneous convolutional neural network accelerator and acceleration method

Technical Field

The invention belongs to the field of convolutional neural network accelerators, and particularly relates to a heterogeneous convolutional neural network accelerator with high parallelism and an acceleration method.

Background

The convolutional neural network makes great progress in the field of computer vision, and is widely applied to scenes such as image classification, target detection, security monitoring video analysis and the like. Compared with the traditional method, the convolutional neural network relies on a deeper network layer number and more model parameters, obtains strong capability of learning and extracting features from a large amount of data, and greatly exceeds the accuracy of human beings in many visual tasks.

With the increasing abundance of the landing services of the convolutional neural network, high requirements are put forward on the real-time performance of processing and the power consumption of equipment. At present, a computation platform of a convolutional neural network mainly comprises three types, namely a CPU, a GPU and an FPGA, wherein the CPU is good at processing serial control flows, certain disadvantages exist when the computation of a large-scale parallel neuron of the neural network is processed, the GPU has certain parallelism but higher power consumption and cannot meet the requirements of ground services on power consumption, the FPGA has strong parallel processing capacity and is suitable for processing a large amount of parallel operations, and the FPGA has the characteristics of low dynamic configuration and operation power consumption and certain defects in task control. How to exert the advantages of a plurality of computing platforms and design a convolutional neural network accelerator based on a heterogeneous processor, the processing real-time performance is improved, the power consumption is reduced, and the convolutional neural network accelerator is more and more concerned by researchers. Currently, related research on heterogeneous convolutional neural network accelerators at home and abroad is continuously explored, the structures of the current heterogeneous convolutional neural network accelerators comprise a zynq platform-based DPU acceleration engine, a TPU accelerator adopting a pulsation array structure and the like, although certain results are obtained, the current heterogeneous convolutional neural network accelerators are far from becoming mature, the current design scheme is low in calculation parallelism and cannot fully utilize calculation resources, and in an edge calculation scene, the problems of high power consumption and poor real-time performance exist in the execution process of the convolutional neural network accelerator, and the field still has great research value and development space.

Disclosure of Invention

The invention aims to provide a high-parallelism heterogeneous convolutional neural network accelerator and an acceleration method, solves the technical problems of high power consumption and poor real-time performance of an embedded system in the convolutional neural network inference process in an edge calculation scene, and provides a new technical method for accelerating the processing of a convolutional neural network.

In order to achieve the purpose, the technical scheme of the invention is as follows: a heterogeneous convolutional neural network accelerator with high parallelism comprises a control subsystem, a parallel processing subsystem and a storage subsystem, wherein the control subsystem is connected with the parallel processing subsystem, the control subsystem is connected with the storage subsystem, and the storage subsystem is connected with the parallel processing subsystem; the control subsystem comprises a control module, an instruction module and a configurable weight quantization module; the processing subsystem comprises a convolution parallel computing module and a pooling module; the storage subsystem comprises an off-chip storage module and an on-chip storage module.

Furthermore, the convolution calculation module is composed of a plurality of levels of processing unit arrays, wherein the processing unit array of the first level is responsible for parallel processing of the input channels, and the processing unit of the second level is responsible for parallel processing of the output channels; the processing process is realized based on a pipeline mode, and multi-stage pipeline is realized by three processes of reading data, operating and writing back data.

Furthermore, the on-chip memory module is composed of a buffer memory hierarchy, a cache memory hierarchy and a first-in first-out memory hierarchy.

Further, the acceleration method of the heterogeneous convolutional neural network accelerator with high parallelism comprises the following steps:

the method comprises the following steps: in the control subsystem, a control module initializes an off-chip storage module and writes corresponding configuration files, network structure files, input pictures and weight data into the off-chip storage module;

step two: the control module drives the instruction module, the instruction module reads the network structure file in the off-chip storage module, automatically analyzes the file, generates a corresponding configuration instruction and a scheduling instruction, sends the configuration instruction to the configurable weight quantization module, and sends the scheduling instruction to the parallel processing subsystem;

step three: the configurable weight quantization module reads a configuration file in an external memory, calls a quantization program according to the configuration file, performs corresponding quantization on weight data, and writes the quantized weight data back to the off-chip memory module;

step four: the on-chip storage module reads in input pictures and weight data from the off-chip storage module, completes the caching of current line data and two adjacent lines of data, and loads corresponding data to corresponding FIFO blocks;

step five: the parallel processing subsystem analyzes the scheduling instruction and sequentially executes the convolution calculation module and the pooling calculation module according to the sequence of the network structure;

step six: when the convolution calculation module expands the input feature maps and the output feature maps, the parallelism is set as Tm and Tn, tm input feature maps and Tn output feature maps are expanded once, and Tm and Tn group vectors are processed in parallel once; when the convolution calculation module calculates, the multi-channel input characteristic graph is expanded along two dimensions of an input channel and an output channel, and multiplication and accumulation operation is performed on data of the two dimensions in parallel;

step seven: the pooling calculation module takes the calculation results of the convolution calculation module as input, compares the results for a plurality of times to obtain the final calculation result, and writes the result back to the off-chip storage module through the on-chip storage module.

Compared with the prior art, the invention has the beneficial effects that:

1. the parallel computing module provided by the method solves the problem of read-write dependence of convolution operation in the traditional accelerator computing design scheme, adjusts the read-write access of two adjacent cycles to the same block of memory address into the read-write access to the memory addresses in different blocks, can completely streamline the convolution operation process, increases the utilization rate of computing resources and logic resources of a processor, and reduces the operation time.

2. The on-chip storage module provided by the method caches the data of the current line and two adjacent lines through a multi-level storage structure, improves the reuse rate of the data, and reduces the times of accessing the memory; by designing the FIFO cache levels corresponding to the parallel computing units, each FIFO block only improves the data reading and writing back functions for the corresponding processing unit array, and realizes parallel loading of data to the computing unit, thereby realizing data processing with higher parallelism.

3. The accelerator provided by the method supports dynamic bit width quantization of the weight data, so that the accelerator is suitable for operation and processing of the weight data with various bit widths.

4. The accelerator provided by the method has the functions of programmability and dynamic reconfiguration, and realizes acceleration of various network structures by setting different configuration commands in a command module of a control subsystem; and the sizes and dimensions of input and output channels and convolution kernels and the parallelism of calculation also support dynamic programming and configuration.

5. The accelerator provided by the invention realizes a network structure with optimized hardware, and the structure adopts an implementation mode of firstly pooling and then activating, and the mode reduces the size of an input characteristic diagram sent into an activation function through a pooling function, thereby reducing the operation amount of the network structure.

Drawings

FIG. 1 is a schematic diagram of the overall architecture of a convolutional neural network accelerator used in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a configurable weight quantization module method used in embodiments of the present invention;

FIG. 3 is a schematic diagram of a method for using an on-chip memory module in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a parallel computing subsystem method used in embodiments of the present invention;

FIG. 5 is a schematic diagram of a convolution calculation module method used in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A heterogeneous convolutional neural network accelerator with high parallelism is designed and realized on a Zynq UltraScale + MPSoC heterogeneous processor chip based on xilinx. As shown in fig. 1, a heterogeneous convolutional neural network accelerator with high parallelism comprises a control subsystem, a parallel processing subsystem and a storage subsystem, wherein the control subsystem is connected with the parallel processing subsystem, the control subsystem is connected with the storage subsystem, and the storage subsystem is connected with the parallel processing subsystem; the control subsystem comprises a control module, an instruction module and a configurable weight quantization module; the processing subsystem comprises a convolution parallel computing module and a pooling module; the storage subsystem comprises an off-chip storage module and an on-chip storage module.

The control module is used for loading input data from the off-chip storage module to the on-chip storage module or writing output data from the on-chip storage module back to the off-chip storage module, the on-chip storage module is used for caching input data to be processed or an operation result obtained by the parallel computing module, the configurable weight quantization module is used for reading weights from the on-chip storage module and quantizing according to a configuration file, and the instruction module mainly obtains instructions from the off-chip storage module and drives the parallel computing module to execute operation. And the parallel processing subsystem is used for processing the input data and the weight and realizing the operation of the convolution calculation module or the pooling calculation module according to the instruction.

Specifically, wherein the control subsystem:

the control module is realized based on an ARM CPU processor and a Linux operating system, and the configuration of the instruction module and the quantification module and the scheduling of the storage module and the calculation module are completed by deploying the corresponding control system; the interface between the control system and the outside is determined by an on-chip bus protocol, the instruction transmission interface is realized by an AXI-Lite interface, and the transmission interfaces among the on-chip memory module, the off-chip memory module and the parallel computing module are realized by an AXI interface. The control module has the function of loading input data from the off-chip storage module to the on-chip storage module or writing the data from the off-chip storage module back to the on-chip storage module.

The instruction module mainly obtains instructions from an external memory and drives the parallel computing module to execute corresponding operations, or generates configuration files with different quantization bit widths. The instruction module is responsible for initializing the configurable weight quantization module and initializing and executing and scheduling the parallel computing module; the instruction module reads the network structure and configuration parameters stored in the external memory, and initializes, schedules data and calculates and schedules the parameter configuration of the parallel computing module. The configurable weight quantization module reads the weight from the on-chip storage module and quantizes the read weight according to the configuration file; and supporting various quantization bit widths, and quantizing the weight value according to the configuration file.

The parallel processing subsystem comprises:

the convolution calculation module expands the multi-channel input characteristic graph along two dimensions of an input channel and an output channel, performs multiply-accumulate operation on data of the two dimensions in parallel, and consists of a multi-level processing unit array, wherein the processing unit array of a first level is responsible for parallel processing of the input channel, and the processing unit of a second level is responsible for parallel processing of the output channel; the processing process is realized based on a pipeline mode, and multi-stage pipeline is realized by three processes of reading data, operating and writing back data.

The convolutional neural network model mainly comprises a convolutional layer, a pooling layer and a full-connection layer. The main implementation of the convolutional layer is the multiply-accumulate operation of the matrix, and the calculation of the fully-connected layer can also be abstracted to the multiply-accumulate operation of the matrix. In the calculation process, the operations of the two layers are mapped into the same operation and completed by a multiplexing convolution calculation module.

The convolution calculation module is composed of a multiplier, an adder and a multiply-accumulate device formed by an intermediate result register, and is mainly responsible for multiply-accumulate operation of the matrix, and each multiply-accumulate time-sharing calculation is carried out on a plurality of adjacent convolution sliding windows; performing expansion and parallel multiplication and accumulation operation on data of two dimensions by adopting the two dimensions of an input channel and an output channel; the method comprises the steps that parallel computing is realized on a hardware structure through a multi-level processing unit array, the processing unit array of a first level is responsible for parallel processing of input channels, the processing unit array of a second level is responsible for parallel processing of output channels, and the number of the channels processed in parallel at two levels each time is determined by the parallelism degree in a configuration file;

the pooling calculation module is composed of a comparator and is used for realizing comparison operation of two numbers, an output characteristic diagram obtained by operation of the convolution calculation module is used as input, data in a sliding window is sent into the comparator to be compared, and the maximum value in the sliding window is obtained. The operation is repeated continuously until the sliding window traverses the whole input, a pooled final operation result is obtained, and the output is written back to the off-chip storage module.

Wherein the storage subsystem:

the on-chip memory module is composed of a plurality of memory hierarchies: buffer storage hierarchy, cache storage hierarchy, FIFO storage hierarchy. Buffer storage hierarchy is mainly used for storing input image data and quantized weight data, cache storage hierarchy is mainly used for storing data of a current line and data of two adjacent lines, and FIFO storage hierarchy is mainly used for storing data in a sliding window being processed.

The off-chip storage module is composed of a DRAM memory and is mainly used for storing input data of the model, weight parameters of the model, network structure files and configuration files.

A method for accelerating a heterogeneous convolutional neural network accelerator with high parallelism comprises the following steps:

step 1: referring to fig. 1, in the control subsystem, the control module initializes the off-chip storage module, and stores the image data and the network structure file, the input image, the configuration file, and the weight data of the convolutional neural network model into the sd card or the external memory DDR.

Step 2: the control module drives the instruction module, and the instruction module reads the network structure file in the external memory, automatically analyzes the file and generates a corresponding configuration instruction and a corresponding scheduling instruction.

And step 3: the configurable weight quantization module reads a configuration file in the external memory, calls a quantization program according to the configuration file, performs corresponding quantization on weight data by referring to fig. 2, and writes the quantized weight data back to the external memory. The quantization mode is a quantization perception training mode and is specifically realized by a training model, parameter quantization and quantization calibration; the model is obtained by off-line training on the cloud server. Then, the model parameters are quantized by a quantization tool and 32-bit floating-point numbers are converted into 16-bit fixed-point numbers. Finally, an accuracy check is performed using the unmarked calibration data set to ensure that the accuracy is within an acceptable range.

And 4, step 4: referring to fig. 3, the input data and the weight data are loaded from the off-chip memory module to the on-chip memory module, and buffer of the current line data and the two adjacent lines of data is completed, and the corresponding data are recorded in the corresponding FIFO blocks.

And 5: the parallel processing subsystem analyzes the scheduling instruction, completes the initialization configuration of the parallel computing module, and determines the hardware template parameters of the convolution operation: the number of input channels, the number of output channels, the size of a convolution kernel and the parallelism; and sequentially executing the convolution calculation module and the pooling calculation module according to the sequence of the network structure.

And 6: referring to fig. 5, in the parallel processing subsystem, the convolution calculation module adopts the following expansion mode for the input feature map and the output feature map: the parallelism is set as Tm and Tn, tm input characteristic maps and Tn output characteristic maps are unfolded once, and Tm × Tn group vectors are processed in parallel once. Referring to fig. 4, the convolution operation of the convolution calculation module is to expand the multi-channel input feature map along two dimensions of the input channel and the output channel, and perform multiply-accumulate operation on the data of the two dimensions in parallel, and the processing procedure thereof is divided into the following three processes: and loading input picture data and weight data to be processed from an FIFO cache block of the on-chip storage module, executing convolution operation to complete multiplication and accumulation of the input picture and the weight, and writing an output characteristic picture obtained through final operation back to the off-chip storage module from the on-chip storage module.

And 7: the operation processing process of the pooling computing module is divided into the following three processes: and reading the output characteristic diagram from the external memory, loading the output characteristic diagram into the on-chip storage module, executing comparison operation or averaging operation, and writing the final operation result back to the off-chip storage module.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. Any partial modification or replacement within the technical scope of the present disclosure by a person skilled in the art should be included in the scope of the present disclosure.

Claims

1. A heterogeneous convolutional neural network accelerator with high parallelism is characterized by comprising a control subsystem, a parallel processing subsystem and a storage subsystem, wherein the control subsystem is connected with the parallel processing subsystem, the control subsystem is connected with the storage subsystem, and the storage subsystem is connected with the parallel processing subsystem; the control subsystem comprises a control module, an instruction module and a configurable weight quantization module; the processing subsystem comprises a convolution parallel computing module and a pooling module; the storage subsystem comprises an off-chip storage module and an on-chip storage module.

2. The heterogeneous convolutional neural network accelerator of claim 1, wherein the convolutional calculation module is composed of multiple hierarchical processing unit arrays, the processing unit array of the first hierarchical level is responsible for parallel processing of input channels, and the processing unit of the second hierarchical level is responsible for parallel processing of output channels; the processing process is realized based on a pipeline mode, and the three processes of reading data, operating and writing back data realize multi-stage pipeline.

3. The heterogeneous convolutional neural network accelerator of claim 2, wherein the on-chip memory module is comprised of a buffer memory hierarchy, a cache memory hierarchy, and a first-in-first-out memory hierarchy.

4. The acceleration method of the heterogeneous convolutional neural network accelerator with high parallelism as claimed in claim 1, comprising the following steps:

step two: the control module drives the instruction module, the instruction module reads a network structure file in the off-chip storage module, automatically analyzes the file, generates a corresponding configuration instruction and a scheduling instruction, sends the configuration instruction to the configurable weight quantization module, and sends the scheduling instruction to the parallel processing subsystem;

step six: when the convolution calculation module expands the input feature maps and the output feature maps, the parallelism is set as Tm and Tn, tm input feature maps and Tn output feature maps are expanded once, and Tm and Tn group vectors are processed in parallel once; when the convolution calculation module calculates, the multi-channel input characteristic graph is expanded along two dimensions of an input channel and an output channel, and multiplication and accumulation operation is carried out on data of the two dimensions in parallel;