CN112329910A

CN112329910A - Deep convolutional neural network compression method for structure pruning combined quantization

Info

Publication number: CN112329910A
Application number: CN202011071970.2A
Authority: CN
Inventors: 陆生礼; 付成龙; 庞伟
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-02-05
Anticipated expiration: 2040-10-09
Also published as: CN112329910B

Abstract

The invention provides a structure pruning combined quantization-oriented deep convolutional neural network compression method, and belongs to the technical field of calculation, calculation or counting. The compression method updates the weight parameters in a gradient descending mode in the back propagation process, and carries out structure pruning on the updated weights according to the network loss precision, so that the continuous updating of the weight parameters is realized, and the processing speed of convolution calculation is accelerated. The accelerator realized by FPGA uses the weight code obtained by the compression method as input data, realizes the shift accumulation operation of high parallel eigenvalue and the weight after structure pruning through the parallel computing array comprising 448 SQPEs, supports the convolution operation of the sparse network after structured pruning with different compression ratios, and realizes the accelerator with hardware-friendly type, low power consumption and high throughput.

Description

Deep convolutional neural network compression method for structure pruning combined quantization

Technical Field

The invention relates to an electronic information and deep learning technology, in particular to a deep convolution neural network accelerator for structure pruning combined quantification, and belongs to the technical field of calculation, calculation or counting.

Background

In recent years, accumulation of big data, innovation of theoretical algorithm, improvement of computing power and evolution of network facilities enable the artificial intelligence industry continuously accumulating more than half a century to revolutionarily progress again, and research and application of artificial intelligence enter a brand-new development stage. The new deep neural network applied to artificial intelligence updates and iterates, and the number of layers is infinite. From the earliest LeNet for digital recognition to the ILSVRC race champion AlexNet of 2012 to *** net of 2014, the accuracy of image recognition is increasing, with a consequent multiplication of network computation and network parameters. However, in some fields with strong real-time requirements and mobile end scenes with low power consumption, such as smart phones and automatic driving scenes, huge calculation amount and parameter amount become main obstacles for limiting the deployment of deep neural networks in these fields. How to design a deep convolutional neural network accelerator with high throughput rate and energy efficiency ratio becomes key.

From the current research, pruning and quantization are the trends for designing convolutional neural networks. Many zero-value parameters exist in the network, most redundant parameters of the network can be reduced through pruning, only effective weight information is stored, the reasonable pruning has small influence on the accuracy of the network, and even part of redundant parameters in the network are removed through network pruning, so that the accuracy is improved for the inference of the network. However, the distribution of parameters removed by pruning is random, which causes the problem of load imbalance and causes the low calculation efficiency of the accelerator. The quantization is to quantize the high-precision weight into a low-precision discrete value, so that the memory requirement can be reduced, the operation power consumption can be reduced, the weight parameters are quantized by using an incremental quantization scheme in some designs, and the multiplication is replaced by using the displacement according to the quantization characteristics, but only a fixed weight value range is supported, and the quantization of the parameters of different layers cannot be supported when different ranges exist. The method aims to design a structure-oriented pruning combined quantization compression algorithm and realize a hardware-friendly low-power-consumption high-throughput accelerator.

Disclosure of Invention

The invention aims to provide a deep convolutional neural network compression method for structure pruning combined quantization aiming at the defects of the background technology, an accelerator realized based on the compression method is suitable for structure pruning and quantization of a pulse array regular data stream, and the technical problems that the existing pruning and quantization schemes are unbalanced in load, low in calculation efficiency and incapable of supporting quantization of parameters of different layers are solved.

The present study is based on a systolic array architecture. The accelerator combining the structure pruning with the quantization algorithm has obvious advantages, the structure pruning is regular pruning, and the method is suitable for regular data flow of pulsating whole columns, so that the problem of load imbalance of a sparse network in the accelerator can be solved, and the throughput rate is improved. And complicated codes and decoders do not need to be designed, so that the complexity of design is reduced.

The invention adopts the following technical scheme for realizing the aim of the invention:

combining structure-oriented pruning with a quantization compression algorithm, firstly, carrying out quantization processing on a network, and directly setting a pre-trained network weight parameter to be a power number adjacent to the value of the pre-trained network weight parameter according to the value of the pre-trained network weight parameter; then retraining, adjusting the network weight parameter value, determining whether the parameter is continuously changed according to the gradient of the back propagation, wherein the gradient is a regular parameter and jumps down by an exponential level to reach 2^-13The value boundary is not changed, if the gradient is negative, the parameter jumps up by an exponential step, if the value reaches 2⁴The boundary of (2) is kept unchanged. The formula for updating the weight parameter is shown in 1-1, wherein w' represents the updated weight, w represents the original weight, and L represents the loss function, and the cross entropy is adopted as the loss function in the experiment. After training is finished, counting the numerical range of the weight parameter of each layer, and configuring the parameter range of each layer of the circuit according to the range; then, performing structured pruning on the network weight parameters, specifically, inputting the weight parameters into channels, grouping each group of eight channels, and then reserving a certain value at all the same positions in the eight channels, for example, reserving one value at the position of eight channels (0,0) and reserving one value at the position of (0,1), so that the eight channels are compressed into one channel; and finally, evaluating which value is kept to have the minimum influence on the loss of the network precision by adopting a self-adaptive algorithm, extracting power value data of the quantized weight and the relative position of a reserved channel in the structural pruning, and performing offline coding.

A structure-oriented pruning combined quantization deep convolutional neural network hardware accelerator, comprising: the device comprises a bus interface unit, a configuration register unit, a storage scheduling unit, an accelerator control unit, an on-chip cache unit, a parallel computing unit and a functional unit. The bus interface units are respectively provided with a master-slave AIX4 bus interface, and an accelerator can be mounted on any bus using AIX4 protocol. An external controller (e.g., a CPU) sends a control word from the interface to the configuration register unit through the interface unit to complete the initial configuration of the accelerator. After the initial configuration is completed, the accelerator control unit controls the circuit to operate according to the configuration information. The accelerator control unit firstly needs to start the storage scheduling unit to read external data, and the storage scheduling unit can automatically carry data only by giving an initial address and data volume according to configuration information. The memory scheduling unit actively reads data from the off-chip memory through a main interface of the interface unit and places the data into a corresponding area of the on-chip cache unit according to the data type. The data cached by the on-chip cache unit comprises a characteristic value, a weight, a normalization parameter, a part and result and a final output result, and the on-chip cache unit also comprises a characteristic value pre-fetching cache. The feature value prefetch cache provides parallel input data streams for the PAU.

The PAU is two sheets of 14 × 16 arrays composed of 448 SQPE units, and is used for reading data in the characteristic value pre-fetching cache and the weight sub-cache region to perform convolution operation, each SQPE in the same row obtains one point of different output characteristic diagrams after once shifting and accumulation, and one column of data of the same output characteristic diagram is obtained after the calculation of the SQPE in the same column is completed. The PAU adopts an SQPE organization mode of multiplexing characteristic values in the row direction and multiplexing weight values in the column direction, relatively fixed data streams can exist, and complexity of reading out control logic of an external cache unit caused by network sparseness is reduced.

The SQPE unit is the basic building block of the PAU and is composed of a decoder, a feature value selector, a general shifter, and a controller. The controller controls the running state of the SQPE unit, and whether the displacement accumulation or the calculation is completed or the calculation results are sequentially output is started. The input weight is the weight of off-line coding, and the channel selection index, the shift amplitude and the direction are obtained through a decoder. Due to the characteristic of channel pruning, a plurality of characteristic values in the channel direction of the characteristic diagram need to be input by one clock of the SQPE unit, data needing to be calculated in the characteristic values are selected according to information of channel selection indexes, the obtained characteristic values are sent to the universal shifter, and the universal shifter carries out calculation according to the amplitude and the direction obtained by the decoder to obtain a shifting result. The shifted result is placed in a register and accumulated with subsequent results.

Based on the preferable scheme of the technical scheme, the accelerator control unit is composed of 5 states and respectively corresponds to waiting, writing a characteristic diagram, writing a convolution kernel, performing convolution calculation and sending a calculation result. And each state sends a corresponding control signal to a corresponding sub-module to complete a corresponding function.

Based on the preferable scheme of the technical scheme, the running state of the SQPE control unit has three states of starting shift accumulation, completing calculation and outputting results. The interval for starting the shift accumulation twice is determined by the relationship between the number of clocks which are accumulated and the number of spread clocks of the characteristic value from the first SQPE in a row to the last SQPE in the row, and if the number of clocks which are accumulated is greater than the number of spread clocks or the number of the spread clocks is equal to the number of the spread clocks, no interval exists in the starting twice; if the number of the accumulated clocks is less than the number of the propagation clocks, the difference between the two clocks needs to be waited. The calculation completion state is that the final accumulated result is cut into 16bits and then put into the result register of the SQPE.

Based on the preferable scheme of the technical scheme, the input weight is the weight of 8bits off-line coding. The most significant bit 1 indicates that the weight is negative and 0 indicates a positive number. The middle 3 bits represent the channel selection index, the convolution kernels are grouped by channels and can be divided into 8 channel groups, 4 channel groups or two channel groups, and then the channel selection index records the position of a non-zero value in one channel group for selecting corresponding characteristic value data, so that the operation of the rest zero weight can be avoided. The lowest four bits are used to hold the weight valuesThe numerical value is also a weight value subjected to quantization processing, and can be represented in a range of 2^-13～2⁴In the meantime.

Based on the preferable scheme of the technical scheme, the data bit width of the AXI4 bus interface is larger than the data bit width of a single weight or a feature diagram, so that a plurality of data are spliced into a multi-bit data to be sent, and the data transmission speed is improved.

By adopting the technical scheme, the invention has the following beneficial effects:

(1) the invention adopts a self-adaptive pruning method and a quantification method to compress the deep convolution neural network, updates the weight parameters in a gradient descending mode in the back propagation process, has high parameter updating speed and can continuously update the parameters, so that the precision of the network is reduced within 1 percent under the condition of 32X compression ratio, and the realization of hardware is considered at the beginning of the design of the algorithm.

(2) According to the invention, by recording redundant position information and a low-bit weight value after power value quantization in the structured pruned sparse network, the processing speed of convolution calculation is accelerated, the area of a circuit is reduced, and an FPGA hardware platform is utilized to accelerate the convolution neural network, so that the operation of replacing multiplication by displacement after weight power value quantization can be supported, the convolution and full-connection operation of the structured pruned sparse network with different compression ratios can be supported, the real-time performance of the deep neural network can be improved, the higher calculation performance is realized, and the energy consumption is reduced.

Drawings

FIG. 1 is a flow chart of the deep convolutional neural network compression algorithm of the present invention.

FIG. 2 is a schematic diagram of the structure of the deep convolutional neural network of the present invention.

FIG. 3 is a schematic diagram of a computational array according to the present invention.

FIG. 4 is a schematic diagram of the logic operation of the SQPE calculating unit according to the present invention.

Fig. 5 is a schematic diagram of a state transition of the control unit.

Detailed Description

The technical scheme of the invention is explained in detail in the following with reference to the attached drawings.

FIG. 1 is a flowchart of a deep convolutional neural network compression algorithm of the present invention, which comprises, first, quantizing the network, and directly setting the pre-trained network weight parameters to the powers adjacent to the values thereof according to the values of the pre-trained network weight parameters; then retraining, adjusting the network weight parameter value, determining whether the weight parameter is continuously changed according to the gradient of back propagation, wherein the gradient is that the regular weight parameter jumps down by an exponential order to reach 2^-13The value boundary is not changed, the weight parameter jumps up an exponential level if the gradient is negative, and if the value reaches 2⁴The boundary of (c) is kept constant. The weight parameter is updated. After training is finished, counting the numerical range of the weight parameter of each layer, and configuring the parameter range of each layer of the circuit according to the range; then, performing structured pruning on the network weight parameters, specifically, inputting the weight parameters into channels, grouping each group of eight channels, and then reserving a certain value at all the same positions in the eight channels, for example, reserving one value at the position of eight channels (0,0) and reserving one value at the position of (0,1), so that the eight channels are compressed into one channel; and finally, evaluating which value is kept to have the minimum influence on the loss of the network precision by adopting a self-adaptive algorithm, extracting power value data of the quantized weight and the relative position of a reserved channel in the structural pruning, and performing off-line coding.

Fig. 2 shows the hardware structure of the convolutional neural network accelerator designed in the present invention, and the operation of the PE array is as follows, taking 14 × 16, convolution kernel step 1, and convolution kernel size 3 × 3 as examples.

The PC caches configuration data of the accelerator in a CRU (configuration Register Unit) through an AIX4-S interface, the configuration data including an initial address of an image, an initial address where weight parameters are stored, an intermediate data storage address, a network layer number, and a network configuration. After the data configuration is completed, the PC starts the accelerator by writing the register through the AIX4-S interface. An SSU (Storage Scheduling Unit) transports off-Chip data to an OCCU (On Chip Cache Unit) through an AXI4-M of a BIU (Bus Interface Unit), and there are three data transport modes: the purpose of defining the three kinds of reuse is to reduce the number of times of accessing off-chip storage and fully utilize data which is carried to on-chip cache to reduce power consumption. Taking the output reuse mode as an example, firstly, the SSU needs to read the normalized data, and if the number of output channels is N, N pairs of normalized data need to be put into the normalization buffer corresponding to the OCCU, so as to provide BN data for the subsequent FU (Function Unit). Then, the SSU reads the off-chip feature value data, and since the size of the calculation array is 14 × 16, a column of 14 points of one output feature map can be calculated in parallel, so that the input feature map data to be provided is calculated by a formula as 16 rows, and assuming that the input channel is 64, only data of 32 channels are processed at a time due to the limitation of on-chip buffer capacity. After the data are stored in the OCCU characteristic value cache region, the convolution kernels of the corresponding input 32 channels outside the chip are sequentially read through the SSU, the number of the output channels is 64, and each batch can process 32 output channels at most due to the limitation of the calculation matrix. According to output reuse, after the batch of data is calculated, the calculated result is only a partial sum result, the feature values and convolution sums of the remaining 32 input channels need to be updated, and the partial sums calculated in two batches are added to obtain the first 14 rows of data of 32 output channels.

The on-chip cache distribution of the computational array data stream of the present invention is shown in FIG. 3. The total number of Input Buffer banks is 8, which belongs to an Input characteristic value Buffer unit of the OCCU, and data needs to be Input to an RF (register file) in parallel before being provided to a calculation array, so that parallel Input data streams are provided for a 14-row calculation array and data among rows and columns when a convolution kernel slides are multiplexed. The Weight Buffer is a Weight parameter Buffer unit belonging to the OCCU, 36 small Buffer units are provided in total, each Buffer unit stores a convolution kernel, and a decoding unit is needed before the convolution kernel is provided for a calculation array to decompose the Weight of 8bits after coding. The decoded weight data is broadcast for use in a column. The calculated data needs to go through FU, perform ReLU, BN or Pooling operation and then put into the result buffer area of OCCU.

Referring to fig. 4, an SQPE (Sparse Quantization Processing Unit) Unit is a basic component of a PAU (Parallel Computing Unit), and is composed of a decoder, a feature value selector, a general shifter, and a controller. The controller controls the running state of the SQPE unit, and whether the displacement accumulation or the calculation completion or the settlement result is started to be output in sequence is judged. The input weight is the weight of off-line coding, and the channel selection index, the shift amplitude and the direction are obtained through a decoder. Due to the characteristic of channel pruning, a plurality of characteristic values in the channel direction of the characteristic diagram need to be input into the SQPE unit by one clock, and data needing to be calculated in the characteristic values are selected according to the information of the channel selection index. And the obtained characteristic value is sent to a universal shifter, and the universal shifter calculates according to the amplitude and the direction obtained by the decoder to obtain a primary shifting result. The shifted result is placed in a register and accumulated with subsequent results.

Referring to fig. 5, the convolution operation is expanded into vector multiply-accumulate operation, so that the network structure and the hardware architecture are more matched, the calculation is simplified according to the operation information and the algorithm structure, the calculation efficiency is improved, and the energy consumption is reduced. The specific state transition process of this embodiment is as follows: after initialization, the accelerator enters an IDLE state, the state controller waits for a multiplexing type signal or _ en sent by the SSU and enters a FT _ IFM (write characteristic value) state; entering into an FT _ WT (write weight index) state when FT _ ifm _ done (write characteristic value completion signal) sent by an SSU is detected; entering a CAL (calculation) state when ft _ wt _ done (write weight completion signal) sent by the SSU is detected; when detecting that cal _ OFM _ done (calculation completion signal) sent by the SSU is 0, re-entering into FT _ IFM (write characteristic value) state, and when detecting that cal _ OFM _ done (calculation completion signal) sent by the SSU is 1, entering into TX _ OFM (transmission result) state; if last _ ofm _ flg sent by the SSU is detected, the OR _ DONE state is entered.

The parameters can be modified through the state controller, and the modification of the image size, the convolution kernel size, the step size, the output characteristic diagram size and the number of output channels during operation is supported. By utilizing the running state and the algorithm structure, redundant calculation is skipped, so that unnecessary calculation and access are reduced, the efficiency of the convolutional neural network accelerator is improved, and the energy consumption is reduced.

The present embodiment is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made based on the technical idea proposed by the present invention and the disclosed technical solution falls within the protection scope of the present invention.

Claims

1. A deep convolutional neural network compression method combining quantization and structural pruning is characterized in that pre-trained network weight parameters are quantized into power value data adjacent to numerical values of the pre-trained network weight parameters, each network weight parameter is adjusted according to a gradient of back propagation, structural pruning is performed in a mode of keeping nonzero numerical values in each channel network weight parameter, offline coding is performed on the pruned weights according to the power value data corresponding to each channel network weight parameter and the position of the nonzero numerical value of the pruned weight data, and training of a deep convolutional neural network is performed according to the offline coded weights.

2. The structure-oriented pruning combined quantization deep convolutional neural network compression method as claimed in claim 1, wherein the method for adjusting each network weight parameter according to the back-propagated gradient comprises the following steps: the exponent level of the power value data is adjusted downwards when the gradient is a regular parameter, the exponent level of the power value data is adjusted upwards when the gradient is a parameter in charge, and the exponent level of the power value data is kept unchanged when the power value data reaches a boundary value.

3. The structure-pruning-oriented and quantization-combined deep convolutional neural network compression method as claimed in claim 1, wherein the method for performing structure pruning in a manner of reserving non-zero values in each channel network weight parameter comprises: combining non-zero values in the network weight parameters of each channel to obtain single-channel weight parameters, adopting a self-adaptive algorithm to evaluate the influence of each bit of the single-channel weight parameters on the network precision loss, and reserving the numerical value of each bit of the single-channel weight parameters when the influence on the network precision loss is minimum.

4. The structure-oriented pruning combined quantization deep convolutional neural network compression method as claimed in claim 1, wherein the weight coding obtained by off-line coding of the pruned weights is characterized in that the highest bit represents the positive and negative of the weight, the middle bit records represent the channel index of the nonzero value position, and the lower bit records the pruned weight quantization value.

5. The structure-oriented pruning combined quantization deep convolutional neural network compression method as claimed in claim 2, wherein the expression for adjusting each network weight parameter according to the back-propagated gradient is as follows:

wherein, w and w' are network weight parameters before and after adjustment, and L is a loss function.

6. The method of claim 1, wherein the forward derivation of the deep convolutional neural network based on offline coded weights is implemented by an accelerator comprising:

the bus interface unit acquires data containing offline weight codes and control words from the outside;

a configuration register unit storing control words;

the accelerator control unit generates an initial configuration instruction after reading the control words, sends an external data reading instruction containing an initial address and data quantity to the storage scheduling unit, and sends a working state conversion instruction to the parallel computing unit;

the memory scheduling unit writes the external data acquired by the bus interface unit into the on-chip cache unit after receiving the external data reading instruction;

the on-chip cache unit caches and stores the external data written by the scheduling unit and the part and the result output by the parallel computing unit; and a process for the preparation of a coating,

and the parallel computing unit is used for reading the initial configuration instruction to complete initial configuration, receiving the working state conversion instruction sent by the accelerator control unit to complete reading and convolution operation of external data, and writing a part and a result in the convolution operation to the on-chip cache unit.

7. The method as claimed in claim 6, wherein the parallel computing units multiplex eigenvalues in row direction and weight values in column direction.

8. The structure-oriented pruning combined quantization deep convolutional neural network compression method as claimed in claim 6, wherein a basic unit of the parallel computing unit is an SQPE unit, and the SQPE unit comprises:

the decoder reads the off-line weight code and outputs a channel selection index, a shift amplitude and a direction;

the characteristic value selector is used for receiving the characteristic diagram data read from the outside and the channel selection index and outputting the characteristic value of the position corresponding to the channel selection index;

the general shifter carries out shift accumulation operation on the characteristic value output by the characteristic value selector according to the shift amplitude and the shift direction output by the decoder after receiving the start shift accumulation instruction;

the partial sum accumulator accumulates the shift accumulation result of the time on the basis of the partial sum output by the last convolution operation, outputs the partial sum of the time of the convolution operation, stops accumulation operation when receiving a calculation completion instruction, and outputs the partial sum obtained by each convolution operation when receiving a calculation result instruction which is output in sequence; and a process for the preparation of a coating,

and the controller outputs a start shift accumulation instruction, a calculation completion instruction and a calculation result instruction in sequence.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the compression method as claimed in claim 1.

10. Miniaturized embedded terminal equipment, its characterized in that includes: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the compression method according to claim 1 when executing the program according to a set bit width.