CN115115044A

CN115115044A - Configurable sparse convolution hardware acceleration method and system based on channel fusion

Info

Publication number: CN115115044A
Application number: CN202210789002.8A
Authority: CN
Inventors: 王琴; 莫志文; 蒋剑飞; 景乃锋; 绳伟光; 贺光辉
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-09-27

Abstract

The invention provides a configurable sparse convolution hardware acceleration method and system based on channel fusion, which comprises the following steps: step 1: acquiring offset addresses of all nonzero effective activation value data and corresponding convolution kernel weight data; step 2: storing the effective activation value data and the offset address corresponding to the convolution kernel weight data, and sequentially performing multiply-accumulate operation on the corresponding effective value data pairs; and step 3: setting a data selector and a data splitter in the multiply-accumulate queue, and decoupling the data result fused with the convolution kernel to a corresponding output channel for accumulation; and 4, step 4: and redistributing the data of different channels, and sending the fused data of the corresponding output channel to the position of the output channel before fusion again. According to the invention, the channel fusion preprocessing is carried out on the sparse convolution kernel, so that the effective data volume among the fused output channels is balanced as much as possible, and higher hardware utilization rate and higher sparse convolution acceleration efficiency are brought.

Description

Configurable sparse convolution hardware acceleration method and system based on channel fusion

Technical Field

The invention relates to the technical field of convolutional neural networks, in particular to a configurable sparse convolution hardware acceleration method and system based on channel fusion.

Background

In recent years, with the progress of algorithms and the improvement of computing power, the field of artificial intelligence is greatly developed, and the industry is influenced. However, a large convolutional neural network has excellent performance and also puts extremely high requirements on computing resources and storage resources, and particularly under some resource-limited scenes, overhead on computing and storage resources is often reduced by some methods, that is, a lightweight model is required.

The non-linear activation layer (ReLU, etc.) in convolutional neural networks makes the activation values quite sparse, with about 70% sparsity in a typical neural network. The regularization of L1 and L2 accelerates training, avoids overfitting and brings more zero values in the weight. Further, Han et al propose that after pruning, Dropout, quantization, etc. operations are performed on the convolutional neural network, the effective neuron connections can be compressed to the original 1/9 to 1/13 while maintaining the same precision, and the weight data is reduced to about 1/10.

Therefore, one possible way is to take full advantage of the sparsity of the convolutional neural network to reduce resource requirements. Namely, because a plurality of invalid zero-value data exist in the convolutional neural network, the data do not need to be stored or calculated, and the invalid data are skipped through a reasonable mode, so that a large amount of storage space and calculation resources can be saved.

However, the convolution sparsity caused by pruning, Dropout, ReLU, regularization and the like is often irregular, and the resulting unstructured irregular sparsity makes it difficult for general hardware to obtain benefits from the sparsity. Structured pruning and the like often have larger performance difference compared with unstructured pruning, a CPU and a GPU often convert the structured pruning and the unstructured pruning into matrix multiplication for calculation when calculating convolution, and irregular sparse matrix multiplication even can bring more additional overhead due to uneven branch jumping and load among different threads, so that the unstructured pruning needs customized hardware for acceleration.

Patent document CN107341544B (application number: CN201710524017.0) discloses a reconfigurable accelerator based on a partitionable array and an implementation method thereof, the reconfigurable accelerator including: the scratchpad memory buffer area is used for realizing data reuse of convolution calculation and sparse full-connection calculation; the divisible computing array comprises a plurality of reconfigurable computing units and is divided into a convolution computing array and a sparse fully-connected computing array; the register cache region is a storage region formed by a plurality of registers and provides input data, weight data and corresponding output results for convolution calculation and sparse full-connection calculation; the input data and the weight data of the convolution calculation are respectively input into a convolution calculation array, and a convolution calculation result is output; and respectively inputting the input data and the weight data of the sparse full-connection calculation into a sparse full-connection calculation array, and outputting a sparse full-connection calculation result.

Aiming at the hardware acceleration design of the sparse convolution neural network, the similar invention skips over the sparse convolution kernel, skips over the sparse weight value, or skips over both the sparse convolution kernel and the sparse weight value. The designs have better sparse calculation acceleration efficiency at lower parallelism, but when the operator parallelism is improved, the designs can be influenced by low hardware utilization rate caused by unbalanced load due to huge difference of effective data quantity among parallels. Based on the method, the invention provides a configurable sparse convolution hardware acceleration design scheme based on channel fusion, which skips over coefficient activation values and simultaneously performs channel fusion preprocessing on sparse weight data of different output channels, skips over invalid weight calculation, and simultaneously ensures that effective data among the output channels are balanced as much as possible due to a channel fusion algorithm, so that higher sparse acceleration ratio and hardware utilization rate can be still maintained at higher operator parallelism degree.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a configurable sparse convolution hardware acceleration method and system based on channel fusion.

The configurable sparse convolution hardware acceleration method based on channel fusion provided by the invention comprises the following steps:

step 1: acquiring offset addresses of all nonzero effective activation value data and corresponding convolution kernel weight data in a bitmap sliding window mode;

step 2: storing the effective activation value data and the offset address corresponding to the convolution kernel weight data, and sequentially performing multiply-accumulate operation on the corresponding effective value data pairs;

and step 3: setting a data selector and a data splitter in the multiply-accumulate queue, and decoupling the data result fused with the convolution kernel to a corresponding output channel for accumulation;

and 4, step 4: and redistributing the data of different channels, and sending the fused data of the corresponding output channel to the position of the output channel before fusion again.

Preferably, for the case that the sparse distribution of the convolution kernels is not uniform, the hardware operator 3 × 3 is set to have spatial redundancy at the position where the effective value of the convolution kernel appears most frequently, so that the size of the hardware convolution operator and the corresponding sliding window becomes 3 × 3+ 1.

Preferably, for channel fusion:

firstly, dividing corresponding K groups of output channels into C/C _ in _ parallel group convolution kernels according to input parallelism C _ in _ parallel of an operation unit, wherein each group of convolution kernels comprises K groups of output channels and C _ in _ parallel group input channels;

then, for each group of convolution kernels, counting and sequencing the effective data quantity contained in all input channels of the C _ in _ parallel group under each group of output channels;

finally, performing pairwise matching on all output channel groups according to the minimum effective data volume and the maximum effective data volume in sequence, wherein the rule that the matching is successful is that the data at all corresponding positions are not effective values at the same time;

and all the channels which are successfully matched are fused into one channel, and the channels which are not successfully fused continue to be connected with the next channel in sequence until all the remaining channels which are not matched are traversed.

Preferably, a bitmap mode is adopted to store sparse activation values, and if the final activation values are stored in n-bit width, the bitmap mode stores whether all data are effective values or not and all real effective data at the cost of 1/n.

Preferably, for the sparse weight convolution kernel, the server performs channel fusion on the sparse weight data, so as to reduce the storage overhead of the weight data, skip the invalid zero value calculation, and balance the effective calculation amount among different operation units.

The configurable sparse convolution hardware acceleration system based on channel fusion provided by the invention comprises:

module M1: acquiring offset addresses of all nonzero effective activation value data and corresponding convolution kernel weight data in a bitmap sliding window mode;

module M2: storing the effective activation value data and the offset address corresponding to the convolution kernel weight data, and sequentially performing multiply-accumulate operation on the corresponding effective value data pairs;

module M3: setting a data selector and a data splitter in the multiply-accumulate queue, and decoupling the data result fused with the convolution kernel to a corresponding output channel for accumulation;

module M4: and redistributing the data of different channels, and sending the fused data of the corresponding output channel to the position of the output channel before fusion again.

Preferably, for channel fusion:

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, while the sparse activation value is skipped, the channel fusion preprocessing is carried out on the sparse convolution kernel, so that the effective data volume among the fused output channels is balanced as much as possible, and thus higher hardware utilization rate and higher sparse convolution acceleration efficiency are brought; for the convolution kernels with uneven distribution, the channel fusion efficiency when the sparse data of the convolution kernels are uneven in distribution can be improved through the spatial redundancy design of the convolution kernel units.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a channel fusion algorithm for a general convolution operator according to the present invention;

FIG. 2 is a schematic diagram of a channel fusion failure of a general convolution operator provided in the present invention;

FIG. 3 is a schematic diagram of the success of channel fusion of the custom convolution operator provided by the present invention;

FIG. 4 is a schematic diagram of a channel fusion-based sparse convolution operator zero-jump calculation scheme provided by the present invention;

FIG. 5 is a schematic diagram of a two-dimensional sparse convolution operator array provided by the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

the invention can compress and skip the storage of the activation value and the weight of the sparse convolution and the invalid calculation, and simultaneously can balance the effective data amount among different output channels as much as possible through channel fusion preprocessing, thereby still having higher hardware utilization rate and acceleration efficiency when the parallelism of a calculation unit is expanded, and simultaneously having good channel fusion and acceleration support for the condition of convolution kernels with uneven sparse distribution through a customized hardware operator.

The sparse convolution calculation of the invention is divided into two parts of a sparse activation value and a sparse weight kernel:

1. for sparse activation values, the method adopts a bitmap mode for storage, and if the final activation value is stored in n-bit width, the bitmap mode only needs to store whether all data are effective values and all real effective data at the cost of 1/n.

2. For the sparse weight convolution kernel, the invention performs channel fusion on the sparse weight data at the server end to reduce the storage overhead of the weight data, skip the invalid zero value calculation and balance the effective calculation amount among different operation units.

Specifically, for the channel fusion algorithm:

firstly, dividing corresponding K groups of output channels into C/C _ in _ parallel group convolution kernels according to the input parallelism C _ in _ parallel of an operation unit, wherein each group of convolution kernels comprises K groups of output channels and C _ in _ parallel group input channels. Then, for each group of convolution kernels, the effective data amount contained in all the input channels of the C _ in _ parallel group under each group of output channels is counted and sorted. And finally, sequentially matching every two output channel groups according to the minimum effective data volume and the maximum effective data volume, wherein the rule that the matching is successful is that the data at all corresponding positions are not effective values at the same time. And all the channels which are successfully matched are fused into one channel, and the channels which are not successfully fused continue to be connected with the next channel in sequence until all the remaining channels which are not matched are traversed. The channel fusion algorithm aims at minimizing the number of effective channels after fusion, adopts an approximately ergodic algorithm, has relatively high complexity, but only needs to operate once because the weight can not change after the model is trained, and the complexity is high and is acceptable.

The invention performs channel fusion on the sparse convolution kernel, and performs bitmap sparse storage on the activation value, and simultaneously considers zero jump acceleration on invalid calculation for sparse calculation.

Specifically, for the zero-jump calculation of sparse convolution:

for sparse activation values, the invention adopts a sliding window mode for a bitmap and sends the bitmap to a first non-zero detection unit, the first non-zero detection unit sends all non-zero relative offset addresses to a corresponding effective activation value storage unit and a fused convolution kernel weight storage unit in sequence, the storage unit sends corresponding effective activation values and weight data to a multiplication unit and an addition unit for multiplication and accumulation operation, and a data selector and a data splitter are arranged in a corresponding multiplication and accumulation queue to decouple the data result fused with the convolution kernel to a corresponding output channel for accumulation and finally output. And for the output parallel units of different channels, the data redistribution unit is finally sent into the data redistribution unit so as to send the fused data of the corresponding output channel into the position of the output channel before fusion again.

Particularly, when the sparse distribution of the convolution kernel is uneven, the size of the hardware convolution operator and the corresponding sliding window is changed into 3 x 3+1 by implementing spatial redundancy design on the position with the highest occurrence frequency of the effective value of the hardware operator 3 x 3 convolution kernel, so that the influence of channel fusion efficiency reduction caused by uneven data distribution of the sparse convolution kernel is avoided.

The configurable sparse convolution hardware acceleration system based on channel fusion provided by the invention comprises: module M1: acquiring offset addresses of all nonzero effective activation value data and corresponding convolution kernel weight data in a bitmap sliding window mode; module M2: storing the effective activation value data and the offset address corresponding to the convolution kernel weight data, and sequentially performing multiply-accumulate operation on the corresponding effective value data pairs; module M3: setting a data selector and a data splitter in the multiply-accumulate queue, and decoupling the data result fused with the convolution kernel to a corresponding output channel for accumulation; module M4: and redistributing the data of different channels, and sending the fused data of the corresponding output channel to the position of the output channel before fusion again.

For the condition that sparse distribution of convolution kernels is not uniform, spatial redundancy is set for the position where the effective value of the hardware operator 3 x 3 convolution kernel has the highest occurrence frequency, so that the sizes of the hardware convolution operator and the corresponding sliding window are changed into 3 x 3+ 1.

For channel fusion: firstly, dividing corresponding K groups of output channels into C/C _ in _ parallel group convolution kernels according to input parallelism C _ in _ parallel of an operation unit, wherein each group of convolution kernels comprises K groups of output channels and C _ in _ parallel group input channels; then, for each group of convolution kernels, counting and sequencing the effective data quantity contained in all input channels of the C _ in _ parallel group under each group of output channels; finally, performing pairwise matching on all output channel groups according to the minimum effective data volume and the maximum effective data volume in sequence, wherein the rule that the matching is successful is that the data at all corresponding positions are not effective values at the same time; and all the channels which are successfully matched are fused into one channel, and the channels which are not successfully fused continue to be connected with the next channel in sequence until all the remaining channels which are not matched are traversed.

And storing the sparse activation values by adopting a bitmap mode, and if the final activation values are stored by using the bit width of n-bit, storing whether all data are effective values and all real effective data by using the bitmap mode at the cost of 1/n. For the sparse weight convolution kernel, the sparse weight data is subjected to channel fusion at the server side, so that the storage overhead of the weight data is reduced, invalid zero value calculation is skipped, and the effective calculation amount among different operation units is balanced.

Fig. 1 is a schematic diagram of a channel fusion algorithm for sparse weighted convolution kernels according to the present invention, where two convolution kernels of each group do not have a position of valid value at the same time. Thus, through the channel fusion algorithm, the sparse convolution kernels of the four different output channels are fused into a dense convolution kernel of two channels. It can be seen that while the channel fusion algorithm performs compression and invalid calculation on the storage of the sparse weight, the effective data amount between different output channels can be balanced as much as possible through channel fusion preprocessing, so that the hardware utilization rate and the acceleration efficiency can still be higher when the parallelism of the calculation unit is expanded.

Fig. 2 shows an example of a failure of the channel fusion algorithm support in the case that the sparse distribution of the general convolution operator for the sparse weights is not uniform. It can be seen that, for example, in a network of the RepVGG type, the middle of the convolution kernel is often a valid value, and under such sparse distribution, it is difficult for a general operator to support channel fusion of such sparsely distributed weighted data.

Fig. 3 shows a successful example of the weighted convolution kernel channel fusion algorithm for sparsely unevenly distributed weights of the configurable custom operator proposed in the present invention. When the sparse distribution of the convolution kernels is uneven, the method implements spatial redundancy design on the position with the highest occurrence frequency of the effective value of the hardware operator convolution kernel of 3 multiplied by 3, so that the sizes of the hardware convolution operator and the corresponding sliding window are changed into 3 multiplied by 3+1, and the influence of channel fusion efficiency reduction caused by uneven data distribution of the sparse convolution kernels is avoided. It can be seen that, under the condition that the data at the middle most of the 3 × 3 convolution kernel is often an effective value as shown in the figure, the efficiency of the channel fusion algorithm is greatly improved at the cost of a small storage space by performing spatial redundancy design on the data at the middle most of the convolution kernel.

For the case of custom operators, the channel fusion algorithm can be described in pseudo-code as follows. The channel fusion algorithm aims at minimizing the number of effective channels after fusion.

Customized channel fusion algorithm:

1:for(c＝1；c<＝C；c+＝I_parallel){

2:for(k＝1；k<＝K；k+＝O_parallel){

3 kernel _ cnt [ c, k ] sum (kernel [1:3] [1:3] [ c ] [ k ]); // obtaining the effective weight number of different input/output channel groups

4:}}

5:for(c＝1；c<＝C；c+＝I_parallel){

6:for(n＝0；n<I_parallel；n++){

For (K ═ 1; K ═ K; K + + { tmp _ kernel _ cnt [ K ] + ═ kernel _ cnt [ c + n ] [ K ]; the// tmp _ kernel _ cnt [ k ] represents the effective data volume contained in the kth output channel in the input channel of the corresponding I _ parallel group; then will use this as reference to carry out pairwise pairing

8:}}

9: channel _ index ═ sort (tmp _ kernel _ cnt); // sorting tmp _ kernel _ cnt in ascending order, the subscript value corresponding to the ascending order

10 for (K1; K < ═ K; K + + {// starting from the smallest effective data volume

11 if (tmp _ kernel _ cnt (k) | 0) {// all-zero channel skip calculation directly

12 for (m ═ K; m > K; m- -) {// least significant data volume channel fuses with the largest attempt

13, tmp _ flag is True; v/flags for determining whether two output channels are fusible

14:for(n＝0；n<I_parallel；n++){

15: if (kernel [1:3] [1:3] [ c + n ] [ channel _ index (k) ], kernel [1:3] [1:3] [ c + n ] [ channel _ index (m) ]), tail) {// it is necessary that all input channels in parallel for I _ parallel can be merged

16:tmp_flag＝Fasle；

17:}}

18:if(tmp_flag＝＝True){

Channel _ index (k), channel _ index (m); if the flag is still true, then fuse the two channels

20:}

21:}}}

Fig. 4 is a schematic diagram of the customized sparse convolution operator based on the three-weight distribution of the present invention. Firstly, for a sparse activation value bit map, the invention firstly converts the bit map, then sends the converted bit map into a first non-zero detection unit in a sliding window mode, the first non-zero detection unit then sequentially sends all non-zero relative offset addresses to a corresponding effective activation value storage unit and a fused convolution kernel weight storage unit, the storage unit sends the corresponding effective activation value and weight data into a multiplication unit and an addition unit for multiplication and accumulation operation, and a data selector and a data splitter are arranged in a corresponding multiplication and accumulation queue to recouple the data result of the fused convolution kernel to a corresponding output channel for accumulation and finally output.

Fig. 5 shows a two-dimensional operator parallel design of the configurable sparse convolution hardware accelerated design based on channel fusion, where the left side is effective activation value data output by different input channels, and the upper side is different convolution kernel data corresponding to different output channels after fusion. After the data are fed into the two-dimensional sparse convolution operator array, the array sends the obtained data of different output channels to the lower channel redistribution unit so as to send the fused data of the corresponding output channels to the position of the output channel before fusion again.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A configurable sparse convolution hardware acceleration method based on channel fusion is characterized by comprising the following steps:

2. The channel fusion-based configurable sparse convolution hardware acceleration method is characterized in that for the case that the sparse distribution of convolution kernels is not uniform, the hardware convolution operator and the corresponding sliding window size become 3 × 3+1 by performing spatial redundancy setting on the position where the hardware operator 3 × 3 convolution kernel effective value has the highest occurrence frequency.

3. The configurable sparse convolution hardware acceleration method based on channel fusion of claim 1, characterized in that for channel fusion:

all the channels which are successfully matched are merged into one channel, and the channels which are not successfully merged continue to be connected with the next channel according to the sequence until all the remaining channels which are not matched are traversed.

4. The configurable sparse convolution hardware acceleration method based on channel fusion of claim 1 is characterized in that a bitmap mode is adopted to store sparse activation values, and if a final activation value is stored with n-bit width, the bitmap mode is used to store whether all data are valid values and all real valid data at a cost of 1/n.

5. The configurable sparse convolution hardware acceleration method based on channel fusion of claim 1 is characterized in that for a sparse weight convolution kernel, channel fusion is performed on sparse weight data at a server side, so that storage overhead of the weight data is reduced, invalid zero value calculation is skipped, and effective calculation amount among different operation units is balanced.

6. A configurable sparse convolution hardware acceleration system based on channel fusion, comprising:

7. The configurable sparse convolution hardware acceleration system based on channel fusion of claim 6, wherein for the case of uneven sparse distribution of convolution kernels, spatial redundancy is set for the position where the effective value of hardware operator 3 x 3 convolution kernel appears most frequently, so that the size of hardware convolution operator and corresponding sliding window becomes 3 x 3+ 1.

8. The channel fusion-based configurable sparse convolution hardware acceleration system of claim 6, wherein for channel fusion:

9. The configurable sparse convolution hardware acceleration system based on channel fusion of claim 6, wherein a bitmap mode is adopted to store sparse activation values, and if a final activation value is stored in a bit width of n-bit, the bitmap mode is used to store whether all data are valid values and all real valid data at a cost of 1/n.

10. The configurable sparse convolution hardware acceleration system based on channel fusion of claim 6 is characterized in that for a sparse weight convolution kernel, the storage overhead of weight data is reduced, invalid zero value calculation is skipped, and the effective calculation amount among different operation units is balanced by performing channel fusion on the sparse weight data at a server side.