CN109993279B

CN109993279B - Double-layer same-or binary neural network compression method based on lookup table calculation

Info

Publication number: CN109993279B
Application number: CN201910178528.0A
Authority: CN
Inventors: 张萌; 李建军; 李国庆; 沈旭照; 曹晗翔; 刘雪梅; 陈子洋
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2023-08-04
Anticipated expiration: 2039-03-11
Also published as: CN109993279A

Abstract

The invention discloses a double-layer same-or binary-value neural network compression method based on lookup table calculation, which is completed by a double-layer convolution structure, and the algorithm comprises the following steps: firstly, after nonlinear activation, batch normalization and binary activation are carried out on an input feature map, grouping is carried out on first-layer convolution operations with different convolution kernel sizes to obtain a first-layer output result; then, a second layer convolution operation with a size of 1×1 is adopted for the first layer output result to obtain an output characteristic diagram. In the hardware implementation, the improved double-layer convolution uses the three-input exclusive nor operation of double-layer parallel computation to replace the traditional double-layer sequential computation mode, and all double-layer convolution operations are completed by using a lookup table mode, so that the utilization rate of hardware resources is improved. The compression method provided by the invention is an algorithm hardware collaborative compression scheme integrating full-precision high-efficiency neural network skills and a lookup table calculation mode, has a better compression effect structurally, and reduces the consumption of logic resources in hardware.

Description

Double-layer same-or binary neural network compression method based on lookup table calculation

Technical Field

The invention relates to an FPGA (field programmable gate array) optimization design technology of a binary neural network, belonging to the technical field of digital image processing.

Background

Based on the vigorous development of deep learning technology, convolutional Neural Networks (CNNs) are increasingly used in the field of digital image processing. Starting from the most classical AlexNet to the res net residual neural network proposed by Facebook institute, deep convolutional neural networks began to step into a high-speed development period, and the performance of the neural networks was also gradually rising. In the aspect of practical application, *** has achieved remarkable results in aspects of automatic driving, face recognition and the like by using a convolutional neural network. At the same time, convolutional neural networks have also encountered challenges in the development process, such as the high computational power and high complexity characteristics of convolutional neural networks, which make CNNs more difficult to apply to embedded devices.

With the popularity of mobile intelligent terminal devices, it is desirable to implement algorithms for neural networks on some devices that have only low-performance processors. Thus, a variant BCNN (binary convolutional neural network) of CNN is attracting attention in terms of lightweight and low power consumption by virtue of its ability to extract features without performing multiplication operations. A new binarization algorithm based on convolutional neural networks was proposed in university of montreal, canada, courbariaux et al, (Binarized neural networks: training deep neural networks with weights and activations constrained to +1 or-1) (arXiv preprint arXiv:1602.02830,2016), in 2016, which binarizes the weights of the neural network and the activation values of each layer, saves a lot of storage space, calculation resources and forward propagation time, and can theoretically reduce the computational complexity by 60% without greatly reducing the model accuracy by compensating the convolution weights (weights) and the output feature map (FeatureMap) by coefficients. The binarization method can effectively reduce the consumption of hardware resources, reduce the calculation cost, improve the processing speed of the neural network and be beneficial to realizing the neural network algorithm on a chip. In the same year, the XNORNet proposed by Mohammad Rastegari of Washington university changes the traditional convolution multiplication convolution operation into an exclusive OR operation, so that the hardware implementation of the binary neural network is simpler and more convenient. However, compared with the classification capability of the full-precision convolutional neural network, the feature extraction capability of the binary neural network is also lacking, and the binary neural network is equivalent to the regularization process of the full-precision neural network, and further the complexity of the sparse network. The binarization process causes larger loss after the feature extracted by the network is subjected to binary activation, and how to extract more effective features under the condition of binarization becomes a key problem of the binary neural network. In recent two years, different proprietary binary algorithms are proposed, such as parallel network PC-BNN ABC-Net and the like, achieve better effects, but the binary algorithm identification performance is improved, meanwhile, the cost of hardware implementation is not greatly reduced, and the problems that the algorithm is simple but not suitable for hardware and the like are solved. In summary, the algorithm of the binary convolutional neural network is at an initial scale, and further development of the algorithm of the binary neural network which is beneficial to hardware implementation is one direction of the development of the binary neural network in the future.

Because of the huge calculation amount of the neural network algorithms, it becomes extremely difficult to directly realize the algorithms in the form of software at the terminal, and research on the special acceleration hardware of the neural network is a current development trend. Therefore, different acceleration structures dedicated to neural networks have been proposed, and a major consideration in designing acceleration hardware is how to run faster and save hardware resources. Aiming at the problem of faster running, experts mainly research a parallelization operation mode of the neural network algorithm, and the parallelization operation mode is matched with the characteristic that hardware can execute in parallel to accelerate the execution of the algorithm. Aiming at the problem of saving hardware resources, the main research direction is data multiplexing and a function multiplexing part which are contained in a neural network algorithm, so that the cost of the hardware resources can be reduced.

In terms of the hardware implementation of the binarization neural network, the problem of low efficiency and high cost exists in the operation in the BCNN according to the existing application of the general full-precision artificial intelligent acceleration chip. These high performance processors are not available for embedded systems and other low power operations. Due to the rapid development of algorithms of the binary neural network, the FPGA design realization of the network becomes more and more popular due to the structural deformation characteristics of the algorithms. The university of Qinghua realizes a general binary neural network accelerator in a paper of FP-BNN: binarized neural network on FPGA (neuron, 2018, 275:1072-1086), 11.6 times of CPU computing speed and 2.75 times of GPU accelerating computing capacity are realized on the AlexNet structure, and the whole model reaches 384GOP/s/w computing speed on an FPGA.

But the architecture consumes relatively large amounts of power and logic resources. Therefore, in order to be able to use the algorithm with high recognition rate in the low-power consumption embedded device, a software and hardware collaborative design method for optimizing the algorithm and exclusively deploying the FPGA, which are beneficial to the hardware implementation, is needed.

Disclosure of Invention

The invention aims to: in order to overcome the defects of the prior art, the invention provides a double-layer same-or binary-value neural network compression method based on lookup table calculation, which reduces network parameters, improves calculation efficiency and reduces resource consumption.

The technical scheme is as follows: a double-layer same or binary neural network compression method based on lookup table calculation;

the compression method is completed by a double-layer convolution structure, and the algorithm comprises the following steps:

firstly, after nonlinear activation, batch normalization and binary activation are carried out on an input feature map, grouping is carried out on first-layer convolution operations with different convolution kernel sizes to obtain a first-layer output result;

then, a second layer convolution operation with a size of 1×1 is adopted for the first layer output result to obtain an output characteristic diagram.

The hardware implementation steps of the double-layer convolution structure comprise:

(1) After the nonlinear activation, batch normalization and binary activation processes are realized by hardware, the convolution of the second layer is performed simultaneously in the same-or processing process of the first layer convolution module, so that double-layer convolution simultaneous calculation is realized;

(2) And (3) carrying out addition operation in a 6:3 compression tree mode on the output values obtained by double-layer convolution and simultaneous calculation in the step (1).

The double-layer convolution simultaneously calculates and adopts three-input simultaneous OR processing, wherein the input three values are respectively an input characteristic diagram value, a first-layer convolution weight value and a second-layer convolution weight value.

The double-layer convolution is formed by connecting a group convolution formed by different convolution kernel sizes and a second-layer convolution with the size of 1 multiplied by 1.

The double-layer convolution adopts a lookup table mode to perform simultaneous calculation, and three input simultaneous or processing basic units formed by the double-layer convolution simultaneous calculation are realized in one lookup table according to the basic characteristics of multiple inputs and single outputs in the lookup table.

The beneficial effects are that: the invention provides a double-layer same-or binary-value network calculated based on a lookup table, wherein a traditional convolution kernel is replaced by a composite double-layer convolution kernel with stronger characteristic extraction capability, and the condition that convolution operation is not binary in double-layer convolution is eliminated by adopting three-input same-or calculation, so that the parameter quantity and calculation complexity of a binary-value neural network are further reduced. The structure adopts CIFAR-10 data set to verify the effectiveness of the algorithm.

Drawings

FIG. 1 is a double-layer schematic of a single module;

FIG. 2 is a graph of a sign function and a gradient update function;

FIG. 3 is a diagram of a binary neural network algorithm and an improved architecture;

FIG. 4 is a three-input or hardware implementation process conversion diagram;

FIG. 5 is a block diagram of an overall hardware implementation of a two-layer convolution.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

A double-layer same or binary neural network compression method based on lookup table calculation is completed by a double-layer convolution structure, and the algorithm comprises the following steps: firstly, after nonlinear activation, batch normalization and binary activation are carried out on an input feature map, grouping is carried out on first-layer convolution operations with different convolution kernel sizes, and a first-layer output result is obtained. Then, a second layer convolution operation with a size of 1×1 is adopted for the first layer output result to obtain an output characteristic diagram.

The hardware implementation steps comprise:

The invention is further illustrated by the following examples using convolution sizes of 3 x 3, 1x 3, 3 x1, etc. Convolution operation as shown in fig. 1, the normal 3 x 3 convolution operation in fig. 1 (a) is replaced with a double-layer convolution on the right side of fig. 1. Assuming that the number of input activated channels is n and the number of output activated channels is m, the size of a traditional convolution kernel on the left side is n.times.9, pconv3×3 on the right side represents 3×3 convolution, pconv1×3 represents 1×3 convolution, pconv3×1 represents 3×1 convolution, parameters are respectively n/8*m/8*9, n/8*m/8*3, n/8*m/8.times.3, the operation size of 1×1 convolution is n.times.m, the sum of double-layer convolution parameters on the right side is n.times.m.1.75, the parameter number is 1/5 of the common convolution, and the number of parameters and the convolution calculation amount are greatly reduced. The reduction of parameters does not bring further loss of precision, namely, the binary activation process is not carried out between double-layer convolutions for the first time, and the binarization process is realized by adopting hardware; each binary activation process in the binary neural network converts the features extracted by the original network convolution process into new features with partial effective features, and meanwhile, the reverse gradient propagation process of the network is seriously influenced, so that the gradient cannot propagate forward when propagating to the position, as shown in (a) of fig. 2, the sign function is greater than zero and is 1, less than zero and is 0, the gradient is infinity at zero, and the gradient is zero at other positions. A new gradient function (as shown in fig. 2 (b)) is needed to solve the problem of gradient inability to counter-propagate. The structure uses a new gradient function similar to Gaussian distribution, not only meets the gradient distribution of a sign function to a certain extent, but also further reduces the normal loss process of binarization, plays a certain role in correction, and ensures that the training speed and accuracy of the network are improved to a certain extent. However, the correction of the gradient only solves the gradient propagation problem, and the loss in the forward propagation process cannot be effectively solved, so that the loss caused by the binary activation process is required to be reduced as much as possible in the process of feature extraction, and the training difficulty and the accuracy loss of the network are reduced. From the above description, it is known that in the process of designing the binary neural network algorithm, more effective features should be extracted with fewer layers as possible, and the algorithm features of the binary network are exactly satisfied by using the two-layer network as in (b) of fig. 1. The test of the neural network proves that the feature extraction in the direction of the feature map is more important than the feature extraction among channels, and more features can be extracted under fewer parameters by increasing the number of channels of the first layer and reducing the number of channels of the second layer.

To verify the algorithm part of the invention, experiments utilize Tensorflow to build a binary double-layer co-or convolutional neural network based algorithm, using 4 parallel 3×3 convolutional kernels, 2 1×3 convolutional kernels, 2 3×1 convolutional kernels and 11×1 convolutional kernel instead of one 3×3 convolutional kernel. The convolutional neural network structure shown in fig. 3 is used for comparison test, the left side of fig. 3 is a common residual neural network, the convolutional neural network comprises seven convolutional modules, in the first module, binary weight convolutional operation is used after batch normalization operation, and the number of channels is 128; in the second through seventh convolution modules, each module has 1 convolution operation of 3×3, and the number of channels is 128, 256, and 512, respectively; each convolution operation is followed by a PBA layer which consists of a nonlinear function activation layer, a batch normalization processing layer and a binary activation layer, and a maximum pooling layer is arranged after the second, fourth and seventh convolution modules; the seventh convolution module is followed by a full connection layer, and since a three-channel color picture CIFAR-10 data set with the size of 32×32 is used for testing and training, wherein CIFAR-10 is classified by 10, the output channel number of the last full connection layer of CIFAR-10 is 10, and finally a normalized exponential function (Softmax) layer is connected to complete classification operation, the right side of fig. 3 is a network improved on the left side of fig. 3 by using the method, as shown by a broken line box module, from the second convolution module, a double-layer identical or convolution kernel is used to replace the original 3×3 convolution kernel, the characteristic channel number of an intermediate layer of the double-layer convolution structure can be defined by itself, and the whole performance of the network can be enhanced by proper increase, and other parts can keep the same as the original network. The network model of fig. 3 was trained and tested using a Tensorflow setup, and table 1 gives model comparisons for 250 rounds of training at the same layer number.

As shown in Table 1, after the same number of layers is trained for 250 rounds, the network test accuracy of ResNet (residual neural network-7 layer) is 87% under CIFAR-10, and the test accuracy of the binary residual network (PM-ResNet-7) improved by the invention is 86.1%; table I shows the comparison of network parameters under CIFAR-10 data set, the number of original network parameters is 2.83M, and the improved network is only 1.08M, and the reduction rate reaches 63%. The reduction of parameters inevitably reduces the convolution operation times of the network, so that the calculation complexity of the network is greatly reduced and the calculation time is saved under the condition of ensuring the test accuracy and full binary values.

Table 1 comparison of parameters and accuracy of different network models

Data set	Model	Number of parameters	Accuracy rate of
				CIFAR-10	ResNet-7	2.83M	87％
CIFAR-10	PM-ResNet-7	1.08M	86.1％

In terms of hardware implementation, as shown in fig. 4 (a), the normal convolution operation steps of the dual-layer network are that output values o11, o12, o13 of the first layer convolution are calculated, and then the calculated values are subjected to convolution operation of 1x1 to obtain o1, o2, o3. The first calculation mode is shown in formulas (1), (2), (3) and (4), and the values obtained by the first layer convolution operation need to be summed correspondingly, so that the result is a multi-bit value.

O11＝I11W111+I12W112+…+I21W121+I22W122+…+I31W131+I32W132+…+I39W139 (1)

O12＝I11W211+I12W212+…+I21W221+I22W222+…+I31W231+I32W232+…+I39W239 (2)

O13＝I11W311+I12W312+…+I21W321+I22W322+…+I31W331+I32W332+…+I39W339 (3)

O1＝O11*x11+O12*x12+O13*x13 (4)

When the next step of O1 result calculation is performed, the weight parts x11, x12 and x13 are single bits, but the output results O11, O12 and O13 of the first layer are multi-bit values, so that only addition and subtraction operations can be performed, the advantage that the convolution operation can use the same or operation when binary neural network hardware is realized is lost, and the hardware resource cost of the method is also larger. The improved calculation mode is shown in fig. 4 (b), and the method of fusion contract or is adopted, and only three input values are adopted for the same or judgment, so that the output result of the first layer can be removed, and the problem that the middle layer is not single bit is solved. The difficulty of simultaneous or calculation can not be carried out any more, and the precision loss caused by the binary activation process is eliminated, so that the binary neural network can extract more characteristics under the condition of fewer parameters. From equation (5), the final result O1 is calculated by performing a 1-bit summation after performing a three-bit exclusive nor operation. Because multi-bit adders consume more resources than single-bit adders, the resulting clock delay is also more, adding complexity to the implementation of the circuit.

At the same time, three single bit simultaneous or networks have more advantages over common computing methods in terms of computation on look-up tables. Taking FPGA as an example, the programmable logic resource of FPGA is mainly composed of two parts, one part is a look-up table (LUT) to realize the combination circuit, and the other part is a register to realize the sequential circuit. The most consumed in the process of realizing the neural network hardware is convolution operation, the convolution operation is mainly multiplication and addition and subtraction operation, and the convolution operation is mainly addition, subtraction and addition or operation in the binary neural network, but a large amount of combinational logic circuits are consumed, so that the optimization of the combinational logic circuits is a key point. According to the characteristics of the LUTs, each LUT is a lookup table, different logic functions can be realized, but input and output modes are fixed, and aiming at the FPGA device realized by the currently mainstream neural network hardware, the LUTs of the FPGA are mostly in a 4-6 input mode. For the 6-input mode, the LUT has three modes, and one 6-input-1 output, two 3-input-1 outputs or one 5-input-2 output mode can be realized respectively. According to the same or principle, two single bit inputs are subjected to same or to obtain single bit output, and three single bit inputs are subjected to same or still single bit output. Meanwhile, according to the output bit limit of the LUT, three two-input or operation cannot be realized on 1 LUT, so that the LUT required by two-input or operation and the LUT required by two three-input or operation are the same under normal conditions, and the second convolution operation is required after the normal or convolution in the first mode, so that more logic resources are required to be consumed by consumed hardware resources compared with the second one-time calculation double-layer convolution, and the second mode disclosed by the invention can save considerable hardware resources.

For the whole hardware implementation of the double-layer network, as shown in fig. 5, in this example, 8 parallel modules are adopted to perform parallel computation, a convolution module with the size of an input channel p and a feature map m×n is divided into 8 modules, the size of the input feature map p/8×m×n is equal to that of a 3×3 convolution module; the upper left corner of the input feature diagram is first convolved with p/8 x 3 values and 4 different 3 x 3 convolution kernels, and the weight matrix p/8 x 3 is then convolved. The obtained intermediate result is further convolved with the next 1x1 convolution kernel, the intermediate matrix is subjected to sliding exclusive nor operation by 128 4*1 matrixes to obtain matrixes of output 128 channels, the size of the matrixes is 128 x p/8 x 4 x 9, the obtained matrixes are summed in the non-channel direction, the sum of each matrix is p/8 x 4 x 9, and the channel directions do not need to be accumulated. The adder of the matrix adopts addition operation of a 6:3 compression tree mode, and firstly, according to the characteristics of 5 input and 2 output of the LUT, the consumption of 2 LUT is only needed for 5 value summation. Secondly, according to the characteristic that the resource utilization rate is higher as the number of parallel inputs is larger in the 5-in 3-addition pipeline operation, the resource utilization rate is greatly improved by carrying out parallel addition calculation on the double-layer exclusive nor value. And calculating the 1 multiplied by 3 or 3 multiplied by 1 convolution module by adopting the same processing mode, obtaining 8 128 multiplied by 1 matrix values by final parallel calculation, and carrying out 1-direction summation to obtain final 128 output characteristic diagram values.

It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A double-layer same or binary neural network compression method based on lookup table calculation is characterized in that:

then, a second layer convolution operation with the size of 1 multiplied by 1 is adopted for the output result of the first layer to obtain an output characteristic diagram;

(2) Adding the output values obtained by double-layer convolution in the step (1) in a 6:3 compressed tree mode;

the double-layer convolution simultaneously calculates and adopts three-input simultaneous or processing, wherein the input three values are respectively an input characteristic diagram value, a first-layer convolution weight value and a second-layer convolution weight value;

2. The method for compressing the two-layer co-or binary neural network based on the lookup table calculation as claimed in claim 1, wherein the method comprises the following steps: the double-layer convolution is formed by connecting a group convolution formed by different convolution kernel sizes and a second-layer convolution with the size of 1 multiplied by 1.