CN109886391B

CN109886391B - Neural network compression method based on space forward and backward diagonal convolution

Info

Publication number: CN109886391B
Application number: CN201910089080.5A
Authority: CN
Inventors: 张萌; 沈旭照; 李国庆; 李建军; 刘文昭; 郭晟昊
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2023-04-28
Anticipated expiration: 2039-01-30
Also published as: CN109886391A

Abstract

The invention discloses a neural network compression method based on space positive and negative diagonal convolution, which is a core of the current computer vision and digital image processing solution for many scenes, but the calculation complexity and the parameter number of the convolutional neural network are still limiting factors of various application scenes. In order to improve the calculation efficiency of the convolutional neural network and reduce the number of parameters of the network, the invention replaces a pair of continuous traditional square convolution kernels with positive and negative diagonal convolution kernels in space, carries out positive and negative diagonal convolution operation first, then carries out anti-diagonal convolution operation after batch normalization and nonlinear function activation treatment, further reduces the calculation complexity of the convolutional neural network on the premise of retaining a more effective local feeling center, accelerates the network propagation speed, has a certain regularization effect in the diagonal convolution, improves the robustness of the network, reduces the overfitting of a model, and improves the overall effect of the compressed network obviously.

Description

Neural network compression method based on space forward and backward diagonal convolution

Technical Field

The invention relates to a neural network clipping and convolution decomposition technology, and belongs to the technical field of digital image processing.

Background

In recent years, deep learning has remarkable results in solving the problem of high-level abstract cognition, and a convolutional neural network is one of the most important tools for deep learning, and the weight sharing network of the convolutional neural network is more similar to a biological neural network structure, so that the weight number is reduced, and the scale of a model is reduced. The convolutional neural network has high adaptability to the deformation of the image in the forms of translation, scaling, rotation and the like, and is widely applied to the fields of image recognition, target detection and the like, such as microsoft uses the convolutional neural network as a handwriting recognition system for arabic and chinese, *** uses the convolutional neural network to recognize faces, license plates and the like in street view pictures.

Convolutional neural networks initially contain two structures, a convolutional layer and a pooling layer. The convolution units in the convolution layer are used as the identifier of the characteristics, the association among the pixels of the picture is local, the small image blocks are respectively perceived by the similar human eyes through the optic nerve, and each neuron is not needed to perceive the whole picture, so that each convolution kernel has one output, and the characteristics of the whole picture are obtained after comprehensive judgment in the brain. Weight sharing is a feature of convolutional neural networks, features at the bottom of a picture are universally applicable, and are not related to positions, such as image edges, and local region edge features above or below an image can be extracted by using a similar feature extractor, so that only one convolution kernel is needed for extracting a feature of a local region in an image. For the first few layers of networks mainly used for extracting the bottom layer features, the weight sharing further reduces the number of parameters, and the pooled sampling can merge the tiny features into one while the convolutional layer recognizes the features, because the close-positioned areas often contain some subtle links with each other, and the pooled sampling can integrate the tiny features, and also because of the uniqueness of the convolutional neural network structure, the pooled sampling is widely applied in the field of image recognition.

The convolutional neural network reduces the number of parameters of the network and reduces the network scale while fully preserving the image characteristics, but the calculation complexity and the number of parameters of the convolutional neural network are still limiting factors of a plurality of application scenes, and particularly when the convolutional neural network is used at a mobile terminal, one-time forward propagation consumes huge calculation resources, consumes longer time and is unfavorable for being deployed on applications with high real-time processing requirements.

Yann Le Cun et al in Optimal brain damage (Advances in Neural Information Processing Systems (NIPS), 1990:598-605) propose that the neural network has numerous parameters, but some of them have little contribution to the final output result and appear redundant, and the model neurons need to be ordered according to the contribution to the final output result, and those neurons with low contribution degree are discarded, so that the model running speed is faster and the parameters are fewer; han Song et al in Deep compression Compressing Deep neural networks with pruning, trained quantization andhuffman coding (International Conference on Learning Representations (ICLR), 2016) use pruning, in all connections, remove connections with weights less than a certain threshold, and then retrain the network, by which the parameters of the Alexnet and VGG-16 models are reduced by 9 and 13 times, respectively; both of the above methods require a constantly iterative process, i.e., both model pruning and network training are alternately repeated.

Szegedy C et al in Rethinking the inception architecture for computer vision (Proceeding ofthe IEEE conference on computer vision and pattern recognition, 2016:2818-2826) spatially resolved the traditional convolution into an asymmetric convolution, replacing any n by 1*n convolution followed by n 1 convolution, and the number of parameters is significantly reduced, but this decomposition method does not work well in the first few layers of the network, only with significant effects on the middle layer.

Simonyan K et al (arXiv preprint arXiv:1409.1556,2014.) in Very deep convolutional networks for large-scale image recognition propose VGG models, the first proposal is that multiple 3x3 convolution kernels can be used to replace a convolution layer with a larger convolution kernel, and an equivalent receptive field idea is obtained, for example, the equivalent receptive field of two continuous 3x3 convolution operations is 5x5, so that on one hand, parameters are reduced, on the other hand, more nonlinear mapping is performed, and the fitting capability of a network is enhanced. There are many methods for accelerating and compressing the model, such as reducing numerical accuracy, global averaging and pooling, optimizing an activation function cost function, accelerating hardware, and the like, so that in order to be able to deploy the model on a mobile terminal application with high processing real-time requirements, a convolutional neural network needs a more efficient clipping algorithm.

Disclosure of Invention

The invention aims to: the invention aims to solve the problems that the existing convolutional neural network reduces the number of parameters of the network and the network scale while fully preserving the image characteristics, but the calculation complexity and the number of parameters of the convolutional neural network are still limiting factors of a plurality of application scenes, and particularly when the convolutional neural network is used at a mobile terminal, huge calculation resources are consumed for one forward propagation, a long time is consumed, and the convolutional neural network is unfavorable for being deployed in the application with high real-time processing requirements.

The technical scheme is as follows: in order to solve the technical problems, the invention provides the following technical scheme:

a neural network compression method based on space forward and backward diagonal convolution comprises the following steps:

(1) Filling zero padding is carried out on an input characteristic diagram of the convolutional neural network;

(2) Performing dead angle convolution operation on the feature map after zero filling;

(3) Firstly carrying out batch normalization processing on the output characteristic diagram after the dead angle convolution operation, then carrying out nonlinear function activation processing, and keeping the size of the processed characteristic diagram unchanged;

(4) And (3) filling zero padding is carried out on the feature map processed in the step (3), and anti-diagonal convolution operation is carried out.

Further, in step (2), compared with the conventional square convolution operation, the convolution kernel of the opposite angle convolution operation has parameters only on the opposite angle line, and the rest is 0.

Further, in step (4), the opposite angle convolution operation has parameters only on opposite angle lines, and the rest is 0, compared with the conventional square convolution operation.

Further, in step (2) and step (4), a pair of consecutive conventional square convolution operations are spatially replaced by a pair of consecutive forward and backward diagonal convolution operations.

Further, in the step (1) and the step (4), the zero padding is needed to be determined according to the size N of the diagonal convolution kernel, and the total number of rows and the total number of columns of the zero padding are equal to N-1; when N-1 is odd, the left side and the upper side of the characteristic diagram are respectively supplemented with (N-2)/2 columns and rows, and the right side and the lower side of the characteristic diagram are respectively supplemented with N/2 columns and rows; when N-1 is even, the upper, lower, left and right of the feature map are complemented with (N-1)/2 rows and columns.

The beneficial effects are that: compared with the prior art, the invention has the advantages that: the invention provides a neural network based on space positive and negative diagonal convolution, which spatially replaces a pair of continuous traditional square convolution operations with a pair of continuous positive and negative diagonal convolution operations, uses fewer parameters on the premise of equivalent local feeling center, improves the calculation efficiency of the convolutional neural network, accelerates the network propagation speed, increases the middle nonlinear processing link, has a certain regularization effect, and reduces the overfitting of a model. Experimental results show that compared with the traditional convolutional neural network, the network clipping effect is improved obviously.

Drawings

FIG. 1 is a schematic diagram of a spatial diagonal convolution;

FIG. 2 is a schematic diagram of a positive-angle convolution operation with a spatial diagonal convolution kernel of size 2;

FIG. 3 is a schematic diagram of an anti-diagonal convolution operation with a spatial diagonal convolution kernel size of 2;

FIG. 4 is a schematic diagram of a front-back diagonal convolution equivalent sense center with a space diagonal convolution kernel size of 2;

FIG. 5 is a schematic representation of a positive-angle convolution operation with a spatial diagonal convolution kernel of size 3;

FIG. 6 is a schematic diagram of an anti-diagonal convolution operation with a spatial diagonal convolution kernel size of 3;

FIG. 7 is a schematic diagram of the equivalent sense center of a positive and negative diagonal convolution with a space diagonal convolution kernel size of 3;

fig. 8 is a block diagram of a convolutional neural network for use in experiments of the present invention.

Detailed Description

The space positive and negative diagonal convolution operation is as shown in fig. 1, the positive and negative diagonal convolution operation is carried out after zero padding is carried out on the input feature map, then the positive and negative diagonal convolution operation is processed through an intermediate process, and in the intermediate process, batch normalization processing is carried out firstly, namely, the input pixel point x is input _i Subtracting the mean mu and then dividing by the mean square error

Obtain normalized value->

Then scale transformation and offset are carried out to obtain a value y after batch normalization processing _i Wherein:

m is the batch size, epsilon is a fixed value, and gamma and beta are both learned fixed parameters; performing nonlinear function activation processing, wherein the simplest linear rectification function ReLU is used, namely outputting equal to input for numbers with input being greater than or equal to 0, outputting equal to 0 for numbers with input being less than 0, and performing anti-diagonal convolution operation after refilling zero; for the zero padding operation in step (1) and step (4), the number of zero padding is determined according to the size N of the diagonal convolution kernel, and the total number of rows and the total number of columns of zero padding are equal to N-1. When N-1 is odd, the left side and the upper side of the characteristic diagram are respectively supplemented with (N-2)/2 columns and rows, and the right side and the lower side of the characteristic diagram are respectively supplemented with N/2 columns and rows; when N-1 is even, the upper, lower, left and right of the feature map are complemented with (N-1)/2 rows and columns.

As shown in fig. 2 and 3, the forward and backward diagonal convolution operation diagrams of the diagonal convolution kernel size n=2, and for convenience of demonstration, parameters of the forward and backward diagonal convolution kernels are set to 1; as shown in the left diagram of fig. 2, the input feature map is 7*7, the total number of rows and total columns to be zero-padded is N-1=1 according to the convolution kernel size n=2, because 1 is an odd number, (N-2)/2=0 columns and rows should be padded on the left and upper sides of the feature map, N/2=1 columns and rows should be padded on the right and lower sides of the feature map, and the feature map size after zero padding is 8×8; through the dead angle convolution operation, as shown in the right graph of fig. 2, the first output of the upper left corner of the output feature diagram is equal to a11 and a22 multiplied by the corresponding weights respectively and added, and the weights are set to be 1, so that the first output of the upper left corner of the output feature diagram is a11+a22, then the dead angle convolution kernel slides rightward to obtain the second output of a12+a23, and the second output of a12+a23 slides sequentially line by line rightward until the last result a77 of the output feature diagram is calculated, and after the dead angle convolution operation is finished, the size of the output feature diagram is 7*7 after the intermediate processing process.

As shown in the left graph of fig. 3, according to the size of the diagonal convolution kernel being 2, the size of the diagonal convolution output feature map after the zero padding operation is 8×8, and through the anti-diagonal convolution operation, as shown in the right graph of fig. 3, the first values of the output feature map are a12+a23 and a21+a32 multiplied by the corresponding weights respectively and added, and because the weights are all set to 1, the first output of the output feature map is a12+a23+a21+a32, then the anti-diagonal convolution kernel slides to the right line by line in sequence, and the corresponding results of the output feature map are obtained in sequence, and the size of the output feature map is 7*7.

As shown in the right graph of fig. 3, in the output feature graph subjected to the forward and backward diagonal convolution operation, the equivalent sense center of the initial input feature graph, in which the value a45+a56+a54+a65 subjected to the specific identification is shown in the specific identification part of fig. 4, is equivalent to a diamond sense center, a pair of conventional square convolution kernels needs 2×n=8 parameters for the diamond sense center, and only 2*N =4 parameters are needed for the forward and backward diagonal convolution kernels, so that the number of parameters is reduced by 2*N (N-1) =4 on the premise of ensuring a more effective sense center, the sparsity of a network is greatly improved, and the regularization effect on the network is achieved.

As shown in fig. 5 and 6, the forward and backward diagonal convolution operation diagrams of the diagonal convolution kernel size n=3, and for convenience of demonstration, parameters of the forward and backward diagonal convolution kernels are set to 1; as shown in the left diagram of fig. 5, the input feature diagram is 7*7, the total number of rows and total columns to be zero-padded is N-1=2 according to the convolution kernel size n=3, because 2 is an even number, (N-1)/2=1 columns and rows should be padded on the left and upper sides of the feature diagram, and (N-1)/2=1 columns and rows should be padded on the right and lower sides of the feature diagram, and the feature diagram size after zero padding is 9*9; through the dead angle convolution operation, as shown in the right graph of fig. 5, the first output of the upper left corner of the output feature graph is equal to 0, a11 and a22 are multiplied by the corresponding weights respectively and added, and the weights are set to be 1, so that the first output of the upper left corner of the output feature graph is a11+a22, then the dead angle convolution kernel slides rightward to obtain the second output of a12+a23, and the second output slides rightward row by row sequentially until the last result a66+a77 of the output feature graph is calculated, and after the dead angle convolution operation is finished, the size of the output feature graph is 7*7 after the intermediate processing process.

As shown in the upper graph of fig. 6, according to the size of the opposite angle convolution kernel being 3, the size of the opposite angle convolution output feature map after being subjected to zero padding operation is 9*9, and through the opposite angle convolution operation, as shown in the lower graph of fig. 6, the first value of the output feature map is 0, a11+a22 and 0 are multiplied by corresponding weights respectively and then added, and as the weights are all set to be 1, the first output of the output feature map is a11+a22, then the opposite angle convolution kernel slides to the right line by line in sequence, corresponding results of the output feature map are obtained in sequence, and the size of the output feature map is 7*7.

As shown in the lower graph of fig. 6, in the output feature graph of the forward and backward diagonal convolution operation, the value a24+a35+a46+a33+a44+a55+a42+a53+a64 passing through the specific identifier is shown in the specific identifier part of the initial input feature graph, which is equivalent to a diamond-like sense center in fig. 7, a pair of conventional square convolution kernels need to have parameters of 2×n×n=18, and only 2*N =6 parameters are needed by the forward and backward diagonal convolution kernels, so that the number of parameters is reduced by 2*N ×n-1=12 on the premise of ensuring a more effective sense center, the number of network parameters is further reduced, the calculation efficiency of a model is greatly improved, and the model clipping effect is obvious.

To verify the present invention, a comparison test was performed using the convolutional neural network structure shown in fig. 8, and (a) of fig. 8 is a VGG-19 network proposed by Simonyan K et al, comprising five convolutional modules, each of which has two consecutive convolutional operations in the first and second convolutional modules, but the number of channels is different, 64 and 128, respectively; in the third to fifth convolution modules, each module has four consecutive convolution operations, the number of channels being 256, 512 and 512, respectively; each convolution operation is followed by a batch normalization layer and a nonlinear function activation layer, which are not shown in the figure; there is a maximum pooling layer (Max pooling) between each convolution module; the fifth convolution module is followed by three full-connection layers, since three channel color picture CIFAR-10 and CIFAR-100 datasets of size 32x32 are used for testing and training, wherein CIFAR-10 is classified as 10, CIFAR-100 is classified as 100, the number of output channels for the last full-connection layer of CIFAR-10 is 10, the number of output channels for CIFAR-100 is 100, a normalized exponential function (Sofemax) layer is finally accessed to complete classification,

fig. 8 (b) shows a modified network based on fig. 8 (a) by using the present invention, as shown by a dashed box module, starting from a third convolution module, using a space positive-negative diagonal convolution kernel instead of the original square convolution kernel, wherein the positive-negative diagonal convolution operation and the negative-positive diagonal convolution operation are both followed by a batch normalization processing layer and a nonlinear function activation layer, training and testing are performed by using a Tensorflow building network, and table 1 shows model comparison cases of training 200 rounds under the same super-parameters.

Table 1 network model comparative test case

As shown in Table 1, under the same super parameters, after training for 200 rounds, the accuracy of the VGG-19 network test under the CIFAR-10 data set is 92.83%, and the improved accuracy of the test by using the invention is 92.89%; under CIFAR-100 data set, the test accuracy of VGG-19 network is 69.92%, the improved test accuracy of the invention is 69.78%, and the test accuracy under two data sets is not great; the number of parameters under different data sets is slightly different due to the different output channel numbers of the final full-connection layer, but the overall difference is not great, the comparison condition of network parameters under CIFAR-10 data sets is given in the first table, the number of original network parameters is 45.23M, the number of improved networks is only 32.05M, the number of parameters is reduced by 13.18M, the reduction rate reaches 29.13%, and under the condition of ensuring the test accuracy, the calculation complexity of the network is greatly reduced, and the calculation efficiency is improved.

Claims

1. A neural network compression method based on space forward and backward diagonal convolution is applied to the field of image recognition and is characterized by comprising the following steps:

(2) Performing positive angle convolution operation on the feature map after zero filling, wherein the convolution kernel of the positive angle convolution operation only has parameters on positive angle lines, and the rest parts are 0;

(4) Filling zero padding is carried out on the feature map processed in the step (3), anti-diagonal convolution operation is carried out, the convolution kernel of the anti-diagonal convolution operation only has parameters on anti-diagonal lines, and the rest parts are 0;

in the step (1) and the step (4), the zero padding quantity is determined according to the size N of the diagonal convolution kernel, and the total number of rows and columns of the zero padding is equal to N-1; when N-1 is odd, the left side and the upper side of the characteristic diagram are respectively supplemented with (N-2)/2 columns and rows, and the right side and the lower side of the characteristic diagram are respectively supplemented with N/2 columns and rows; when N-1 is even, the upper, lower, left and right of the feature map are complemented with (N-1)/2 rows and columns.

2. The neural network compression method of claim 1, wherein in step (2) and step (4), a pair of consecutive conventional square convolution operations are spatially replaced by a pair of consecutive forward and backward diagonal convolution operations.