CN114372565A

CN114372565A - Target detection network compression method for edge device

Info

Publication number: CN114372565A
Application number: CN202210038592.0A
Authority: CN
Inventors: 钟胜; 唐维伟; 邹旭; 徐文辉; 谭富中; 卢金仪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-19

Abstract

The invention belongs to the technical field of target detection of edge equipment, and discloses a target detection network compression method for the edge equipment, which comprises the following steps: optimizing a network structure for SkyNet, and quantizing the characteristic diagram and the weight parameter; and reconstructing a forward reasoning structure and combining part of calculation processing in the depth separable convolution. The invention is respectively sent from the angle of algorithm optimization, takes SkyNet as an example, and provides a compression processing technology for the target detection network, thereby reducing the deployment difficulty of the target detection network on edge equipment. The invention cuts the network and is more suitable for the edge device. The invention carries out quantization processing and greatly reduces the size of the network model. The invention carries out merging processing, and greatly reduces the calculated amount of the network.

Description

Target detection network compression method for edge device

Technical Field

The invention belongs to the technical field of target detection of edge equipment, and particularly relates to a target detection network compression method for edge equipment.

Background

At present, the task of target detection is to find out and identify the target of interest in the image, and they are widely used in scenes such as automatic driving, face detection and video monitoring. In recent years, a target detection algorithm based on a convolutional neural network obtains better performance than that of a traditional method, but due to huge calculation amount and parameter amount, most of the convolutional neural networks are deployed on a general CPU or a general GPU, power consumption is large, the size is large, meanwhile, real-time detection of edge equipment is difficult to achieve, and a light-weight network or a compression network mode is urgently needed, so that the convolutional neural network can directly carry out inference work on the edge equipment.

To solve this problem, prior art 1 proposes a lightweight network of mobilene, and prior art 2 proposes a lightweight network of Xception, which uses a Deep Separable Convolution (DSC) instead of the standard Convolution to reduce the amount of calculation and parameters and to improve the operation efficiency of the target detection network on the edge device to some extent. In the prior art 3, a hardware-friendly network Skynet is designed for edge devices on the basis of deep separable convolution, and compared with MobileNet and Xception, the Skynet structure is more regular, the module reuse rate is higher, and the deployment platform still has higher computational requirements.

To meet the edge application scenario, a number of neural network optimization techniques are proposed, which mainly include network compression, and are roughly divided into quantization of the network and pruning of the network. The quantization part uses low precision number to replace high precision number to participate in convolution calculation, uses partial precision loss to avoid floating point number operation, or uses a central value to replace a part of weight value in a clustering mode; the pruning is to utilize the network sparsity, and as the weight in the neural network is smaller, the influence on the final prediction result is smaller, the network can be skipped by judging whether the weight of the network is zero.

Through the above analysis, the problems and defects of the prior art are as follows: the existing target detection network has low operation efficiency on the edge equipment, high requirement on the equipment, large power consumption and large volume, and cannot carry out real-time detection on the edge equipment.

The difficulty in solving the above problems and defects is: the high-precision target detection task can be realized by utilizing the convolutional neural network, but the characteristics of large parameter and large calculation amount make the realization on edge equipment difficult.

The significance of solving the problems and the defects is as follows: by the set of compression method, the network parameter quantity and the calculated quantity can be reduced, network deployment can be carried out on low-power-consumption and low-cost edge equipment, and the precision loss is in a reasonable range.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a target detection network compression method for edge equipment.

The invention is realized in such a way that an object detection network compression method for edge equipment comprises the following steps:

optimizing a network structure for SkyNet, and quantizing the characteristic diagram and the weight parameter; and reconstructing a forward reasoning structure and combining part of calculation processing in the depth separable convolution.

Further, the target detection network compression method for the edge device comprises the following steps:

removing a bypass branch structure in the SkyNet network and deleting a partial channel output by a first layer to perform optimized cutting of the SkyNet network to obtain an optimized SkyNet network;

step two, the SkyNet network is compressed after the weight is quantized to 7 bits, the characteristic diagram is quantized to 8 bits, the bias parameters and the scaling coefficients are fused, and the point is fixed to 32 bits for retraining;

and step three, merging the convolution layer and the normalization layer, merging the activation processing, the quantization, the inverse quantization and the saturation truncation processing into FETCH processing, and merging the SkyNet network structure.

Further, the performing optimized cutting of the SkyNet network includes the following steps:

first, the Skynet branches are pruned; pruning the output of the first layer into 32 channels by taking each depth separable convolution as a minimum unit layer;

secondly, adding pooling treatment after the sixth layer, modifying the last layer into depth separable convolution, and optimizing the whole network structure into a straight cylinder type.

Further, the optimized SkyNet network includes:

a 3-channel input layer CHL3, intermediate layers CHL32, CHL96, CHL192, CHL384, CHL512, CHL96, and a regression layer CHL 30;

the convolution between each layer of the optimized SkyNet network uses a deep separable convolution.

Further, the compression of the SkyNet network after retraining comprises the following steps:

(1) selecting the maximum value in the convolution kernel corresponding to each output channel as the maximum value of quantization, and performing weight quantization by adopting a maximum value quantization mode:

q_w＝w×scale_w；

wherein w represents a vector of the original weight corresponding to each channel; q _ w represents a vector of quantized weights; scale _ w represents that the scaling factor is a scalar;

(2) selecting a threshold value by utilizing the KL relative entropy, and quantizing the feature diagram by adopting a saturation type quantization method: selecting a threshold value T, scaling the value of the original distribution within the range of +/-T to-127 to +127 in equal proportion, carrying out saturation treatment on the part out of the range, and directly taking the value of the part out of the range represented by the saturation value.

Further, selecting a threshold value by using the KL relative entropy, and performing feature map quantization by using saturated quantization comprises:

1) selecting a threshold value by utilizing KL relative entropy:

wherein p represents original distribution before quantization is set, q represents distribution after quantization using a threshold value T, H (p) represents original distribution information entropy, H (p, q) represents cross entropy of original distribution and distribution after quantization, and DKL (p | | q) represents KL relative entropy;

2) calculating a scaling factor scale _ fm:

3) performing fixed-point processing on the bias and scaling coefficients, merging floating point number units appearing in forward reasoning, amplifying and rounding, and storing the final coefficient by adopting a 32-bit integer number;

4) calculating the inverse quantization coefficient after merging, amplification and rounding:

Bias_merge＝int(bias×scale_next_fm×shift_coe)；

wherein scale _ w represents a weight quantization coefficient; scale _ fm represents the quantization coefficient of the front layer feature map; bias represents bias; scale _ next _ fm represents the next layer quantized coefficient; scale _ merge represents the inverse quantization coefficient after merging, amplification and rounding; bias _ merge represents a Bias coefficient; shift _ coe represents magnification.

Further, the merging the SkyNet network structure includes:

(1) merging the convolution layer and the normalization layer to obtain a merged output y₃Comprises the following steps:

wherein:

wherein, y₁Represents the convolution output; x represents an input; w represents a weight; b represents a deviation; x, w and b are vectors; μ represents a mean value; σ represents the standard deviation; γ represents a scaling coefficient; β represents a scaling offset; 1 e-6; w represents a post-fusion weight; b represents post-fusion bias;

(2) merging activation processing, quantization, inverse quantization and saturation truncation into FETCH processing; the FETCH processing converts 32 bits into 8 bits; ReLu activation and saturation truncation were performed.

Further, the performing ReLu activation and saturation truncation comprises:

judging the positive and negative of the input data, and if the positive is true, performing saturation truncation; if the output is negative, the activated value is 0.

Another object of the present invention is to provide an object detection network compression system for an edge device, comprising:

the network pruning module is used for carrying out optimized cutting on the SkyNet network by removing a bypass branch structure in the SkyNet network and deleting a part of channels output by the first layer to obtain the optimized SkyNet network;

the network compression module is used for compressing the SkyNet network after the weight is quantized to 7 bits, the characteristic diagram is quantized to 8 bits, the bias parameters and the scaling coefficients are fused, and the point is fixed to 32 bits for retraining;

and the network structure merging module is used for merging the convolution layer and the normalization layer, merging the activation processing, the quantization, the inverse quantization and the saturation truncation processing into FETCH processing and merging the SkyNet network structure.

By combining all the technical schemes, the invention has the advantages and positive effects that:

the invention is respectively sent from the angle of algorithm optimization, takes SkyNet as an example, and provides a compression processing technology for the target detection network, thereby reducing the deployment difficulty of the target detection network on edge equipment. The invention cuts the network and is more suitable for the edge device. The invention carries out quantization processing and greatly reduces the size of the network model. The invention carries out merging processing, and greatly reduces the calculated amount of the network.

Drawings

Fig. 1 is a flowchart of a target detection network compression method for an edge device according to an embodiment of the present invention.

Fig. 2 is a diagram of the optimized Skynet network structure according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a saturated truncated scaling quantization according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a merged convolutional layer and a normalization layer according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of the FETCH operation provided by the embodiment of the present invention.

Fig. 6 is a schematic diagram of a calculation process of each layer before fixed-point processing according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of a calculation process of each layer after the fixed point processing provided by the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a target detection network compression method for an edge device, and the following describes the present invention in detail with reference to the accompanying drawings.

The target detection network compression method for the edge device provided by the embodiment of the invention comprises the following steps:

As shown in fig. 1, the method for compressing an object detection network for an edge device according to an embodiment of the present invention includes the following steps:

s101, performing optimized cutting on the SkyNet network by removing a bypass branch structure in the SkyNet network and deleting a partial channel output by a first layer to obtain an optimized SkyNet network;

s102, compressing the SkyNet network after the weight is quantized to 7 bits, the characteristic diagram is quantized to 8 bits, the bias parameters and the scaling coefficients are fused, and the point is fixed to 32 bits for retraining;

s103, merging the convolution layer and the normalization layer, merging the activation processing, the quantization, the inverse quantization and the saturation truncation processing into FETCH processing, and merging the SkyNet network structure.

The SkyNet network optimized cutting method provided by the embodiment of the invention comprises the following steps:

The optimized SkyNet network provided by the embodiment of the invention comprises the following components:

The SkyNet network compression method after retraining provided by the embodiment of the invention comprises the following steps:

q_w＝w×scale_w；

The method for selecting the threshold value by utilizing the KL relative entropy and quantizing the characteristic diagram by adopting the saturation quantization comprises the following steps:

1) selecting a threshold value by utilizing KL relative entropy:

2) calculating a scaling factor scale _ fm:

Bias_merge＝int(bias×scale_next_fm×shift_coe)；

The SkyNet network structure merging method provided by the embodiment of the invention comprises the following steps:

wherein:

The ReLu activation and saturation truncation provided by the embodiment of the invention comprises the following steps:

The technical solution of the present invention is further described with reference to the following specific embodiments.

Example 1:

(1) optimized tailoring

The Skynet branches are pruned, each depth separable convolution is taken as a minimum unit layer, the output of a first layer is pruned into 32 channels, then the pooling operation is added after a sixth layer, and finally the last layer is modified into the depth separable convolution, so that the whole network structure is optimized into a straight cylinder type.

As shown in fig. 2, the optimized Skynet includes 8 layers in total, namely a 3-channel input layer (CHL3), an intermediate layer (CHL32, CHL96, CHL192, CHL384, CHL512, CHL96), and a regression layer (CHL30), wherein the convolution between each layer is realized by using Deep Separable Convolution (DSC), the network structure is regular, the module multiplexing is facilitated, and a high-efficiency calculation structure is formed.

The input image size used in the present invention is 160 × 160, and the image size of each layer is shown in table 1:

TABLE 1 detailed dimensions of the layers

(2) Compressing a model

The quantization parts are quantization of weights, quantization of feature maps and spotting of bias and scaling coefficients, respectively.

The weight adopts a maximum value quantization mode, and the maximum value selects the maximum value in the convolution kernels corresponding to each output channel.

And (3) setting the original weight corresponding to each channel as a vector w, the weight after quantization as a vector q _ w and the scaling coefficient as a scalar scale _ w, wherein the corresponding relation is shown as formulas (1) and (2), in order to facilitate operation, the boundary is taken as 63, and the finally obtained q _ w is uniformly distributed between-63 and 63.

q_w＝w×scale_w (2)

The characteristic diagram quantization adopts saturation quantization, and a threshold value is selected by utilizing the KL relative entropy, so that the precision loss is obviously reduced.

Saturated quantization as shown in fig. 3, a threshold T is selected, the values of the original distribution within the range of ± T are scaled to-127 to +127, the red value table in the figure is scaled to be out of the range, saturation processing is performed, and the saturated value is directly taken to represent the part of the values.

Assuming that original distribution before quantization is p, distribution after quantization is q by using a threshold value T, original distribution information entropy is H (p), cross entropy of original distribution and distribution after quantization is H (p, q), and KL relative entropy is DKL (p | | | q), the following formula is given:

the available scaling factor is scale _ fm, where:

and (4) performing fixed-point processing on the bias and scaling coefficients, merging floating point number units appearing in forward reasoning, amplifying and rounding, and storing the coefficients by adopting 32-bit integer numbers. And setting the weight quantization coefficient as scale _ w, the current layer feature map quantization coefficient as scale _ fm, the bias as bias, and the next layer quantization coefficient as scale _ next _ fm.

Setting the inverse quantization coefficient after merging, amplification and rounding as Scale _ merge, the Bias coefficient as Bias _ merge, and the amplification factor as shift _ coe, including:

Bias_merge＝int(bias×scale_next_fm×shift_coe) (7)

(3) merging network structures

And combining the convolution layer and the normalization layer, wherein the deep separable convolution comprises the normalization layer, and the normalization layer can accelerate network convergence, control overfitting and solve the problems of gradient disappearance and gradient explosion when a network model is trained. After the model training is completed, all parameters are fixed, and at this time, the convolution layer parameters and the normalization layer parameters in the network are combined, as shown in fig. 4, so that the network structure can be effectively simplified, the calculation amount is reduced, and the calculation efficiency is improved.

Assuming that the convolution output is y1, the input is x, the weight is w, the deviation is b, x, w and b are vectors, the normalization layer output is y2, the mean value is μ, the standard deviation is σ, the scaling coefficient is γ, the scaling offset is β, and let ε be 1e-6, the convolution calculation formula is:

y₁＝wx+b (9)

the normalization layer calculation formula is as follows:

wherein:

substituting equation (10) for equation (11), assuming W as the post-fusion weight and B as the post-fusion bias, with the combined output y 3:

wherein:

and after fusion, normalization layer calculation is not needed any more, so that the size of the model is reduced, the calculation resource is saved, and the performance is improved for forward reasoning.

The merge activation operation, quantization, dequantization, and saturation truncation are FETCH operations, as shown in fig. 5.

The fetch operation performs a 32bit to 8bit transition, essentially completing both the ReLu activation and saturation truncation processes. After calculation by using fixed-point data, the result needs to be reduced to the previous 1/shift _ coe, for edge equipment, since shift _ coe is the power of 2, the bit can be directly taken, the positive and negative of input data needs to be judged in the middle, and if the positive is true, saturation truncation is carried out; if the output is negative, the activated value is 0.

FIG. 6 shows the inference steps before the stationarity and network structure consolidation, and FIG. 7 shows the inference steps after optimization.

Example 2:

(1) optimized cut and retrain

And (5) performing retraining on the model after the Skynet network structure is optimized into a straight cylinder type. Table 2 shows comparison of accuracy before and after pruning of Skynet, and the accuracy is reduced by less than 0.03 by using average accuracy (AP, as shown in formula 23) as an evaluation index, so that the efficiency of real-time calculation of Skynet on the edge device is greatly improved under the condition of meeting the actual application requirements.

TABLE 2 comparison of accuracy before and after Skynet pruning

Type of model	Average accuracy
		Complete model	0.797
Model after pruning	0.770

(2) Compressing models and merging network structures

The model after pruning and retraining is quantized and combined to activate operation, quantization, inverse quantization and saturation truncation operation, compared with the model before uncompressed, the precision is reduced by only 2.34%, and the size of the network model is reduced by 74.5%.

Comparison tables relating to model size, average accuracy and loss of accuracy before and after model compression optimization, e.g.

Shown in table 3.

	Before compression	After compression
			Data type	32bit floating point type	7bit/8bit/32bit fixed point type
Model size (MB)	1.41	0.359
			Average Precision (AP)	0.770	0.752
Loss of precision	---	2.34％
			Compression ratio	---	74.5％

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. An object detection network compression method for an edge device, the object detection network compression method for an edge device comprising:

2. The object detection network compression method for edge devices of claim 1, wherein the object detection network compression method for edge devices comprises the steps of:

3. The target detection network compression method for an edge device of claim 2, wherein the performing optimized pruning of the SkyNet network comprises the steps of:

4. The target detection network compression method for an edge device of claim 2, wherein the optimized SkyNet network comprises:

5. The method of object detection network compression for an edge device of claim 2, wherein the compression of the retrained SkyNet network comprises the steps of:

q_w＝w×scale_w；

6. The method according to claim 5, wherein the selecting the threshold value using the KL relative entropy and performing the feature map quantization using saturation quantization comprises:

1) selecting a threshold value by utilizing KL relative entropy:

2) calculating a scaling factor scale _ fm:

Bias_merge＝int(bias×scale_next_fm×shift_coe)；

7. The method of object detection network compression for an edge device of claim 2, wherein said merging the SkyNet network structure comprises:

wherein:

(2) merging activation processing, quantization, inverse quantization and saturation truncation into FETCH processing; the FETCH processing converts 32 bits into 8 bits; performing ReLu activation and saturation truncation; the performing ReLu activation and saturation truncation comprises:

8. An object detection network compression system for an edge device implementing the object detection network compression method for the edge device according to any one of claims 7, wherein the object detection network compression system for the edge device comprises: