CN112257794B

CN112257794B - YOLO-based lightweight target detection method

Info

Publication number: CN112257794B
Application number: CN202011164112.2A
Authority: CN
Inventors: 李晨; 许虞俊; 杜文娟
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2022-10-28
Anticipated expiration: 2040-10-27
Also published as: CN112257794A

Abstract

The invention discloses a lightweight target detection method based on YOLO, which consists of a repeatedly stacked channel non-scaling convolution module, wherein the size of a model is greatly reduced through the combination of repeated channel non-scaling convolution blocks, common 1x1 convolution and common 3x3 convolution, and the weight of each channel is redistributed through an ECA structure, so that the self-adaptive learning capacity of each channel to different types of targets is enhanced. The network outputs three characteristic graphs with different scales based on a YOLO series frame and is respectively responsible for predicting objects with different scales, so that the model can realize a detection effect with higher precision under the condition of extremely low parameter quantity.

Description

YOLO-based lightweight target detection method

Technical Field

The invention relates to the field of computer vision target detection, in particular to a light-weight multi-scale target detection method.

Background

Target detection is always a quite important research field in the field of computer vision, and with the development of deep learning technology, a target detection algorithm is also shifted from a traditional algorithm based on manual features to a detection technology based on a deep convolutional neural network. With the gradual improvement of the requirement on detection precision, the difficulty of detection tasks is increased, and more complex large networks are designed, such as SSD, R-CNN and Mask-R-CNN. The number of network parameters is often more than 100M, such as the number of Faster R-CNN model parameters reaches 132M, and the number of AmoebaNet model parameters reaches 209M. Although a larger network model and a deeper network depth mean that the network can extract more deep features to improve the accuracy of the network, huge parameters and calculation amount are introduced, and the network cannot be deployed in some application scenes with limited memory, so that the lightweight target detection network is always a research field of great interest in the industry.

In many mobile scenes, an object detection network is deployed, and not only the calculation complexity of a model and the number of parameters of the model but also the detection accuracy of the model need to be considered. The common methods such as network pruning and network parameter quantification are optimized on the designed network model.

In order to be more suitable for a mobile scene, a light-weight network specially customized for the mobile scene is needed, the problems of limited memory and limited computing power in the mobile scene are solved, and the light-weight target detection network can be well deployed in the mobile scene.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a lightweight target detection method based on YOLO, which can effectively reduce the number of parameters of a network and improve the detection precision of an algorithm.

The technical scheme is as follows: in order to realize the purpose, the invention adopts the technical scheme that:

a lightweight target detection method based on YOLO comprises a feature extraction module, a multi-scale receptive field fusion module and a target detection module; the feature extraction module mainly comprises a 1x1 convolution, a 3x3 convolution and a channel non-shrinking and unwinding block NEP, and a lightweight trunk network with a large receptive field is obtained by stacking the modules and is used for feature extraction; extracting three branches from the extracted semantic features on different scales, sending the three branches into a multi-scale receptive field module for multi-scale fusion, and finally respectively predicting objects with different sizes by using three feature maps with different scales obtained after fusion; and calculating a Loss function by using Distance-IoU Loss, and improving the regression precision of the detection frame to obtain a final target detection network, wherein the method specifically comprises the following steps:

step 1, collecting detection pictures to form a training set.

Step 2, feature extraction: and inputting the training set into a feature extraction module to extract semantic features, extracting three branches from the extracted semantic features on different scales, and sending the three branches into a multi-scale receptive field fusion module. The feature extraction module comprises a first 1x1 convolution, a first 3x3 convolution and a channel non-scaling convolution block NEP which are connected in sequence. The NEP comprises a first layer network, a second layer network, an attention module ECA and a third layer network which are sequentially connected, wherein the first layer network is a first Ghost module, the second layer network is a 3x3 depth separable convolution block, the third layer network is a second Ghost module, the first Ghost module and the second Ghost module respectively comprise a second 1x1 convolution and a second 3x3 depth separable convolution which are sequentially connected, and the first Ghost module and the second Ghost module replace a commonly used 1x1 convolution block.

The Ghost module in the GhostNet network structure is used as the basic module of the network convolution, and the module realizes the function of the conventional 1x1 convolution through the combination of the 1x1 convolution and the deep separable convolution of 3x 3. The invention ensures that the receptive field of the network can be continuously expanded in the process of channel fusion by introducing the 3x3 depth separable convolution, so as to solve the problem of insufficient feature extraction caused by over shallow depth of the lightweight network. Because the Ghost module is used for replacing a commonly used 1x1 volume block, and the depth of 3x3 in the module can be separated into convolutions, the receptive field of the lightweight network can be further improved, and the characteristic extraction performance of the network can be effectively improved.

The attention module ECA learns the channel weights of the channel non-reduced volume block NEP through a weight-shared 1-dimensional convolution on a one-dimensional feature map obtained after global average pooling, wherein the size of a 1-dimensional convolution kernel k × 1 represents the cross-channel information interaction rate of the module, and k is dynamically adjusted along with the change of the number of channels. And then distributing the obtained weight of each channel to each characteristic channel of the channel non-shrinking and unreeling block NEP, finally performing weight characteristic fusion on the channel subjected to weight redistribution, and fusing the obtained weight characteristics to obtain semantic characteristics through a second Ghost module. The method is characterized in that a channel attention module ECA is used, on a one-dimensional characteristic diagram obtained after global average pooling, the weight of each channel is learned through 1-dimensional convolution shared by the weights, and the module replaces a fully-connected channel scaling scheme through a 1-dimensional convolution mode, so that the problem of information loss is solved, and the parameter number of a network is greatly reduced. The invention adopts the module to effectively solve the balance problem of the channel weight of the lightweight neural network. The attention module ECA is used for introducing a channel attention mechanism, the attention module ECA learns the weight of each channel through a 1-dimensional convolution shared by weights on a one-dimensional feature map obtained after global average pooling, and the parameter learning of the important feature channel can be paid more attention to by the lightweight network under the condition that the parameters are limited.

Finally, under a network framework of a YOLO series, a stacked channel non-scaling module NEP is convoluted with a common 1x1 and a common 3x3, features are fused in a mode similar to a feature pyramid network, three-dimensional tensors are output through a plurality of convolutions after the network features are fused, and the three-dimensional tensors with different scales are responsible for predicting target detection frames with different scales.

And adjusting the number of channels by using a Ghost module in the NEP, extracting features by using 3x3 deep separable convolution, sending the output of the network into an attention module ECA to calculate the weight of each channel, distributing the calculated weight to each feature channel, and finally performing feature fusion on the channels after the weight is redistributed by using the Ghost module to obtain complete network output.

Step 3, a multi-scale receptive field fusion module: and the multi-scale receptive field fusion module performs multi-scale fusion according to the semantic features of the three branches to obtain three fused feature maps with different scales.

Step 4, the target detection module: and respectively predicting objects with different sizes by using the three fused feature maps with different scales.

Step 5, a loss calculation module: and calculating a Loss function by adopting Distance-IoU Loss, and improving the regression precision of the detection frame to obtain a final target detection network.

Preferably, the following components: when the feature extraction module uses the channel non-scaling convolution block NEP to carry out down-sampling on the current feature graph, the number of channels is expanded, and the problem of feature information loss caused by down-sampling is solved.

Preferably: when the depth separable convolution step length of the channel non-shrinkage volume block NEP is 2, residual connection is not used. The depth separable convolution step length of the channel non-shrinkage volume block NEP is 1, and residual error connection is added.

Preferably, the following components: the feature graph used by the channel non-shrinking and non-releasing block NEP is not zoomed in the calculation process of the channel dimension, the original channel number is kept unchanged, and the problems of feature information loss caused by the reduction of the channel dimension and excessive parameters caused by the expansion of the channel number are solved.

Preferably: in the step 3, the multi-scale receptive field fusion module uses the output of the network feature extraction module in different scales to fuse the high-level semantic features with smaller scales into the features with smaller receptive fields and larger scales through 1x1 convolution, so that the detection effect on small targets and medium targets is improved.

Preferably: in the step 5, distance-IoU Loss is as follows:

wherein L is _DIoU Indicating the loss of position of the prediction box, B indicating the prediction box, B ^gt Representing the real box, b ^gt Represents the center point of the prediction frame and the center point of the real frame, rho ² () Representing the euclidean distance and c representing the length of the diagonal of the smallest rectangle containing the predicted box and the real box.

Preferably, the following components: and finally, training the data used by the network in an actual application scene in a centralized manner, and after the training is finished, quantizing the parameters of the final target detection network into 8 bits, so that the size of the model of the network is further reduced, and the model can be deployed in an application scene with limited memory. .

Compared with the prior art, the invention has the following beneficial effects:

the lightweight target detection method provided by the invention uses the Ghost module as a basic channel adjustment and channel feature fusion module, introduces 3x3 deep separable convolution on the basis of common 1x1 convolution, and solves the problems of insufficient receptive field and insufficient semantic features of a lightweight target detection network. And channel weights are redistributed by introducing an ECA module, and the available channel capacity of the lightweight convolution is fully utilized. And the NEP module is ensured not to have scaling in the calculation process, the loss of characteristic information is reduced, and the detection precision of the network is effectively improved. Therefore, the network structure provided by the invention solves the problem of overhigh parameter complexity of the deep convolutional neural network, the precision is improved to a certain extent compared with the current mainstream lightweight target detection algorithm, the size of the model can be further reduced after the parameter precision is quantized to 8 bits, and meanwhile, the high-precision target detection is realized.

Drawings

Fig. 1 is a diagram of a network architecture of the present invention.

Fig. 2 is the Ghost module.

Fig. 3 is an ECA module.

Fig. 4 is a structure of the shrinking-free block NEP of the channel proposed by the present invention, where the left side of the diagram is a convolution step of 2, and the right side of the diagram is a convolution step of 1.

Fig. 5 is a structure designed in the embodiment (a) according to the network framework adopted by the present invention on the basis of the non-shrinking unwinding block of the channel. Wherein the number in parentheses after each module name represents the tensor dimension of the module output.

Detailed Description

The present invention is further illustrated in the accompanying drawings and described in the following detailed description, it is to be understood that such examples are included solely for the purposes of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present specification, and it is intended to cover all such modifications as fall within the scope of the invention as defined in the appended claims.

A YOLO-based lightweight target detection method is characterized in that a target object contained in an image is identified through a target detection algorithm, and the position of the target object in the image can be output, the YOLO-based lightweight target detection algorithm is used in the example, training is carried out on a training set, the detection effect of a model is verified on a test set, the YOLO-based lightweight target detection method is composed of a repeatedly-stacked channel scaling-free convolution module, the size of the model is greatly reduced through the combination of repeated channel scaling-free convolution blocks, common 1x1 convolution and 3x3 convolution, meanwhile, the weight of channels is redistributed through an ECA structure, and the adaptive learning capacity of each channel for different types of targets is enhanced. The network outputs three feature maps with different scales based on a YOLO series frame, and the three feature maps are respectively responsible for predicting objects with different scales, so that the model can realize a detection effect with higher precision under the condition of extremely low parameter quantity. As shown in fig. 1-5, comprising the steps of:

step 1, collecting detection pictures to form a training set.

Step 2, feature extraction: and inputting the training set into a feature extraction module to extract semantic features, extracting three branches of the extracted semantic features on different scales, and sending the three branches into a multi-scale receptive field fusion module. The feature extraction module comprises a first 1x1 convolution, a first 3x3 convolution and a channel non-scaling convolution block NEP which are connected in sequence. As shown in fig. 4, the channel non-scaling convolution block NEP includes a first layer network, a second layer network, an attention module ECA, and a third layer network, which are connected in sequence, where the first layer network is a first Ghost module, the second layer network is a 3x3 depth separable convolution block, and the third layer network is a second Ghost module, where the first Ghost module and the second Ghost module each include a second 1x1 convolution and a second 3x3 depth separable convolution, which are connected in sequence, and the first Ghost module and the second Ghost module replace a commonly used 1x1 convolution block.

The lightweight target detection algorithm used in this example, as shown in fig. 2, employs the Ghost module in the Ghost network structure as the basic module of the network convolution, and the Ghost module implements the function of the conventional 1x1 convolution by the combination of the 1x1 convolution and the deep separable convolution of 3x 3. The invention ensures that the receptive field of the network can be continuously expanded in the process of channel fusion by introducing the depth separable convolution of 3x3, so as to solve the problem of insufficient feature extraction of the lightweight network caused by over shallow depth.

In order to further improve the detection effect of the lightweight detection network, the limited parameters of the lightweight detection network are fully utilized, as shown in fig. 3, in this example, a channel attention module ECA is used, and on a one-dimensional feature map obtained after global average pooling, the weight of each channel is learned through a 1-dimensional convolution shared by the weights, wherein the size of a 1-dimensional convolution kernel kx 1 represents the cross-channel information interaction rate of the module, and k is dynamically adjusted along with the change of the number of the channels. The module replaces a fully-connected channel scaling scheme through a 1-dimensional convolution mode, so that the problem of information loss is solved, and the number of parameters of a network is greatly reduced. The invention adopts the module to effectively solve the balance problem of the weight of the lightweight network channel.

The attention module ECA learns the channel weights of the channel non-reduction volume block NEP through a weight-sharing 1-dimensional convolution on the one-dimensional feature map obtained after the global average pooling. And then distributing the obtained weight of each channel to each characteristic channel of the channel non-shrinking and unreeling block NEP, finally performing weight characteristic fusion on the channel subjected to weight redistribution, and fusing the obtained weight characteristics to obtain semantic characteristics through a second Ghost module.

In order to make the model more compact, in this example, the above Ghost module and the channel attention module ECA are fused to design a new convolution module NEP without scaling of the number of channels, the convolution module NEP without scaling of the number of channels firstly adjusts the number of channels through the Ghost module, then performs feature extraction by using 3x3 deep separable convolution, then sends the output of the network into the channel attention module ECA to calculate the weight of each channel, then distributes the calculated weight to each feature channel, and finally continues to perform feature fusion on the channel after the weight is redistributed through the Ghost module to obtain the complete network output.

In the embodiment, the non-zooming module NEP of the stacked channel is convoluted with common 1x1 and 3x3, the characteristics are fused in a mode similar to a characteristic pyramid network, the network characteristics are fused according to the weight, three-dimensional tensors are output through a plurality of convolutions, and the three-dimensional tensors with different scales are responsible for predicting target detection frames with different scales.

When the feature extraction module uses the channel non-scaling convolution block NEP to carry out down-sampling on the current feature graph, the number of channels is expanded, and the problem of feature information loss caused by down-sampling is solved.

As shown in fig. 4, when the step size of the depth separable convolution of the channel non-scaled module NEP is 2, residual concatenation is not used; when the depth separable convolution step length is 1 and downsampling is not carried out, residual error connection is added, and the problem that the gradient of a shallow network disappears during network training is solved.

The feature graph used by the NEP is not zoomed in the calculation process, the original channel number is kept unchanged, and the problems of feature information loss caused by channel dimension reduction and excessive parameters caused by channel number expansion are solved.

The multi-scale receptive field fusion module fuses the high-level semantic features with smaller scales into the features with smaller receptive fields and larger scales through 1x1 convolution by using the output of different scales of the network feature extraction module, so that the detection effect on small targets and medium targets is improved.

The Loss of the network training is mainly classified by three types _class Predicting the Loss of position of the frame Loss _location Whether each frame contains an object and Loss of accuracy Loss of the frame _confidence Loss of position of frame therein _location The Distance-IoU is adopted for calculating (1), and the formula is as follows:

Distance-IoU Loss adds a penalty term on the basis of the original IoU calculation, wherein b and b are in the formula ^gt Representing the center point of the prediction box and the center point of the real box. ρ is a unit of a gradient ² () And c represents the Euclidean distance, and the diagonal length of the minimum rectangle containing the prediction frame and the real frame, so that the newly added penalty term can shorten the center distance between the target frame and the real frame, and can reflect the actual error between the target frame and the real frame.

In this embodiment, as shown in fig. 5, the training step of the model includes:

a network model is built by using a Keras framework, 1 Tesla V100 GPU video card is used for model training, and a Pascal VOC data set is used as a data set. Training by using a Train and val data set (5011 sheets in total) of VOC2007 and a Train and val data set (11540 sheets in total) of VOC2012, then testing by using Test data (4952 sheets in total) of VOC2007, fixing the pixels of an input picture to be 416x416, and using data enhancement including turning, scaling, random clipping and HSV enhancement on the input picture; using an Adam optimizer, an initial learning rate of 0.001, a batch size of 16, 250 epochs were trained. The decline of the learning rate is realized by monitoring the val _ loss, when the val _ loss keeps 10 epochs not to decline any more, the learning rate is reduced to 0.5 time of the original learning rate, and the network training uses a loss function Distance-IoU loss to guide the network optimization. The performance of the network is measured by the 20-class multi-class average accuracy of the VOC, as well as the model size.

The number of the finally trained network model parameters is only 1.54M, the mAP on the VOC2007 test set reaches 72.1%, and the comparison results with some mainstream lightweight target detection networks are as follows:

Model Name	Parameter	mAP(VOC 2007)	FLOPs
				Tiny YOLOv2	15.1M	57.1％	6.97B
Tiny YOLOv3	8.4M	58.4％	5.52B
				YOLO Nano	4.0M	69.1％	4.67B
ours	1.54M	72.1％	2.69B

according to the data, the number of the obtained network model parameters is the minimum, only 1.54M and close to 1/3 of YOLO Nano; the floating point calculation amount is also only 2.68B. And the detection precision is the highest in all models, and the available capacity of the lightweight network is fully utilized, so that the network realizes better balance between reducing the model parameters and improving the model expressive force.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A lightweight target detection method based on YOLO is characterized by comprising the following steps:

step 1, collecting detection pictures to form a training set;

step 2, feature extraction: inputting the training set into a feature extraction module to extract semantic features, extracting three branches of the extracted semantic features on different scales, and sending the three branches into a multi-scale receptive field fusion module; the feature extraction module comprises a first 1x1 convolution, a first 3x3 convolution and a channel non-scaling convolution block NEP which are connected in sequence; the channel non-scaling convolution block NEP comprises a first layer network, a second layer network, an attention module ECA and a third layer network which are sequentially connected, wherein the first layer network is a first Ghost module, the second layer network is a 3x3 depth separable convolution block, the third layer network is a second Ghost module, the first Ghost module and the second Ghost module respectively comprise a second 1x1 convolution and a second 3x3 depth separable convolution which are sequentially connected, and the first Ghost module and the second Ghost module replace a commonly used 1x1 convolution block; attention module ECA learns each channel weight of channel non-shrinking volume block NEP through 1-dimensional convolution shared by a weight on a one-dimensional characteristic diagram obtained after global average pooling, wherein the size of 1-dimensional convolution kernel k multiplied by 1 represents the cross-channel information interaction rate of the module, and k can be dynamically adjusted along with the change of the number of channels; then distributing the obtained weight of each channel to each characteristic channel of a channel non-shrinking and unreeling block NEP, finally performing weight characteristic fusion on the channel subjected to weight redistribution, and fusing the obtained weight characteristics to obtain semantic characteristics through a second Ghost module;

step 3, a multi-scale receptive field fusion module: the multi-scale receptive field fusion module performs multi-scale fusion according to the semantic features of the three branches to obtain three fused feature maps with different scales;

step 4, the target detection module: respectively predicting objects with different sizes by using the three fused feature maps with different scales;

2. The YOLO-based lightweight target detection method of claim 1, wherein: when the feature extraction module uses the channel non-scaling convolution block NEP to carry out down-sampling on the current feature graph, the number of channels is expanded, and the problem of feature information loss caused by down-sampling is solved.

3. The YOLO-based lightweight target detection method of claim 2, wherein: when the depth separable convolution step length of the channel non-shrinkage and unwinding block NEP is 2, residual connection is not used; the depth separable convolution step length of the channel non-shrinkage volume block NEP is 1, and residual error connection is added.

4. The YOLO-based lightweight target detection method according to claim 3, wherein: the feature graph used by the channel non-shrinking and non-releasing block NEP is not zoomed in the calculation process of the channel dimension, the original channel number is kept unchanged, and the problems of feature information loss caused by the reduction of the channel dimension and excessive parameters caused by the expansion of the channel number are solved.

5. The YOLO-based lightweight target detection method according to claim 4, wherein: in the step 3, the multi-scale receptive field fusion module uses the output of the network feature extraction module in different scales to fuse the high-level semantic features with smaller scales into the features with smaller receptive fields and larger scales through 1x1 convolution, so that the detection effect on small targets and medium targets is improved.

6. The YOLO-based lightweight target detection method according to claim 5, wherein: in the step 5, distance-IoU Loss is as follows:

wherein L is _DIoU Indicating the loss of position of the prediction box, B indicating the prediction box, B ^gt Representing the real box, b ^gt Represents the center point of the prediction frame and the center point of the real frame, rho ² () Representing the euclidean distance and c representing the length of the diagonal of the smallest rectangle containing the prediction box and the real box.

7. The YOLO-based lightweight target detection method of claim 6, wherein: and 5, quantizing the parameters of the final target detection network into 8 bits.