CN115880564A

CN115880564A - Lightweight target detection method

Info

Publication number: CN115880564A
Application number: CN202211511383.XA
Authority: CN
Inventors: 郑宇�; 李邦宇
Original assignee: Shenyang Siasun Robot and Automation Co Ltd
Current assignee: Shenyang Siasun Robot and Automation Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-31

Abstract

The invention discloses a light-weight target detection method, which is based on a YOLOv3 improved target detection algorithm, adopts the idea of convolution separable operation (DW) in GhostNet, uses GhostModule to replace Conv2d, divides multi-channel data into a plurality of groups and processes the data at the same time, and achieves the purpose of reducing network parameters; adding a characteristic output layer on the basis of three characteristic output layers in an original network, and calculating a new prior frame size by using a k-means algorithm to identify a smaller target object; the original IoU calculation method is further changed into a DIoU calculation mode, penalty items are added, and the generated preselection frame has higher regression speed and better effect in the network training process. On the premise of ensuring the extraction accuracy of the characteristics of the YOLOv3 network, the method reduces the calculated amount of the model and improves the detection speed.

Description

Lightweight target detection method

Technical Field

The invention belongs to the field of intelligent machine learning, and particularly relates to a YOLOv 3-based improved lightweight target detection algorithm.

Background

Object detection is a very simple task for humans, recognizing some common objects even in infants of several months. But at first it was still a difficult task for machines to learn about object detection. Object detection requires the identification and localization of all instances of a certain object in the field of view, which together with other similar tasks, such as classification, segmentation, motion estimation, scene understanding, etc., constitute the fundamental problem in the field of computer vision. At present, target detection is a popular direction of computer vision and digital image processing, is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like, reduces consumption of human capital through computer vision, and has important practical significance. Therefore, the target detection becomes a research hotspot of theory and application in recent years, is an important branch of image processing and computer vision discipline and is also a core part of an intelligent monitoring system, and simultaneously, the target detection is also a basic algorithm in the field of universal identity recognition and plays a vital role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like.

For the inspection robot, aiming at the working environment, the target recognition is carried out on the human body and the large animal, and the existing target detection methods comprise a traditional detection method and a detection method based on deep learning. The conventional target detection algorithm can be summarized as the following steps: firstly, traversing the whole image in a sliding window mode to generate a certain number of candidate frames; secondly, extracting the characteristics of the candidate frame; and finally, classifying the extracted features by using classification methods such as a Support Vector Machine (SVM) and the like to further obtain a result. Due to the lack of effective image representations at the time, people were only able to design complex feature representations and fully utilize limited computational resources through various acceleration skills. The main mode is manual extraction, which has certain limitation, and the performance of manual characteristics tends to be saturated. The traditional target detection algorithm is only suitable for the situation with obvious characteristics and simple background. Since 2012, the wide application of convolutional neural networks has enabled new measures to be opened for target detection. In 2014, the R-CNN algorithm is about to emerge, target detection is developed rapidly at an unprecedented speed, and the method is applied to the deep learning era. In actual application, the background is complex and changeable, the target to be detected is complex and changeable, the target detection is difficult to complete through general abstract features, and the model is trained by utilizing huge and abundant data in deep learning, so that the generalization ability of the algorithm is stronger, and the algorithm is more easily applied to actual scenes. In the current application, the working environment of the robot is special, mostly shrubs and bushes have great influence on target identification, which can cause inaccurate identification or loss of target objects; in a processor of the robot, various data need to be processed, the calculated amount needs to be compressed as much as possible, and the calculated amount of the original YOLOv3 network parameters is too large, so that the phenomenon that small objects are lost is detected, and therefore, in the original network, optimization is carried out, and the reduction of the parameters and the increase of the accurate identification of the objects in the environment are realized.

Disclosure of Invention

The invention discloses a light-weight target detection method, which is based on a YOLOv3 improved target detection algorithm, adopts the idea of convolution separable operation (DW) in GhostNet, uses GhostModule to replace Conv2d, divides multi-channel data into a plurality of groups and processes the data at the same time, and achieves the purpose of reducing network parameters; adding a characteristic output layer on the basis of three characteristic output layers in the original network, and calculating a new anchors size by using a k-means algorithm to identify a smaller target object; the original IoU calculation method is further changed into a DIoU calculation mode, penalty items are added, and the generated preselection frame has higher regression speed and better effect in the network training process. On the premise of ensuring the extraction accuracy of the characteristics of the YOLOv3 network, the method reduces the calculated amount of the model and improves the detection speed.

The detection of YOLOv3 is carried out in two steps: and determining the position of the detection object, and classifying the detection object. The detailed procedure of the modified YOLOv3 is as follows: firstly, extracting multiple characteristic layers by using YOLOv3 for target detection, extracting four layers in total, and performing characteristic fusion by using FPN. And secondly, decoding the prediction result, and calculating the coordinate and width and height of the boundary box displayed finally to obtain the position of the boundary box. And thirdly, screening the predicted bounding box scores for sorting and non-maximum inhibition, taking out the boxes with the scores of each class larger than a certain threshold value and sorting, and performing non-maximum inhibition by using the positions and the scores of the boxes. Finally, the bounding box with the highest probability, i.e. the last box displayed, can be obtained.

The technical scheme adopted by the invention for realizing the purpose is as follows: a lightweight target detection method comprising the steps of:

preprocessing pictures acquired by a robot;

and the preprocessed picture enters a bone stem layer of a YOLOv3 network to extract features, and the class of the target and the position and the size of a prediction frame are obtained through a neck layer and a prediction layer in sequence. The category is a biological classification including human and animal.

The preprocessing of the acquired picture specifically comprises the following steps: the picture size is changed to 416 x 416, the image is centered and the blank portions are filled in with gray.

The bone stem layer is a Ghost _ Darknet-53 network.

And the bone stem layer adopts Ghost Module to replace the Conv2d convolution layer in the original Darknet-53 network.

In the Ghost Module, firstly, a convolutional layer of 1*1 compresses the number of channels of an input picture, then, a convolutional layer of 3*3 is used for carrying out depth separable convolution to obtain more feature maps, different feature maps are stacked together to form new output, and the new output is added with the input; the results are obtained as input to a lower network.

The extraction of the features of the bone stem layer entering the YOLOv3 network comprises the following steps:

firstly, performing convolution layer, BN normalization and activation operation, and performing five groups of downsampling convolution, wherein each group of downsampling is composed of a Ghost Module and a residual structure; each group of downsampling is carried out according to the order of times of 1,2,8 and 4, and feature extraction is repeatedly carried out;

the outputs of the latter four groups are used as the extracted features as the inputs to the neck layer.

Adding a fourth output layer behind the neck layer and the prediction layer, and using FPN feature fusion to form an improved YOLOv3, which is specifically as follows:

the neck layer and the prediction layer are collectively called output layers, the original output sizes are respectively 13 × 13, 26 × 26 and 52 × 52 in sequence, and the output layers are respectively used for detecting targets with three sizes, namely large, medium and small; wherein, the three sizes are relatively set large, medium and small sizes.

The fourth output layer has a size of 104x104, fusing 52 x 52 features for detecting targets smaller than 52 x 52.

Using a k-means clustering algorithm to recalculate the dimensions of the prior box with the label file of the data set, comprising the steps of: clustering a plurality of sample labeling frames in the training sample image set, and determining a target YOLOv3 pre-selection frame corresponding to the training sample image set.

Modifying the intersection ratio of the calculation preselection frame and the actual frame into a DIoU, and specifically comprising the following steps:

in the calculation of the original intersection ratio IoU, the ratio of the Euclidean distance between the centers of the prediction frame and the target frame and the length of the diagonal line of the minimum rectangle covering the prediction frame and the target frame is added as a penalty term, so that the prediction frame moves to the target frame under the condition of no overlap.

A lightweight object detection system comprising:

the preprocessing module is used for preprocessing the acquired picture;

and the classification and detection module is used for enabling the preprocessed pictures to enter a backbone layer of the YOLOv3 network for feature extraction, and obtaining the category of the target and the position and the size of a prediction frame through a neck layer and a prediction layer in sequence.

The invention has the following beneficial effects and advantages:

1. the invention provides a lightweight target detection method based on YOLOv3 improvement, which reduces the parameter quantity of a model on the premise of ensuring the effectiveness of extracted features and increases the identification accuracy of the model on small objects.

2. In the model training process, the calculation method of the cross-over ratio is changed into DIoU, and a penalty term is added, so that the generated preselection frame has higher regression speed and better effect.

3. By taking the thought of deep separable convolution of the GhostNet network as a reference, the original Darknet-53 backbone network is modified, so that the parameter calculation amount of the whole network is greatly reduced.

4. And adding a fourth layer of feature output layer, and fusing the third features for identifying smaller target objects.

5. The dimensions of the prior box are recalculated using the K-means clustering algorithm.

Drawings

FIG. 1 is a flow chart of the system of the present invention;

fig. 2 is a diagram of a network architecture.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

For the inspection robot working in the environment of shrubs and grasses, the inspection robot has great influence on the target identification of organisms, which can cause inaccurate identification or loss of target objects; the processor of the robot needs to process various data, and needs to be compressed as much as possible in terms of calculation amount, so that the original network needs to be processed.

A lightweight target detection method is an improved target detection algorithm implementation method based on YOLOv3, and the implementation comprises the following contents: the idea of deep separable convolution is adopted, ghost Module is used for replacing Conv2d, a Darknet-53 backbone network is changed, and the calculation amount of network parameters is reduced; modifying the intersection ratio of the calculation preselection frame and the actual frame into a DIoU; adding a fourth characteristic output layer for detecting smaller target objects; the dimensions of the prior box are recalculated using a k-means clustering algorithm.

The improved implementation method based on the YOLOv3 target detection algorithm comprises the following steps:

calculating the prior frame size of the four-layer characteristic output by using a k-means clustering algorithm;

adding a fourth output layer on the original three characteristic output layers, and fusing by using FPN characteristics;

adding a penalty term in the calculation of the intersection ratio;

compressing the model parameters under the condition of ensuring the effective characteristic extraction by using the thought of depth separable convolution;

the dimensions of the prior box are recalculated with the label file of the data set using a k-means clustering algorithm. The method specifically comprises the following steps: clustering a plurality of sample labeling frames in a training sample image set, and determining a target YOLOv3 pre-selection frame corresponding to the training sample image set.

The adding of the fourth layer of feature output layer specifically includes: the original YOLOv3 network has three characteristic output layers, and the sizes are respectively 13 × 13, 26 × 26 and 52 × 52. The 13 × 13 output is used to detect large targets, the corresponding 26 × 26 is medium, 52 × 52 is used to detect small targets, the whole is pyramid (FPN), 104 × 104 feature levels are added, 52 × 52 features are fused, and smaller targets are detected.

The whole pyramid (FPN) specifically comprises: although the high-level network can respond to semantic features, the high-level network has less geometrical information due to too small feature size, and is not beneficial to target detection; although the shallow network contains more geometric information, the semantic features of the images are not many, which is not beneficial to the classification of the images. This problem is more pronounced in small target detection, combining deep and shallow features while meeting the requirements of both target detection and image classification. For example, the features extracted from 13 × 13 layers are merged into 26 × 26 layers by upsampling and changing the size of the features.

The step of modifying the intersection ratio of the calculation preselection frame and the actual frame into the DIoU specifically comprises the following steps: ioU as a very important function of the target detection algorithm performance mAP calculation and loss value calculation. The ratio of the Euclidean distance between the centers of the prediction frame and the target frame to the length of the diagonal line of the minimum rectangle covering the prediction frame and the target frame is added in the calculation of the original intersection ratio IoU as a penalty term, so that the prediction frame can move towards the target frame under the condition of no overlap.

As shown in fig. 1, it is a flow chart of the system of the present invention: the pictures are pre-processed before entering the network. The picture is resized to 416 x 416, the image centered, and the blank portions filled in with gray. The preprocessed pictures enter a backbone layer of a network to extract features, the extracted features are calculated and processed by a neck layer, and the class of the target and the position and the size of a prediction frame are determined by the prediction layer according to the output of the neck layer.

Because an output layer (104 x104x 128) is added on the basis of the original Yolov3 three-layer characteristic output layer for detecting smaller target objects, the size of a prior frame needs to be recalculated before the network is used. The method uses a k-means clustering algorithm to read the xml markup file of the VOC data set, and calculates to obtain 12 prior frame sizes.

The bone stem layer is a Ghost _ Darknet-53 network, replaces the ordinary convolution operation in the original Darknet-53 network with Ghost Module, and mainly comprises a Ghost computing unit and a convolution kernel (hereinafter referred to as conv2 d) with the step length of 2 and the size of 3*3. The Ghost calculation unit is composed of two groups of Conv2d _ DW convolution layer, BN normalization layer and activation function layer, and the processing procedure specified in the calculation unit is as follows: firstly, extracting features through a Conv2d _ DW convolutional layer; then the BN normalization layer is used for processing, and the parameter value is reduced, so that the network calculation amount is reduced; finally, the result after the activation function processing becomes the input of the next network.

The method comprises the steps of preprocessing an image, entering a network, firstly entering a backbone network, changing the number of channels of the image into 32 by a convolution kernel with the step length of 1 and 3*3, changing the number of channels into 64 by conv2d, then entering a computing unit, carrying out channel number halving on input by using 1*1 convolution, processing a processing result by a normalization layer and an activation function layer to obtain output x1, realizing layer-by-layer convolution on the x1 by using 3*3 convolution kernel, processing the processing result by the normalization layer and the activation function layer to obtain output x2, stacking the x1 and the x2 to obtain x3, keeping the original Darknet-53 residual error structure, adding the x3 and the input to obtain final output out, and changing the number of channels processed by conv2d into 128. The computing unit will repeatedly extract features in the sequence of 1,2,8,8,4 in the backbone network, wherein the conv2d processing will be performed after the first four repetitions, the number of channels output after the last four repetitions is 128, 256, 512, 1024, respectively, and the four layers of output are exported as the input of the next neck layer.

The neck layer and the prediction layer are collectively referred to as the output layer. The input of the neck layer is that the skeleton layer outputs 4 characteristic layers with different scales corresponding to four output layers, and the detailed steps are as follows:

1. the fourth feature layer (13,13,1024) is subjected to convolution processing (feature processing and extraction) for 5 times, one part of the processed feature layer is used for convolution and UpSampling, the other part of the processed feature layer is used as input of a prediction layer, a corresponding prediction result (13,13,75) is output, and two convolutions of Conv2d 3 × 3 and Conv2d 1 × 1 are used for adjusting a channel and adjusting the channel to a size required by output.

2. And (3) obtaining a characteristic layer (26,26,256) after convolution and upsampling, splicing with a characteristic layer (26,26,512) in a Darknet53 network to obtain a shape (26,26,768), performing convolution for 5 times, processing the processed shape to obtain a part for convolution upsampling, and using the other part as input of a prediction layer to output a corresponding prediction result (26,26,75).

3. And then splicing (Concat) the feature layer which is subjected to convolution and upsampling in the step 2 with the feature layer of which shape is (52,52,256), and performing convolution for 5 times to obtain the feature layer of which shape is (52,52,128), wherein one part is used for convolution upsampling, and the other part is used as the input of a prediction layer to obtain the (52,52,75) feature layer.

4. And then splicing (Concat) the convolution + up-sampled feature layer in the step 3 with a feature layer with shape being (104, 104, 128), then carrying out convolution to obtain a feature layer with shape being (104, 104, 128), and finally inputting the feature layer into a prediction layer to obtain a feature layer (104, 104,75).

Wherein YOLOv3 implements such multi-scale feature fusion by using an UpSampling method, and the two tensors of the Concat connection in fig. 2 have the same scale (the three place splices are respectively 26 × 26 scale splices, 52 × 52 scale splices and 104 × 104 scale splices, and the tensor scales of the Concat splices are guaranteed to be the same by the convolution operation of UpSampling).

After four feature output results are obtained, decoding of the prediction results follows, and decoding can be performed on only one feature layer at a time, so that 4 times of cyclic decoding is required. The prediction result can obtain the width and height adjustment parameters, the center adjustment parameters, the confidence coefficient and the category probability of the prediction frame, and the decoding process is to calculate the coordinates bx and by and the width and height bw and bh of the boundary frame displayed finally, so that the position of the boundary frame is obtained. And sorting the predicted bounding box scores and screening non-maximum inhibition, and taking out the boxes and scores of which the scores are more than a certain threshold value from each class for sorting. Non-maximal suppression is performed using the position and score of the box. Finally, the bounding box with the highest probability, i.e., the last displayed box, can be obtained.

Claims

1. A lightweight target detection method is characterized by comprising the following steps:

preprocessing a picture acquired by a robot;

and the preprocessed picture enters a bone stem layer of a YOLOv3 network to extract features, and the class of the target and the position and the size of a prediction frame are obtained through a neck layer and a prediction layer in sequence.

2. The method for detecting a light-weighted target according to claim 1, wherein the preprocessing the acquired picture is specifically as follows: the size of the picture is changed to 416 x 416, the image is centered and the blank portions are filled in with gray.

3. The method for detecting a target with reduced weight as set forth in claim 1, wherein the backbone layer is a Ghost _ dark net-53 network.

4. The method for detecting a light-weighted object according to claim 3, wherein the backbone layer is formed by replacing Conv2d convolutional layer in the original Darknet-53 network with Ghost Module.

5. The method as claimed in claim 4, wherein the Ghost Module replaces original Darknet-53 network, wherein in the Ghost Module, firstly, the convolutional layer 1*1 compresses the number of channels of the input picture, then, the convolutional layer 3*3 is used for depth separable convolution to obtain more feature maps, different feature maps are stacked together to form a new output, and then, the new output is added to the input; the results are obtained as input to a lower network.

6. The method for detecting the target with light weight as claimed in claim 1, wherein the step of extracting the features from the bone stem layer entering the YOLOv3 network comprises the following steps:

the outputs of the latter four groups are used as the extracted features and as the input of the neck layer.

7. The method of claim 1, wherein a fourth output layer is added after the neck layer and the prediction layer, and the improved YOLOv3 is constructed by using FPN feature fusion, specifically as follows:

the neck layer and the prediction layer are collectively called output layers, the original output sizes are respectively 13 × 13, 26 × 26 and 52 × 52 in sequence, and the output layers are respectively used for detecting targets with three sizes, namely large, medium and small;

the fourth layer output layer size is 104x104, merging features of 52 x 52 for detecting smaller targets than 52 x 52.

8. The method for detecting the light-weighted target according to claim 1, wherein the method for recalculating the size of the prior box by using the label file of the data set by using a k-means clustering algorithm comprises the following steps: clustering a plurality of sample labeling frames in the training sample image set, and determining a target YOLOv3 pre-selection frame corresponding to the training sample image set.

9. The method for detecting a target with reduced weight according to claim 1, wherein modifying the intersection ratio of the calculation preselection frame and the actual frame to DIoU specifically includes:

10. A lightweight object detection system, comprising:

the preprocessing module is used for preprocessing the pictures acquired by the robot;