CN115171074A

CN115171074A - Vehicle target identification method based on multi-scale yolo algorithm

Info

Publication number: CN115171074A
Application number: CN202210806937.2A
Authority: CN
Inventors: 王英立; 史肖波
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-10-11

Abstract

A vehicle target recognition method based on a multi-scale yolo algorithm belongs to the field of target recognition methods. The problems of instantaneity, accuracy, robustness and the like of the existing target detection method in a complex environment are to be improved. A vehicle target identification method based on a multi-scale yolo algorithm is realized by the following steps: preprocessing a data set; extracting features of the backbone network; performing feature fusion on the PANet; a step of non-maximum inhibition of NMS; outputting a target calibration decision; in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.

Description

Vehicle target identification method based on multi-scale yolo algorithm

Technical Field

The invention relates to a vehicle identification method, in particular to a vehicle target identification method based on a multi-scale yolo algorithm.

Background

Deep learning has good self-learning capability and strong expression and processing capability, and is generally completed in the field of target detection nowadays by using deep learning. Convolutional Neural Networks (CNN) are one of the most widely used expressions in deep learning, and models from Convolutional Neural Networks R-CNN 6 to Fast R-CNN, and Cascade R-CNN are developed and improved continuously, so that the detection accuracy and detection efficiency are greatly improved. Therefore, an algorithm model based on deep learning will become one of the most extensive in the field of object detection.

In the intelligent auxiliary driving technology, a large amount of image recognition and processing work is needed, information collected by vehicles is often input in the form of videos or images, and the vehicle-mounted computer needs to recognize valuable targets and contents based on the visual information so as to provide guarantee for the next vehicle behavior decision. Therefore, the target in the image can be correctly and quickly identified, and the method is the basis of intelligent auxiliary driving technology. Although some achievements have been achieved in the target detection technology, how to improve the problems of instantaneity, accuracy, robustness and the like of the target detection technology in a more complex environment still is an important field to be researched.

Disclosure of Invention

The invention aims to solve the problems that the real-time performance, the accuracy, the robustness and the like of the existing target detection method in a complex environment need to be improved, and provides a vehicle target identification method based on a multi-scale yolo algorithm.

A vehicle object identification method based on a multi-scale yolo algorithm, the method being implemented by the following steps:

preprocessing a data set;

extracting features of the backbone network;

performing feature fusion on the PANet;

a step in which NMS is not greatly inhibited;

outputting a target calibration decision;

in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.

Preferably, the step of extracting the features of the backbone network specifically includes:

(1) Designing a convolution algorithm;

the convolution operation means that each pixel in the output image is obtained by weighted averaging of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel. Performing convolution operation on the image and a convolution kernel to extract certain characteristics of the image;

a pixel array of 8*8 two-dimensional gray scale image and a convolution kernel of 3*3; if the convolution kernel moves with step length of 1, that is, only one cell moves per movement, when the convolution kernel moves to the ith row and the jth column, the input image and the convolution kernel should be multiplied in sequence by the position values and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output value should be: 1+2 + 0+3 + 0+ 4+ 1+5 + 0+6 + 1+ 7 + 1+8 + 0+9, namely-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of the input values and the positions corresponding to the convolution kernels;

one convolution layer contains a plurality of convolution kernels, the number of layers of the output pixel array after the convolution layer is related to the number of the convolution kernels, and if the convolution layer contains n convolution kernels, the number of layers of the output pixel array after the convolution layer is also n;

(2) Designing an activation function;

after the convolutional layer, the data is nonlinear by using an activation function, if the input and the output are always in a linear relationship, after a plurality of layers, the total input and the total output are still in the linear relationship, and no matter how many layers exist in the middle, the total input and the total output are equivalent to one layer, as shown in a formula;

Y＝aX+b；Z＝cY+d

Z＝c(aX+b)+d＝(ac)X+(bc+d)

selecting a ReLU function as an activation function;

(3) Designing a pooling algorithm;

performing pooling operation on an input image after convolution operation;

if the output data is the maximum value of the input data at the position corresponding to the pooling window, pooling the input data as the maximum value; if the output data is the average value of the input data at the corresponding position of the pooling window, the average pooling is performed;

(4) Performing a spatial pyramid pooling structure;

introducing an SPP structure, and establishing a mapping relation between the candidate region and the input characteristic diagram;

(5) Designing a MobileNetv3 network structure;

carrying out convolution on the input DxDxN characteristic diagram by a convolution kernel of 3 x 3 and outputting a DxDxN characteristic diagram; the standard convolution process is that N convolution kernels with the number of 3 multiplied by 3 are convoluted with each channel of the input feature map, and finally a new feature map with the number of the channels being N is obtained;

the deep separable convolution uses 3 convolution kernels of 3 multiplied by 3 to be respectively convolved with each channel of the input characteristic diagram to obtain a characteristic diagram of which one input channel is equal to an output channel, and then uses N convolution kernels of 1 multiplied by 1 to convolute the characteristic diagram to obtain a new characteristic diagram of N channels;

the quantities of parameters used for the two convolutions were calculated separately and the results are shown in the following equation:

P ₁ ＝D×D×3×N (1)

P ₂ ＝D×D×3+D×D×1×N (2)

P ₁ is the parameter, P, used for the standard convolution ₂ The number of parameters used when depth separable convolution is used, D is the length and width of the input feature map, and N is the number of convolution kernels;

when standard convolution is carried out, the number of input channels is far smaller than that of output channels, and the formula (1) and the formula (2) are compared to obtain a formula (3):

it can be seen that P ₂ /P ₁ The result is far less than 1, and after the deep separable convolution is used, the parameter quantity used by the convolution can be greatly reduced while the effect similar to the standard convolution is obtained;

(6) Designing RFBs structures;

introducing hole convolution into the structural module; the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolutional layer is combined with a cavity convolutional layer, wherein a 3 × 3 convolutional core in a primary RFB structure in the common convolutional layer is replaced by parallel 1 × 3 and 3 × 1 convolutional cores, a 5 × 5 convolutional core is replaced by two series 1 × 3 and 3 × 1 convolutional cores, the cavity convolutional layer is respectively composed of 3 convolutional cores with the size of 3 × 3, the expansion rates of the convolutional cores are respectively 1, 3 and 5, and the convolutional layer is prevented from being degraded due to the fact that the expansion rate is too large; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch cavity structure, and a new fusion feature layer is output;

(7) A step of modified SPPNet;

inspired by a main feature extraction network CSPDarknet in YOLOv4, a CSP structure is introduced into an SPP, before multi-scale features are fused, the network is divided into two parts, and one part of features are directly merged with the fused features of the SPP through shortcut connection.

Preferably, the step of performing feature fusion on the PANet specifically comprises:

the CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual is a Residual module, and nxy on the right side of the rectangular frame is the repeated use times of the Residual module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, the length and the width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, and the function of using a pooling layer to perform downsampling by replacing a convolutional network is realized; in a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve a good detection effect on targets with different sizes and scales; finally, combining the prediction information of the three different scale feature layers, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;

the resulting 9 prediction boxes of different sizes were (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326); applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, a larger target is detected; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; on the 52 × 52 fine-scale feature layer 1, (10 × 13), (16 × 30), (33 × 23) are applied for detecting smaller targets.

The invention has the beneficial effects that:

according to the method, the basic experimental method and the experimental principle basis of the YOLOV4 network are known through the research of the YOLOV4 network, the basic structure of the YOLOV4 and how each network directly transmits data information are determined, the network is modified according to the feasibility of the experimental environment and the problems of small target missing detection, shielding, detection under a complex background, multi-scale target detection and the like of the network structure, the target detection accuracy of the network is improved, the network is pruned, the network parameters are reduced, and the training speed of the model is improved.

1. Compared with a CSPDarknet residual jump connection mode of a backbone feature extraction network structure of YOLOv4, the method adopts the MobileNet 3 to replace an original backbone feature extraction network, and the depth separable convolution of the MobileNet 3 is beneficial to optimizing the network structure and reducing network training parameters, so that the MobileNet 3 is used as the backbone feature extraction network, effective utilization and transmission of features in the network are enhanced, the network can learn more feature information, the structure of the network is lighter, and the speed of target detection is improved.

2. Inspired by a main feature extraction network CSPDarknet in YOLOv4, the paper introduces a CSP structure into SPP, before multi-scale features are fused, the network is divided into two parts, one part of features are connected through short-cut paths and directly merged with the features fused by the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.

3. Because the semantic information of the low-level characteristic image is less, the target position is accurate; the semantic information of the high-level feature map is rich, the target position is rough, the RFBNet is introduced to perform multiple fusion on the original receptive field of the network, and the cavity convolution is introduced, so that the purposes of increasing the receptive field and fusing different size features are achieved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram illustrating a basic method of convolution operation according to the present invention;

FIG. 3 shows the number of output layers of a pixel array after passing through a convolutional layer according to the present invention;

FIG. 4 is a graph of two activation functions involved in the present invention;

FIG. 5 is a diagram of two pooling methods involved in the present invention;

FIG. 6 shows the number of output layers of a pixel array after passing through a pooling layer according to the present invention;

FIG. 7 is a SPP structure for a fixed size (21-dimensional) output according to the present invention;

fig. 8 shows a network structure of YOLOv3 according to the present invention.

Detailed Description

The first specific implementation way is as follows:

in the embodiment, as shown in fig. 1, the vehicle target identification method based on the multi-scale yolo algorithm is implemented by the following steps:

preprocessing a data set;

extracting features of the backbone network;

performing feature fusion on the PANet;

a step of non-maximum inhibition of NMS;

outputting a target calibration decision;

The sample imbalance mainly includes two aspects, on one hand, most samples in the data set are relatively easy to learn, that is, so-called "simple samples". The samples occupy most of the whole data set, are relatively distinct and distributed, have low interference on the surrounding background, and can be easily learned and identified by the network. On the contrary, the samples are 'difficult samples', and the characteristics of the samples are weak due to small sizes, close distances with other samples and the like, or main characteristics are lost due to peripheral shielding and light and shadow changes, so that the network can hardly learn the discriminant characteristics of the samples, and the detection effect is poor. In addition, the difficult samples are low in occurrence frequency in the data set and unbalanced in proportion with the simple samples, and sufficient training is difficult to obtain during training, so that the problems of difficult learning, inaccurate detection and the like of the difficult samples are further aggravated.

The invention does not use a focusing loss function to completely replace a cross entropy loss function to train a network, because the fact that the effect of using the focusing loss function only on certain rare categories (such as buses and tank trucks) is greatly improved is found in the actual training and testing of an XDUAV data set, but the effect is not obvious on the categories (such as bicycles and motorcycles) which are rare in quantity, small in size and weak in self characteristics. We analyze that these samples are very limited due to their own features, and although the training weights of these samples are emphasized by the focus loss function, the lack of data volume still makes the network unable to learn the feature expressions of these classes well. Therefore, we use the cross-entropy loss function in the early stage of training, with the goal of learning the overall feature distribution of the sample. For classes with less data such as bicycles and motorcycles, the characteristics are similar, and the cross entropy loss function can enable the network to learn the similar classes as a whole. Then, the focus loss function is used instead, and the training network learns the discriminant characteristics among the similar class samples, so that the samples are better distinguished.

The second embodiment is as follows:

different from the specific embodiment, in the vehicle target identification method based on the multiscale yolo algorithm of the present embodiment, the step of extracting the features of the backbone network specifically includes:

(1) Designing a convolution algorithm;

the convolution operation means that each pixel in the output image is obtained by weighted averaging of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel. Some features of the image are extracted by performing convolution operation on the image and a convolution kernel, and a basic method of the convolution operation is shown in fig. 2.

There is shown an array of pixels of a two-dimensional gray scale image 8*8 and a convolution kernel of 3*3; if the convolution kernel is moved with a step size of 1, that is, only one cell is moved for each movement, when the convolution kernel is moved to the ith row and the jth column, the input image and the corresponding position values of the convolution kernel are sequentially multiplied and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output values should be: 1, 0, 3, (-1), 4, 1, 5, 0, 6, -1, 7, 1, 8, 0, 9, which is-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of the input values and the positions corresponding to the convolution kernels; setting the size of step length and whether to fill input data when designing a network;

one convolution layer contains a plurality of convolution kernels, the number of layers of the output pixel array after the convolution layer is related to the number of convolution kernels, and if the convolution layer contains n convolution kernels, the number of layers of the output pixel array after the convolution layer is also n, as shown in fig. 3.

(2) Designing an activation function;

after convolutional layers, the data is non-linearized using an activation function, making the input and output relationship non-linear, which can characterize more complex changes in the input. If the input and the output are always in a linear relationship, after a plurality of layers, the total input and the output are still in the linear relationship, and no matter how many layers exist in the middle, the total input and the total output are equivalent to one layer, as shown in a formula;

Y＝aX+b；Z＝cY+d

Z＝c(aX+b)+d＝(ac)X+(bc+d)

selecting a ReLU function as an activation function;

commonly used activation functions include a ReLU function and a Mish function, and a function image is shown in FIG. 4; because the output value of the Mish function is not changed greatly when the input value is too large or too small, the ReLU function has greater advantages in avoiding gradient disappearance and improving training speed;

(3) Designing a pooling algorithm;

after the input image is subjected to convolution operation, each pixel point in the graph may affect a plurality of output points, information redundancy may be caused, and algorithm performance is reduced, so pooling operation is performed, the most common algorithms in the pooling operation include average pooling and maximum pooling, as shown in fig. 5, the most commonly used method is maximum pooling;

the pooling algorithm is similar to the convolution algorithm, but the number of layers of the convolution kernel needs to be the same as that of the input data, the pooling layer is a two-dimensional data array, the pooling window moves row by row, column by column and layer by layer, and the number of layers of output data passing through the pooling layer is the same as that of the input data, as shown in fig. 6.

(4) A step of performing a spatial pyramid pooling structure (SPPNet);

the SPP structure is generally located between the convolutional layer and the fully-connected layer, and is used to generate a fixed-size output for an input of any size, and the process is as follows: input feature maps with any size are pooled by using panes (pooling layers) of 1 × 1, 2 × 2 and 4 × 4 in parallel to obtain 1+4+16=21 different picture blocks, and then the maximum value (maximum pooling) or the average value (average pooling) of each picture block is calculated respectively to obtain a feature vector with a fixed size of 21 dimensions, which can be directly connected to the full link layer. Extracting features from receptive fields of different sizes by adjusting the size and number of panes; and then splicing the output and input feature maps of the 3 specifications of the pooling layers to obtain a fused feature vector. I.e., can produce an output of any size, is the most important feature of the SPP architecture. Fig. 7 is an SPP structure.

Taking a classical two-stage target detection algorithm R-CNN as an example, the algorithm generates a large number of candidate regions (about 2000 regions) in a region proposing stage, the candidate regions generate an input image with a fixed size through operations such as clipping or warping, and then the input image is processed by using a CNN model to obtain a feature vector with a fixed size, and the feature vector can be input into a full-connection layer for classification and regression. It can be seen that all the generated candidate regions need to be fed into the CNN model to generate a feature vector of a fixed size, and the large number of candidate regions inevitably leads to inefficiency of the algorithm. After an SPP structure is introduced to optimize an R-CNN algorithm, an input image is directly sent to a CNN model to obtain an integral input feature map, then a mapping relation between candidate regions and the input feature map is established to obtain feature vectors corresponding to all the candidate regions (without passing through the CNN model), and finally the feature vectors are subjected to similar classification and regression. The SPP structure is introduced, only the mapping relation between the candidate region and the input characteristic graph needs to be established, the forward calculation of the CNN model does not need to be repeated, the detection speed of the algorithm is improved by 24-120 times compared with that of the R-CNN, and higher detection precision is realized;

(5) Designing a MobileNetv3 network structure;

the core idea of MobileNet is to decompose a complete convolution operation into two steps, namely depth-by-depth convolution and point-by-point convolution. Efficient neural networks are mainly based on: 1. the number of parameters is reduced; 2. and quantizing the parameters and reducing the memory occupied by each parameter. The current study summary can be seen to be divided into two directions: firstly, compressing a trained complex model to obtain a small model; the second is to directly design and train small models (Mobile Net belongs to this category).

Carrying out convolution on the input DxDx3 feature map by a convolution kernel of 3 x 3 and outputting a DxDxN feature map; the standard convolution process is that N convolution kernels with the number of 3 multiplied by 3 are convoluted with each channel of the input feature map, and finally a new feature map with the number of the channels being N is obtained;

the deep separable convolution uses 3 convolution kernels with the number of 3 multiplied by 3 and each channel of the input characteristic diagram to carry out convolution respectively to obtain a characteristic diagram of which the input channel is equal to the output channel, and then uses N convolution kernels with the number of 1 multiplied by 1 to carry out convolution to the characteristic diagram to obtain a new characteristic diagram of N channels;

P ₁ ＝D×D×3×N (1)

P ₂ ＝D×D×3+D×D×1×N (2)

(6) Designing RFBs structures;

in order to capture the multi-scale characteristic information of the pedestrian, an RFBs structure module is connected behind the improved backbone network; the cavity convolution is introduced into the structure module, so that the effect of increasing the receptive field and fusing different size characteristics is realized;

the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolution layer is combined with a cavity convolution layer, a 3 multiplied by 3 convolution kernel in the original RFB structure in the common convolution layer is replaced by a1 multiplied by 3 convolution kernel and a 3 multiplied by 1 convolution kernel which are connected in parallel, and a 5 multiplied by 5 convolution kernel is replaced by two series-connected 1 multiplied by 3 convolution kernels and a 3 multiplied by 1 convolution kernel, so that the calculated amount of the network can be effectively reduced, and the light weight of the whole network is ensured. The void convolutional layers are respectively composed of 3 convolutional kernels with the size of 3 multiplied by 3, the expansion rates of the convolutional kernels are respectively 1, 3 and 5, and the convolutional layers are prevented from being degraded due to too large expansion rate; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch cavity structure, and a new fusion feature layer is output; in order to keep the original information of the input feature map, the new fusion feature layer is output after being subjected to accumulation operation with the large residual edge formed by the original feature map through a1 × 1 convolution layer transformation channel.

(7) A step of modified SPPNet;

inspired by a CSP graph network for extracting the main features in YOLOv4, a CSP structure is introduced into the SPP, the network is divided into two parts before multi-scale features are fused, one part of the features are connected through short-cut paths and directly merged with the features fused with the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.

The third concrete implementation mode:

different from the second specific embodiment, in the vehicle target identification method based on the multiscale yolo algorithm of the present embodiment, the step of performing feature fusion by using the PANet specifically includes:

the YOLOv4 algorithm is an improved algorithm based on YOLOv2 and YOLOv23, is different from Two-stage object detection algorithms such as Faster R-CNN, and the like, divides an image into different grids, and each grid is responsible for corresponding objects, supports multi-class object detection, and can achieve higher detection speed under the condition of maintaining precision. The network structure of YOLOv3 is shown in fig. 8, and is an end-to-end real-time target detection framework, which mainly comprises a Darknet-53 trunk feature extraction network and a multi-scale feature fusion prediction network.

The CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual in a rectangular frame in the figure is a Residual error module, and nxx on the right side of the rectangular frame is the repeated use times of the Residual error module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, and the length and width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, so that the function of using a pooling layer to perform down-sampling by using a traditional convolutional network is replaced; for example, a 256 × 256 image is input, and 1/2 reduction is performed 5 times after passing through such a full convolution network, so that a feature map having a size of 8 × 8 (the power of 1/2 to the fifth is 1/32) can be obtained. The method is mainly characterized in that a Residual error unit Residual Block is used, a Residual error jump connection structure is used, a learning channel between an input layer and an output layer is increased, and the problems of gradient disappearance and the like when the number of network layers is deepened are solved. Convolution layers in the Residual Block alternately use convolution kernels with two sizes of 1 x 1 and 3 x 3, a BN batch normalization layer is connected after each convolution, input data are subjected to normalization processing with the mean value of 0 and the variance of 1, and a Leaky Re LU activation function is used for nonlinear operation after the BN layer, so that the network can be applied to a nonlinear model.

In a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve a good detection effect on targets with different sizes and scales; finally, combining the prediction information of the three feature layers with different scales, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;

since the more convolutional layers, the more feature information of the input image is lost, the YOLOv4 network uses K-means clustering to obtain the size of the prediction box, 3 different prediction boxes are set for each scale in the feature pyramid network, and the total obtained 9 prediction boxes with different sizes are (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326), respectively; applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, detecting a larger target; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; (10 × 13), (16 × 30), (33 × 23) are applied on the 52 × 52 fine-scale feature layer 1 for detecting smaller targets;

and the detection precision and speed are improved by using a YOLOv4 network.

Simulation experiment:

the hardware configuration used for creating experiment environment experiments is Inter (R) Core i7-9700K CPU, NVIDIA Geforce RTX 2080Ti graphics card and Windows operating system, the software environment is CUDA11.0, cudnn8.0, and a Tensorflow1.14 deep learning framework is adopted. During training, an Adam optimizer is adopted, the initial learning rate is 0.001, the momentum is 0.9, the Batchsize is set to be 8, the iteration times are 500, and a Mosaic data enhancement technique and a Dropblock regularization mode are adopted. The data are randomly divided into a training set, a verification set and a test set according to the following steps of 6. Before training, the size of an input image is adjusted to 608 multiplied by 608, and the width, height and center point coordinates of a boundary box of labeling information are normalized according to a PASCAL VOC data set format, so that the influence of abnormal samples on data is reduced.

In the detection process, whether the target position is successfully predicted or not is measured by using the prediction result and the size of the Io U value of the actual target, the IoU threshold value is set to be 0.5, namely the IoU is greater than 0.5, and the result is recorded as correct prediction, otherwise, the result is recorded as wrong prediction. Using Average accuracy Mean (mAP) and recall rate as the accuracy index of network,

p represents Precision (Precision), which means the probability of detecting the correct target in all detected targets, R represents Recall (Recall), which means the probability of detecting the correct target in all positive samples, and the calculation formulas of Precision and Recall are as follows:

TP refers to the number of positive samples predicted to be correct, FP refers to the number of negative samples predicted to be positive samples but actually negative, and FN refers to the number of negative samples predicted to be negative samples but actually positive samples. In practical applications, networks are often deployed on mobile devices, so the size and detection speed of the networks are not negligible. The size of the network is determined by the number of parameters or the weight, and the detection speed is determined by FPS (Frames Per Second), which is defined as the number of pictures that can be detected Per Second.

Compared with a CSPDarknet residual jump connection mode of a backbone feature extraction network structure of YOLOv4, the depth separable convolution of the MobileNet v3 is beneficial to optimizing the network structure and reducing network training parameters, so that the MobileNet v3 structure is used for replacing a CSPDarknet module, the effective utilization and transmission of features in the network are enhanced, the network learns more feature information, and the target detection speed is further improved; because the semantic information of the low-level characteristic images is less, the target position is accurate; the semantic information of the high-level feature map is rich, the target position is rough, the RFBNet is introduced to perform multiple fusion on the original receptive field of the network, and the cavity convolution is introduced to achieve the purposes of increasing the receptive field and fusing different size features; inspired by a CSP graph network for extracting trunk features in YOLOv4, the paper introduces a CSP structure into SPP, the network is divided into two parts before multi-scale features are fused, one part of features are connected through short-cut paths and are directly merged with the features fused with the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.

The embodiments of the present invention are disclosed as the preferred embodiments, but not limited thereto, and those skilled in the art can easily understand the spirit of the present invention and make various extensions and changes without departing from the spirit of the present invention.

Claims

1. A vehicle target identification method based on a multi-scale yolo algorithm is characterized by comprising the following steps: the method is realized by the following steps:

preprocessing a data set;

extracting features of the backbone network;

performing feature fusion on the PANet;

a step of non-maximum inhibition of NMS;

outputting a target calibration decision;

2. The method for recognizing the target of the vehicle based on the multi-scale yolo algorithm as claimed in claim 1, wherein: the step of extracting the features of the backbone network specifically comprises the following steps:

(1) Designing a convolution algorithm;

the convolution operation means that each pixel in the output image is obtained by weighted average of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel; performing convolution operation on the image and a convolution kernel to extract certain characteristics of the image;

an array of pixels of a two-dimensional gray scale image 8*8 and a convolution kernel 3*3; if the convolution kernel moves with step length of 1, that is, only one cell moves per movement, when the convolution kernel moves to the ith row and the jth column, the input image and the convolution kernel should be multiplied in sequence by the position values and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output value should be: 1+2 + 0+3 + 0+ 4+ 1+5 + 0+6 + 1+ 7 + 1+8 + 0+9, namely-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of corresponding positions of the input values and the convolution kernels;

(2) Designing an activation function;

Y＝aX+b；Z＝cY+d

Z＝c(aX+b)+d＝(ac)X+(bc+d)

selecting a ReLU function as an activation function;

(3) Designing a pooling algorithm;

performing pooling operation on an input image after convolution operation;

(4) Performing a spatial pyramid pooling structure;

introducing an SPP structure, and establishing a mapping relation between a candidate region and an input characteristic diagram;

(5) Designing a MobileNetv3 network structure;

P ₁ ＝D×D×3×N (1)

P ₂ ＝D×D×3+D×D×1×N (2)

(6) Designing RFBs structures;

introducing cavity convolution into the structure module; the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolutional layer is combined with a cavity convolutional layer, wherein a 3 × 3 convolutional core in a primary RFB structure in the common convolutional layer is replaced by parallel 1 × 3 and 3 × 1 convolutional cores, a 5 × 5 convolutional core is replaced by two series 1 × 3 and 3 × 1 convolutional cores, the cavity convolutional layer is respectively composed of 3 convolutional cores with the size of 3 × 3, the expansion rates of the convolutional cores are respectively 1, 3 and 5, and the convolutional layer is prevented from being degraded due to the fact that the expansion rate is too large; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch hollow structure, and a new fusion feature layer is output;

(7) A step of modified SPPNet;

3. The method for recognizing the target of the vehicle based on the multi-scale yolo algorithm according to the claim 1 or 2, wherein: the method for performing the feature fusion by using the PANet specifically comprises the following steps:

the CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual is a Residual module, and nxy on the right side of the rectangular frame is the repeated use times of the Residual module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, the length and the width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, and the function of using a pooling layer to perform downsampling by replacing a convolutional network is realized; in a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve better detection effect on targets with different sizes and different scales; finally, combining the prediction information of the three different scale feature layers, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;

the resulting prediction boxes of 9 different sizes were (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), and 373 × 326, respectively; applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, detecting a larger target; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; on the 52 × 52 fine-scale feature layer 1, (10 × 13), (16 × 30), (33 × 23) are applied for detecting smaller targets.