CN115171074A - Vehicle target identification method based on multi-scale yolo algorithm - Google Patents

Vehicle target identification method based on multi-scale yolo algorithm Download PDF

Info

Publication number
CN115171074A
CN115171074A CN202210806937.2A CN202210806937A CN115171074A CN 115171074 A CN115171074 A CN 115171074A CN 202210806937 A CN202210806937 A CN 202210806937A CN 115171074 A CN115171074 A CN 115171074A
Authority
CN
China
Prior art keywords
convolution
layer
scale
input
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210806937.2A
Other languages
Chinese (zh)
Inventor
王英立
史肖波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202210806937.2A priority Critical patent/CN115171074A/en
Publication of CN115171074A publication Critical patent/CN115171074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A vehicle target recognition method based on a multi-scale yolo algorithm belongs to the field of target recognition methods. The problems of instantaneity, accuracy, robustness and the like of the existing target detection method in a complex environment are to be improved. A vehicle target identification method based on a multi-scale yolo algorithm is realized by the following steps: preprocessing a data set; extracting features of the backbone network; performing feature fusion on the PANet; a step of non-maximum inhibition of NMS; outputting a target calibration decision; in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.

Description

Vehicle target identification method based on multi-scale yolo algorithm
Technical Field
The invention relates to a vehicle identification method, in particular to a vehicle target identification method based on a multi-scale yolo algorithm.
Background
Deep learning has good self-learning capability and strong expression and processing capability, and is generally completed in the field of target detection nowadays by using deep learning. Convolutional Neural Networks (CNN) are one of the most widely used expressions in deep learning, and models from Convolutional Neural Networks R-CNN 6 to Fast R-CNN, and Cascade R-CNN are developed and improved continuously, so that the detection accuracy and detection efficiency are greatly improved. Therefore, an algorithm model based on deep learning will become one of the most extensive in the field of object detection.
In the intelligent auxiliary driving technology, a large amount of image recognition and processing work is needed, information collected by vehicles is often input in the form of videos or images, and the vehicle-mounted computer needs to recognize valuable targets and contents based on the visual information so as to provide guarantee for the next vehicle behavior decision. Therefore, the target in the image can be correctly and quickly identified, and the method is the basis of intelligent auxiliary driving technology. Although some achievements have been achieved in the target detection technology, how to improve the problems of instantaneity, accuracy, robustness and the like of the target detection technology in a more complex environment still is an important field to be researched.
Disclosure of Invention
The invention aims to solve the problems that the real-time performance, the accuracy, the robustness and the like of the existing target detection method in a complex environment need to be improved, and provides a vehicle target identification method based on a multi-scale yolo algorithm.
A vehicle object identification method based on a multi-scale yolo algorithm, the method being implemented by the following steps:
preprocessing a data set;
extracting features of the backbone network;
performing feature fusion on the PANet;
a step in which NMS is not greatly inhibited;
outputting a target calibration decision;
in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.
Preferably, the step of extracting the features of the backbone network specifically includes:
(1) Designing a convolution algorithm;
the convolution operation means that each pixel in the output image is obtained by weighted averaging of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel. Performing convolution operation on the image and a convolution kernel to extract certain characteristics of the image;
a pixel array of 8*8 two-dimensional gray scale image and a convolution kernel of 3*3; if the convolution kernel moves with step length of 1, that is, only one cell moves per movement, when the convolution kernel moves to the ith row and the jth column, the input image and the convolution kernel should be multiplied in sequence by the position values and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output value should be: 1+2 + 0+3 + 0+ 4+ 1+5 + 0+6 + 1+ 7 + 1+8 + 0+9, namely-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of the input values and the positions corresponding to the convolution kernels;
one convolution layer contains a plurality of convolution kernels, the number of layers of the output pixel array after the convolution layer is related to the number of the convolution kernels, and if the convolution layer contains n convolution kernels, the number of layers of the output pixel array after the convolution layer is also n;
(2) Designing an activation function;
after the convolutional layer, the data is nonlinear by using an activation function, if the input and the output are always in a linear relationship, after a plurality of layers, the total input and the total output are still in the linear relationship, and no matter how many layers exist in the middle, the total input and the total output are equivalent to one layer, as shown in a formula;
Y=aX+b;Z=cY+d
Z=c(aX+b)+d=(ac)X+(bc+d)
selecting a ReLU function as an activation function;
(3) Designing a pooling algorithm;
performing pooling operation on an input image after convolution operation;
if the output data is the maximum value of the input data at the position corresponding to the pooling window, pooling the input data as the maximum value; if the output data is the average value of the input data at the corresponding position of the pooling window, the average pooling is performed;
(4) Performing a spatial pyramid pooling structure;
introducing an SPP structure, and establishing a mapping relation between the candidate region and the input characteristic diagram;
(5) Designing a MobileNetv3 network structure;
carrying out convolution on the input DxDxN characteristic diagram by a convolution kernel of 3 x 3 and outputting a DxDxN characteristic diagram; the standard convolution process is that N convolution kernels with the number of 3 multiplied by 3 are convoluted with each channel of the input feature map, and finally a new feature map with the number of the channels being N is obtained;
the deep separable convolution uses 3 convolution kernels of 3 multiplied by 3 to be respectively convolved with each channel of the input characteristic diagram to obtain a characteristic diagram of which one input channel is equal to an output channel, and then uses N convolution kernels of 1 multiplied by 1 to convolute the characteristic diagram to obtain a new characteristic diagram of N channels;
the quantities of parameters used for the two convolutions were calculated separately and the results are shown in the following equation:
P 1 =D×D×3×N (1)
P 2 =D×D×3+D×D×1×N (2)
P 1 is the parameter, P, used for the standard convolution 2 The number of parameters used when depth separable convolution is used, D is the length and width of the input feature map, and N is the number of convolution kernels;
when standard convolution is carried out, the number of input channels is far smaller than that of output channels, and the formula (1) and the formula (2) are compared to obtain a formula (3):
Figure BDA0003738212180000031
it can be seen that P 2 /P 1 The result is far less than 1, and after the deep separable convolution is used, the parameter quantity used by the convolution can be greatly reduced while the effect similar to the standard convolution is obtained;
(6) Designing RFBs structures;
introducing hole convolution into the structural module; the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolutional layer is combined with a cavity convolutional layer, wherein a 3 × 3 convolutional core in a primary RFB structure in the common convolutional layer is replaced by parallel 1 × 3 and 3 × 1 convolutional cores, a 5 × 5 convolutional core is replaced by two series 1 × 3 and 3 × 1 convolutional cores, the cavity convolutional layer is respectively composed of 3 convolutional cores with the size of 3 × 3, the expansion rates of the convolutional cores are respectively 1, 3 and 5, and the convolutional layer is prevented from being degraded due to the fact that the expansion rate is too large; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch cavity structure, and a new fusion feature layer is output;
(7) A step of modified SPPNet;
inspired by a main feature extraction network CSPDarknet in YOLOv4, a CSP structure is introduced into an SPP, before multi-scale features are fused, the network is divided into two parts, and one part of features are directly merged with the fused features of the SPP through shortcut connection.
Preferably, the step of performing feature fusion on the PANet specifically comprises:
the CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual is a Residual module, and nxy on the right side of the rectangular frame is the repeated use times of the Residual module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, the length and the width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, and the function of using a pooling layer to perform downsampling by replacing a convolutional network is realized; in a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve a good detection effect on targets with different sizes and scales; finally, combining the prediction information of the three different scale feature layers, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;
the resulting 9 prediction boxes of different sizes were (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326); applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, a larger target is detected; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; on the 52 × 52 fine-scale feature layer 1, (10 × 13), (16 × 30), (33 × 23) are applied for detecting smaller targets.
The invention has the beneficial effects that:
according to the method, the basic experimental method and the experimental principle basis of the YOLOV4 network are known through the research of the YOLOV4 network, the basic structure of the YOLOV4 and how each network directly transmits data information are determined, the network is modified according to the feasibility of the experimental environment and the problems of small target missing detection, shielding, detection under a complex background, multi-scale target detection and the like of the network structure, the target detection accuracy of the network is improved, the network is pruned, the network parameters are reduced, and the training speed of the model is improved.
1. Compared with a CSPDarknet residual jump connection mode of a backbone feature extraction network structure of YOLOv4, the method adopts the MobileNet 3 to replace an original backbone feature extraction network, and the depth separable convolution of the MobileNet 3 is beneficial to optimizing the network structure and reducing network training parameters, so that the MobileNet 3 is used as the backbone feature extraction network, effective utilization and transmission of features in the network are enhanced, the network can learn more feature information, the structure of the network is lighter, and the speed of target detection is improved.
2. Inspired by a main feature extraction network CSPDarknet in YOLOv4, the paper introduces a CSP structure into SPP, before multi-scale features are fused, the network is divided into two parts, one part of features are connected through short-cut paths and directly merged with the features fused by the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.
3. Because the semantic information of the low-level characteristic image is less, the target position is accurate; the semantic information of the high-level feature map is rich, the target position is rough, the RFBNet is introduced to perform multiple fusion on the original receptive field of the network, and the cavity convolution is introduced, so that the purposes of increasing the receptive field and fusing different size features are achieved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram illustrating a basic method of convolution operation according to the present invention;
FIG. 3 shows the number of output layers of a pixel array after passing through a convolutional layer according to the present invention;
FIG. 4 is a graph of two activation functions involved in the present invention;
FIG. 5 is a diagram of two pooling methods involved in the present invention;
FIG. 6 shows the number of output layers of a pixel array after passing through a pooling layer according to the present invention;
FIG. 7 is a SPP structure for a fixed size (21-dimensional) output according to the present invention;
fig. 8 shows a network structure of YOLOv3 according to the present invention.
Detailed Description
The first specific implementation way is as follows:
in the embodiment, as shown in fig. 1, the vehicle target identification method based on the multi-scale yolo algorithm is implemented by the following steps:
preprocessing a data set;
extracting features of the backbone network;
performing feature fusion on the PANet;
a step of non-maximum inhibition of NMS;
outputting a target calibration decision;
in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.
The sample imbalance mainly includes two aspects, on one hand, most samples in the data set are relatively easy to learn, that is, so-called "simple samples". The samples occupy most of the whole data set, are relatively distinct and distributed, have low interference on the surrounding background, and can be easily learned and identified by the network. On the contrary, the samples are 'difficult samples', and the characteristics of the samples are weak due to small sizes, close distances with other samples and the like, or main characteristics are lost due to peripheral shielding and light and shadow changes, so that the network can hardly learn the discriminant characteristics of the samples, and the detection effect is poor. In addition, the difficult samples are low in occurrence frequency in the data set and unbalanced in proportion with the simple samples, and sufficient training is difficult to obtain during training, so that the problems of difficult learning, inaccurate detection and the like of the difficult samples are further aggravated.
The invention does not use a focusing loss function to completely replace a cross entropy loss function to train a network, because the fact that the effect of using the focusing loss function only on certain rare categories (such as buses and tank trucks) is greatly improved is found in the actual training and testing of an XDUAV data set, but the effect is not obvious on the categories (such as bicycles and motorcycles) which are rare in quantity, small in size and weak in self characteristics. We analyze that these samples are very limited due to their own features, and although the training weights of these samples are emphasized by the focus loss function, the lack of data volume still makes the network unable to learn the feature expressions of these classes well. Therefore, we use the cross-entropy loss function in the early stage of training, with the goal of learning the overall feature distribution of the sample. For classes with less data such as bicycles and motorcycles, the characteristics are similar, and the cross entropy loss function can enable the network to learn the similar classes as a whole. Then, the focus loss function is used instead, and the training network learns the discriminant characteristics among the similar class samples, so that the samples are better distinguished.
The second embodiment is as follows:
different from the specific embodiment, in the vehicle target identification method based on the multiscale yolo algorithm of the present embodiment, the step of extracting the features of the backbone network specifically includes:
(1) Designing a convolution algorithm;
the convolution operation means that each pixel in the output image is obtained by weighted averaging of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel. Some features of the image are extracted by performing convolution operation on the image and a convolution kernel, and a basic method of the convolution operation is shown in fig. 2.
There is shown an array of pixels of a two-dimensional gray scale image 8*8 and a convolution kernel of 3*3; if the convolution kernel is moved with a step size of 1, that is, only one cell is moved for each movement, when the convolution kernel is moved to the ith row and the jth column, the input image and the corresponding position values of the convolution kernel are sequentially multiplied and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output values should be: 1, 0, 3, (-1), 4, 1, 5, 0, 6, -1, 7, 1, 8, 0, 9, which is-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of the input values and the positions corresponding to the convolution kernels; setting the size of step length and whether to fill input data when designing a network;
one convolution layer contains a plurality of convolution kernels, the number of layers of the output pixel array after the convolution layer is related to the number of convolution kernels, and if the convolution layer contains n convolution kernels, the number of layers of the output pixel array after the convolution layer is also n, as shown in fig. 3.
(2) Designing an activation function;
after convolutional layers, the data is non-linearized using an activation function, making the input and output relationship non-linear, which can characterize more complex changes in the input. If the input and the output are always in a linear relationship, after a plurality of layers, the total input and the output are still in the linear relationship, and no matter how many layers exist in the middle, the total input and the total output are equivalent to one layer, as shown in a formula;
Y=aX+b;Z=cY+d
Z=c(aX+b)+d=(ac)X+(bc+d)
selecting a ReLU function as an activation function;
commonly used activation functions include a ReLU function and a Mish function, and a function image is shown in FIG. 4; because the output value of the Mish function is not changed greatly when the input value is too large or too small, the ReLU function has greater advantages in avoiding gradient disappearance and improving training speed;
(3) Designing a pooling algorithm;
after the input image is subjected to convolution operation, each pixel point in the graph may affect a plurality of output points, information redundancy may be caused, and algorithm performance is reduced, so pooling operation is performed, the most common algorithms in the pooling operation include average pooling and maximum pooling, as shown in fig. 5, the most commonly used method is maximum pooling;
if the output data is the maximum value of the input data at the position corresponding to the pooling window, pooling the input data as the maximum value; if the output data is the average value of the input data at the corresponding position of the pooling window, the average pooling is performed;
the pooling algorithm is similar to the convolution algorithm, but the number of layers of the convolution kernel needs to be the same as that of the input data, the pooling layer is a two-dimensional data array, the pooling window moves row by row, column by column and layer by layer, and the number of layers of output data passing through the pooling layer is the same as that of the input data, as shown in fig. 6.
(4) A step of performing a spatial pyramid pooling structure (SPPNet);
the SPP structure is generally located between the convolutional layer and the fully-connected layer, and is used to generate a fixed-size output for an input of any size, and the process is as follows: input feature maps with any size are pooled by using panes (pooling layers) of 1 × 1, 2 × 2 and 4 × 4 in parallel to obtain 1+4+16=21 different picture blocks, and then the maximum value (maximum pooling) or the average value (average pooling) of each picture block is calculated respectively to obtain a feature vector with a fixed size of 21 dimensions, which can be directly connected to the full link layer. Extracting features from receptive fields of different sizes by adjusting the size and number of panes; and then splicing the output and input feature maps of the 3 specifications of the pooling layers to obtain a fused feature vector. I.e., can produce an output of any size, is the most important feature of the SPP architecture. Fig. 7 is an SPP structure.
Taking a classical two-stage target detection algorithm R-CNN as an example, the algorithm generates a large number of candidate regions (about 2000 regions) in a region proposing stage, the candidate regions generate an input image with a fixed size through operations such as clipping or warping, and then the input image is processed by using a CNN model to obtain a feature vector with a fixed size, and the feature vector can be input into a full-connection layer for classification and regression. It can be seen that all the generated candidate regions need to be fed into the CNN model to generate a feature vector of a fixed size, and the large number of candidate regions inevitably leads to inefficiency of the algorithm. After an SPP structure is introduced to optimize an R-CNN algorithm, an input image is directly sent to a CNN model to obtain an integral input feature map, then a mapping relation between candidate regions and the input feature map is established to obtain feature vectors corresponding to all the candidate regions (without passing through the CNN model), and finally the feature vectors are subjected to similar classification and regression. The SPP structure is introduced, only the mapping relation between the candidate region and the input characteristic graph needs to be established, the forward calculation of the CNN model does not need to be repeated, the detection speed of the algorithm is improved by 24-120 times compared with that of the R-CNN, and higher detection precision is realized;
(5) Designing a MobileNetv3 network structure;
the core idea of MobileNet is to decompose a complete convolution operation into two steps, namely depth-by-depth convolution and point-by-point convolution. Efficient neural networks are mainly based on: 1. the number of parameters is reduced; 2. and quantizing the parameters and reducing the memory occupied by each parameter. The current study summary can be seen to be divided into two directions: firstly, compressing a trained complex model to obtain a small model; the second is to directly design and train small models (Mobile Net belongs to this category).
Carrying out convolution on the input DxDx3 feature map by a convolution kernel of 3 x 3 and outputting a DxDxN feature map; the standard convolution process is that N convolution kernels with the number of 3 multiplied by 3 are convoluted with each channel of the input feature map, and finally a new feature map with the number of the channels being N is obtained;
the deep separable convolution uses 3 convolution kernels with the number of 3 multiplied by 3 and each channel of the input characteristic diagram to carry out convolution respectively to obtain a characteristic diagram of which the input channel is equal to the output channel, and then uses N convolution kernels with the number of 1 multiplied by 1 to carry out convolution to the characteristic diagram to obtain a new characteristic diagram of N channels;
the quantities of parameters used for the two convolutions were calculated separately and the results are shown in the following equation:
P 1 =D×D×3×N (1)
P 2 =D×D×3+D×D×1×N (2)
P 1 is the parameter, P, used for the standard convolution 2 The number of parameters used when depth separable convolution is used, D is the length and width of the input feature map, and N is the number of convolution kernels;
when standard convolution is carried out, the number of input channels is far smaller than that of output channels, and the formula (1) and the formula (2) are compared to obtain a formula (3):
Figure BDA0003738212180000091
it can be seen that P 2 /P 1 The result is far less than 1, and after the deep separable convolution is used, the parameter quantity used by the convolution can be greatly reduced while the effect similar to the standard convolution is obtained;
(6) Designing RFBs structures;
in order to capture the multi-scale characteristic information of the pedestrian, an RFBs structure module is connected behind the improved backbone network; the cavity convolution is introduced into the structure module, so that the effect of increasing the receptive field and fusing different size characteristics is realized;
the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolution layer is combined with a cavity convolution layer, a 3 multiplied by 3 convolution kernel in the original RFB structure in the common convolution layer is replaced by a1 multiplied by 3 convolution kernel and a 3 multiplied by 1 convolution kernel which are connected in parallel, and a 5 multiplied by 5 convolution kernel is replaced by two series-connected 1 multiplied by 3 convolution kernels and a 3 multiplied by 1 convolution kernel, so that the calculated amount of the network can be effectively reduced, and the light weight of the whole network is ensured. The void convolutional layers are respectively composed of 3 convolutional kernels with the size of 3 multiplied by 3, the expansion rates of the convolutional kernels are respectively 1, 3 and 5, and the convolutional layers are prevented from being degraded due to too large expansion rate; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch cavity structure, and a new fusion feature layer is output; in order to keep the original information of the input feature map, the new fusion feature layer is output after being subjected to accumulation operation with the large residual edge formed by the original feature map through a1 × 1 convolution layer transformation channel.
(7) A step of modified SPPNet;
inspired by a CSP graph network for extracting the main features in YOLOv4, a CSP structure is introduced into the SPP, the network is divided into two parts before multi-scale features are fused, one part of the features are connected through short-cut paths and directly merged with the features fused with the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.
The third concrete implementation mode:
different from the second specific embodiment, in the vehicle target identification method based on the multiscale yolo algorithm of the present embodiment, the step of performing feature fusion by using the PANet specifically includes:
the YOLOv4 algorithm is an improved algorithm based on YOLOv2 and YOLOv23, is different from Two-stage object detection algorithms such as Faster R-CNN, and the like, divides an image into different grids, and each grid is responsible for corresponding objects, supports multi-class object detection, and can achieve higher detection speed under the condition of maintaining precision. The network structure of YOLOv3 is shown in fig. 8, and is an end-to-end real-time target detection framework, which mainly comprises a Darknet-53 trunk feature extraction network and a multi-scale feature fusion prediction network.
The CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual in a rectangular frame in the figure is a Residual error module, and nxx on the right side of the rectangular frame is the repeated use times of the Residual error module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, and the length and width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, so that the function of using a pooling layer to perform down-sampling by using a traditional convolutional network is replaced; for example, a 256 × 256 image is input, and 1/2 reduction is performed 5 times after passing through such a full convolution network, so that a feature map having a size of 8 × 8 (the power of 1/2 to the fifth is 1/32) can be obtained. The method is mainly characterized in that a Residual error unit Residual Block is used, a Residual error jump connection structure is used, a learning channel between an input layer and an output layer is increased, and the problems of gradient disappearance and the like when the number of network layers is deepened are solved. Convolution layers in the Residual Block alternately use convolution kernels with two sizes of 1 x 1 and 3 x 3, a BN batch normalization layer is connected after each convolution, input data are subjected to normalization processing with the mean value of 0 and the variance of 1, and a Leaky Re LU activation function is used for nonlinear operation after the BN layer, so that the network can be applied to a nonlinear model.
In a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve a good detection effect on targets with different sizes and scales; finally, combining the prediction information of the three feature layers with different scales, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;
since the more convolutional layers, the more feature information of the input image is lost, the YOLOv4 network uses K-means clustering to obtain the size of the prediction box, 3 different prediction boxes are set for each scale in the feature pyramid network, and the total obtained 9 prediction boxes with different sizes are (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326), respectively; applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, detecting a larger target; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; (10 × 13), (16 × 30), (33 × 23) are applied on the 52 × 52 fine-scale feature layer 1 for detecting smaller targets;
and the detection precision and speed are improved by using a YOLOv4 network.
Simulation experiment:
the hardware configuration used for creating experiment environment experiments is Inter (R) Core i7-9700K CPU, NVIDIA Geforce RTX 2080Ti graphics card and Windows operating system, the software environment is CUDA11.0, cudnn8.0, and a Tensorflow1.14 deep learning framework is adopted. During training, an Adam optimizer is adopted, the initial learning rate is 0.001, the momentum is 0.9, the Batchsize is set to be 8, the iteration times are 500, and a Mosaic data enhancement technique and a Dropblock regularization mode are adopted. The data are randomly divided into a training set, a verification set and a test set according to the following steps of 6. Before training, the size of an input image is adjusted to 608 multiplied by 608, and the width, height and center point coordinates of a boundary box of labeling information are normalized according to a PASCAL VOC data set format, so that the influence of abnormal samples on data is reduced.
In the detection process, whether the target position is successfully predicted or not is measured by using the prediction result and the size of the Io U value of the actual target, the IoU threshold value is set to be 0.5, namely the IoU is greater than 0.5, and the result is recorded as correct prediction, otherwise, the result is recorded as wrong prediction. Using Average accuracy Mean (mAP) and recall rate as the accuracy index of network,
p represents Precision (Precision), which means the probability of detecting the correct target in all detected targets, R represents Recall (Recall), which means the probability of detecting the correct target in all positive samples, and the calculation formulas of Precision and Recall are as follows:
Figure BDA0003738212180000111
TP refers to the number of positive samples predicted to be correct, FP refers to the number of negative samples predicted to be positive samples but actually negative, and FN refers to the number of negative samples predicted to be negative samples but actually positive samples. In practical applications, networks are often deployed on mobile devices, so the size and detection speed of the networks are not negligible. The size of the network is determined by the number of parameters or the weight, and the detection speed is determined by FPS (Frames Per Second), which is defined as the number of pictures that can be detected Per Second.
Compared with a CSPDarknet residual jump connection mode of a backbone feature extraction network structure of YOLOv4, the depth separable convolution of the MobileNet v3 is beneficial to optimizing the network structure and reducing network training parameters, so that the MobileNet v3 structure is used for replacing a CSPDarknet module, the effective utilization and transmission of features in the network are enhanced, the network learns more feature information, and the target detection speed is further improved; because the semantic information of the low-level characteristic images is less, the target position is accurate; the semantic information of the high-level feature map is rich, the target position is rough, the RFBNet is introduced to perform multiple fusion on the original receptive field of the network, and the cavity convolution is introduced to achieve the purposes of increasing the receptive field and fusing different size features; inspired by a CSP graph network for extracting trunk features in YOLOv4, the paper introduces a CSP structure into SPP, the network is divided into two parts before multi-scale features are fused, one part of features are connected through short-cut paths and are directly merged with the features fused with the SPP, and the operation reduces the calculation amount by 40%. The structure is formed by fusing a plurality of shallow networks, the shallow networks do not have the problem of disappearing gradient during training, and the convergence of the networks can be accelerated.
The embodiments of the present invention are disclosed as the preferred embodiments, but not limited thereto, and those skilled in the art can easily understand the spirit of the present invention and make various extensions and changes without departing from the spirit of the present invention.

Claims (3)

1. A vehicle target identification method based on a multi-scale yolo algorithm is characterized by comprising the following steps: the method is realized by the following steps:
preprocessing a data set;
extracting features of the backbone network;
performing feature fusion on the PANet;
a step of non-maximum inhibition of NMS;
outputting a target calibration decision;
in addition, based on the influence of the class imbalance problem of the sample data set on the classification precision, a multi-loss function alternative training strategy is adopted, and the cross entropy loss function and the focusing loss function are alternately used at different stages of network training, so that the problem of sample imbalance is solved.
2. The method for recognizing the target of the vehicle based on the multi-scale yolo algorithm as claimed in claim 1, wherein: the step of extracting the features of the backbone network specifically comprises the following steps:
(1) Designing a convolution algorithm;
the convolution operation means that each pixel in the output image is obtained by weighted average of pixels in a small area of the corresponding position of the input image, and the area is called a convolution kernel; performing convolution operation on the image and a convolution kernel to extract certain characteristics of the image;
an array of pixels of a two-dimensional gray scale image 8*8 and a convolution kernel 3*3; if the convolution kernel moves with step length of 1, that is, only one cell moves per movement, when the convolution kernel moves to the ith row and the jth column, the input image and the convolution kernel should be multiplied in sequence by the position values and then averaged, so as to determine the output values of the ith row and the jth column of the output image, and taking the pixel value shown in fig. 3 as an example, the values of the 2 nd row and the 3 rd column of the output value should be: 1+2 + 0+3 + 0+ 4+ 1+5 + 0+6 + 1+ 7 + 1+8 + 0+9, namely-6; the number of layers of the convolution kernels is equal to that of the input data, if the input image is a three-dimensional color image, the number of layers of the convolution kernels is also three, the three-dimensional convolution operation and the two-dimensional convolution operation are basically the same, and the output values are weighted averages of the values of corresponding positions of the input values and the convolution kernels;
one convolution layer contains a plurality of convolution kernels, the number of layers of the output pixel array after the convolution layer is related to the number of the convolution kernels, and if the convolution layer contains n convolution kernels, the number of layers of the output pixel array after the convolution layer is also n;
(2) Designing an activation function;
after the convolutional layer, the data is nonlinear by using an activation function, if the input and the output are always in a linear relationship, after a plurality of layers, the total input and the total output are still in the linear relationship, and no matter how many layers exist in the middle, the total input and the total output are equivalent to one layer, as shown in a formula;
Y=aX+b;Z=cY+d
Z=c(aX+b)+d=(ac)X+(bc+d)
selecting a ReLU function as an activation function;
(3) Designing a pooling algorithm;
performing pooling operation on an input image after convolution operation;
if the output data is the maximum value of the input data at the position corresponding to the pooling window, pooling the input data as the maximum value; if the output data is the average value of the input data at the corresponding position of the pooling window, the average pooling is performed;
(4) Performing a spatial pyramid pooling structure;
introducing an SPP structure, and establishing a mapping relation between a candidate region and an input characteristic diagram;
(5) Designing a MobileNetv3 network structure;
carrying out convolution on the input DxDx3 feature map by a convolution kernel of 3 x 3 and outputting a DxDxN feature map; the standard convolution process is that N convolution kernels with the number of 3 multiplied by 3 are convoluted with each channel of the input feature map, and finally a new feature map with the number of the channels being N is obtained;
the deep separable convolution uses 3 convolution kernels with the number of 3 multiplied by 3 and each channel of the input characteristic diagram to carry out convolution respectively to obtain a characteristic diagram of which the input channel is equal to the output channel, and then uses N convolution kernels with the number of 1 multiplied by 1 to carry out convolution to the characteristic diagram to obtain a new characteristic diagram of N channels;
the quantities of parameters used for the two convolutions were calculated separately and the results are shown in the following equation:
P 1 =D×D×3×N (1)
P 2 =D×D×3+D×D×1×N (2)
P 1 is the parameter, P, used for the standard convolution 2 The number of parameters used when depth separable convolution is used, D is the length and width of the input feature map, and N is the number of convolution kernels;
when standard convolution is carried out, the number of input channels is far smaller than that of output channels, and the formula (1) and the formula (2) are compared to obtain a formula (3):
Figure FDA0003738212170000021
it can be seen that P 2 /P 1 The result is far less than 1, and after the deep separable convolution is used, the parameter quantity used by the convolution can be greatly reduced while the effect similar to the standard convolution is obtained;
(6) Designing RFBs structures;
introducing cavity convolution into the structure module; the RFBs structure firstly performs 1 × 1 convolution on the feature map to perform channel transformation, and then performs multi-branch void structure processing to obtain target multi-scale information features; the multi-branch structure adopts a structure that a common convolutional layer is combined with a cavity convolutional layer, wherein a 3 × 3 convolutional core in a primary RFB structure in the common convolutional layer is replaced by parallel 1 × 3 and 3 × 1 convolutional cores, a 5 × 5 convolutional core is replaced by two series 1 × 3 and 3 × 1 convolutional cores, the cavity convolutional layer is respectively composed of 3 convolutional cores with the size of 3 × 3, the expansion rates of the convolutional cores are respectively 1, 3 and 5, and the convolutional layer is prevented from being degraded due to the fact that the expansion rate is too large; finally, concat operation is carried out on the feature layers with different sizes after being processed by the multi-branch hollow structure, and a new fusion feature layer is output;
(7) A step of modified SPPNet;
inspired by a main feature extraction network CSPDarknet in YOLOv4, a CSP structure is introduced into an SPP, before multi-scale features are fused, the network is divided into two parts, and one part of features are directly merged with the fused features of the SPP through shortcut connection.
3. The method for recognizing the target of the vehicle based on the multi-scale yolo algorithm according to the claim 1 or 2, wherein: the method for performing the feature fusion by using the PANet specifically comprises the following steps:
the CSPDarknet-53 network comprises a large number of convolution operations, a large number of residual modules formed by 3 x 3,1 x 1 convolution are stacked, and a 3 x 3 convolution layer with the step size of 2 is used for reducing the size of the feature map by 1/2; residual is a Residual module, and nxy on the right side of the rectangular frame is the repeated use times of the Residual module; after passing through the residual module each time, it can be found that a 3 × 3 convolutional layer with the step length of 2 is used for 5 times in total, the length and the width of the feature map can be changed to 1/2 of the original length and width by using the convolutional layer each time, and the function of using a pooling layer to perform downsampling by replacing a convolutional network is realized; in a prediction stage, respectively inputting feature layers F1, F2 and F3 obtained by a CSPDarknet-53 feature extraction network into a multi-scale prediction network, and obtaining a coarse-scale feature layer 3 by the F3 through convolution operation for detecting a large-scale target; the characteristic layer 3 is firstly fused with F2 through upsampling, and then is convolved to obtain a mesoscale characteristic layer 2 for detecting a mesoscale target; the characteristic layer 2 is subjected to upsampling and F1 fusion convolution to obtain a fine-scale characteristic layer 1 for detecting small-scale targets, and the characteristic Pyramid Network (FPN) structure enables the algorithm to achieve better detection effect on targets with different sizes and different scales; finally, combining the prediction information of the three different scale feature layers, and obtaining a final detection result through a non-maximum inhibition post-processing algorithm;
the resulting prediction boxes of 9 different sizes were (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), and 373 × 326, respectively; applying (116 × 90), (156 × 198), (373 × 326) on the 13 × 13 coarse-scale feature layer 3, detecting a larger target; applying medium-sized prediction boxes (30 × 61), (62 × 45), (59 × 119) on the 26 × 26 mesoscale feature layer 2 for detecting medium-sized targets; on the 52 × 52 fine-scale feature layer 1, (10 × 13), (16 × 30), (33 × 23) are applied for detecting smaller targets.
CN202210806937.2A 2022-07-08 2022-07-08 Vehicle target identification method based on multi-scale yolo algorithm Pending CN115171074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210806937.2A CN115171074A (en) 2022-07-08 2022-07-08 Vehicle target identification method based on multi-scale yolo algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210806937.2A CN115171074A (en) 2022-07-08 2022-07-08 Vehicle target identification method based on multi-scale yolo algorithm

Publications (1)

Publication Number Publication Date
CN115171074A true CN115171074A (en) 2022-10-11

Family

ID=83493785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210806937.2A Pending CN115171074A (en) 2022-07-08 2022-07-08 Vehicle target identification method based on multi-scale yolo algorithm

Country Status (1)

Country Link
CN (1) CN115171074A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937703A (en) * 2022-11-30 2023-04-07 南京林业大学 Enhanced feature extraction method for remote sensing image target detection
CN117853891A (en) * 2024-02-21 2024-04-09 广东海洋大学 Underwater garbage target identification method capable of being integrated on underwater robot platform

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937703A (en) * 2022-11-30 2023-04-07 南京林业大学 Enhanced feature extraction method for remote sensing image target detection
CN115937703B (en) * 2022-11-30 2024-05-03 南京林业大学 Enhanced feature extraction method for remote sensing image target detection
CN117853891A (en) * 2024-02-21 2024-04-09 广东海洋大学 Underwater garbage target identification method capable of being integrated on underwater robot platform

Similar Documents

Publication Publication Date Title
CN111126472B (en) SSD (solid State disk) -based improved target detection method
CN111259930B (en) General target detection method of self-adaptive attention guidance mechanism
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN112381097A (en) Scene semantic segmentation method based on deep learning
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN115171074A (en) Vehicle target identification method based on multi-scale yolo algorithm
CN111046917B (en) Object-based enhanced target detection method based on deep neural network
CN113743269B (en) Method for recognizing human body gesture of video in lightweight manner
CN107564007B (en) Scene segmentation correction method and system fusing global information
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN116895030B (en) Insulator detection method based on target detection algorithm and attention mechanism
CN111899203A (en) Real image generation method based on label graph under unsupervised training and storage medium
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN113628297A (en) COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning
CN111199255A (en) Small target detection network model and detection method based on dark net53 network
CN111275694A (en) Attention mechanism guided progressive division human body analytic model and method
CN117710965A (en) Small target detection method based on improved YOLOv5
Yu et al. Intelligent corner synthesis via cycle-consistent generative adversarial networks for efficient validation of autonomous driving systems
CN111612803B (en) Vehicle image semantic segmentation method based on image definition
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN116597276A (en) Target detection method based on improved YOLOv5 model
CN116258934A (en) Feature enhancement-based infrared-visible light fusion method, system and readable storage medium
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
CN113537228A (en) Real-time image semantic segmentation method based on depth features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination