CN114332485A

CN114332485A - Higher-order target detection method based on YOLOv3 model

Info

Publication number: CN114332485A
Application number: CN202111387117.6A
Authority: CN
Inventors: 邵海见; 严晨旭; 邓星
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-04-12

Abstract

The invention discloses a high-order target detection method based on a YOLOv3 model, which comprises the following steps: step 1: presetting an image data set; step 2: designing a high-distinguishability feature extraction method: extracting features of horizontal, vertical and diagonal directions of a pre-input image through a high-pass filter and a low-pass filter; and step 3: designing a high-order target detection method: adding high-order calculation before sampling on a characteristic layer, splitting an input image to obtain a sub-tensor, performing layer product operation and addition operation on the sub-tensor, and then adding an activation function and regularizing; and 4, step 4: adjusting the parameters of the convolutional neural network; and 5: and extracting the high-distinctiveness features of the image data set by using the high-distinctiveness feature extraction method, importing the obtained features of the image data set into a deep learning network which integrates the high-order target detection method, and completing the fitting of the image data set. The invention improves the characteristic extraction.

Description

Higher-order target detection method based on YOLOv3 model

Technical Field

The invention relates to the technical field of deep learning target detection, in particular to a high-order target detection method based on a YOLOv3 model.

Background

The main purpose of object detection is to detect and locate a specific number of objects from a picture. The core problem of target detection is to locate and classify the content to be detected, and the shape, size and position of the target appearing in the image need to be determined according to the influence of the detected object under different conditions such as illumination, shade and the like, and higher precision and shorter detection time are ensured. The traditional target detection method is based on traditional manual labeling feature methods, such as the Hog feature + SVM algorithm, the Hear feature + Adaboost algorithm and the DPM algorithm, the target detection algorithms use a sliding window to perform region selection, and the features are obtained through a manual extraction mode, however, the extraction mode is complex, if the feature extraction is not good, the training result is biased, the portability of the mode is poor, and the algorithm needs to be redesigned according to different specific tasks. In recent years, convolutional neural networks surpass traditional pattern recognition and machine learning algorithms in more and more fields, and the top-level performance and precision are obtained. The target detection performance of the improved feature extraction method based on the convolutional neural network is remarkably improved, so that the method is gradually called as a hot problem of discussion and analysis in theoretical research and practical application in recent years, and a second-order detection model and a first-order detection model appear successively.

The second-order model extracts features by generating a possible Region (Region Propasal) and a convolutional neural network, then puts the features into a classifier and corrects the positions, for example, Girshick proposed a target detection method of an R-CNN model in 2014 refers to the idea of a sliding window, a convolutional neural network is used in the candidate Region to extract feature vectors with fixed lengths, then the extracted features are put into the classifier for classification, and the target detection test result of PASCAL VOC shows that mAP can be improved to 62.4% and is nearly 20% higher than that of the traditional algorithm. The improved Faster R-CNN subsequently improved the test speed by a factor of hundreds while improving the accuracy. Although the detection model basically realizes real-time monitoring, the detection model still has the problems of long detection time and more training parameters, and the main reason is that the detection model still does not completely get rid of the idea of traditional target detection.

The first-order detection model completely abandons the inherent detection modes of region selection and classifier classification of the two-stage model, converts a classification problem into a regression problem, and has the advantage of greatly shortening the detection time. The YOLO target detection model designed by the artificial intelligence laboratory of Facebook has only one training network, and the test speed on a single Titan X reaches 45 frames per second, which is 9 times that of fast R-CNN. After the YOLO and YOLOv2 target detection networks, the YOLOv3 target detection network proposed by Joseph Redmon and Ali faradai can complete the detection of 320 × 320 pictures within 22 milliseconds, and under the Titan X environment, the detection precision reaches 57.9%. Although the detection speed of the YOLO series target detection network is improved, a natural image comprises various complex backgrounds, a single feature is difficult to extract a significant object from the complex backgrounds, and a great improvement space is still provided in the aspects of large target identification and accuracy.

Disclosure of Invention

The invention provides a high-order target detection method based on a YOLOv3 model, which aims to solve the problems of long detection time and low accuracy of large target identification in the prior art.

The invention provides a high-order target detection method based on a YOLOv3 model, which comprises the following steps:

step 1: presetting an image data set;

step 2: designing a high-distinguishability feature extraction method: extracting features of horizontal, vertical and diagonal directions of a pre-input image through a high-pass filter and a low-pass filter;

and step 3: designing a high-order target detection method: adding high-order calculation before sampling on a characteristic layer, splitting an input image to obtain a sub-tensor, performing layer product operation and addition operation on the sub-tensor, and then adding an activation function and regularizing;

and 4, step 4: adjusting the parameters of the convolutional neural network;

and 5: and extracting the high-distinctiveness features of the image data set by using the high-distinctiveness feature extraction method, importing the obtained features of the image data set into a deep learning network which integrates the high-order target detection method, and completing the fitting of the image data set.

Further, the step 2 includes the following processes:

step 2.1: obtaining vector space V according to Haar scale function_i：

V_i＝s^j/2φ(s^jx-k)|i,k∈Z

Wherein s is^jIs a scale variable, the change of j will reduce or enlarge the space, k is a translation variable, and the change of k will affect the position of the space on the x-axis;

step 2.2: obtaining vector space W according to Haar wavelet function_iComprises the following steps:

W_i＝S^j/2Ψ(s^jx-k)|i,k∈Z

wherein W_iRepresenting the vector space formed by the wavelet function of the i-th order, s^jIs a scale variable, a change in j will shrink or expand the space, k is a translation variable, and a change in k will affect the position of the space on the x-axis.

Step 2.3: through a vector space W_iRepresenting a vector space V_i：

Step 2.4: the features of the horizontal, vertical and diagonal directions of the pre-input image are extracted by formula (1) obtained in step 2.3.

Further, the step 3 includes the following processes:

step 3.1: the feature matrix is subjected to element-by-element multiplication: after three downsampling, the Darknet53 backbone network outputs three shape features, halves and copies three before the first two feature upsampling, and multiplies the three feature matrices element by successive layer multiplication operations, as shown in the following formula:

wherein w^rRepresenting the r-th feature vector, x being the local feature of the corresponding vector,

is the r-order auto-outer product of x, which can represent the interaction of high-order features;

step 3.2: characteristic layer nonlinear activation: activating (x) by using a sigmoid function, and outputting the processed characteristics through a residual error unit:

F(x)＝sigmoid(f(x))+x

f (x) is the result obtained by the high-order calculation and is transmitted to the next layer of the convolutional neural network.

Further, in step 3.1, the three shape are (13, 13, 1024), (26, 26, 768), and (52, 52, 128), respectively.

Further, the step 4 comprises: setting a priori frame: according to the clustering algorithm, (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326), 9 different sized prior boxes were chosen, smaller prior boxes (10x13), (16x30), (33x23) were used on the 52 x 52 feature map, medium prior boxes (30x61), (62x45), (59x119) were used on the 26 x 26 feature map, large prior boxes (116x90), (156x198), (373x326) were used on the 13 x13 feature map.

Further, the step 4 further includes: and (3) setting other parameters: the gpu utilization rate is set to 0.7, the video memory is not fully occupied, and the video memory is distributed according to needs and a BFC algorithm is used. Batch _ size was set to 8, learning rate 1e-4, training 500 batches.

The invention has the beneficial effects that:

(1) the method comprises the steps of preprocessing images by using Haar wavelets, stripping original images layer by layer through a high-pass filter and a low-pass filter, obtaining high-frequency characteristics and low-frequency characteristics of the images under different resolutions, and integrating the characteristics into a training data set to play a good improvement role in characteristic extraction.

(2) High-order calculation is added before sampling on the characteristic layer, the input image is split to obtain a sub tensor, layer product operation and addition operation are carried out on the sub tensor, and an activation function is added and regularized after each layer, so that the overall expression capacity of the network is improved. Through the interaction of high-order information, the high-order calculation further improves the resolution of the image, and makes up for the problem of loss of space details of a target object in the original target detection network.

(3) The combination of the Haar wavelet transform and the high-order calculation complements each other, the high-frequency features and the low-frequency features obtained by the wavelet transform under different resolutions can be amplified with the help of the high-order calculation, and the detail features can be more concerned in the training process.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:

FIG. 1A is a flow chart of wavelet decomposition in an embodiment of the present invention;

FIG. 1B is a sample diagram of wavelet decomposition in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network structure incorporating a high-order target detection calculation method and wavelet transformation according to the present invention;

FIG. 3 is a pictorial view of a practical implementation of the present invention;

FIG. 4 is a polar pie chart of the accuracy of each model over each class in the dataset.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a high-order target detection method based on a YOLOv3 model, which comprises the following steps:

step 1: data set preparation: the data sets used in the present invention are VOC2007 and VOC2012, and the VOC data sets are divided into 4 general categories in total: vehicle, household, animal, person, each major class is subdivided into 20 subclasses.

Step 1.1: selecting a training and validation dataset: the data set used for training and validation was the trainval data set of PASCAL VOC2007 and PASCAL VOC2012, 90% of which was used for training and 10% of which was used for testing, which had 16551 images (of which VOC2007 had 5011 and VOC2012 had 11540) and 40058 subjects (of which VOC2007 had 12608 and VOC2012 had 27450).

Step 1.2: selecting a test data set: test data set a test data set of VOC2007 with a total of 4952 pictures and 12032 test subjects was selected.

As shown in fig. 1A, 1B;

step 2: the design of the high-distinguishability feature extraction method comprises the following steps: the features of the horizontal, vertical and diagonal directions of the pre-input image are extracted through the high-pass filter and the low-pass filter, and the problem that most of low-frequency and high-frequency information is lost when the pictures in the data set are directly put into a deep learning network for training is solved.

Step 2.1: haar wavelets have a strong multi-resolution decomposition capability, and the idea is to accurately represent the original signal using a scale function and a wavelet function. The Haar scale function is expressed as:

the vector space V formed by the scale function in layer-by-layer decomposition_iComprises the following steps:

V_i＝s^j/2φ(s^jx-k)|i,k∈Z (2)

wherein s is^jIs a scale variable, a change in j will shrink or expand the space, k is a translation variable, and a change in k will affect the position of the space on the x-axis.

Step 2.2: determining a vector space W_i: the Haar wavelet function is expressed as:

Ψ(x)＝φ(2x)-φ(2x-1) (3)

the vector space W formed by the wavelet function in the layer-by-layer decomposition_iComprises the following steps:

W_i＝S^j/2Ψ(s^jx-k)|i,k∈Z (4)

Step 2.3: vector space V_iSum vector space W_iOrthogonally expressing different resolution characteristic information of the image: v_iAnd W_iIs orthogonal, vector space W_iCan represent a vector space V_iThe recursive development of the above formula can obtain:

from this, it is understood that the higher the number of stages of decomposition, the more information is included. In the data preprocessing in Mul-YOLO, in order to fully extract the high-frequency and low-frequency information of each picture, the Haar wavelet strips the picture information layer by layer through a low-pass filter and a high-pass filter, and parameters of original image characteristic parameters after transformation under different wavelet scales can be obtained after decomposition, so that the low-frequency characteristics and the high-frequency characteristics in the horizontal, vertical and diagonal directions of one picture are obtained.

And step 3: designing a high-order target detection method: high-order calculation is added before sampling on the characteristic layer, the input image is split to obtain a sub tensor, layer product operation and addition operation are carried out on the sub tensor, and an activation function is added and regularized after each layer, so that the overall expression capacity of the network is improved. Through the interaction of high-order information, the high-order calculation further improves the resolution of the image, and makes up for the problem of loss of space details of a target object in the original target detection network.

Step 3.1: the feature matrix is subjected to element-by-element multiplication: the Mul-YOLO is mainly improved in image resolution, and high-order calculation is added to an original model structure, so that the global resolution can be effectively increased. After three downsamplings, the Darknet53 backbone network outputs features with three shape as (13, 13, 1024), (26, 26, 768), and (52, 52, 128), respectively. Half of the first two features are intercepted and copied into three parts before up-sampling, and the three feature matrixes are multiplied element by element through continuous layer multiplication operation, wherein the multiplication is shown as the following formula:

is the r-order auto-outer product of x and can represent the interaction of higher order features.

Step 3.2 feature layer nonlinear activation: activating f (x) by using a sigmoid function, and outputting the processed characteristics through a residual error unit.

F(x)＝sigmoid(f(x))+x (7)

The whole high-order calculation process is subjected to three times of layer multiplication and one time of layer addition operation, each pixel of the characteristic matrix is amplified,

the output features of an original model of the Yolov3 are subjected to upsampling, and are spliced with upper-layer features and then subjected to several continuous convolutions, so that the details of high-order spatial information and positions of an image can be lost, and the loss can influence the accuracy of a final detection result to a great extent.

Fig. 2 shows, using Darknet-53 as the backbone network, Darknet-53 includes 52 convolutional layers and 1 fully-connected layer for the host network, 52 convolutional layers are composed of 1 convolutional layer with channel number 32 and 5 repeated residual units, each of the 5 residual units has 1 independent convolutional layer and a set of repeated convolutional layers with the number of times of 1, 2, 8, 8, 4, respectively, wherein the repeated convolutional layers include one convolution with 1 × 1 (channel number halved) and one convolution with 3 × 3 (channel number restored). The input size of the image is 416 x 416, since the step size of the first independent convolution layer of each residual unit is 2, the three output feature layers are respectively subjected to downsampling by 8 times, 16 times and 32 times for three times, the input size of the image is changed from 416 x 416 to 52 x 52, 26 x 26 and 13 x13, and a high-order target detection method is added before upsampling on the feature layers with the two sizes of 13 x13 and 26 x 26, so that the multi-scale features of the image information can be fused.

And 4, step 4: adjusting parameters of the convolutional neural network: before training, proper hyper-parameters are set according to factors such as data volume, data distribution characteristics, hardware equipment conditions and the like, and necessary data are recorded in the training process.

Step 4.1: setting a priori frame: according to the clustering algorithm, (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326)9 different-sized prior boxes were chosen, and the principle of using a larger prior box on a small feature map was followed because the smaller feature map had a larger receptive field, smaller prior boxes (10x13), (16x30), (33x23) on a 52 x 52 feature map, medium prior boxes (30x61), (62x45), (59x119) on a 26 x 26 feature map, and larger prior boxes (116x90), (156x198), (373x326) on a 13 x13 feature map.

Step 4.2: and (3) setting other parameters: the gpu utilization rate is set to 0.7, the video memory is not fully occupied, and the video memory is distributed according to needs and a BFC algorithm is used. Batch _ size was set to 8, learning rate 1e-4, training 500 batches.

And 5: the method comprises the steps of preprocessing an image by using Haar wavelets to obtain low-frequency information and high-frequency information in horizontal, vertical and diagonal directions, putting a processed data set into a deep learning network integrated with a high-order target detection method, and learning characteristic information of the image under different resolutions to enhance the generalization capability of a detection model and improve the target detection accuracy.

As shown in fig. 3, it can be seen that there is a good sorting and locating function.

Fig. 4 is a polar axis pie chart of the detection accuracy for 20 class objects in 4952 pictures of the PASCAL VOC2007test data set by the higher-order object detection method based on the YOLOv3 model and other 6 existing object detection methods provided by the present invention, wherein the minimum value is 0, the maximum value is 100, and 10 percentage points are used as intervals from the center point to the outside. It can be seen that the blue area covered by Mul-YOLO is larger than that of other models, and the comprehensive effect is better.

The Mul-YOLO provided by the invention is superior to the prior art in the aspects of detection accuracy and jiance efficiency, and particularly as shown in the following table 1, the table 1 is the comparison of the Mul-YOLO and other models on mAP and FPS, although the detection speed is 2ms slower than that of the original YOLO v3, the Mul-YOLO has advantages compared with other models, and the mAP is improved more, so the Mul-YOLO has better performance on precision and speed.

TABLE 1

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A high-order target detection method based on a YOLOv3 model is characterized by comprising the following steps:

step 1: presetting an image data set;

and 4, step 4: adjusting the parameters of the convolutional neural network;

2. The higher-order target detection method based on the YOLOv3 model of claim 1, wherein the step 2 comprises the following processes:

step 2.1: obtaining vector space V according to Haar scale function_i：

V_i＝s^j/2φ(s^jx-k)|i,k∈Z

W_i＝S^j/2Ψ(s^jx-k)|i,k∈Z

Step 2.3: through a vector space W_iRepresenting a vector space V_i：

3. The higher-order target detection method based on the YOLOv3 model of claim 1, wherein the step 3 comprises the following processes:

F(x)＝sigmoid(f(x))+x

4. The higher-order target detection method based on the YOLOv3 model of claim 3, wherein in the step 3.1, the three shape are (13, 13, 1024), (26, 26, 768) and (52, 52, 128), respectively.

5. The higher-order target detection method based on the YOLOv3 model of claim 1, wherein the step 4 comprises: setting a priori frame: according to the clustering algorithm, (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326), 9 different sized prior boxes were chosen, smaller prior boxes (10x13), (16x30), (33x23) were used on the 52 x 52 feature map, medium prior boxes (30x61), (62x45), (59x119) were used on the 26 x 26 feature map, large prior boxes (116x90), (156x198), (373x326) were used on the 13 x13 feature map.

6. The higher-order target detection method based on the YOLOv3 model of claim 5, wherein the step 4 further comprises: and (3) setting other parameters: the gpu utilization rate is set to 0.7, the video memory is not fully occupied, and the video memory is distributed according to needs and a BFC algorithm is used. Batch _ size was set to 8, learning rate 1e-4, and 500 dense batches were trained.