CN113077496A

CN113077496A - Real-time vehicle detection and tracking method and system based on lightweight YOLOv3 and medium

Info

Publication number: CN113077496A
Application number: CN202110413744.6A
Authority: CN
Inventors: 李智军; 程琦云; 李国欣
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-07-06

Abstract

The invention provides a real-time vehicle detection and tracking method based on lightweight YOLOv3, which comprises the following steps: step 1: detecting the vehicles in the traffic video by adopting a lightweight YOLOv 3-based algorithm, and marking the positions of the vehicles in the video by using a priori frames; step 2: tracking the vehicle position at the next moment by applying a Kalman filtering algorithm; and step 3: on the basis of Kalman filtering algorithm tracking, the unique tag ID of the detected target is determined by using Hungarian matching algorithm, and accurate positioning and tracking of a plurality of targets are realized. The whole scheme of the invention has strong robustness and low omission factor, is easy to expand to various vehicle categories, and meets the requirements of vehicle detection and continuous tracking in the monitoring video.

Description

Real-time vehicle detection and tracking method and system based on lightweight YOLOv3 and medium

Technical Field

The invention relates to the technical field of deep learning, in particular to a real-time vehicle detection and tracking method and system based on lightweight YOLOv3 and a medium.

Background

With the rapid growth of automobile reserves, a large number of scholars carry out intensive research on advanced assistant driving systems, vehicle detection becomes a key point in ADAS research, effective detection and tracking of vehicles on the front road are important components for making judgment and early warning of a safety assistant driving system, and miniaturization of a detection algorithm model becomes a premise and means for rapid and real-time running of vehicle-mounted embedded equipment.

In a target detection algorithm based on a traditional method, a classifier of a feature extractor has poor generalization performance, different features need to be designed and selected in different scenes, reasonable feature extraction difficulty is high, operation complexity is high, and practical application is limited.

Detection algorithms based on deep learning can be classified into two methods, namely a region-based method and a regression-based method. The region-based method generates candidate regions through a selective search algorithm and then uses a convolutional neural network for classification, and the main methods include R-CNN, Fast R-CNN and the like. The region-based method is used for detecting in two steps, has high detection precision, but has the defects of complex network and low detection speed. Regression methods such as SSD and YOLO take the target detection problem as a regression problem, and can directly regress the object class probability and the coordinate position. However, the network structure of the algorithm is still large, and the defects that the operation speed of transplanting and deploying the algorithm to the real vehicle embedded equipment is low, the actual deployment cost is high and the like exist.

Disclosure of Invention

In view of the defects in the prior art, the present invention aims to provide a real-time vehicle detection and tracking method and system and medium based on lightweight YOLOv 3.

The invention provides a real-time vehicle detection and tracking method based on lightweight YOLOv3, which comprises the following steps:

step 1: detecting the vehicles in the traffic video by adopting a lightweight YOLOv 3-based algorithm, and marking the positions of the vehicles in the video by using a priori frames;

step 2: tracking the vehicle position at the next moment by applying a Kalman filtering algorithm;

and step 3: on the basis of Kalman filtering algorithm tracking, the unique tag ID of the detected target is determined by using Hungarian matching algorithm, and accurate positioning and tracking of a plurality of targets are realized.

Preferably, the backbone network of the lightweight YOLOv3 algorithm employs one 7-layer convolutional layer, and in the residual network structure of each convolutional layer, fewer repeating residual units are employed.

Preferably, the step 1 comprises the steps of:

step 1.1: performing K-means + + clustering on the vehicle frames in the training set, selecting three types of vehicle frames with different sizes, selecting three vehicle frames with different shapes in each type of size, and taking nine vehicle frame shapes as prior frames;

step 1.2: acquiring a single-frame video image in a traffic video;

step 1.3: and predicting the coordinates of the candidate vehicles by adopting a target detection algorithm based on the lightweight YOLOv3, and framing all the candidate vehicles by using a priori frames according with the sizes of the vehicles.

Preferably, in step 2, a kalman filter algorithm is used to predict the position of the frame at the next time, and the state of the kalman filter is updated.

The invention also provides a real-time vehicle detection and tracking system based on the lightweight YOLOv3, which comprises the following modules:

module M1: detecting vehicles in the traffic video by adopting a lightweight YOLOv 3-based algorithm, and marking the positions of the vehicles by using a priori frame;

module M2: tracking the vehicle position at the next moment by applying a Kalman filtering algorithm;

module M3: on the basis of Kalman filtering algorithm tracking, the unique tag ID of the detected target is determined by using Hungarian matching algorithm, and accurate positioning and tracking of a plurality of targets are realized.

Preferably, the module M1 includes the following modules:

module M1.1: performing K-means + + clustering on the vehicle frames in the training set, selecting three types of vehicle frames with different sizes, selecting three vehicle frames with different shapes in each type of size, and taking nine vehicle frame shapes as prior frames;

module M1.2: acquiring a single-frame video image in a traffic video;

module M1.3: and predicting the coordinates of the candidate vehicles by adopting a target detection algorithm based on the lightweight YOLOv3, and framing all the candidate vehicles by using a priori frames according with the sizes of the vehicles.

Preferably, the module M2 uses a kalman filter algorithm to predict the position of the box at the next time, and updates the state of the kalman filter.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the light-weight YOLOv3 algorithm is used for vehicle detection, vehicles with different sizes can be detected and the positions of the vehicles can be framed, the characteristics are not required to be designed manually, the process of characteristic selection is omitted, the quality of extracted characteristics is better, the vehicle detection method is more robust in the face of complex scenes, and the omission ratio is low;

2. the invention improves the traditional YOLOv3 algorithm, reduces the network parameters under the condition that the average accuracy rate is basically kept unchanged, reduces the network model to 1/4 of the original YOLOv3, and doubles the detection speed;

3. the method adopts a Kalman filtering algorithm to track the vehicle position at the next moment after target detection;

4. on the basis of adopting a Kalman filtering algorithm, a Hungarian matching algorithm is utilized to carry out association matching on vehicles in adjacent frames of a video, the unique label ID of a detected target is determined, accurate positioning and tracking of a plurality of targets are realized, and the unstable detection conditions such as detection discontinuity, omission, target occlusion and the like are improved;

5. the detection tracking algorithm has stronger robustness in a complex road environment, and can meet the requirements on the precision and the speed of vehicle detection tracking in the actual intelligent driving process.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a neural network architecture of the present invention;

FIG. 3 is a diagram of the residual network architecture of the present invention;

FIG. 4 is a diagram illustrating the effect of the present invention on the KITTI training set;

FIG. 5 is a diagram of the effect of the model of the present invention on the validation set.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Referring to fig. 1, a real-time vehicle detection and tracking method in a traffic video is divided into a vehicle detection module and a vehicle tracking module.

Referring to fig. 3, in a vehicle detection algorithm, a target detection algorithm based on a lightweight YOLOv3 algorithm is invented, the algorithm is improved on a traditional YOLOv3 network structure, a darknet53 is not used as a backbone network, but a YOLOv3-tiny structure similar to the darknet19 is used for reference, the backbone network uses 7 convolutional layers, in a residual error network structure of each convolutional layer, fewer repeated residual error units are used, and the network depth is reduced by reducing the convolutional layers. And selecting a proper prior frame size by using a modified K-means + + algorithm in the selection of the prior frame size. On the final network output, a logistic regression is used to enable the output prediction box to better cover the vehicle to be detected.

The improved lightweight YOLOv3 network structure is shown in fig. 2 and can be divided into a backbone network and a detection head; firstly, a backbone network is used for extracting the characteristics of an image; the input size of the backbone network is 416 multiplied by 3, and a convolution layer containing 16 convolution kernels multiplied by 3 is adopted to carry out primary feature extraction on the input network. The convolution formula for feature extraction is:

in the formula, a_i,jA value representing the coordinate (i, j) in the feature map; w is a_c,m,nThe c channel coordinate is the value of (m, n) for the convolution kernel; x is the number of_c,i+m,j+nInputting a value with the c channel coordinate being (i + m, j + n); w is a_bIs the bias term of the convolution kernel.

The extracted feature layer network output of the previous layer is used as the input of the next layer, then 32 convolution kernels with the step size of 2 are adopted for filtering, and a residual block is adopted for reinforcing the network feature extraction, and as shown in fig. 3, the residual block is formed by connecting 16 convolution kernels with the step size of 1 × 1 and 32 convolution kernels with the step size of 3 × 3. The residual network formula is:

out＝f₂[f₁(in)]+in，

where in is the input, f₁Is a 1X 1 convolutional layer, f₂3 × 3 convolutional layers, out is the output.

By adopting the same principle method, filtering is carried out on subsequent network structures by respectively sequentially adopting 64, 28, 256, 512 and 1024 3 multiplied by 3 convolution kernels with the step length of 2, so that the length and the width of a feature extraction layer are reduced, the depth is increased, and deeper features can be extracted. Meanwhile, 2, 4 and 4 residual blocks are sequentially adopted for connecting the first 4 subsequent different convolution kernels, feature extraction is enhanced, a large number of network layers are reduced compared with YOLOv3, the network structure is miniaturized, and the learning and network forward reasoning running speed can be increased.

Referring to fig. 4 and 5, next is a detection head for performing target detection on a feature map given by the backbone network and giving the position of a detection frame.

The detection head extracts the feature extraction convolution layers with the sizes of 52 x 52, 26 x 26 and 13 x 13, wherein 3 layers are extracted separately, and the depths of the feature extraction convolution layers are 128, 256 and 1024 respectively. The 13 × 13 feature layer is convolved with 512 1 × 1 convolution kernels, 1024 3 × 3 convolution kernels, and 18 1 × 1 convolution kernels in this order, so that the input of the feature layer network is 13 × 13 × 18. Each vector of 1 × 1 × 18 is responsible for determining whether the object in the receptive field is an automobile, and if so, the central coordinate and the size of the prediction frame are given at the same time.

The 13 × 13 feature extraction layer not only outputs the result alone, but also amplifies the 13 × 13 features to 26 × 26 size by means of upsampling, connects the up-sampled features with the original 26 × 26 feature layer, and outputs the up-sampled features simultaneously after convolving the 26 × 26 feature layer. The original 26 × 26 layers are also connected with 52 × 52 for output in the same way, so as to obtain multi-scale target detection.

And for the size of the output prediction box, selecting a proper prior box size in the training set by using a clustering algorithm to predict the automobiles with different scales in different videos. The invention adopts the K-means + + algorithm to calculate the clusters, and the K-means + + algorithm improves the selection of the initial points of the K-means algorithm, so that the distance between the cluster centers is as far as possible. The key steps of the K-means + + algorithm are as follows:

step S1: randomly taking a sample from the data set as an initial clustering center u₁；

Step S2: calculate the distance D (X) of each sample xi of the data X from the nearest cluster center_i)，

D(x_i)＝argmin|x_i-u_r|；

Step S3: calculating the probability of each sample being selected as the next cluster center

Selecting a next clustering center;

step S4: and repeating the step 2 and the step 3 until k clustering centers are selected.

After 9 prior frames are obtained according to the clustering center, the larger 52 × 52 feature extraction convolutional layer adopts the smaller 3 prior frames and has the largest receptive field, the middle 26 × 26 feature extraction convolutional layer adopts the middle 3 prior frames, and the smaller 13 × 13 feature extraction convolutional layer adopts the larger 3 prior frames and has the smallest receptive field.

After the neural network outputs the prediction frame, in order to enable the prediction frame to better cover the detected target, a logistic regression function is used for carrying out confidence regression on each prior frame on different scales, the frame value of the object is predicted, and the most appropriate target category area is selected according to the confidence. The logistic regression function prediction formula is:

in the formula, c_x，c_yCoordinate offset of grid coordinates relative to the center of the image; p is a radical of_w，p_hThe length of the length and the width of the prior frame is the side length; t is t_x，t_y，t_w，t_hA target value for deep web learning; b_x，b_y，b_w，b_hAnd the coordinate values of the predicted frame are finally calculated by a formula.

In a vehicle tracking module, the invention uses Kalman filtering algorithm and Hungarian algorithm to realize the accurate positioning and tracking of a plurality of targets.

And calculating the centroid coordinate of the frame according to the prior frame detected by the lightweight YOLOv3 target detection algorithm, predicting the position of the frame at the next moment by using a Kalman filtering algorithm, and updating the state of a Kalman filter.

And the Kalman filtering algorithm predicts the coordinate position of the target at the moment according to the coordinate position of the vehicle detected at the last moment. Firstly, calculating the coordinates (X, y) of the centroid of the detected object according to the coordinates of the vehicle frame detected by the deep learning detection algorithm, and expressing the coordinates as the current state X of the target_t|t，X_t-1|t-1Is the target state at the last moment, X_t|t-1Predicting the target state of the current time, the observation state Z_tFor the coordinates of the actually detected object centroid, P_t|tEstimating the error covariance, P, for the current time instant_t|t-1And marking the estimation error covariance of the current moment predicted by the previous moment. A is a state transition matrix, H is an observation matrix, K_tGain matrix, W, for Kalman filtering_t-1|t-1For the excitation noise at the previous time instant, Q, R are the covariance matrices of the excitation noise and the observation noise, respectively. The Kalman filtering tracking collective formula is as follows:

X_t|t-1＝AX_t-1|t-1+W_t-1|t-1， (1)

P_t|t-1＝AP_t-1|t-1A^T+W_t-1|t-1， (2)

X_t|t＝X_t|t-1+K_t(Z_t-X_t-1|t-1)， (3)

P_t|t＝P_t|t-1-K_tHP_t|t-1， (4)

K_t＝P_t|t-1H^T(HP_t|t-1H^T+R)^-1 (5)

the position of the vehicle detected at the previous time at the current time is predicted using equations (1), (2), and the state of the kalman filter is updated using equations (3), (4), (5).

On the basis of tracking the vehicle position at the next moment after the target is detected by adopting a Kalman filtering algorithm, vehicles in adjacent frames of the video are associated and matched by utilizing a Hungarian matching algorithm, the unique ID label of the detected target is determined, and accurate positioning and tracking of a plurality of targets are realized.

The Hungarian optimal matching algorithm adopts Euclidean distance of two different set coordinates as a cost matrix when performing association matching, and then adopts the Hungarian algorithm to perform characteristic association, namely, the minimum value d of the Euclidean distance between the centroid coordinate predicted at different previous moments and the detection coordinate at the current moment is solved_minPredicting coordinates and time of the last momentAnd assigning and associating the detection coordinates. The Euclidean distance calculation formula is as follows:

set of predicted box centroid coordinates predicted for multiple previous moments

Is the set of the coordinates of the centroid of the prediction frame detected at the moment.

And when the detection value at the moment is not distributed to any predicted value at the previous moment, namely the number of the predicted coordinates at the previous moment is less than the number of the detected coordinates at the moment, tracking the detection value as a new target. The specific formula is as follows:

n_t-1＜n_t，

wherein the number of predicted coordinates at time t-1 is n_t-1And the number of actual detection coordinates at the time t is n_t。

In the actual tracking situation, considering the situations of missing detection and tracking failure, when the calculated Euclidean distance exceeds a set threshold value or a plurality of frames fail to detect the vehicle object, the tracking loss is determined. The specific formula is as follows:

f＞f_max∨d＞d_max，

in the formula, f is the number of target frames which are not continuously detected; f. of_maxThe maximum number of lost frames; d is the Euclidean distance; d_maxIs the maximum distance threshold.

The invention also provides a real-time vehicle detection and tracking system based on the lightweight YOLOv3, which comprises the following modules: module M1: detecting vehicles in the traffic video by adopting a lightweight YOLOv 3-based algorithm, and marking the positions of the vehicles by using a priori frame; module M2: tracking the vehicle position at the next moment by applying a Kalman filtering algorithm; module M3: on the basis of Kalman filtering algorithm tracking, the unique tag ID of the detected target is determined by using Hungarian matching algorithm, and accurate positioning and tracking of a plurality of targets are realized.

The backbone network of the lightweight YOLOv3 algorithm employs one 7-layer convolutional layer, and in the residual network structure of each convolutional layer, fewer repeated residual units are employed.

Module M1 includes the following modules: module M1.1: performing K-means + + clustering on the vehicle frames in the training set, selecting three types of vehicle frames with different sizes, selecting three vehicle frames with different shapes in each type of size, and taking nine vehicle frame shapes as prior frames; module M1.2: acquiring a single-frame video image in a traffic video; module M1.3: and predicting the coordinates of the candidate vehicles by adopting a target detection algorithm based on the lightweight YOLOv3, and framing all the candidate vehicles by using a priori frames according with the sizes of the vehicles.

The kalman filter algorithm is used in block M2 to predict the position of the box at the next time while updating the state of the kalman filter.

The invention also provides a computer-readable storage medium having a computer program stored thereon, which, when being executed by a processor, carries out the steps of the method as described above.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A real-time vehicle detection and tracking method based on lightweight YOLOv3 is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the backbone network of the lightweight YOLOv3 algorithm employs 7 convolutional layers, and in the residual network structure of each convolutional layer, fewer repeated residual units are employed.

3. The method for detecting and tracking the vehicle in real time based on the lightweight YOLOv3 as claimed in claim 1, wherein the step 1 comprises the following steps:

step 1.2: acquiring a single-frame video image in a traffic video;

4. The method as claimed in claim 1, wherein the step 2 uses kalman filter algorithm to predict the position of the frame at the next time, and updates the state of the kalman filter.

5. A real-time vehicle detection and tracking system based on lightweight YOLOv3 is characterized by comprising the following modules:

6. The system of claim 5, wherein the backbone network of the lightweight YOLOv3 algorithm employs 7 convolutional layers, and fewer repeating residual units are employed in the residual network structure of each convolutional layer.

7. The system for real-time vehicle detection and tracking based on the lightweight YOLOv3 of claim 5, wherein the module M1 comprises the following modules:

module M1.2: acquiring a single-frame video image in a traffic video;

8. The system for real-time vehicle detection and tracking based on the lightweight YOLOv3 of claim 5, wherein the module M2 utilizes a kalman filter algorithm to predict the position of the frame at the next time and update the state of the kalman filter.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.