CN112927267A

CN112927267A - Target tracking method under multi-camera scene

Info

Publication number: CN112927267A
Application number: CN202110275199.9A
Authority: CN
Inventors: 卢新彪; 杭帆; 唐紫婷; 刘雅童; 李芳�; 李亦勤; 张弛
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-08

Abstract

The invention discloses a target tracking method in a multi-camera scene, which realizes splicing of different camera pictures by using YOLO-V4 in combination with an improved Deepsort algorithm and an image splicing algorithm, and finally realizes multi-target tracking in a spliced video. In the aspect of a data set, a self-made intelligent trolley data set and a self-made vehicle re-identification data set containing an intelligent trolley are adopted. According to the invention, through constructing abundant data sets, improving models and splicing and fusing pictures, multi-target tracking under a multi-camera scene is realized, and the accuracy rate of vehicle re-identification is improved better.

Description

Target tracking method under multi-camera scene

Technical Field

The invention relates to a target tracking method, in particular to a target tracking method in a multi-camera scene.

Background

Target detection and tracking are a research hotspot in the field of computer vision at present, and have wide application in the fields of video monitoring, automatic driving, man-machine interaction, intelligent home furnishing and the like. The moving target tracking belongs to the content of video analysis, and the video analysis integrates the middle-level and high-level processing stages in the field of computer vision research, namely, the image sequence is processed, so that the rule of the moving target is researched, or semantic and non-semantic information support including motion detection, target classification, target tracking, event detection and the like is provided for decision alarm of a system. The research and application of the video target tracking method is an important branch in the field of computer vision, and is increasingly and widely applied to various fields of scientific technology, national defense construction, aerospace, medicine and health and national economy, so that the research target tracking technology has great practical value and wide development prospect.

With the development of neural networks, neural networks for object detection and tracking have been developed from machine learning to deep learning. Currently, target detection algorithms are broadly divided into two categories: one is two stages, the detection problem is divided into two stages by the two-stage detection algorithm, firstly candidate regions (region prosages) are generated and then classified, and typical representatives of the one are R-CNN, Fast R-CNN and Master R-CNN families. The recognition error rate and the recognition missing rate of the images are low, but the speed is low, and the real-time detection scene cannot be met. Another type of method is called a one-stage detection algorithm, which does not require a stage of generating a candidate region, directly generates a class probability and a position coordinate value of an object, and can directly obtain a final detection result through a single detection, so that the detection speed is faster, and more typical algorithms such as YOLO, SSD, YOLOv3, YOLO-V4, CenterNet, and the like are available. The main task of multi-Object Tracking, i.e. Multiple Object Tracking (MOT), is to provide a sequence of images, find moving objects in the sequence of images, and identify moving objects in different frames, i.e. to provide a certain accurate id, although these objects may be arbitrary, such as pedestrians, vehicles, various animals, etc. Currently, the mainstream target Tracking strategy studied in the industry of the academic community is TBD (Tracking-by-detectino), that is, target detection is performed in each frame, and then target Tracking is performed by using the result of the target detection, and the more classical algorithms include SORT and deep SORT. The image splicing technology is a technology for carrying out space matching alignment on a group of image sequences of mutually overlapped parts, forming a complete new image of a wide-view-angle scene containing information of each image sequence after resampling and synthesis, and can make up the defect of limited shooting content of a single-point camera, expand the experience range of equipment and represent algorithms of SIFT, SURF and ORB.

For the application of target tracking in various scenes, many scholars have already made many better research results, but the research of multi-target cross-camera tracking technology, most of academic research nowadays focuses on finding the overlapping part between the images captured by the cameras, and the overlapping part is used as a basis for tracking different cameras respectively. The optimized SURF algorithm is utilized to match the overlapped parts of the images of the two cameras, the target handover of the multiple cameras is completed, and the cross-camera tracking is realized. Moreover, when multiple targets appear in different camera pictures, cross-camera tracking of overlapped picture parts can only be realized, a determined accurate id is allocated to the overlapped picture parts, and how the id of target tracking of the non-overlapped part is allocated is not provided with a proper solution. In the aspect of target detection, in the prior art, a frame difference method is adopted, namely a difference operation is performed on two continuous frames of images of a video image sequence to obtain a moving target contour, the algorithm is simple to implement, low in programming complexity and high in running speed, but the algorithm is seriously dependent on a selected inter-frame time interval and a selected segmentation threshold, poor in universality and easy to be restricted by scenes.

In order to realize multi-target tracking among multiple cameras and enable target detection and tracking to have better effects, a method of YOLO-V4 combined with Deepsort is adopted, a target appearance feature extraction network in the Deepsort is improved, and an attention mechanism is introduced to obtain better matching features.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a multi-target and high-identification-accuracy target tracking method in a multi-camera scene.

The technical scheme is as follows: the invention discloses a target tracking method under a multi-camera scene, which comprises the following steps:

s1: shooting a picture of a target object, and labeling the picture to obtain a first target object data set;

s2: the first target object data set and the collected target object associated data set are scattered and mixed to obtain a total data set, and a YOLO-V4 model is trained by adopting the total data set;

s3: shooting each target object at multiple angles to obtain pictures of each target object at different angles and obtain a second target object data set;

s4: an attention mechanism is introduced to improve a target appearance characteristic extraction network in a Deepsort algorithm;

s5: training the improved target appearance characteristic extraction network by using a second target object data set;

s6: combining the trained YOLO-V4 model with an improved Deepsort algorithm, obtaining a detection frame of a target object by using the YOLO-V4 model, and tracking the detected target object by using the improved Deepsort algorithm to obtain a target object tracking model;

s7: the method comprises the steps of adopting a plurality of cameras arranged at different positions, and tracking a target object to be tracked by applying a target object tracking model.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

the invention obtains the video spliced by multiple cameras by applying the SURF image splicing algorithm, and finally realizes multi-target tracking by applying the YOLO-V4 model in combination with the improved Deepsort algorithm in the video. The target appearance information extraction network in the Deepsort algorithm is optimized, a channel attention mechanism is introduced, and the vehicle weight identification accuracy is improved by 1.1% compared with the prior art. In conclusion, the multi-target tracking method and the multi-target tracking system realize multi-target tracking in a multi-camera scene, and the accuracy rate of vehicle weight identification is improved well.

Drawings

FIG. 1 is a comparison graph of the effect of the target appearance information extraction network and the improved target appearance information extraction network in the original Deepsort algorithm trained by the self-made vehicle re-identification data set according to the present invention.

FIG. 2 is a diagram of the tracking effect of multiple intelligent trolleys in a single-camera scene by applying a YOLO-V4 model and an improved Deepsort algorithm.

FIG. 3 is a diagram of a plurality of intelligent vehicle tracking effects in a scene with two cameras fused by applying a YOLO-V4 model and an improved Deepsort algorithm.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

According to the method, different camera pictures are spliced by using YOLO-V4 in combination with an improved Deepsort algorithm and an image splicing algorithm, and finally multi-target tracking is realized in the spliced video. In the aspect of a data set, a self-made intelligent trolley data set and a self-made vehicle re-identification data set containing an intelligent trolley are adopted. The method comprises the following specific steps:

s1: shooting a photo of the intelligent vehicle, and marking the photo to obtain an intelligent vehicle data set;

s2: combining the intelligent vehicle data set with the collected vehicle data set to obtain a total data set, and training a YOLO-V4 model by using the data set;

s3: shooting each intelligent vehicle at multiple angles to obtain pictures of each intelligent vehicle at different angles, taking out the part of the pictures containing the intelligent vehicles, and combining the collected vehicle heavy identification data sets to obtain a vehicle heavy identification data set containing the intelligent vehicles;

s4: the target appearance characteristic extraction network in the deep sort algorithm is improved, and an attention mechanism is introduced;

s5: training an improved target appearance characteristic extraction network in the Deepsort algorithm by using a vehicle weight recognition data set;

s6: and combining the trained YOLO-V4 model with an improved Deepsort algorithm to obtain a model capable of tracking the intelligent trolley.

S7: and splicing videos shot by a plurality of cameras by using a SURF algorithm to obtain spliced videos, and tracking the intelligent vehicle in the videos by using a YOLO-V4 model and an improved Deepsort algorithm.

In step S1, the target detection effect of the YOLO-V4 model is closely related to the data set, and therefore, the data set must be sufficient. In the process of data set production, all the situations that the intelligent vehicle can appear in the scene need to be considered. And shooting pictures of the intelligent trolley from different angles, different shooting distances and different scenes to obtain 560 pictures containing the intelligent trolley. And finally, labeling the intelligent trolleys in the pictures by using data set labeling software to obtain the label file corresponding to each picture, wherein the quantity, the size and the angle of the intelligent trolleys contained in each picture are different. And combining the collected partial car data set with the self-made intelligent car data set to obtain a final data set. Wherein, 80% of the data set is used for a training set, 10% is used for a verification set, and the last 10% is used as a test set.

In step S2, the target detection network used is a YOLO-V4 network. The YOLO-V4 network is a YOLO series one-stage target detection network, is an improved version of YOLOV3, is subjected to a lot of small improvements on the basis of YOLOV3, and achieves great improvement of target detection accuracy while keeping the recognition rate not to be reduced. The major improvements of YOLO-V4 are as follows: 1. the method improves the Yolov3 backbone extraction network Darknet53, modifies the activation function of Darknet Conv2D from LeakyReLU to Mish, and the formula of the Mish function is as follows:

Mish＝x*tanh(ln(1+e^x))

the network architecture in Darknet53 was then modified to use the CSPnet architecture, with Darknet53 modified to CSPDarknet 53. 2. The SPP structure and the PANet structure are used. The SPP structure is connected to the last feature layer of the CSPdark net53 to perform convolution for three times, then the SPP structure is processed by using the maximum pooling of four different scales, the size of the pooling kernel of the maximum pooling is respectively 13x13, 9x9, 5x5 and 1x1, and the repeated extraction of features is realized by using an up-sampling and down-sampling network of the PANet. 3. The training part adopts a Mosaic data enhancement method, 4 pictures are read each time, four pictures are respectively turned, zoomed, changed in color gamut and the like, and the pictures are well arranged according to four directions to form new pictures. 4. CIOU is used as regression to optimize LOSS. The CIOU takes the distance between the target and the prior frame, the overlapping rate, the scale and the penalty term into consideration, so that the regression of the target frame becomes more stable, and the calculation formula is as follows:

where IOU is the intersection/union of areas between the prediction box and the actual box, ρ²(b,b^gt) C represents the diagonal distance of the minimum closure area which can contain the prediction frame and the real frame at the same time. And the calculation formula of alpha and v is as follows:

w, h and w^gt，h^gtRepresenting the width and height of the real frame and the prediction frame, 1-CIOU can obtain the corresponding LOSS, and the calculation formula is as follows:

in step S3, the quality of the extraction capability of the target appearance feature extraction network in deep sort is closely related to the data set used for training the network, and therefore, a vehicle re-identification data set needs to be created. Every dolly all need take the picture of different angles separately, draws the position of the intelligent vehicle in the picture alone, and every dolly is about taking 40 pictures, then combines together the vehicle heavy identification data set that collects with the intelligent car heavy identification data set of self-control, obtains the vehicle heavy identification data set that contains the intelligent vehicle, and the data set contains 585 different vehicles, and every kind of vehicle possess about 40 pictures. Taking 90% as training set and 10% as testing set.

In steps S4-S5, a Deepsort target tracking algorithm is used and improved. The deep Sort algorithm is an improvement of the Sort algorithm, the Sort tracking method is to input IOU conditions of a detection frame and a tracking frame into the Hungarian algorithm for linear distribution to associate inter-frame IDs, and although the method is high in tracking precision and accuracy, ID switching is easily caused. Therefore, the Deepsort adds the appearance information of the target into the matching calculation, so that the ID can be correctly matched under the condition that the target is shielded and appears later, the frequent ID switch can be effectively reduced, the 128-dimensional feature vector corresponding to the detection frame is calculated through the convolutional neural network for extracting the appearance information of the target, and the effect of tracking the target is directly influenced by the extracting effect of the convolutional neural network on the appearance information of the target. For the patent, the identified main target is the intelligent vehicle, and for this purpose, a proper convolutional neural network needs to be trained to extract the target appearance information of the intelligent vehicle. In order to enable the feature extraction capability of the network to be better, the method improves the feature extraction network of the Deepsort, and introduces a channel attention mechanism network ECA-Net behind the original residual error network. ECA-Net provides a local cross-channel interaction strategy without dimension reduction and a method for adaptively selecting the size of a one-dimensional convolution kernel, thereby realizing the improvement on performance.

In step S6, the YOLO-V4 model is combined with the improved Deepsort algorithm, and multi-target tracking is carried out in the scene of a single camera.

In step S7, in order to make the stitching have good accuracy and robustness and have good real-time performance, the SURF algorithm is used to extract the feature points of the image sequence. The SURF algorithm has the advantages of high speed and high matching degree in the current mainstream image splicing algorithm, so the SURF algorithm is also called as a rapid robust feature. The idea of multi-camera video stitching is as follows: firstly, reading pictures captured by each camera, then splicing the captured pictures by using a SURF algorithm to obtain spliced pictures, and finally fusing all the spliced pictures to obtain a final multi-camera fused video. And tracking the intelligent vehicle in the video by using a YOLO-V4 model and combining a modified version of Deepsort algorithm.

In order to better embody the technical effect of the invention, the classification performance of the trained YOLO-V4 network is counted, and as a result, as shown in tables 1 and 2, the AP in table 1 is the average accuracy, and the identification accuracy of the YOLO-V4 network under a single category is reflected. The mAP is the average of the AP values under all categories, reflecting the accuracy of the YOLO-V4 network under all categories. F1 is a comprehensive evaluation index of the model and reflects the classification effect of the YOLO-V4 network.

TABLE 1 Classification Performance index of trained YOLO-V4 network

TABLE 2 Deepsort target appearance information extraction network comparison before and after improvement

Accuracy in table 2 is the Accuracy, and the larger the Accuracy is, the stronger the extraction capability of the target appearance information extraction network is. The Loss is a calculated value of a Loss function, and the smaller the Loss is, the stronger the extraction capability of the target appearance information extraction network is.

Claims

1. A target tracking method under a multi-camera scene is characterized by comprising the following steps:

2. The method for tracking the target under the multi-camera scene according to claim 1, wherein the step S2 of building the YOLO-V4 model comprises:

(1) the method improves the Yolov3 backbone extraction network Darknet53, modifies the activation function of Darknet Conv2D from LeakyReLU to Mish, and the formula of the Mish function is as follows:

Mish＝x*tanh(ln(1+e^x))

then modifying the network structure in Darknet53, using CSPnet structure, modifying Darknet53 into CSPDarknet 53;

(2) after the SPP structure is connected to the last feature layer of the CSPdakrnet 53 to carry out convolution for three times, the SPP structure is respectively used for processing by utilizing the maximum pooling of four different scales, the size of the pooling kernel of the maximum pooling is respectively 13x13, 9x9, 5x5 and 1x1, and the repeated extraction of the features is realized by utilizing an up-sampling network and a down-sampling network of the PANET;

(3) the training part adopts a Mosaic data enhancement method, and the method reads a plurality of pictures each time, respectively turns over, zooms, changes color gamut and the like, and places the pictures in various directions to form new pictures;

(4) CIOU is used as regression optimization LOSS, and the calculation formula is as follows:

where IOU is the intersection/union of the areas between the predicted and actual boxes, ρ 2(b, b)^gt) C represents the diagonal distance of the minimum closure area which can simultaneously contain the prediction frame and the real frame; the calculation formula of alpha and v is as follows:

3. the method for tracking the target under the multi-camera scenario as claimed in claim 1, wherein the attention mechanism of step S4 is a channel attention mechanism network ECA-Net.

4. The method for tracking the target under the multi-camera scene according to claim 1, wherein the step S7 further comprises the steps of: firstly, reading pictures captured by each camera, splicing the captured pictures by using a SURF algorithm, and then fusing all the spliced pictures to obtain a final multi-camera fused video.

5. The method for tracking the target under the multi-camera scene as claimed in claim 1, wherein the target is an intelligent car, and the target-related data is other data with a shape similar to the intelligent car.