CN114820699A

CN114820699A - Multi-target tracking method, device, equipment and medium

Info

Publication number: CN114820699A
Application number: CN202210325675.8A
Authority: CN
Inventors: 刘洋; 赵雄
Original assignee: Xiaomi Automobile Technology Co Ltd
Current assignee: Xiaomi Automobile Technology Co Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-07-29
Anticipated expiration: 2042-03-29
Also published as: CN114820699B

Abstract

The invention provides a multi-target tracking method, which is used for solving the technical problems of uncontrollable multi-model calling efficiency, low characteristic extraction and recognition efficiency, easy target loss and low target tracking accuracy in the prior art, and comprises the following steps: acquiring a first video frame and a second video frame, wherein the first video frame is a current video frame, and the second video frame is a previous video frame adjacent to the first video frame; respectively processing the first video frame and the second video frame through a target detection network, determining a plurality of first targets in the first video frame and a plurality of second targets in the second video frame, and obtaining pose information corresponding to each first target and each second target; respectively calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets; and matching the targets based on the predicted pose information and pose information corresponding to the first targets, and tracking according to the target matching result.

Description

Multi-target tracking method, device, equipment and medium

Technical Field

The invention relates to the technical fields of video image processing, computer vision and the like, in particular to a multi-target tracking method, a multi-target tracking device, multi-target tracking equipment and a multi-target tracking medium.

Background

The sequencing algorithm (Sort) is a traditional Online real-time Tracking algorithm, And decomposes a multi-target Tracking problem into a target detection part, a state prediction part And a data association part. The detection part in the Sort algorithm usually adopts a fast-migration learning target detection (fast-Region-CNN, fast-RCNN) target detection algorithm, and the algorithm has the advantages that the operation speed of the state prediction and data association part is high, and online tracking can be realized. However, it is not considered that the appearance features of each target are matched only by using an IntersecTIon Over Union (IOU), which may cause the phenomenon that id tags are switched frequently in practical use, and the lost target cannot be recovered any more.

In the prior art, in order to solve the above problems, there are two solutions as follows:

the method comprises the following two steps:

the two-stage tracking method based on deep learning introduces a convolutional neural network to extract apparent features of a target on the basis of a sort algorithm, and adds a cascade matching strategy. Although the two-stage method improves the target tracking accuracy, the running time of the method is limited because two networks are required to be calculated in series, the consumed time is the sum of the two network times plus the time of a tracking module, and the number of times of calling of the apparent feature extraction model is multiplied with the increase of the number of targets.

(II) a single-stage method:

the Joint Detection and Embedding method (JDE) integrates a target Detection algorithm and a target re-identification algorithm into one network, and uses the same backbone network to extract features, so that the network can simultaneously output position information and apparent feature vectors of a target. However, the same network is used for training the target detection and the pedestrian re-identification reid features at the same time, and the imbalance of tasks of the target detection and the pedestrian re-identification reid features causes the difficulty in achieving higher accuracy of the model.

Therefore, a multi-target tracking method capable of quickly realizing tracking and easily adjusting the re-recognition characteristics and target detection balance is needed to solve the problems of low recognition efficiency and easy target loss.

Disclosure of Invention

The invention provides a multi-target tracking method, a multi-target tracking device, multi-target tracking equipment and a multi-target tracking medium, which are used for solving the technical problems that in the prior art, the multi-model calling efficiency is uncontrollable, the feature extraction efficiency and the recognition efficiency are low, targets are easy to lose, and the target tracking accuracy is low.

In a first aspect, an embodiment of the present invention provides a multi-target tracking method, which is applied to an automobile, and includes:

acquiring a first video frame and a second video frame, wherein the first video frame is a current video frame, and the second video frame is a previous video frame adjacent to the first video frame;

respectively processing the first video frame and the second video frame through a target detection network, determining a plurality of first targets in the first video frame and a plurality of second targets in the second video frame, and obtaining pose information corresponding to each first target and each second target;

respectively calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets;

and matching the targets based on the predicted pose information and pose information corresponding to the first targets, and tracking according to the target matching result.

In a possible implementation manner, in the method provided by the embodiment of the present invention, processing a first video frame and a second video frame through a target detection network, determining a plurality of first targets in the first video frame and a plurality of second targets in the second video frame, and obtaining pose information corresponding to each of the first targets and the second targets includes:

processing the second video frame through a target detection network to determine a plurality of second targets and obtain pose information and re-identification characteristics corresponding to each second target;

and processing the first video frame through a target detection network, determining a plurality of first targets, and obtaining pose information and re-identification characteristics corresponding to each first target.

In a possible implementation manner, in the method provided by the embodiment of the present invention, performing target matching based on the predicted pose information and pose information corresponding to a plurality of first targets includes:

determining a second target with a distance to any first target being smaller than a preset target threshold value as a third target based on the predicted pose information and pose information corresponding to the plurality of first targets;

and if the re-identification features of the third target are matched with the re-identification features of the first target, determining that the third target is matched with the first target.

In a possible implementation manner, in the method provided in an embodiment of the present invention, the method further includes:

and updating the pose information of a second target matched with the first target by using the pose information corresponding to the first target.

determining a second target that is not matched as a lost target;

and stopping tracking the lost target, and deleting the pose information corresponding to the lost target.

In a possible implementation manner, in the method provided in the embodiment of the present invention, the target detection network is trained by the following method:

obtaining pose information of the target from a pre-obtained video frame sample containing the target through marking;

and training the target detection network by using the pose information of the target.

In a possible implementation manner, in the method provided in an embodiment of the present invention, the training method further includes:

freezing the main trunk of the target detection network, and carrying out scale fusion on the re-identification feature extraction branches of the target detection network.

In a second aspect, an embodiment of the present invention provides a multi-target tracking apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a first video frame and a second video frame, the first video frame is a current video frame, and the second video frame is a previous video frame adjacent to the first video frame;

the processing unit is used for respectively processing the first video frame and the second video frame through a target detection network, determining a plurality of first targets in the first video frame and a plurality of second targets in the second video frame, and obtaining pose information corresponding to each first target and each second target;

the calculation unit is used for respectively calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets;

and the matching unit is used for matching the targets based on the predicted pose information and the pose information corresponding to the first targets and tracking according to the target matching result.

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the processing unit is specifically configured to:

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the matching unit is specifically configured to:

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the matching unit is further configured to:

determining a second target that is not matched as a lost target;

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the processing unit trains the target detection network by the following method:

In a possible implementation manner, in the apparatus provided in this embodiment of the present invention, the processing unit is further configured to:

freezing the backbone of the target detection network, and carrying out scale fusion on the re-identification feature extraction branches of the target detection network.

In a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement a method as provided by the first aspect of an embodiment of the invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, implement the method as provided by the first aspect of embodiments of the present invention.

In the embodiment of the invention, a first video frame and a second video frame are obtained, then the first video frame and the second video frame are respectively processed through a target detection network, a plurality of first targets in the first video frame and a plurality of second targets in the second video frame are determined, pose information corresponding to each first target and each second target is obtained, predicted pose information of each second target in the first video frame is respectively calculated according to the pose information corresponding to the plurality of second targets, finally, target matching is carried out based on the predicted pose information and the pose information corresponding to the plurality of first targets, and tracking is carried out according to a target matching result. Compared with the prior art, the method solves the problems that the multi-model calling efficiency is uncontrollable, the feature extraction efficiency and the recognition efficiency are low, the target is easy to lose and the target tracking accuracy is low, on the basis of ensuring the running speed of the whole algorithm, the tracking efficiency is stable and robust, and the external target can be effectively sensed in the automatic driving process, so that effective interaction is made, and the driving safety is ensured.

Drawings

Fig. 1 is a schematic flow chart of a multi-target tracking method according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a training method for a target detection network in multi-target tracking according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of pre-scale fusion features provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature after scale fusion provided by an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for predicting a target location in multi-target tracking according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of a matching tracking method in multi-target tracking according to an embodiment of the present invention;

fig. 7 is a schematic specific flowchart of a multi-target tracking method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a multi-target tracking apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Some of the words that appear in the text are explained below:

1. the term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

2. Pedestrian re-identification (reid) is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence.

3. multi-Object Tracking (MOT) is to track Multiple objects in continuous video pictures, and the essence of Tracking is to associate the same Object (Object) in the front and rear frames of a video and to assign a unique track id. The main task is to give an image sequence, find moving objects in the image sequence, correspond moving objects in different frames one to one, and then give the motion tracks of different objects. These objects may be any such as pedestrians, vehicles, various animals, and so on. In the three-layer structure of computer vision, target tracking belongs to the middle layer and is the basis of other high-layer tasks (such as action recognition, behavior analysis and the like). The target tracking comprises single target tracking and multi-target tracking. The multi-target tracking problem needs correlation matching between targets besides the problems of illumination, deformation, shielding and the like which can be encountered by single target tracking. In addition, frequent occlusion of the target, unknown track starting and ending moments, large target scale change, similar appearance, interaction between targets, low frame rate and the like are frequently encountered in the multi-target tracking task.

4. The sorting algorithm (Sort) is a traditional Online real-time Tracking algorithm, And decomposes a multi-target Tracking problem into a target detection part, a state prediction part And a data association part.

5. Fast-migratory learning target detection (fast-Region-CNN, fast-RCNN) is the first algorithm to successfully apply deep learning to target detection. The R-CNN realizes a target detection technology based on algorithms such as a Convolutional Neural Network (CNN), linear regression, a Support Vector Machine (SVM) and the like.

6. The IntersecTIon-to-Union ratio (IOU) is an important function of the performance of the target detection algorithm, and the function value is equal to the ratio of the IntersecTIon and the Union of the "predicted frame" and the "real frame".

7. A Joint Detection and Embedding method (JDE) is anchor-based target Detection and is characterized in that target Detection and Embedding learning are integrated in the same network, and the speed is high.

8. The Unscented Kalman Filter (Unscented Kalman Filter, UKF) is a combination of Unscented Transform (UT) and standard Kalman Filter systems, and the Unscented Transform makes the nonlinear system equation suitable for the standard Kalman system under linear assumption.

The sequencing algorithm (Sort) is a traditional Online real-time Tracking algorithm, And decomposes a multi-target Tracking problem into a target detection part, a state prediction part And a data association part. The detection part in the Sort algorithm adopts a fast-RCNN target detection algorithm, the position and the category of an input picture are output through the target detection algorithm, then state prediction and updating are carried out on each detected target through Kalman filtering, and finally, the Hungary is used for matching the predicted target and the target detected by the current frame by taking iou as a cost matrix. The Sort has the advantages that the operation speed of the state prediction and data correlation part is high, and online tracking can be realized. However, the method does not consider that the appearance characteristics of each target are matched only by using the IOU, the phenomenon of switching the id tags often occurs in actual use, and the lost target cannot be found back any more.

the method comprises the following two steps:

the two-stage tracking method based on deep learning introduces a convolutional neural network to extract apparent features of a target on the basis of a sort algorithm, and adds a cascade matching strategy. The specific implementation is that a target re-recognition network is trained independently besides the target detection network. The cascade matching method effectively solves the problem that the target temporarily disappears due to occlusion and the like. And secondly, improving a target apparent feature extraction network on a model and a loss function so as to ensure that the distance in the same id is small enough. Although the two-stage method improves the target tracking accuracy, the running time of the method is limited because two networks are required to be calculated in series, the consumed time is the sum of the two network times plus the time of a tracking module, and the number of times of calling of the apparent feature extraction model is multiplied with the increase of the number of targets.

(II) single-stage method:

the Joint Detection and Embedding method (JDE) integrates a target Detection algorithm and a target re-identification algorithm into one network, and uses the same backbone network to extract features, so that the network can simultaneously output position information and apparent feature vectors of a target. The method effectively improves the running speed of the algorithm. The FairMOT method is based on the idea of the JDE method, and is combined with a target detection algorithm CenterNet which does not need to set an anchor frame, so that the center position, the width and the height, the center point offset and the reid characteristics of a target are directly learned. However, the same network is used for training the target detection and the reid features at the same time, and the model is difficult to achieve high precision due to the imbalance of tasks of the target detection and the reid features.

In the technical scheme, the driving process comprises video image processing, 3d target detection, reid feature extraction, unscented kalman filter prediction and multi-target tracking functions, and the method, the device, the equipment and the medium for multi-target tracking provided by the invention are mainly used for an automatic driving function, and are explained in more detail by combining drawings and embodiments.

The embodiment of the invention provides a multi-target tracking method, as shown in fig. 1, comprising the following steps:

step 101, a first video frame and a second video frame are obtained.

During specific implementation, video frames are acquired in real time through a vehicle-mounted camera or other shooting equipment, the first video frame is a current video frame, and the second video frame is a previous video frame adjacent to the first video frame.

Step 102, the first video frame and the second video frame are processed through a target detection network respectively, a plurality of first targets in the first video frame and a plurality of second targets in the second video frame are determined, and pose information corresponding to each first target and each second target is obtained.

In specific implementation, the second video frames are processed through the target detection network to determine a plurality of second targets and obtain pose information and re-identification features corresponding to each second target, and the same first video frames are processed through the target detection network to determine a plurality of first targets and obtain pose information and re-identification features corresponding to each first target.

In the step, pose information and re-identification characteristics are obtained through a target detection network, and the re-identification characteristics, namely the re-identification characteristics, can effectively solve the problem of id number change when the target is shielded, missed to detect, suddenly changed in direction and the like in multi-target tracking.

And 103, respectively calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets.

In specific implementation, the pose information corresponding to the second targets is obtained through a target prediction method, and the predicted pose information of each second target in the first video frame is predicted according to the pose information.

And 104, performing target matching based on the predicted pose information and the pose information corresponding to the plurality of first targets, and tracking according to a target matching result.

In specific implementation, based on the predicted pose information and pose information corresponding to a plurality of first targets, determining a second target with a distance to any first target smaller than a preset target threshold value as a third target, if the re-recognition features of the third target are matched with the re-recognition features of the first targets, determining that the third target is matched with the first targets, then tracking according to a target matching result, namely updating the pose information of the second target matched with the first targets by using the pose information corresponding to the first targets, and determining the unmatched second target as a lost target,

As shown in fig. 2, the training process of the target detection network in multi-target tracking provided by the embodiment of the present invention may include the following steps:

step 201, obtaining pose information of the target from a pre-obtained video frame sample containing the target through marking.

And 202, training the target detection network by using the pose information of the target.

And step 203, freezing the backbone of the target detection network, and performing scale fusion on the re-identification feature extraction branches of the target detection network.

In specific implementation, when the reid branch is trained, the detection model trained in step 202 is used as a pre-training model to be input, a main network part is frozen, only the parameters of the scale fusion and the reid related part are updated, and the discriminative reid feature is extracted while the detection precision is not influenced. As shown in fig. 3, which is a feature before scale fusion, and fig. 4, which is a feature after scale fusion, it can be seen from fig. 3 and fig. 4 that semantic features can be enhanced after feature scale fusion.

In the target prediction stage in tracking, a common method is Kalman filtering, but the linear derivation and calculation process cannot be well applied to a nonlinear system. Because the motion of the pedestrian is nonlinear in the automatic driving scene, especially when the frame rate is low, the nonlinear characteristic is more obvious, in order to increase the accuracy of target position prediction, the scheme adopts a method of Unscented Kalman Filtering (UKF), as shown in fig. 5, the target position prediction method in multi-target tracking provided by the embodiment of the invention can comprise the following steps:

step 501, data is initialized.

In particular, the state vector is initialized

Sum state covariance matrix P ₀ The formula is as follows:

step 502, target prediction is performed.

Weight W of Sigma point and average value of state estimation obtained by lossless transformation _i ^m Sum covariance weight W _i ^c ；

Then, the state is updated, and the formula is as follows:

and updating the observation equation, wherein the formula is as follows:

wherein

P _k The latest filtering result and filtering covariance are respectively. And performing unscented Kalman filtering by the formula to realize pose prediction of the target.

After the pose prediction, target matching is performed based on the predicted pose information and pose information corresponding to a plurality of first targets, and tracking is performed according to a target matching result, as shown in fig. 6, a matching tracking process in multi-target tracking provided by an embodiment of the present invention may include the following steps:

step 601, initializing a tracker for each target.

In the implementation, the first frame is usually processed, the target position and the apparent feature vector are obtained by the detection network, and then a tracker is initialized for each target.

Step 602, predicting the position of the target in the current frame and matching.

During specific implementation, UKF prediction is carried out on a target in the tracker to predict the position of the target in a current frame; obtaining the target position information and the reid characteristic of the current frame by a detection network; calculating a reid cost matrix and a 3d distance cost matrix of the current target and the target in the tracker, and enabling the distance value to be larger than a threshold value T _d The value of the appearance cost is assigned to infinity inf; and matching the current target with the target in the tracker by using a Hungarian algorithm. And updating the position, the state and the characteristics of the target successfully matched in the tracker.

Step 603, performing secondary matching.

In specific implementation, the 3d distance cost matrix of the detected and tracked target which is not matched in the step 602 is calculated, secondary matching is performed by the Hungarian algorithm, and the position, state and feature matrix of the target in the tracker are updated for the target which can be matched.

Step 604, perform three matches.

In specific implementation, the 3d distance cost matrix of the detected and unconfirmed tracking target which is not matched in the step 603 is calculated, the Hungarian algorithm is matched, and the position, the state and the feature matrix of the target which can be matched in the tracker are updated.

At step 605, the flag is lost.

In specific implementation, for the target which is not matched in the tracker all the time, the tracking state of the target is marked as lost, and the lost state is greater than a threshold value T _t And after the frame, the target is considered to disappear, and the detected target which is not matched up all the time is added into the tracker to be set to be in a state to be confirmed.

As shown in fig. 7, the multi-target tracking method provided in the embodiment of the present invention is explained in detail.

Step 701, a first video frame and a second video frame are obtained.

Step 702, processing the second video frame through the target detection network to determine a plurality of second targets, and obtaining pose information and re-identification characteristics corresponding to each second target.

Step 703, processing the first video frame through the target detection network, determining a plurality of first targets, and obtaining pose information and re-identification features corresponding to each first target.

In specific implementation, in step 702 and step 703, pose information and re-identification features are obtained through the target detection network, and the re-identification features, that is, the re-identification features, can effectively solve the problem of id number change when the target is shielded, missed, suddenly changed in direction, and the like in multi-target tracking.

In this step, an object detection network of Anchor-free is adopted, the object detection problem is changed into a problem of key point regression, namely, the object is represented by the center point of the object box, the offset (offset) of the center point of the object and the width (size) are predicted to obtain the actual box of the object, and the classification information is represented by heatmap. The network can be applied to 2D target detection, and can be expanded to 3D target detection tasks, namely information such as depth, length, width, height and angle of a target, namely posture information is output. Meanwhile, the network is also used to obtain re-identification features, namely reid features, and the training method of the target detection network shown in fig. 2 is used to train the re-identification features in a targeted manner, and the specific process is not repeated here.

And 704, respectively calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets.

In specific implementation, the pose information corresponding to the second targets is obtained through a target prediction method, and the predicted pose information of each second target in the first video frame is predicted according to the pose information. In this step, the target prediction method shown in fig. 5 is selected for prediction, which is not described herein again.

Step 705, performing target matching based on the predicted pose information and pose information corresponding to the plurality of first targets.

In specific implementation, based on the predicted pose information and pose information corresponding to the multiple first targets, a second target with a distance to any one of the first targets being smaller than a preset target threshold is determined as a third target, and if the re-identification features of the third target are matched with those of the first targets, the third target is determined to be matched with the first targets.

And step 706, tracking according to the target matching result.

In specific implementation, tracking is performed according to a target matching result, namely, the pose information of a second target matched with the first target is updated by using the pose information corresponding to the first target, and the unmatched second target is determined as a lost target.

And step 707, stopping tracking the lost target, and deleting the pose information corresponding to the lost target.

As shown in fig. 8, based on the same inventive concept of the multi-target tracking method, the present invention further provides a multi-target tracking apparatus, including:

an obtaining unit 801, configured to obtain a first video frame and a second video frame, where the first video frame is a current video frame, and the second video frame is a previous video frame adjacent to the first video frame;

a processing unit 802, configured to process the first video frame and the second video frame through a target detection network, determine multiple first targets in the first video frame and multiple second targets in the second video frame, and obtain pose information corresponding to each first target and each second target;

a calculating unit 803, configured to calculate predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets;

and the matching unit 804 is used for matching the targets based on the predicted pose information and the pose information corresponding to the plurality of first targets and tracking according to the target matching result.

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the processing unit 802 is specifically configured to:

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the matching unit 804 is specifically configured to:

In a possible implementation manner, in the apparatus provided in the embodiment of the present invention, the matching unit 804 is further configured to:

determining a second target that is not matched as a lost target;

In a possible implementation manner, in the apparatus provided in this embodiment of the present invention, the processing unit 802 trains the target detection network by the following method:

In a possible implementation manner, in the apparatus provided in this embodiment of the present invention, the processing unit 802 is further configured to:

In addition, the multi-target tracking method and apparatus of the embodiments of the invention described in conjunction with fig. 2-8 may be implemented by an electronic device. Fig. 9 is a schematic diagram illustrating a hardware structure of an electronic device according to an embodiment of the present invention.

The electronic device may comprise a processor 901 and a memory 902 storing computer program instructions.

Specifically, the processor 901 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits implementing the embodiments of the present invention.

Memory 902 may include mass storage for data or instructions. By way of example, and not limitation, memory 902 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 902 may include removable or non-removable (or fixed) media, where appropriate. The memory 902 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 902 is a non-volatile solid-state memory. In a particular embodiment, the memory 902 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically Alterable ROM (EAROM), or flash memory or a combination of two or more of these.

The processor 901 realizes any one of the multi-target tracking methods in the above embodiments by reading and executing computer program instructions stored in the memory 902.

In one example, the electronic device can also include a communication interface 903 and a bus 910. As shown in fig. 9, the processor 901, the memory 902, and the communication interface 903 are connected via a bus 910 to complete communication with each other.

The communication interface 903 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present invention.

The bus 910 includes hardware, software, or both to couple the components of the electronic device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 910 can include one or more buses, where appropriate. Although specific buses have been described and shown in the embodiments of the invention, any suitable buses or interconnects are contemplated by the invention.

The electronic device may execute the multi-target tracking method in the embodiment of the present invention based on the received video frame, thereby implementing the multi-target tracking method and apparatus described with reference to fig. 2 to 8.

In addition, in combination with the electronic device in the above embodiments, the embodiments of the present invention may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the multi-target tracking methods of the embodiments described above.

In the embodiment of the invention, a first video frame and a second video frame are obtained, then the first video frame and the second video frame are respectively processed through a target detection network, a plurality of first targets in the first video frame and a plurality of second targets in the second video frame are determined, pose information corresponding to each first target and each second target is obtained, predicted pose information of each second target in the first video frame is respectively calculated according to the pose information corresponding to the plurality of second targets, finally, target matching is carried out based on the predicted pose information and the pose information corresponding to the plurality of first targets, and tracking is carried out according to a target matching result. Compared with the prior art, the method solves the problems that the multi-model calling efficiency is uncontrollable, the efficiency of feature extraction and recognition is low, the target is easy to lose and the tracking accuracy of the target is low, on the basis of ensuring the running speed of the whole algorithm, the tracking efficiency is stable and robust, and the external target can be effectively sensed in the automatic driving process, so that effective interaction is made and the driving safety is ensured.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A multi-target tracking method is characterized by comprising the following steps:

and matching the targets based on the predicted pose information and the pose information corresponding to the first targets, and tracking according to the target matching result.

2. The multi-target tracking method according to claim 1, wherein the processing the first video frame and the second video frame through a target detection network to determine a plurality of first targets in the first video frame and a plurality of second targets in the second video frame and obtain pose information corresponding to each of the first targets and the second targets comprises:

processing the second video frame through the target detection network to determine a plurality of second targets and obtain pose information and re-identification characteristics corresponding to each second target;

3. The multi-target tracking method according to claim 2, wherein the performing target matching based on the predicted pose information and pose information corresponding to the plurality of first targets comprises:

determining a second target with a distance to any first target smaller than a preset target threshold value as a third target based on the predicted pose information and pose information corresponding to the plurality of first targets;

and if the re-identification feature of the third target is matched with the re-identification feature of the first target, determining that the third target is matched with the first target.

4. The multi-target tracking method of claim 3, further comprising:

5. The multi-target tracking method of claim 4, further comprising:

determining a second target that is not matched as a lost target;

stopping tracking the lost target, and deleting the pose information corresponding to the lost target.

6. The multi-target tracking method according to any one of claims 1-5, wherein the target detection network is trained by:

7. The multi-target tracking method of claim 6, wherein the training method further comprises:

and freezing the backbone of the target detection network, and carrying out scale fusion on the re-identification feature extraction branches of the target detection network.

8. A multi-target tracking apparatus, comprising:

the calculation unit is used for calculating the predicted pose information of each second target in the first video frame according to the pose information corresponding to the plurality of second targets;

and the matching unit is used for matching targets based on the predicted pose information and the pose information corresponding to the plurality of first targets and tracking according to the target matching result.

9. The multi-target tracking device of claim 8, wherein the processing unit is specifically configured to:

10. The multi-target tracking device of claim 9, wherein the matching unit is specifically configured to:

11. The multi-target tracking device of claim 10, wherein the matching unit is further configured to:

12. The multi-target tracking device of claim 11, wherein the matching unit is specifically configured to:

determining a second target that is not matched as a lost target;

13. The multi-target tracking device of any one of claims 8-12, wherein the processing unit trains the target detection network by:

14. The multi-target tracking device of claim 13, wherein the processing unit is further configured to:

15. An electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory that, when executed by the processor, implement the method of any of claims 1-7.

16. A computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1-7.