CN113409361B

CN113409361B - Multi-target tracking method and device, computer and storage medium

Info

Publication number: CN113409361B
Application number: CN202110922602.2A
Authority: CN
Inventors: 林涛; 张炳振; 刘宇鸣; 邓普阳; 张枭勇; 陈振武; 王宇; 周勇
Original assignee: Shenzhen Urban Transport Planning Center Co Ltd
Current assignee: Shenzhen Urban Transport Planning Center Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2023-04-18
Anticipated expiration: 2041-08-12
Also published as: CN113409361A

Abstract

The invention provides a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence. Firstly, inputting a video into a fusion detection association module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features; secondly, obtaining the object type, the object position information and the same trackID of the same object in different video frames by a multi-task learning method in deep learning; and predicting the position of the target of the current frame possibly by using a track prediction module according to the target motion track information in the continuous frames, and providing reference for the fusion detection correlation module. And finally, outputting the multi-target tracking information. The method solves the technical problems that the target tracking efficiency is low, the target is easy to lose and the target ID is easy to change in the prior art, improves the efficiency of multi-target tracking and avoids the loss of target tracking.

Description

Multi-target tracking method and device, computer and storage medium

Technical Field

The application relates to a target tracking method, in particular to a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence.

Background

The multi-target tracking is to simultaneously track a plurality of targets in a video, application scenes such as security protection, automatic driving and the like are adopted, the number of people and vehicles in the scenes is uncertain, the characteristics of each target are uncertain, and the tracking of the targets is the basis of other applications (such as target positioning, target density calculation and the like). Different from single target tracking, multi-target tracking has a unique ID for each target, and the target is ensured not to be lost in the tracking process. Meanwhile, the appearance of a new target and the disappearance of an old target are also problems to be solved by multi-target tracking.

At present, many researches are carried out on multi-target tracking, the main tracking strategy is DBT (tracking based on detection), a detection module and a data association module are independent, a video sequence firstly passes through a detection algorithm to obtain position information of a target, and a final track result is obtained after the data association algorithm is executed.

A representative algorithm in multi-target tracking is a DeepsORT algorithm, belongs to a data association algorithm in MOT (multi-target tracking), and can be combined with any detector to realize multi-target tracking. The algorithm combines a Kalman filtering algorithm and a Hungarian algorithm. And predicting the state of the detection frame in the next frame by using a Kalman filtering algorithm, and matching the state with the detection result of the next frame. In the matching process, a Hungarian algorithm is used, the motion characteristics obtained by Kalman filtering are combined with appearance characteristics extracted by a CNN (convolutional neural network) to be fused together to calculate a cost matrix.

The MOT is mainly applied to scenes such as security protection, automatic driving and the like, and the scenes have high requirements on algorithm real-time performance. In the case of a fixed hardware level, the detection efficiency and the detection accuracy of the MOT should be improved as much as possible. In the prior art, the MOT has the problem of low efficiency in practical application. The existing real-time MOT usually only concerns about data association steps, essentially only completes a part of the MOT, and cannot really solve the efficiency problem.

In addition, different targets are often occluded in a real scene, which can cause problems of target loss, target ID change and the like in the MOT.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or important part of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

In view of this, in order to solve the technical problems of low target tracking efficiency, easy target loss and easy target ID change in the prior art, the present invention provides a multi-target tracking method, apparatus, computer and storage medium.

And the fusion detection correlation module outputs the position and the category information of different targets. The track prediction module takes the information as input to learn different types of target track information, so that the target tracking efficiency is improved, and the target tracking loss is avoided.

A multi-target tracking method comprises the following steps:

s110, inputting the video into a fusion detection correlation module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features;

s120, calculating a loss function;

s130, acquiring data association relation among the target type, the target position information and the target; inputting target position information into a track prediction module, learning target movement by using convolution operation, outputting predicted position information, forming different types of target motion rule information and transmitting the different types of target motion rule information to a database and a fusion detection association module;

s140 outputs multi-target tracking.

Preferably, the specific method for obtaining the feature map in step S110 is:

1) 1/4 downsampling the video through the convolutional layer 1 to obtain a characteristic diagram 1;

2) 1/8 downsampling the characteristic diagram 1 through the convolution layer 2 to obtain a characteristic diagram 2;

3) The characteristic diagram 2 is subjected to 1/16 down-sampling by the convolution layer 3 to obtain a characteristic diagram 3.

Preferably, the calculating the loss function in step S120 specifically includes the following three loss functions:

1) A target classification loss function;

2) A target location regression loss function;

3) Multi-objective cross entropy loss function.

Preferably, the calculation methods of the three loss functions in step S120 are specifically:

1) Objective classification penalty function

：

Wherein the content of the first and second substances,

a true class label representing a target>

Represents model predicted values, <' > based on>

Representing a total number of target categories;

indicating target class label representation->

A category feature of (a); />

And representing a class characteristic balance coefficient for balancing the influence of the class characteristics on the overall loss function, wherein the value is 0.5.

Random initialization at the beginning of training, followed by updating of training each iteration

The update formula is:

represents a difference between the current data and a characteristic of the category>

Represents the total number of target classes, based on the number of target classes>

Indicating target class label representation->

Is selected based on the category characteristic of->

Representing the model predicted value; />

Indicates the fifth->

The difference of the current data and the class characteristics at the time of the sub-iteration will then @>

Is updated to indicate->

Simultaneously use->

Guarantee->

Stable and/or bright>

The value was taken to be 0.5;

2) Target position regression loss function:

wherein the content of the first and second substances,

represents the model target prediction value, < > or >>

Indicates the true value of the target, and>

can take the value>

，/>

Represents the coordinate value of the center point of the detection frame, and is greater than or equal to>

Indicates that the detection frame is wide and/or is open>

Presentation detection boxHigh, or is greater than or equal to>

The position and the size of the target detection frame can be obtained through regression, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is expressed as follows:

wherein the content of the first and second substances,

indicating the position of the output of the trajectory prediction module, including->

Information;

3) Multi-objective cross entropy loss function:

wherein, the first and the second end of the pipe are connected with each other,

a true class label representing a target>

Representing the model predicted value;

the fusion detection association module aims to generate target types, target position information and trackID information of targets among different video frames, so that loss functions need to be weighted and summed to form a total loss function, and the loss function of the fusion detection association module needs to be calculated;

the loss function of the fusion detection correlation module is as follows:

，/>

and the multi-task weight parameter is expressed and can be set according to different task requirements.

Preferably, in S230, the target movement law is learned through a three-layer ConvLSTM network, and the predicted position information is output, specifically, the predicted position information includes feature information of a first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.

The system comprises a video input module, a fusion detection association module, a track prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting a target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.

A computer comprising a memory storing a computer program and a processor implementing the steps of a multi-target tracking method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements a multi-target tracking method.

The invention has the following beneficial effects: according to the scheme, the detection algorithm and the data association algorithm are fused into one module, so that repeated calculation is reduced. The track prediction module can be used for well processing the matching problem of difficult targets, the trackID generated by obtaining the data association relation among the target type, the target position information and the targets is more stable, the identification accuracy of the same target of the previous frame and the next frame can be improved, and the trackID is frequently switched. The problem of low computational efficiency and poor real-time performance of the existing multi-target tracking technology is solved, and meanwhile, the robustness for target shielding loss is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a fusion detection association module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a difference computing network according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a trajectory prediction module according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a ConvLSTM model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a multi-target tracking device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiment 1, this embodiment is described with reference to fig. 1 to 3, and a multi-target tracking method includes the following steps:

s110, inputting the video into a fusion detection association module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features;

firstly, inputting a video to a fusion detection association module to obtain a target position and data association information at one time, wherein a model of the fusion detection association module specifically refers to fig. 2.

The specific method for obtaining the feature map by performing the down-sampling process is that, assuming that the size of the input video frame is 1280 × 720 (length × width, which means that there are 1280 pixels in length and 720 pixels in width), the image is adjusted to 896 × 896 by resize, which facilitates subsequent processing. The down-sampling process is as follows:

(1) 1/4 down-sampling the input image by convolution layer 1 (convolution kernel size 8*8, step = 8) to obtain a feature map 1 with size 224 × 224;

(2) 1/8 down-sampling is carried out on the characteristic diagram 1 through a convolution layer 2 (the size of a convolution kernel is 2*2, and the step length is = 2), and a characteristic diagram 2 with the size of 112 × 112 is obtained;

(3) Then the feature map 2 is passed through convolution layer 3 (convolution kernel size 2*2, step = 2), and 1/16 down-sampling is calculated, resulting in feature map 3, size 56 x 56.

So far, 3 feature maps with different sizes are obtained by the image through a down-sampling process. Each frame of image in the fusion detection association module is subjected to down-sampling calculation, and 6 feature maps of the front frame and the rear frame are used as input and are transmitted into a difference calculation network. The method aims to calculate and fuse difference characteristics under different scales, and finally, a multi-task learning method is used for simultaneously predicting and obtaining data association relations among target types, target position information and targets.

The difference calculation network mainly comprises two structures of DenseBlock and Transition. The specific DenseBlock is composed of a BN layer + ReLU layer +3*3 convolution layer, and the input and output characteristic diagrams of DenseBlock are consistent. The Transition is composed of a BN layer + ReLU layer +1*1 convolutional layer +2*2 average pooling layer, and thus the size of the feature map becomes 1/2 of the original size after each Transition. In actual calculation, a total of 6 feature maps of two frames are input into the difference calculation network. Similar to the twin Network (Siamese Network), the difference calculation Network also has two paths, which correspond to the 3 feature maps of the previous frame and the 3 feature maps of the current frame, respectively. The two path networks are identical in structure but different in weight.

(1) Firstly inputting a feature map 1 with the size of 224 × 224 into each channel, changing the network size into 112 × 112 through Transition1, and then inputting a DenseBlock1 network learning feature to obtain the features of 112 × 112;

(2) Fusing and adding the features obtained in the previous step with the features 2, and continuously transmitting the features into a Transition2 network and a DenseBlock2 network to obtain 56 × 56 features;

(3) Similarly, the features of the previous step and the feature map 3 are fused and added, and are transmitted into a DenseBlock3 network for further learning features;

(4) The previous frame and the current frame respectively obtain a feature map of 56 × 56, and the difference between the two feature maps obtains a difference feature with the size of 56 × 56.

S120, calculating a loss function; since the network target obtains the association relationship between the target type, the target position information and the target data at a time, that is, the trackID information in the tracking process, the loss function needs to be calculated.

The calculation of the loss function specifically includes the following three loss functions:

1) A target classification loss function;

2) A target location regression loss function;

3) Multi-objective cross entropy loss function.

Wherein the target classification loss function

The calculation method specifically comprises the following steps:

wherein the content of the first and second substances,

a true class label representing a target>

Representing the probability that the model predicts a positive sample>

Representing a total number of target categories; />

Indicates that the target class label is->

A category feature of (a); />

The update formula is: />

Representing target class tag representation>

Is selected based on the category characteristic of->

Representing the model predicted value; />

Represents a fifth or fifth party>

Is updated to indicate->

Simultaneously use->

Guarantee->

Stabilized +>

The value was taken to be 0.5;

wherein the target location regression loss function:

wherein the content of the first and second substances,

represents a model target prediction value, <' > based on>

Indicates the true value of the target, and>

can take the value>

，/>

Indicates that the detection frame is wide and/or is open>

Indicates that the detection frame is high and/or is up or down>

wherein the content of the first and second substances,

And (4) information.

Wherein, the multi-target cross entropy loss function:

wherein the content of the first and second substances,

a true category label representing a target>

Representing the model predicted value;

the fusion detection association module aims to generate target types, target position information and trackID information of targets among different video frames, so that loss functions are weighted and summed to form a total loss function, and the loss function of the fusion detection association module is required to be calculated;

the loss function of the fusion detection correlation module is as follows:

wherein the content of the first and second substances,

，/>

the data association target is to obtain target trackID information in front and back video frames, and if a red vehicle appears in the previous frame and the red vehicle also appears in the current frame, the two vehicles can be judged to be the same trackID through data association. In order to find that the same object has the same trackID in different frames, a model should judge that the same object is closer to the space than different objects, and a common method in the prior art, namely a triplet loss function, is used in the MOT.

The specific algorithm implementation process is as follows: the difference features are followed by a full connection layer, and the number N of nodes of the full connection layer indicates that there are at most N different trackids (N is a hyper-parameter, which can be modified according to the needs of the scene, and usually takes the value N = 20000). The classification process is to classify the object when the object is detected. If the target exists before, the corresponding trackID is correctly classified, otherwise, the target is a new target with a classification label of-1, the parameters of the full connection layer are updated, and the object can be identified in the subsequent classification process by adding a trackID. Meanwhile, in the updating process of the model parameters, the trackIDs which are not detected for a long time can be forgotten, and the total number of the trackIDs recorded by the model is ensured not to exceed the value of N.

S150 outputs the multi-target tracking.

Embodiment 2, this embodiment is described with reference to fig. 4, and the multi-target tracking method further includes a track prediction module, and the track prediction module may learn historical track information of targets of different categories. The trajectory prediction module model structure is described with particular reference to fig. 4. The LSTM structure is a classical network structure for processing time series data, while ConvLSTM is a network structure formed by combining an LSTM structure and convolution (convolution), and the model structure is specifically shown with reference to fig. 5, wherein,

represents->

The input of the moment>

Represents->

The input of the moment>

To represent/>

Output at a moment in time>

Represents->

The output at that moment is greater or less>

Represents->

The output of the moment can not only establish a time sequence relation, but also exert the characteristic of convolution to depict the local spatial features of the image.

S210, inputting the target position information into a track prediction module, and calculating output variables C and H in the LSTM; the model being input in a sequence of successive image frames, e.g. X _t And X _t+1 Calculating C (cell output) and H (hidden state) for two continuous frame inputs; c (cell output) and H (hidden state) are output variables in the LSTM.

Wherein C represents a cell unit in the LSTM and is used for storing time sequence information and medium-term and long-term memory; h represents a hidden unit for storing the recent memory in the time sequence information.

S220, estimating C and H of a target time through C and H input at the past time by using convolution operation;

s230, learning a target movement rule through a three-layer ConvLSTM network, outputting predicted position information, and forming different types of target movement rule information;

wherein the first layer learns the characteristic information of the target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.

S240, the different types of target motion rule information are respectively transmitted to the database and the fusion detection association module. When the target is shielded, and the fusion detection association module cannot identify the image information of the current frame, the motion track of the next frame of the image can be predicted through the motion rule information of different types of targets obtained through the track prediction model training.

In a traffic monitoring scene, the visual angle of the camera is generally fixed, so that the vehicle tracks in the pictures shot by the camera have certain similarity. The rule can be obtained through automatic learning of a special neural network structure. The track learning prediction module can also store the learned motion rule information in a database for a long time, and can be called at any time when the fusion detection association module needs to use the motion rule information.

After training and learning the target movement law, inputting a frame of image and the position information of the current target, and outputting the position information of the target at the next moment by the track prediction model, wherein the position information comprises x, y, w and h. The predicted target position can be added into a target position loss function of the fusion detection correlation module, and the position identification accuracy is improved.

The track prediction module predicts different positions of different types of targets, and optionally stores output results in a database for the fusion detection association module to utilize the information.

English appearing in the present embodiment or the drawings is explained below

1) ConvLSTM-Encode, namely a convolution length memory coding layer;

2) ConvLSTM-Position, memory Position layer when convolution length;

3) ConvLSTM-Decode, namely a convolution long-time and short-time memory decoding layer;

4) trackID, the same target should have the same trackID in different frames;

5) CNN: a Convolutional Neural Network. The key parameters are the size and the step size of the convolution kernel, the size of the convolution kernel influences the influence range of the convolution kernel in the image, and the step size influences the distance of each movement of the convolution kernel.

Embodiment 3, the embodiment is described with reference to fig. 6, and the multi-target tracking device of the embodiment includes a video input module, a fusion detection association module, a trajectory prediction module, an output module, and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the shooting and fusion detection correlation module is connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.

The video input module inputs the video to the fusion detection association module, the fusion detection association module obtains the data association relation among the target category, the target position information and the target, simultaneously transmits the target position information to the track prediction module, and transmits the data association relation between the target category and the target to the output module; the track prediction module obtains different types of target motion rule information according to the received target position information, and simultaneously transmits the target motion rule information to the storage module and the fusion joint detection association module; when the target tracking of the fusion detection correlation module is lost, the next video frame can be predicted through the target motion rule information.

The key technology of the invention is as follows:

1. the invention fuses the detection algorithm and the data association algorithm into one module, reduces repeated calculation, and can obtain the data association information between the target position information and the continuous frames by only one-time calculation.

2. And the detection association module is used for learning multi-scale information of the video frame, performing differential feature learning at different scales and performing feature fusion at different scales on the basis. And finally, outputting a final result by utilizing a multi-task learning method.

3. The track prediction module can learn historical track information, help predict the target track, avoid losing because of sheltering from and causing the target.

4. The invention fuses the detection module and the data association module into the same neural network, reduces the calculated amount to shorten the operation time by sharing the same bottom layer characteristics,

the traditional DeepsSort algorithm runs 26FPS frames (FPS, which is how many frames can be detected per second, and the higher the FPS, the faster the algorithm is, which is a standard for measuring the execution speed of the algorithm), and the algorithm runs 33FPS frames.

The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method for modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

1. A multi-target tracking method is characterized by comprising the following steps:

s110, inputting the video into a fusion detection association module, performing down-sampling processing to obtain a feature map, inputting the feature map into a difference calculation network to obtain difference features, wherein the specific method for obtaining the feature map comprises the following steps:

3) The characteristic diagram 2 is subjected to 1/16 downsampling through the convolutional layer 3 to obtain a characteristic diagram 3;

s120, calculating a loss function, specifically including the following three loss functions:

1) A target classification loss function;

2) A target location regression loss function;

3) A multi-objective cross entropy loss function;

s120, calculating a loss function, specifically:

1) Target classification loss function L _cls ：

Wherein, y _i True class label, x, representing an object _i Representing a model predicted value, and M represents the total number of target categories;

representing object class tag representation y _i A category feature of (a); lambda represents a class characteristic balance coefficient, and the value of lambda is 0.5;

The update formula is:

representing the difference between current data and a category characteristic, M representing a target category total, based on the number of categories in the category>

Representing object class label representation y _j Class feature of (a), x _i Representing the model predicted value; />

Represents the difference between the current data and the class characteristics at the tth iteration, and will then ≦>

Is updated to indicate->

While alpha is used to ensure->

The alpha value is taken as 0.5;

2) Target position regression loss function:

wherein t represents the model target predicted value, t ^* The real value of the target is represented, i can take values of x, y, w, h, x and y to represent the coordinate value of the central point of the detection frame, w represents the width of the detection frame, h represents the height of the detection frame, x, y, w and h can be regressed to obtain the position and the size of the target detection frame, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is changed and represented:

wherein, t' _i The position of the output of the track prediction module is represented, and the position comprises x, y, w and h information;

3) Multi-objective cross entropy loss function:

wherein, y _i True class label, x, representing an object _i Representing the model predicted value;

learning target movement by using convolution operation, and outputting predicted position information, wherein the predicted position information specifically comprises characteristic information of a first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs the predicted position information;

s140 outputs the multi-target tracking.

2. A multi-target tracking device, for implementing the multi-target tracking method of claim 1, comprising a video input module, a fusion detection association module, a trajectory prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing different types of target motion rule information.

3. A computer comprising a memory storing a computer program and a processor, the processor implementing the steps of a multi-target tracking method as claimed in claim 1 when executing the computer program.

4. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a multi-target tracking method according to claim 1.