CN112116634B

CN112116634B - Multi-target tracking method of semi-online machine

Info

Publication number: CN112116634B
Application number: CN202010754142.2A
Authority: CN
Inventors: 刘龙军; 金焰明; 孙宏滨; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2024-05-07
Anticipated expiration: 2040-07-30
Also published as: CN112116634A

Abstract

The multi-target tracking method of the semi-online machine-mounted comprises the steps of obtaining detection frames of pedestrians or moving targets according to videos of the pedestrians or moving targets, obtaining a Kalman sequence spectrum according to position change information among the detection frames in a period of time window, finding a pair of Kalman heads according to the Kalman sequence spectrum, obtaining the detection frames of the targets or the moving objects to be tracked in the next frame through similarity of an appearance model, a moving model and a size change model, and enabling the targets or the moving objects to be in the detection frames in the frame, otherwise, indicating that the targets are lost; and splicing the detection frames with the similarity higher than the threshold value into a Kalman sequence spectrum, updating a motion model and an appearance model in the Kalman sequence spectrum, and tracking a pedestrian or a moving object target in the next frame. The method is suitable for any track spliced multi-target tracking algorithm, namely, the method is not limited to the constraint of different tracks generated by a plurality of targets such as pedestrians, moving objects and the like, the tracking precision can be effectively improved, and the identity conversion value can be reduced.

Description

Multi-target tracking method of semi-online machine

Technical Field

The invention relates to a tracking method, in particular to a multi-target tracking method of a semi-online machine.

Background

The multi-target tracking method is mainly applied to track tracking of a plurality of people or moving objects in a video sequence shot by a camera: in the driving scene of the unmanned vehicle, real-time track tracking can be carried out on pedestrians or other vehicle targets on the road shot by a camera arranged in the unmanned vehicle, and the motion track of the pedestrians or other vehicle targets is predicted, so that the unmanned vehicle can carry out effective avoidance or automatic driving decision according to the motion of the targets; in a plurality of cross-camera monitoring scenes, a plurality of pedestrians in the cameras can be tracked according to requirements, and walking tracks and positioning of a plurality of pedestrian targets can be monitored through videos captured by different cameras; in sports scenes shot by the cameras, such as basketball games, the running tracks of a plurality of athletes shot by the cameras can be tracked respectively by a multi-target tracking method, and actions, behavior analysis and the like on the athletes on the field are performed based on the tracked tracks. The multi-target tracking method can also be applied to tracking a plurality of targets such as enemy ships, vehicles and the like in military scenes. The current tracking methods are numerous, but in order to efficiently track, prompt and optimization such as real-time performance, accuracy and the like must be performed on the multi-target tracking method.

MOT (multi-objective tracking) can be largely divided into online MOT and offline MOT, which differ in that: the former can be pushed backwards along with the real-time frame number, the tracking track result can be timely given, and overall, the real-time performance is higher than that of the latter, and the accuracy is lower than that of the latter; the latter must wait for the whole video sequence to finish the forward calculation, and track after obtaining the information such as the detection frames in all video frames, so it is difficult to meet the real-time requirement compared with the former, but the accuracy is generally higher due to better combination of global information. On-line tracking requires real-time track tracking to be completed immediately after the detection operation of each next frame is completed. Therefore, the online tracking algorithm has better real-time performance intuitively, but cannot effectively utilize global information of the video, so that accuracy may be reduced; in contrast, offline tracking is tracking a track after all frames of a given video sequence have been detected. The mode can well utilize global information, and tracking results are relatively accurate, but cannot meet the real-time requirement. The sizes of the time receptive fields of the online tracking, the semi-online tracking and the offline tracking are respectively the current frame, the time window and the global, and are sequentially increased; the real-time performance is reduced in turn.

Occlusion problems have been one of the difficulties in MOT, and although iterative updating of various algorithms is quite rapid, most of the algorithm performance remains difficult to maintain robust when severe occlusion is encountered. Regardless of whether an online MOT or an offline MOT, or MOTs constructed using deep learning methods, various approaches have been made to attempt to address occlusion problems when they are encountered. But essentially by sacrificing real-time. The accuracy and the precision are very important in the scene of actual tracking application, for example, the poor real-time performance of a tracking algorithm in an unmanned automobile can lead to the delay of vehicle judgment, the erroneous judgment or the delayed decision, and unnecessary traffic accidents are caused; the poor accuracy can lead to a plurality of targets to be tracked in disorder, so that tracking failure is caused, for example, when a criminal suspects are tracked by using a multi-target tracking algorithm in a plurality of intelligent cameras in a city, the criminal suspects can be tracked, or the tracked non-suspects can cause the true suspects to run away, and the like.

Disclosure of Invention

The invention aims to provide a semi-online multi-target tracking method.

In order to achieve the above object, the present invention is realized by the following technical scheme:

According to a multi-target tracking method of a semi-online machine, a detection frame of a pedestrian or a moving target is obtained through a YOLO-V3 detector according to a video of the pedestrian or the moving target, a Kalman sequence spectrum is obtained according to position change information among the detection frames in a period of time window, then a pair of Kalman heads are found according to the Kalman sequence spectrum, a detection frame of the target or the moving object to be tracked in a next frame is obtained through similarity of an appearance model, a moving model and a size change model, and the target or the moving object is located in a detection frame in the frame, otherwise, the target is indicated to be lost; and splicing the detection frames with the similarity higher than the threshold value into a Kalman sequence spectrum, updating a motion model and an appearance model in the Kalman sequence spectrum, and tracking a pedestrian or a moving object target in the next frame.

The invention is further improved in that the similarity of the appearance model is obtained through the following processes:

in the n _th frame video, the patch is fixed in size [64,128], there are D detection boxes, D patches, and the X detection box is expressed as Patch corresponding to the X-th detection frame is/>

In n _th frames, performing crop and restore operations on the area where each detection frame is located to obtain D patches with the number equal to the fixed size of the detection frame, then dividing the pixels of the D patches into a plurality of groups according to the color interval,

The matrix result reshape obtained by grouping is taken as a one-dimensional vector Tsr _X, and the one-dimensional vector Tsr _X is taken as a one-dimensional vectorTo obtain an appearance model, and to express the appearance model of the X-th detection frame and the Y-th trajectory as: f (X) and f (Y); finally, the appearance model is updated by vector fusion, denoted/>The similarity of the appearance model is as follows formula (3-1);

Wherein: Λ ^A (X, Y) represents the similarity that is the appearance model.

The invention is further improved in that the similarity between the motion model and the size change model is obtained through the following processes: the time difference between adjacent frames is delta _t, and the kth object in the nth frame isPosition center coordinates/>For (c, d), the velocity vector of the corresponding coordinate is/>Acceleration vector/>, corresponding to coordinatesFor/>The target corresponds to the detection frame size/>Is (w, h) corresponds to the change speed/>For/>The driving force of change is/>The detector influence factor is alpha;

Motion state of the kth object in the nth frame And size status/>Respectively/>And/>Covariance matrix/>, between element factors in motion stateCovariance matrix between element factors in size state is/>According to the law of physical motion, a position prediction equation and a size prediction equation for the next frame are obtained as follows:

I.e.

Order theSimplifying the two iterative state transfer equations and the covariance matrix updating equation as follows:

Carrying out Kalman filter prediction based on normal distribution by taking the formulas (3-8) and (3-9) as iterative equations of a motion model and a size model to obtain position prediction information of an n+1th frame And size prediction information/>

For any of the first segment of track X and the second segment of track Y,And/>A forward velocity vector pointing from the head to the tail of the first track X and a reverse velocity vector pointing from the tail to the head of the second track Y, respectively; /(I)Representing the course of motion simulated by a kalman filter; f (X, Y) is a forward similarity score pointing from the tail of track X toward the head of track Y; Is the reverse similarity score from the head of track Y to the tail of track X;

where Λ ^M (X, Y) represents the similarity between the first segment of track X and the second segment of track Y.

A further improvement of the invention is that the length of the time window is defined as N, the minimum instantiation length of the short track is T _m, the Kalman family chart is denoted as KFM, and the kth detection box in the nth frame is denoted as Representing a detection frame in a KFMOrder in the corresponding fragment trajectories;

If it is It indicates that the detection box has not been cascaded with any fragment trajectories in the KFM, x representsIs the (x+1) th member of a segment of fragment trajectories in the KFM, the ith fragment trajectory in the KFM is defined as TK _i, if the length of the ith fragment trajectory is greater than T _m and its motion model, appearance model, size model are not updated in the nth frame, then the ith fragment trajectory is instantiated as a reliable short trajectory ST _j, otherwise, the ith fragment trajectory is disassembled;

The invention is further improved in that the specific process of splicing the detection frames with the similarity higher than the threshold value into the Kalman sequence spectrum is as follows:

firstly, finding a detection frame KH of an nth frame of pedestrian picture pair: detection frame in nth frame and n+1st frame pictures In/>For the ith detection box in the nth frame,/>Searching each pair of detection frames which possibly belong to the same target and are close in IOU relation for the detection frames in the n+1st frame of picture; if/>Then it willAnd/>Marked 0 and 1, respectively, and will/>And/>Called a pair detection frame KH;

there will be several pairs KH in the nth and n +1 frames (e.g., And/>)；/>Representing detection frame/>, in KFMRepresenting the order in the corresponding fragment trajectories;

Second, predicting:

Predicting the position of the pedestrian target in the next frame of picture according to the motion module of each pedestrian target in the current n+1st frame of picture

Thirdly, track growth: selecting and selecting according to formula (3)The most similar detection box is/>Then, let/>And updating the motion model and the appearance model of the unstable track TK _i which is not yet instantiated;

predicting the position in the next frame by using the motion model and the appearance model of the updated unstable track TK _i;

fourth step: repeating the first to third steps for each frame of tracking;

fifth step, instantiating or backtracking: instantiating or backtracking short tracks in the KFM in the current frame according to the following conditions:

a) Instantiation: if the length of the unstable track TK _i in the Kalman sequence spectrum is not updated in the last frame, instantiating the unstable track TK _i as a new reliable track ST _j;

b) Backtracking: if the length of the unstable track TK _i in the Kalman sequence spectrum is smaller than the threshold T _m and is not updated in the last frame, the unstable track TK _i is deleted in the Kalman family graph KFM and set And marking the fragment track in the Kalman sequence spectrum as a forbidden route.

The invention is further improved in that the specific process of the second step is as follows: according toAnd/>Establishing a motion model of an unstable track TK _i, predicting the position of a pedestrian target belonging to a track TK _i in (n+2) _th frames according to the motion model, and defining the position as/>

Compared with the prior art, the invention has the beneficial effects that:

First: the method is suitable for any track spliced multi-target tracking algorithm, namely, the method is not limited to the constraint of different tracks generated by a plurality of targets such as pedestrians, moving objects and the like, the tracking precision can be effectively improved, and the identity conversion value can be reduced;

Second,: the generated tracking result can be checked, and the error tracking result is corrected, so that the algorithm is more robust, for example, the target positioning error in the current video frame caused in pedestrian target tracking, namely, when the tracking result of the online multi-target tracking algorithm is wrong, the error can be detected through the backtracking module in the time window of the method, so that the tracking track is corrected;

Third,: the method has the advantages that through masking and covering (IOU) of the intersection area between the targets, the degree of distinction between multiple targets is effectively improved with extremely low calculation cost, the problem of target feature disappearance caused by serious shielding in crowded places such as malls, station intersections and the like can be effectively relieved, and the feature distinction degree between incompletely shielded targets and shielding targets is effectively improved;

Fourth,: the invention can utilize the global information in the existing time window to check and correct the error tracking result in a certain time under the condition of meeting the real-time requirement. The method has very robust results in various extreme scenes and has very good suitability for other algorithms similar to short track splicing.

Drawings

FIG. 1 is a flowchart of the backtracking mechanism algorithm of the present invention.

FIG. 2 is a flowchart of the overall algorithm of the present invention.

FIG. 3 is a schematic diagram of an IOU mask module in accordance with an embodiment of the present invention.

FIG. 4 is a schematic diagram of appearance model creation in an example of the invention.

Fig. 5 is a comparison of the performance of each algorithm of MOT 2015.

Fig. 6 is a comparison of the MOT2015 algorithms FPS.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings.

The invention adopts MOT of semi-online mechanism, and the method can make good compromise and optimization in terms of real-time performance and precision.

Referring to fig. 1, the specific process of the present invention is: and obtaining a detection frame of the pedestrian or the moving object by using a video of the pedestrian or the moving object shot by a camera through a YOLO-V3 detector, namely, eliminating other objects or backgrounds for the frame of the pedestrian or the moving object. Acquiring video of a period of time, obtaining a Kalman sequence spectrum according to position change information among detection frames in a period of time window, finding a pair of Kalman heads (KALMAN HEAD, KH) according to the Kalman sequence spectrum, obtaining a detection frame of a target or a moving object to be tracked in the next frame through similarity of an appearance model, a motion model and a size change model, and enabling the target or the moving object to be always in a detection frame in the frame, otherwise, indicating that the target is lost. And splicing the detection frames with the similarity higher than the threshold value into a Kalman sequence spectrum, and updating a motion model and an appearance model in the Kalman sequence spectrum to be used for tracking the pedestrian or moving object target in the next frame.

The similarity of the appearance model is obtained through the following processes:

In frame n _th, patch size is fixed to [64,128], the number of pixel histogram bins is 64, there are D detection boxes, D patches, and the X detection box is expressed as Patch corresponding to the X-th detection frame is/>

Then, in the n _th frames, the present invention performs the crop and restore operations on the region where each detection box is located (e.g., the present invention cuts and resizes each patch to a tensor of the shape [64,128 ]. After these operations, the present invention obtains D patches equal in number to the fixed size of the detection frame. The invention then divides the pixels of the D patches into groups (e.g. 64 groups) each by color interval,

And grouping the resulting matrix result reshape into a one-dimensional vector Tsr _X, i.e., the present invention obtains a1 x 144 tensor from a 3 x 64 tensor remodel, which is obtained by grouping the color intervals. The invention then takes Tsr _X as theIs a representation vector of appearance features of (a). In combination with the appearance model function of [12], the invention obtains an appearance model, and represents the appearance model of the X-th detection frame and the Y-th track as: f (X) and f (Y). Finally, the present invention updates the appearance model by vector fusion, which can be expressed as/>Thus, the appearance similarity can be obtained as shown in the following formula (3-1).

Wherein: Λ ^A (X, Y) represents the similarity that is the appearance model. This approach would be an effective way to enhance the discrimination between targets when the present invention fails to pass through a single physical motion model, where the relationship between the trajectory and the detection frame is complex.

The similarity between the motion model and the size change model is obtained through the following processes: the time difference between adjacent frames is delta _t, and the kth object in the nth frame isPosition center coordinates/>For (c, d), the velocity vector of the corresponding coordinate is/>Acceleration vector/>, corresponding to coordinatesFor/>The target corresponds to the detection frame size/>Is (w, h) corresponds to the change speed/>For/>The driving force of change is/>The detector impact factor is α (the higher the mIOU of the detector the higher the value, defaulting to 0.7).

Motion state of the kth object in the nth frameAnd size status/>Respectively/>And/>Covariance matrix/>, between element factors in motion stateCovariance matrix between element factors in size state is/>According to the law of physical motion, a position prediction equation and a size prediction equation for the next frame are obtained as follows:

I.e.

For any two segments of the trajectories X and Y,And/>The forward velocity vector from the head to the tail of track X and the reverse velocity vector from the tail to the head of track Y are respectively available from equations (3-10) and (3-11). /(I)Representing the course of motion simulated by a kalman filter. F (X, Y) is a forward similarity score pointing from the tail of track X toward the head of track Y; /(I)Is the inverse similarity score pointing from the head of track Y to the tail of track X. The similarity is as follows:

the overall similarity can be expressed as (3-12), Λ ^M (X, Y) representing the similarity between the first segment of trajectory X and the second segment of trajectory Y calculated by equation (3-10) and equation (3-11). The value range of Λ ^M (X, Y) is [0,1], and the value of Λ ^M (X, Y) is closer to 1, which represents that the more likely the first segment of track X and the second segment of track Y belong to the same target in the physical motion simulation process of the model, which is an important basis for judging the connection between the fragment tracks.

Track confidence can be intuitively understood as the degree of matching between the constructed track and the actual track of the object. The confidence level of the trace conf (T _i) can be represented by equation (3-13).

Wherein: Representing the average similarity between the various detections in the existing trajectory,/> Representing two detection boxes/>, in track T _i And/> Representing the continuity of the trajectory, α is the number of frames that the object is lost, and β is a control parameter (default 0.4) related to the accuracy of the detector.

The video sequence selected by the semi-online mechanism on the time axis is positioned between the online mechanism and the offline mechanism, which is a good compromise between the online mechanism and the offline mechanism in performance, but the semi-online tracking mechanism can be well optimized in real-time performance and accuracy through optimization of algorithms such as shielding optimization, semantic segmentation optimization and the like.

The present invention defines the length of the time window as N. The minimum instantiation length of the short track is T _m. The kalman family graph may be denoted as KFM for recording the detection relationship between the motion model and the appearance model. Note that the kth detection box in the nth frame is represented asIt also contains the coordinates and reliability detected in the list: [ x, y, w, h, conf ]; use of the invention/>Representing detection frame/>, in KFMRepresenting the order in the corresponding fragment trajectories. It can be represented by the following mathematical expression:

If it is It indicates that the detection box has not cascaded with any fragment trajectories in the KFM. x representsIs the (x+1) th member of a segment of the fragment trajectory in the KFM. The ith fragment track in KFM is defined as TK _i, which is instantiated as a reliable short track ST _j if its length is greater than T _m and its motion model, appearance model, size model are not updated in the nth frame, otherwise the track will be disassembled.

Taking the situation of the n-th frame pedestrian picture as an example, the invention introduces a short track tracking process and a track backtracking strategy in the invention, as shown in fig. 2:

firstly, finding a detection frame KH of an nth frame of pedestrian picture pair: detection frame in nth frame and n+1st frame pictures In/>For the detection frame in the nth frame picture,/>And searching each pair of detection frames which possibly belong to the same target and are close in IOU relationship for the detection frames in the n+1st frame of picture. If/>Then/>AndMarked 0 and 1, respectively, and will/>And/>Referred to as a pair detection frame KH.

After this step, the present invention will have pairs of KH in the nth and n +1 frames (e.g.,And/>)。Representing detection frame/>, in KFMRepresenting the order in the corresponding patch trajectories, the ith detection box in the nth frame is denoted/>

Second, predicting:

And predicting the positions of the pedestrian targets in the next frame of picture according to the motion module of each pedestrian target in the current n+1st frame of picture. The specific process is as follows:

According to And/>And establishing a motion model of the unstable track TK _i. Based on the motion model, the position of the pedestrian object belonging to the track TK _i in the (n+2) _th frame is predicted and defined as/>

Thirdly, track growth: selecting and selecting a matching strategy according to formula (3)The most similar detection box is/>Then, let/>And updates the motion model and the appearance model of the unstable trajectory TK _i that has not been instantiated.

The position in the next frame is predicted using the motion model and the appearance model of the updated unstable trajectory TK _i.

Fourth step: the tracking for each frame is repeated through the first to third steps.

Fifth step, instantiating or backtracking: short tracks in a KFM (e.g., TK ₀,TK₁,…,TK_i) in the current frame are instantiated or traced back according to the following conditions:

a) Instantiation: if the length of the unstable track TK _i in the Kalman sequence spectrum is not updated in the previous frame, the unstable track TK _i is instantiated as a new reliable track ST _j.

For example, if the length of the unstable track TK _i is greater than or equal to the threshold T _m, it is a reliable track. The threshold T _m is determined according to the actual situation, and is generally taken to be 5.

B) Backtracking: if the length of the unstable track TK _i in the Kalman sequence spectrum is smaller than the threshold T _m and is not updated in the last frame, the unstable track TK _i is deleted in the Kalman family graph KFM and setThe patch trajectories in the kalman sequence spectrum are then marked as forbidden routes to avoid later exploration of the reoccurrence of the path.

The invention adopts the IOU shade module to process the mutual shielding situation of two or more targets, and the process is as follows. As shown in fig. 3, a scene in which targets are blocked from each other is shown. When the object A and the object B are mutually shielded, before extracting the features from the detection frame area, the IOU area between the A and the B is used as a mask to cover the pixel information of the IOU area, so that the related objects are prevented from sharing the feature information of the IOU area, and the distinguishing degree of appearance models of different objects is effectively improved. However, when a plurality of objects block each other, it is easy to cause a phenomenon that the detection area of the object is almost completely covered by a plurality of IOU masks, thereby causing the appearance feature of the blocked object to be completely covered. To avoid this, the present invention provides thresh _IOU to avoid the worst case.

Referring to fig. 3, in the nth frame, the kth detection frame is marked A set of IOU masks between each detection box is/>For the kth detection box, assume that the detection box set/>All of the detection frames are identical to/>When the shielding phenomenon occurs in the detection frame area, the detection frame set/>Detection box in the IOU mask/>, IOU mask/>Record as all occlusion merge area/>If it is/>Post-overlay/>Is/>Then by/>Cover/>The process of (1) is expressed as/>

If it is obtainedLess than a predetermined threshold value Thres ^IM, i.e./>The appearance characteristics of the target are difficult to express in an appearance model, so the invention adopts the method of detecting the frame set/>In detection frame and/>Sequencing the sizes of the shielding areas, and then sequentially removing the detection frames with the minimum shielding area one by one from the detection frame setIn addition, a new detection frame set/>, is obtainedAnd collect the new detection frame/>Re-take into equations (4) and (5) for calculation until/>When the final IOU mask is obtained, the final IOU mask/>As an IOU mask module, when appearance feature extraction is performed, setting the pixel value of the original image area where the appearance feature extraction is located to be zero: the shielded target and the shielded target pass through the shielding intersection area, so that the characteristic distinction between the targets is increased.

The following are specific examples.

The time window length is first set to 40 frames. The video and the detection frames of each frame are used as input, the patch of the corresponding detection frame obtained after clipping and resize is subjected to feature extraction, and appearance representation is carried out in a pixel histogram grouping mode as shown in fig. 3, so that an appearance model is established. The appearance model is built by the following process: in the n _th frame, the patch size is fixed to [64,128], the pixel histogram grouping number is 64, and there are D detection boxes in the frame, D patches, and the X detection box is expressed asPatch corresponding to the X-th detection frame is/>

In frame n _th, the crop and restore operations are performed on the region where each detection box is located (e.g., the present invention cuts and resizes each patch to a tensor of the shape [64,128 ]. After these operations, the present invention obtains D patches equal in number to the fixed size of the detection frame.

The invention then divides the pixels of the D patches into groups (e.g., 64 groups) each according to the color interval, and divides the matrix result reshape obtained by the grouping into a one-dimensional vector Tsr _X. That is, the present invention obtains a1×192 tensor remodeled from a 3×64 tensor, which is obtained by grouping color intervals. Then the invention takes the one-dimensional vector Tsr _X as the patch corresponding to the X-th detection frameIs a model of the appearance of (a).

The appearance model of the X-th detection frame and the appearance model of the Y-th track are expressed as: f (X) and f (Y). Finally, the present invention updates the appearance model by vector fusion, which can be expressed asThus, the appearance similarity can be obtained as shown in formula (7).

Wherein: Λ ^A (X, Y) represents the similarity that is the appearance model.

After the above steps, a chip track with greatly improved reliability can be obtained, and the noise occurring in the detection can be effectively examined, as shown in table 1.

Table 1 comparison of results of the algorithms on MOT15

Referring to fig. 5 and 6, the present invention combines the advantages of online and offline MOTs at the expense of a small amount of real-time, with good precision improvement at MOTA, MOTP, IDS, ML, MT, FM, and a suitable balance between real-time and accuracy.

The algorithm of the present invention performs substantially better on the data set MOT2015 than baseline, except fps, where MOTA, MOTP are raised by 12.6 and 6.3 percent, respectively, which illustrates a very large improvement in continuous tracking ability of the target over baseline, as a whole, which also illustrates that the short trace patch generated by the algorithm of the present invention is more accurate and robust; the algorithm of the present invention reduces 82 on IDS with less boost in the overall number of identity transitions. The algorithm of the invention has higher MT value and lower ML value, which shows that, to a certain extent, a more robust short track can reduce the number of missing frames between partial fragment tracks. The MOTA bit column of the algorithm of the present invention is first, and other metrics such as MOTP, recall, IDS are generally far better than average, from the overall performance among the algorithms, which means that the algorithm of the present invention has greater stability and generalization capability in the algorithm framework. It is particularly notable that the trace-back mechanism does not include any module involving complex computation, and only relies on a simple side appearance model and a motion model to perform the trace-back process through states between online and offline, so that the algorithm of the present invention has extremely obvious FPS advantages compared to other algorithms in the table.

The method adopts a semi-online mechanism to optimize the instantaneity and the accuracy of the multi-target tracking method. The method can detect and correct the established tracking result, effectively improves the target appearance characteristic degree, has high speed and small calculation resource requirement, can be used on an application-specific integrated circuit such as an Inweida TX2 and the like in the scenes of automatic driving, pedestrian tracking and the like, and effectively solves the problem that the real-time performance and the algorithm precision such as MOTA index of the current multi-target tracking algorithm are difficult to reach the best simultaneously.

Claims

1. A multi-target tracking method of semi-online machine-set is characterized in that a detection frame of a pedestrian target is obtained through a YOLO-V3 detector according to a pedestrian target video, a Kalman sequence spectrum is obtained according to position change information among the detection frames in a period of time window, then a pair of Kalman heads are found according to the Kalman sequence spectrum, the detection frame of the pedestrian target to be tracked in the next frame is obtained through similarity of an appearance model, a motion model and a size change model, and the target is located in a detection frame in the frame, otherwise, the target is indicated to be lost; splicing a detection frame with the similarity higher than a threshold value into a Kalman sequence spectrum, updating a motion model and an appearance model in the Kalman sequence spectrum, and tracking a pedestrian target in the next frame;

In the first place In the frame, mark the/>Individual detection frames/>，/>One set of IOU masks between each detection box isFor the/>The number of detection frames is assumed to be equal to the number of detection frames/>All of the detection frames are identical to/>Shielding in the detection frame region, the detection frame set/>The detection box in the box gets the IOU mask/>Record as all occlusion combined area/>; Is/>Post-overlay/>Is/>；

（4）

（5）

If it is obtainedLess than a predetermined threshold/>According to the detection frame set/>In detection frame and/>Sequencing the area of the shielding region, and then sequentially removing the detection frames with the smallest shielding region area from the detection frame set one by oneIn addition, a new detection frame set/>, is obtainedAnd collect the new detection frame/>Re-take into equations (4) and (5) for calculation until/>When a final IOU mask is obtained, the final/>As an IOU mask module, when appearance feature extraction is performed, setting the pixel value of the original image area where the appearance feature extraction is located to be zero: the shielded target and the shielded target pass through a shielding intersection area, so that the characteristic distinction degree among the targets is increased;

Defining the length of the time window as N, and the minimum instantiation length of the short track as The Kalman family graph is denoted KFM, and the kth detection box in the nth frame is denoted/>，/>Representation/>Detection frame/>Corresponding to the order in the patch trajectories;

（1）

If it is Then it indicates that the detection frame is at/>Not yet concatenated with any fragment trajectories,/>Representation/>Is thatFirst (/ >) of a segment of fragment trajectoryMembers,/>/>The individual fragment trajectories are defined as/>If/>The length of individual fragment trajectories is greater than/>And its motion model, appearance model, size model at/>Not updated in the frame, then the/>The individual fragment trajectories are instantiated as reliable short trajectories/>Otherwise, the/>The individual fragment trajectories are disassembled;

（2）

for the detection frames with similarity higher than the threshold value, the specific process of splicing the detection frames into the Kalman sequence spectrum is as follows:

First step, find the first Pair detection frame/>, of frame pedestrian picture: In/>Frame sum/>Detection frame (/ >) in frame picture,) In/>For/>Detection frame in frame map,/>Finding out each pair of detection frames which possibly belong to the same target and are close in IOU relation in the frame picture; if/>,/>Then willMarked as/>, respectivelyAnd/>And will/>Called/>；

At the nth frame andThere are several pairs/>；/>Representation/>Detection frame/>The ith detection box in the nth frame is denoted/>, in the order in the corresponding patch track；

Second, predicting:

Predicting the position of the pedestrian target in the next frame of picture according to the motion module of each pedestrian target in the current n+1st frame of picture ；

Thirdly, track growth: selecting and selecting according to formula (3)The most similar detection box is/>; Then, let theAnd updates the unstable track/>, which has not been instantiatedA motion model and an appearance model of (a);

else 0（3）

Using updated unstable trajectories Predicting the position in the next frame by the motion model and the appearance model;

fourth step: repeating the first to third steps for each frame of tracking;

a) Instantiation: if unstable trajectories in the Kalman sequence spectrum Length is greater than or equal to threshold/>If not updated in the previous frame, the unstable track/>Instantiate as new reliable track/>；

B) Backtracking: if unstable trajectories in the Kalman sequence spectrumIs less than a threshold/>And no update in the previous frame will be in the Kalman family map/>Unstable track/>And set/>And marking the fragment track in the Kalman sequence spectrum as a forbidden route.

2. The multi-target tracking method of a semi-online machine according to claim 1, wherein the specific process of the second step is: according toAnd/>Build unstable track/>Predicting a motion model belonging to a trajectory/>, based on the motion modelPedestrian target at/>The position of the frame and defining the position as/>。