CN114120188B

CN114120188B - Multi-row person tracking method based on joint global and local features

Info

Publication number: CN114120188B
Application number: CN202111373622.5A
Authority: CN
Inventors: 陈军; 孙志宏; 梁超; 王晓芬; 柴笑宇; 杨斌; 姚红豆; 邱焰升; 高�浩
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2024-04-05
Anticipated expiration: 2041-11-19
Also published as: CN114120188A

Abstract

The invention discloses a multi-row person tracking method based on joint global and local characteristics, which comprises the following steps: firstly, fusing original detection frames, detecting key points of the fused detection frames, and filtering according to the confidence coefficient of the key points to determine the final reserved key point coordinates; secondly, generating a new detection frame according to the reserved key point coordinates; then, extracting features of the new detection frame, extracting global features and local features of the pedestrian detection frame respectively, and designing a measurement method based on the combined global features and local features to calculate the similarity between the track and the detection frame; and finally, executing a tracking management strategy, and updating, terminating and other related operations on the track to obtain a final motion track. The method solves the problems of inaccurate expression of the identity features of the pedestrians and the like caused by shielding in a crowded scene, effectively improves the quality of the original detection result, improves the expression capability of the identity features of the pedestrians, effectively improves the data association precision, and improves the tracking accuracy.

Description

Multi-row person tracking method based on joint global and local features

Technical Field

The invention relates to the technical field of monitoring target tracking, in particular to a multi-row person tracking method based on joint global and local features.

Background

The multi-target tracking is used as a middle layer task in the field of computer vision, and has wide application prospects, such as security monitoring, sports analysis, unmanned driving, life medicine and the like. The task of multi-object tracking is to input a video segment, requiring the output of the trajectory of objects appearing in the video. Since pedestrian tracking has a wide research value, multi-line human tracking is the main stream of research in the field of multi-target tracking.

In recent years, as the performance of detection algorithms is continuously improved, multi-row person tracking based on a detection tracking framework has become the mainstream of multi-row person tracking. The principle based on the detection tracking frame is that pedestrians in each frame of the video are detected first, and then appearance features of the pedestrians are extracted to conduct data association and form a final motion track. Currently, most researchers use global features of pedestrian detection boxes to characterize pedestrian identities. However, in a crowded scene, frequent shielding among pedestrians causes that the shielded pedestrian detection frame often contains information of other interferents, so that the extracted global features contain other interference information, thus inaccurate feature expression is caused, the matching precision of data association is influenced, and the tracking accuracy is reduced. Therefore, the representation of the identity characteristics of the blocked pedestrians in the crowded scene plays a very important role in the accuracy of multi-row human tracking.

Relevant researchers express pedestrian identity characteristics by introducing local features. In the document [1], authors design a multi-target tracking method based on blocks, which adopts a local detector to detect pedestrians, obtains a pedestrian detection frame composed of a plurality of sub-blocks, then extracts the HOG characteristic of each sub-block, and calculates the similarity between each sub-block of two pedestrians. And fusing the similarity among all the sub-blocks by taking the confidence coefficient of the detection result of each sub-block as a weight. In document [2], a multi-target tracking method based on a local online learning appearance model is adopted, and for each pedestrian detection frame, the author normalizes it to a size of 24×58, and then cuts it into 15 small sub-blocks along the horizontal and vertical directions. The appearance characteristics of 15 sub-blocks are extracted, and a color histogram is used as the appearance characteristics. Document [3] builds a local-based multi-target tracking method to deal with local occlusion problems, they divide the pedestrian detection frame into 8 sub-blocks, then extract the HOG features of the sub-blocks, and use a first-order markov model to perform data correlation. Document [4] proposes a multi-target tracking method based on a main component, the author thinks that when a pedestrian is blocked by a certain component, the author thinks that it is the main component if the component has little change in appearance over time; for a large change in appearance, it is considered to be occluded. However, due to the lack of an overall relationship between sub-blocks, it is difficult to express an entire image efficiently. Although these methods of extracting local features take into account occlusion factors to some extent, the overall internal links between the detection frames are ignored.

Related references:

[1]Izadinia H,Saleemi I,Li W,et al.2T:Multiple people multiple parts tracker[C]//Proceedings of the European Conference on Computer Vision.Springer,2012:100-114.

[2]Yang B,Nevatia R.Online learned discriminative part-based appearance models for multi-human tracking[C]//Proceedings of the European Conference on Computer Vision.Springer,2012:484-498.

[3]Liu H,Chang F.A novel multi-object tracking method based on main-parts model[C]//Proceedings of the Chinese Control And Decision Conference.IEEE,2017:4569-4573.

[4]Shu G,Dehghan A,Oreifej O,et al.Part-based multiple-person tracking with partial occlusion handling[C]//2012IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2012:1815-1821.

disclosure of Invention

Aiming at the problems, the invention provides a multi-row person tracking method based on joint global and local features. The method comprises the steps of firstly correcting an original detection frame by using a detection corrector, and improving the quality of the original detection frame. And secondly, respectively extracting global features and local features of the detection frame to express identity features of pedestrians, and then designing a method for measuring similarity of the global features and the local features to correlate the pedestrians. And finally, executing a tracking management strategy, updating the tracks associated to the detection frames, suspending the tracks which are not matched to any detection frames, and initializing the detection frames which are not matched to any tracks.

The aim of the invention can be achieved by the following technical scheme:

a multi-row person tracking method based on joint pedestrian and head detection, comprising the steps of:

step 1, acquiring a detection result of each frame by adopting public data set data;

step 2, fusion of detection frames; the detection results on the public data set can have the problems of missed detection and false detection, the detection results need to be corrected to obtain more accurate detection results, and detection frame fusion is adopted, namely the effective overlapping rate between two detection frames on each frame exceeds a certain threshold value, and a new first detection frame is obtained;

step 3, detecting key points; after the new detection frames are obtained through fusion in the step 2, detecting key points, wherein each new detection frame contains a large number of key points;

step 4, filtering key points; setting a threshold value, filtering key points with lower confidence, and if the residual quantity of certain pedestrian key points exceeds a certain threshold value, considering that the pedestrian key points are correct;

step 5, detecting and correcting; according to the key points remained in the new detection frame, a new second detection frame is obtained again, and according to the boundary key points, a new corrected detection frame is obtained by utilizing the proportional relation between the human body key points and the human body height;

step 6, extracting features, namely extracting features of the corrected detection frame by adopting a convolutional neural network; firstly, training a pedestrian re-identification PCB network on a pedestrian re-identification data set, then extracting features of a corrected detection frame, wherein the PCB network divides the pedestrian into p blocks according to the horizontal direction and the vertical direction, obtains feature vectors of each block, and calculates visible area labels of the pedestrian according to the fact that whether key points exist in each block of the pedestrian detection frame or not is counted;

step 7, local appearance characteristic data association is carried out, after characteristic extraction is carried out on the historical track and a certain target detection frame on the t frame by utilizing the method in the step 6, local characteristic data association is calculated by calculating the cosine distance of the local characteristic vector of the historical track and the certain target detection frame on the t frame;

step 8, associating the overall appearance characteristic data; after extracting the characteristics of the historical track and a certain target detection frame on the t frame by using the method in the step 6, calculating global characteristic data association by calculating the cosine distance of global characteristic vectors of the historical track and the certain target detection frame on the t frame;

step 9, data association; in the step 6-8, the whole appearance characteristic, the local appearance characteristic and the visible label of a certain target detection frame on the history track and the t frame are fused so as to calculate data association, and then a Hungary matching algorithm is adopted to obtain an optimal matching result;

step 10, tracking management; step S9, after the optimal solution of the data association is obtained through a Hungary algorithm, returning a successfully matched detection frame and track pair, a detection frame which is not matched with the track, and a track which is not matched with the detection frame; for the track matched with the detection frame, updating the track; setting a tracking track state which is not matched with the detection frame as a track suspension state; regarding a detection frame which is not matched with the track, considering the detection frame as a new track, initializing and adding the new track into a track set;

and 11, repeating the steps 2-10 until all frames are processed, and outputting the target track.

Further, in step 2, combining the detection frames with the effective overlapping rate exceeding a certain threshold value on each frame to obtain a new detection frame;

the effective overlap ratio calculation formula is as follows:

wherein the method comprises the steps ofAnd->Representing the A-th and B-th detection frames on t frames, respectively, < >> Representing the area that can cover the two boxes at a minimum;

the new detection frame calculation formula is as follows:

x _new ＝min(x _A ,x _B )

y _new ＝min(y _A ,y _B )

w _new ＝max((x _A +w _A ),(x _B +w _B ))-x _new

h _new ＝max((y _A +h _A ),(y _B +h _B ))-y _new

wherein (x) _A ,y _A ) To detect the coordinates of frame A, (w _A ,h _A ) To detect the width and height of the frame A, (x) _B ,y _B ) To detect the coordinates of frame B, (w _B ,h _B ) Is the width and height of the detection frame B.

Further, the key points detected in step 3 are respectively: left eye, right eye, nose, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.

Further, in step 4, if the remaining amount of a certain pedestrian key point exceeds a certain threshold, it is considered to be correct, wherein a specific filtering formula is as follows:

wherein ε is _τ Is a binary variable, ε _τ =1 indicates that the confidence of the τ -th key point is greater than the threshold, the key point is kept, otherwise, the key (β) indicates the number of valid key points in the β -th detection frame, CS _τ The confidence score of the τ -th key point is represented, and γ represents a preset threshold.

Further, in the step 5, a newly corrected detection frame is obtained according to the boundary key points; the specific correction formula is as follows:

D ^new ＝{L _x ,T _y ,R _x -L _x +θ,B _y -T _y +ρ}

wherein L is _x And R is _x To correct the leftmost and rightmost coordinates of the x-axis of the key point in the detection frame after correction, T _y And B _y And the uppermost coordinate and the lowermost coordinate of the key point y axis in the corrected detection frame are represented, and theta and rho are two parameters which respectively represent the x-axis offset and the y-axis offset.

Further, in step 6, whether a key point exists in each block of the pedestrian detection frame is counted to calculate a visible area label of the pedestrian, and a specific calculation formula is as follows:

wherein l _v Representing whether the v-th block is visible, v=1, …, p, 1 if visible, otherwise 0, q _s Represents the s-th key point in the pedestrian detection frame, p represents the total number of the detection frame blocks, and H represents the height of the image.

Further, in step 7, the local appearance characteristic data association calculation formula is as follows:

wherein the method comprises the steps ofAnd->Feature vectors, d of the v-th block of a certain target detection frame j on the historical track i and t frames respectively _v For the characteristic distance of the v-th block of a certain target detection frame j on the historical track i and t frames, p represents the total number of the detection frame blocks.

Further, the overall appearance feature similarity measurement formula in step 8 is as follows:

wherein the method comprises the steps ofAnd->Global feature vector d of a target detection frame j on the historical track i and t frames respectively _g Global feature distance for a target detection frame j on target track i and t frames

Further, the specific implementation method of the data association in the step 9 is as follows;

the data association is realized by calculating the characteristic distance dist of the history track i and the target detection frame j, and the specific calculation formula is as follows:

wherein the method comprises the steps ofAnd->The visible scores of the v-th block of a certain target detection frame j on the historical track i and t frames are respectively, if the visible score is 0, the part is invisible, 1 represents the part is visible, and p represents the total number of the detection frame blocks.

Further, in step 10, the specific tracking management method is as follows:

the matching result in the step has 3 cases, namely the detection result of the successful matching of the detection frame and the track, the track which is not matched with the detection frame and the detection result of the track which is not matched with the detection frame;

for the successfully matched pair of the detection frame and the track, updating the history track i successfully matched with the target detection frame j of the t frame by using the target detection frame j of the t frame;

for tracks which are not matched with the detection frame, setting the state of a history track i as a suspension state, and adding the track ID to the vanishing target set;

and initializing the track of a target detection frame j of the t frame for the detection frame which is not matched with the track, assigning a new track ID, and adding the new track ID into the track set.

Compared with the existing multi-pedestrian tracking technology, the invention has the following advantages and beneficial effects:

1) Compared with the prior art, the method solves the problems that the multi-row person tracking cannot process missed detection and false detection in a crowded scene, and the like, and greatly improves the detection quality. The detection frame fusion strategy designed by the invention can eliminate redundancy and false detection problems in the original detection result. And detecting the missing detection frame in the original detection result through the key point detection of the new detection frame.

2) The invention adopts the head tracking strategy integrating global and local characteristics, can carry out identity expression on the whole pedestrian and can also solve the problem of local shielding. The designed feature matching strategy is simple and effective, so that the invention is easier to realize in actual engineering, and the engineering efficiency is improved.

Drawings

FIG. 1 is a diagram of a detection corrector designed by the present invention.

Fig. 2 is a system frame diagram of the present invention.

Detailed Description

In order to facilitate a person of ordinary skill in the art in understanding and practicing the present invention, the present invention will be described in further detail below with reference to the accompanying drawings, it being understood that the examples described herein are for illustration and explanation of the present invention only and are not intended to be limiting of the present invention.

Compared with the existing method which only uses global features, the method can effectively express the identity of the blocked pedestrians, improves the data association matching precision in the multi-row human tracking and improves the multi-row human tracking accuracy. The invention firstly corrects the original detection result by using the detection trimmer to solve the problems of missed detection, false detection and the like in the original detection result, then respectively extracts the global features and the local features of the pedestrian detection frame and designs a brand-new multi-row human tracking feature matching strategy to carry out data association. And finally, updating the track matched with the detection frame after the data association by adopting tracking management, stopping the track which is not matched with the detection frame, and initializing the track of the detection frame which is not matched with the track to obtain the final pedestrian track.

The specific implementation method comprises the following steps:

step S1: and (5) detecting pedestrians. Obtaining a pedestrian detection frame set D, D= { D by adopting an original detection result on the public data set ₁ ,D ₂ ,…,D _t ,…}；

Wherein D is _t A set of boxes is detected for pedestrians over t frames.

Step S2: and (5) fusion of detection frames. For the detection result in step S1, there may be some problems such as missing detection and false detection, and correction needs to be performed to obtain a more accurate detection result. And combining the detection frames with the effective overlapping rate exceeding a certain threshold value on each frame by adopting a detection frame fusion strategy to obtain a new detection frame.

The effective overlap ratio calculation formula is as follows:

wherein the method comprises the steps ofAnd->Representing the A-th and B-th detection frames on t frames, respectively, < >> Representing the area that can cover the two boxes at a minimum.

The new detection frame calculation formula is as follows:

x _new ＝min(x _A ,x _B )

y _new ＝min(y _A ,y _B )

w _new ＝max((x _A +w _A ),(x _B +w _B ))-x _new

h _new ＝max((y _A +h _A ),(y _B +h _B ))-y _new

wherein (x) _A ,y _A ) To detect the coordinates of frame A, (w _A ,h _A ) Is the width and height of the detection frame A. (x) _B ,y _B ) To detect the coordinates of frame B, (w _B ,h _B ) Is the width and height of the detection frame B.

Step S2 is repeatedly performed until no more t frames satisfy the condition, and then stops.

Step S3: and (5) detecting key points. And (2) after obtaining a new detection frame according to each frame in the step (S2), detecting key points. And detecting key points of the human body by adopting a key point detection algorithm of an on-off source. Suppose that K pedestrians are detected in a new detection frame, each pedestrian contains 17 key points, which are respectively: the left eye, right eye, nose, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle, then contains 17 x k key points.

Step S4: and filtering key points. Setting a threshold value and filtering key points with low confidence. A certain pedestrian keypoint residual is deemed correct if it exceeds a certain threshold. The specific filtering formula is as follows:

wherein ε is _τ Is a binary variable, ε _τ =1 means that the confidence of the τ -th key point is greater than the threshold, the key point remains, otherwise it is deleted. count (β) represents the number of valid keypoints in the β -th detection frame. CS (circuit switching) _τ The confidence score of the τ -th key point is represented, and γ represents a preset threshold.

Step S5: and detecting correction. And (5) according to the key points remained in the new detection frame, recovering the new detection frame. And according to boundary key points such as eyes, ankles and the like in the new detection frame, the new corrected detection frame is obtained by utilizing the proportional relation between the human body key points and the height of the human body. The specific correction formula is as follows:

D ^new ＝{L _x ,T _y ,R _x -L _x +θ,B _y -T _y +ρ}

wherein L is _x And R is _x To correct the leftmost and rightmost coordinates of the x-axis of the key point in the detection frame after correction, T _y And B _y And the coordinates of the uppermost and the lowermost of the key points in the corrected detection frame in the y axis are represented, θ and ρ are two parameters, and the x axis offset and the y axis offset are respectively represented, and are obtained by carrying out regression learning on training samples.

Step S6: and (5) extracting characteristics. And extracting the characteristics of the corrected detection frame by adopting a convolutional neural network. Firstly, training on a pedestrian re-identification data set by adopting a pedestrian re-identification PCB network, and then extracting features. The PCB network divides pedestrians into p blocks according to the horizontal and vertical directions, and obtains the characteristic vector f of each block _v (v=1, …, p). Calculating visible area labels of pedestrians according to the fact that whether key points exist in each part of the pedestrian detection frame:

Step S7: the local appearance characteristic data is associated. After S6 feature extraction is carried out on a certain target detection frame j on the historical track i and the t frame, local feature data association is calculated, and a specific calculation formula is as follows:

Step S8: and (5) correlating the overall appearance characteristic data. After S6 feature extraction is finished on a certain target detection frame j on the history track i and the t frame, the overall appearance features of the target detection frame j are respectively as follows: the apparent feature distance of a certain target detection frame j on the history track i and t frames is calculated as follows:

wherein the method comprises the steps ofAnd->Global feature vector d of a target detection frame j on the historical track i and t frames respectively _g The global feature distance of a certain target detection frame j on the historical track i and t frames.

Step S9: and (5) data association. In the step S6-S8, the whole appearance characteristic, the local appearance characteristic and the visible label of a certain target detection frame j on the history track i and the t frame are fused to calculate data association, and then the Hungary matching algorithm is adopted to obtain the optimal matching result. The specific calculation formula of the characteristic distance dist of a certain target detection frame j on the history track i and t frames is as follows:

Step S10: and (5) tracking management. And step S9, after the optimal solution of the data association is obtained through the Hungary algorithm, returning the successfully matched detection frame and track pairs, the detection frame which is not matched with the track, and the track which is not matched with the detection frame. For the track matched with the detection frame, updating the track; setting a tracking track state which is not matched with the detection frame as a track suspension state; for a detection box that does not match a track, then the detection box is considered a new track, initialized and added to the track set.

The matching result in the step has 3 cases, namely the detection result of the successful matching of the detection frame and the track, the track which is not matched with the detection frame and the detection result of the track which is not matched with the detection frame.

In the step, for the successfully matched pair of the detection frame and the track, a target detection frame j of t frames is used for updating the history track i successfully matched with the target detection frame j.

In this step, for tracks that do not match the detection frame, the history track i state is set to abort and the track ID is added to the vanishing target set.

In the step, for the detection frame which is not matched with the track, track initialization is carried out on a certain target detection frame j of the t frame, a new track ID is assigned, and the new track ID is added into the track set.

Step S11: and repeating the steps S2-S10 until all frames are processed, and outputting the target track.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. A multi-row person tracking method based on joint global and local features, comprising the steps of:

2. The multi-row person tracking method based on joint global and local features of claim 1, wherein: step 2, combining the detection frames with the effective overlapping rate exceeding a certain threshold value on each frame to obtain a new detection frame;

the effective overlap ratio calculation formula is as follows:

the new detection frame calculation formula is as follows:

x _new ＝min(x _A ,x _B )

y _new ＝min(y _A ,y _B )

w _new ＝max((x _A +w _A ),(x _B +w _B ))-x _new

h _new ＝max((y _A +h _A ),(y _B +h _B ))-y _new

3. The multi-row person tracking method based on joint global and local features of claim 1, wherein: the key points detected in the step 3 are respectively: left eye, right eye, nose, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.

4. The multi-row person tracking method based on joint global and local features of claim 1, wherein: in step 4, if the residual quantity of a certain pedestrian key point exceeds a certain threshold, the residual quantity is considered to be correct, wherein a specific filtering formula is as follows:

5. The multi-row person tracking method based on joint global and local features of claim 1, wherein: step 5, obtaining a new corrected detection frame according to the boundary key points; the specific correction formula is as follows:

D ^new ＝{L _x ,T _y ,R _x -L _x +θ,B _y -T _y +ρ}

wherein L is _x And R is _x To correct the leftmost and rightmost coordinates of the x-axis of the key point in the detection frame after correction, T _y And B _y Representing key points in corrected detection frameThe y-axis uppermost and lowermost coordinates, θ and ρ, are two parameters representing the x-axis and y-axis offsets, respectively.

6. The multi-row person tracking method based on joint global and local features of claim 1, wherein: in step 6, calculating whether key points exist in each partition of the pedestrian detection frame to calculate visible area labels of the pedestrians, wherein the specific calculation formula is as follows:

7. The multi-row person tracking method based on joint global and local features of claim 1, wherein: in the step 7, the local appearance characteristic data association calculation formula is as follows:

8. The multi-row person tracking method based on joint global and local features of claim 1, wherein: in the step 8, the overall appearance characteristic similarity measurement formula is as follows:

wherein the method comprises the steps ofAnd->Global feature vector d of a target detection frame j on the historical track i and t frames respectively _g The global feature distance of a certain target detection frame j on the target track i and t frames.

9. The multi-row person tracking method based on joint global and local features of claim 1, wherein: the specific implementation method of the data association in the step 9 is as follows;

10. The multi-row person tracking method based on joint global and local features of claim 1, wherein: in step 10, the specific tracking management method is as follows: