CN116433723A

CN116433723A - Multi-target tracking method for modeling relationship among targets

Info

Publication number: CN116433723A
Application number: CN202310238764.3A
Authority: CN
Inventors: 邓宸伟; 武家鹏; 韩煜祺; 唐林波; 王文正; 王旭辰
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-07-14

Abstract

The invention discloses a multi-target tracking method for modeling relationship among targets. Comprising the following steps: for the target output by the current frame detector, constructing an intra-frame image by utilizing the position and appearance characteristics of the target and the topological relation between the target; then, the vertex and edge characteristics of the intra-frame image are updated by utilizing a message transmission network so as to further fuse the characteristics among targets; then combining the track graph of the past frame with the current intra-frame graph to construct an inter-frame graph, wherein the edges of the inter-frame graph represent the similarity of the track and the detected characteristics; performing a message passing process on the inter-frame map for further fusion; and then, calculating scores of matching relations represented by edges by using a full-connection layer network, and recovering the low confidence detection easy to miss and the track lost due to shielding and the like. The method provided by the invention utilizes the topological relation modeling among the targets, can realize stable association in a scene of nonlinear movement of the camera, and utilizes the neighbor target information to assist in recovering the detection of the blocked target; the defect that targets are processed separately by a mainstream multi-target tracking algorithm is overcome to a certain extent, and stable tracking under nonlinear motion and shielding scenes can be realized.

Description

Multi-target tracking method for modeling relationship among targets

Technical Field

The invention belongs to the technical field of computer vision, and relates to a multi-target tracking method for modeling the relationship between targets.

Background

The multi-target tracking is an important computer vision task and has important value in security, automatic driving, crowd density monitoring, behavior analysis and video content understanding tasks. The method comprises the steps of giving different targets different identities through analyzing videos, and deducing the specific position of each target in each frame so as to form a track corresponding to each target.

In recent years, the field of multi-objective tracking has been rapidly developed, and a good effect has been obtained on a data set such as MOT, KITTI, danceTrack. However, since the camera motion has uncertainty, situations such as nonlinear motion of the target, occlusion between targets and background, etc. are easy to occur, and these situations all lead to giving the target a wrong identity, thus leading to unstable tracking.

In coping with occlusion of objects, there are mainly the following solutions: the first category is to optimize the matching strategy, namely, in the detection of the association phase with the track, a dual-threshold matching mode is adopted, and the representation of the association matrix is optimized. The former can restore the low confidence target caused by occlusion, and the latter can better combine the appearance feature with the motion feature, so that the new feature is more biased to the motion feature or the appearance feature with high reliability. The second type is to build extra network branches to directly predict the shielding of the target so as to make up for the omission. The third class is to build a memory bank of features of the object, optimize the features in the past several frames of the object to more robust features by means of attention calculations or the like to cope with occlusion situations.

In coping with nonlinear motion of the object. Mainly has the following solutions: the first is to use image registration before the association stage, and update the motion coordinates of the track through the obtained affine matrix, so as to offset the motion coordinate mutation caused by camera motion to a certain extent. The second type is to fit nonlinear motion in a track-smooth manner. The third type is that when it is determined that a large degree of nonlinear motion of the target occurs, the kalman filtering adopted by the classical algorithm is abandoned, and the matching is performed by adopting, for example, a motion characteristic with rotation invariance.

However, the above algorithms all process targets singly, that is, do not consider the relationship between targets, resulting in an increased difficulty in maintaining stable tracking in a complex scene. Specifically, the process of unifying the features of the objects has certain limitations, since the objects of the same class have similar appearances; for a scene with crowded targets, frequent mutual shielding among the targets is easy to occur, and the single target processing mode is easy to confuse the targets during shielding; for the situation that the camera motion causes nonlinear motion of the targets, the way of unifying the processing track or calculating the affine matrix has larger calculation amount, and the relative position among the targets in the camera motion process can be considered to be basically unchanged, so that modeling the relation among the targets can also well cope with the situation.

Therefore, a multi-target tracking method for modeling the relationship between targets is needed at present, and the problems of shielding, nonlinear motion and the like are solved through the relationship between targets, so that the effect of stable tracking in a complex scene is achieved.

Disclosure of Invention

In view of the above, the invention provides a multi-target tracking method for modeling the relationship between targets, which can achieve the effect of stable tracking in a shielding and nonlinear motion scene by utilizing the appearance, topological relationship and the like between the targets.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a multi-target tracking method for modeling the relationship between targets comprises the following specific steps:

s1, constructing a high-performance detector and an appearance feature extraction network. The input of the detector is a video frame, and the output is the position and the category of each target. The input of the appearance characteristic extraction network is a cut target area, and the appearance characteristic extraction network outputs an appearance characteristic vector of a target.

S2, screening out detection candidate frames with a lower confidence threshold for the current frame to avoid detection omission caused by shielding and the like. The highly repeated detection frames are then de-duplicated using non-maximal suppression. And extracting the appearance characteristics of each target by adopting an appearance characteristic extraction network.

S3, constructing an intra-frame image of the target detected by the current frame, wherein the vertexes of the image represent the codes of the appearance characteristics and the motion characteristics of the target, and the vertexes represent the characteristic space topological relation of the two connected targets. Specifically, the vertex feature code consists of two parts: firstly, the position information and appearance characteristics of the target, and secondly, the included angles and normalized distances between the target and a plurality of adjacent targets. The edge feature code consists of three parts, namely a fused vertex feature, a geometric distance and a boundary frame similarity between two targets connected by the edge. A message passing step is then performed on the constructed intra-graph to fuse the features of the edges and vertices. And if the current frame is not the first frame, executing S4, otherwise executing S5.

S4, constructing a matching diagram for the current frame and the track diagram. Since intra objects cannot match each other, the matching graph is a bipartite graph. One end vertex of the two-part graph is the characteristics of the active track and the temporary lost track, and the other end vertex is the target characteristic of the current frame; the edges are a measure of similarity of the connected tracks to the current target. The edge feature codes of the matching graph are similar to the intra-frame graph, and the edge feature codes of the matching graph consist of feature vector similarity, geometric distance and boundary frame similarity of the track and the current target. Vertex feature coding is consistent with intra pictures to reduce complexity. A message passing step is then performed on the constructed matching graph, and for each edge, input to an edge classifier (comprised of a fully connected layer network) to calculate a score for the matching relationship represented by that edge. And forming a cost matrix by the scores between the track and the target, and obtaining an optimal solution of the cost matrix by calculating a linear assignment problem, namely a matching relation.

S5, sorting the current matching relation. The following situations can be distinguished in particular:

matching the successfully-matched activity track with a high-confidence target: the matching relationship in this case is often reliable, the target is drawn into the corresponding track, and the current feature of the target and the track feature are exponentially smoothed as a new track feature.

Matching the successfully matched activity track with a low confidence target: this situation illustrates that the target may be occluded, restored, and the corresponding track is scratched, and the track features are not updated because the current target features are not trusted.

Successful matching of the inactive trajectory with the high confidence target: this situation illustrates that the target is reproduced after being lost for a period of time, and the target is drawn into the corresponding track, with the current target feature as the new track feature.

Traces that did not match successfully: this situation indicates that the target is out of view or severely occluded. If the state of the track is an active track, marking the track as inactive, and recording the duration of the inactive; if the state of the track is an inactive track, it is determined that the inactive state duration is greater than the maximum retention time, and the inactive state duration is deleted from the track set.

Targets that match unsuccessful: if the confidence level of the target is greater than a certain value, the target is likely to be a new target, and the new target is initialized to be a new track.

After the above steps, the current track diagram is updated, and the process returns to S2.

And finishing processing all frames in the video to be processed, and ending the flow.

Further, the number of vertices and edges of the intra-frame image is related to a specific detection result, and the geometric distance is adopted to find the adjacent target of the target, wherein the range is about the target as the center, and the smaller of the image height and width is about 0.1 time as the radius. The target appearance characteristics adopt 64-dimensional vectors, the included angles between the target appearance characteristics and the neighboring targets and the normalized distance are respectively preset as 20-dimensional vectors, and if the neighboring targets are less than 20, 0 is used for filling; if the number is more than 20, the nearest 20 are taken.

Further, in edge feature coding of the intra-frame graph, the fused vertex features are obtained through attention calculation, the geometric distance is Euclidean distance, and the boundary box similarity comprises a center point difference value, an aspect ratio logarithmic value and a Wasserstein distance.

Further, the number of message transfer iterations of the intra-frame graph and the matching graph is 3, and the message transfer mode is an average aggregation mode. The number of the full-connection layer layers of the edge classifier of the matching graph is 2, and the activation function adopts Sigmoid to normalize and output.

Further, the high confidence threshold is 0.5, the low confidence threshold is 0.2, the target initialization threshold is 0.6, after the edge score of the matching graph is calculated, edges with scores greater than or equal to 0.4 are first screened, and when a cost matrix is constructed, the cost with the score less than 0.4 is set as a large positive number (for example, 1E 5). The maximum retention time of the lost track is 30 frames.

The beneficial effects are that:

1. the invention provides a multi-target tracking method based on relationship modeling among targets, which achieves good effect in different tracking scenes. The invention uses the graph structure to model the relation between the targets by calculating the appearance and topology information of the adjacent targets, and constructs a matching graph in the matching stage, the matching relation between the track and the targets is represented by edges, and the matching score is calculated by extracting edge feature semantics. When the relation between the targets is considered, the invention combines the appearance characteristics and the topology information (the included angle and the normalized distance) of the targets and the adjacent targets, and the characteristic coding mode associates the targets with the surrounding environment, thereby enhancing the semantic information richness of the targets and being beneficial to nonlinear movement and shielding scenes. Specifically, considering the situation that the non-linear motion of the camera causes the abrupt change of the absolute position of the image target, the current mainstream algorithm only adopts discrete absolute position information matching, and the current algorithm causes the problem of abrupt change of the target identity due to the fact that the absolute position of the target between two adjacent frames is too large. The relative positions of the targets in the nonlinear motion process are always consistent, so that the feature codes can be kept stable in most nonlinear motion scenes, and stable tracking is realized. In addition, considering the problem of target occlusion, the current approach of the mainstream algorithm to process targets separately results in missed detection when the confidence of the target is low. And simply reducing the detector confidence may result in false positives. Therefore, the relation modeling mode among targets can recover detection with low confidence by means of consistency of the targets and surrounding environments, and can inhibit incorrect and fragmented false detection.

2. In addition, in order to realize long-time tracking, the invention provides a robust track management mechanism, and the problem of recovering lost tracks is fully considered. Mainstream algorithms tend to still track recovery with absolute position or appearance information. When the target appears again, the position often changes greatly, and the appearance information of the single target is unstable, but the target and the adjacent targets around the target often still keep the characteristics before the track is lost. Therefore, the relation modeling mode between the targets can fully consider semantic information of the targets and surrounding environments of the targets, so that the track can be accurately restored.

Drawings

FIG. 1 is a block diagram of an implementation of the present invention.

Fig. 2 is a schematic diagram of an intra-frame image construction process according to the present invention.

FIG. 3 is a schematic diagram of the matching graph correlation process of the present invention.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

Fig. 1 is a flowchart of the whole tracking algorithm: firstly, the current video frame image is read, and a high-performance detector network is used for obtaining the detection result of the frame, wherein the detection result comprises the category and the position information of each target. An intra-frame graph is then constructed, and the vertex and edge feature encodings of the intra-frame graph are constructed in topological and appearance relationships between the objects. And then iterating the intra-frame image through a message transmission network, fusing the vertex and characteristic relation, and increasing the receptive field of the characteristics. And then constructing a matching graph with the current track graph, wherein the edges of the matching graph represent the appearance and topological relation of the connected targets and tracks. And the matching graph is input into a message transmission network for iteration, a cost matrix is constructed according to the scores of the edge classifications, matching is completed by solving the cost matrix, and finally, the track is updated and managed.

Specific steps will be described in detail below.

Step one, constructing a high-performance detector and appearance characteristic extraction network.

And secondly, obtaining a detection result in the video frame image detection network, defining a high confidence coefficient threshold value and a low confidence coefficient threshold value, screening the detection result according to the threshold value, marking a target with a higher confidence coefficient than the high confidence coefficient threshold value as a high confidence coefficient target, and marking a target with a higher confidence coefficient less than or equal to the high confidence coefficient as a low confidence coefficient target. The low confidence target is that occlusion or the like is likely to occur. The detection frame is then de-duplicated using non-maximal suppression. And inputting the target area into an appearance characteristic extraction network to obtain the appearance characteristics of each target and storing the appearance characteristics.

In the present example, the detector network may select the YOLO series or the centrnet, etc., the appearance feature extraction network is OSNet, the high confidence threshold is selected to be 0.5, the low confidence threshold is selected to be 0.2, and the dimension of each target appearance feature vector is 64.

And thirdly, after the position and appearance characteristics of the current frame target are obtained, constructing an intra-frame graph representing the relation between the targets. Fig. 2 shows a feature encoding mode of the vertices and edges of the intra-frame graph. Each vertex of the intra-frame graph represents a feature of the object of the current frame, and the edge represents some similarity feature of the two objects connected.

For vertex feature encoding, a distance is first selected, the size of which is 0.1 times of the smaller of the image heights, only targets in a circle with the current vertex as the center and the distance as the radius are considered, and the targets are called as 'neighbor targets', and the rest targets are called as 'non-neighbor targets'. The distances between the target and all the adjacent targets are calculated and normalized. The normalization is performed in a specific way by multiplying each distance by the maximum distance. The included angle between neighboring targets of the target is then calculated. The distance and angle describe a topological relationship of the target to neighboring targets. In order to maintain the alignment invariance over time, the distance and angle features are stored and ordered. The included angles with the adjacent targets and the normalized distances are preset as 20-dimensional vectors, and if the number of the adjacent targets is less than 20, 0 is used for filling; if the number is more than 20, the nearest 20 are taken. The appearance characteristics of the object, the position information of the object (center point and bounding box height and width) and the resulting topological relationship with the neighboring object are then combined to form the feature code for the vertex. The vertex feature dimension is 108.

For edge feature coding, the appearance features of the two connected targets are fused, and the specific mode of fusion is to perform attention calculation to obtain vectors with the same dimension. And then, carrying out similarity calculation on the position information, and combining the fused vertex characteristics with the similarity of the position information to form the edge characteristic code. The similarity of the position information is composed of the difference value of the center points of the boundary boxes, the logarithmic ratio of the heights and the widths of the boundary boxes and the Wasserstein distance of the two targets. The Wasserstein distance is IoU distance used by the mainstream algorithm, which is more useful for describing the extent of bounding box overlap of small objects. The dimension of the edge feature code is 69.

The built intra-frame map is then updated via the messaging network. The information transmission network can fuse the characteristics between the vertexes and the edges, and after a plurality of iterations, each vertex can be fused to a larger range of characteristics, so that modeling of the target relationship is facilitated. In order to balance the speed and the precision, the number of layers of the message transmission network is 3, the vertex characteristic updating modes are average aggregation, and the characteristic updating is not carried out on the edges. I.e. the features of vertices and connected edges are mapped as new features by the multi-layer perceptron.

And step four, after the intra-frame image is obtained, constructing a matching image with the track image to complete the tracking step. If the current frame is the first frame, then there is no trace map, so the current intra-frame map is taken as the trace map. Otherwise, as shown in the first step to the fourth step of fig. 3, a matching graph is constructed.

The matching graph is a two-part graph consisting of a track graph and an intra-frame graph, wherein one side is the vertex of the track graph, and the other side is the vertex of the intra-frame graph. The edge represents the feature measurement of matching the connected track and the vertex, specifically, the feature measurement is similar to the edge feature of the intra-frame image, the fusion appearance feature of the track and the target is calculated, and the similarity of the position information of the track and the target is used as the edge feature code. The edge feature vector dimension of the matching graph is also 69. The matching graph is then passed through a messaging network, the number of layers is 3, and the edge and vertex features are updated in an average aggregation manner. And calculating each edge by using a full connection layer with the layer number of 2 to obtain a score between 0 and 1, wherein the score represents the probability score of the connection of the corresponding track and the target. And constructing a matched cost matrix according to the likelihood score, setting the corresponding position of the edge with the score less than or equal to 0.4 to be 1E5 at the corresponding position of the cost matrix to reject the matching, and setting the response position of the cost matrix to be 1 minus the likelihood score. And then obtaining an optimal solution of the matching matrix, namely the corresponding relation between the track and the target by adopting a Sinkhorn algorithm or a Hungarian algorithm.

And step five, as shown in the fifth step of fig. 3, performing track management and updating of a track map. Checking the matching relation solved in the step four one by one, and if the matching is the matching of the movable track and the high-confidence target, the matching is always credible, and taking the index smoothing of the target feature and the track feature as a new track feature; if the non-active track is matched with the high-confidence target, the target is shown to reappear after disappearing for a period of time, the state of the track is restored to be active, the original track feature is abandoned, and the current target feature is taken as the track feature; if the matching of the moving track and the low-confidence target is achieved, the target is restored, and track characteristics are not updated. For the remaining unsuccessful matching tracks, the status is set to inactive and the track is deleted if the duration of the inactivity exceeds 30 frames. For the remaining objects that have not been successfully matched, if the confidence is greater than 0.6, this indicates that the object is a new object, and the new object is initialized to a new track for matching of the subsequent frames. And finishing processing all frames in the video to be processed, and ending the flow.

From this, the whole flow design of the multi-target tracking algorithm for modeling the relationship between targets is completed.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multi-target tracking method for modeling the relationship between targets is characterized by comprising the following steps of:

step one, a high-performance detector and appearance feature extraction network is constructed. The input of the detector is a video frame, and the output is the position and the category of each target. The input of the appearance characteristic extraction network is a cut target area, and the appearance characteristic extraction network outputs an appearance characteristic vector serving as a target;

and secondly, screening out detection candidate frames with a lower confidence threshold for the current frame to avoid detection omission caused by shielding and the like. The highly repeated detection frames are then de-duplicated using non-maximal suppression. Extracting appearance characteristics of each target by adopting an appearance characteristic extraction network;

constructing an intra-frame image of the target detected by the current frame, wherein the vertexes of the image represent the codes of the appearance characteristics and the motion characteristics of the target, the edges represent the characteristic space topological relation of the two connected targets, and the intra-frame image characteristics are updated through a message transmission network;

and step four, constructing a matching diagram for the current frame and the track diagram. One end vertex of the matching graph is the characteristics of the moving track and the temporary lost track, and the other end vertex is the target characteristic of the current frame; the edges are similarity measures of the connected tracks and the current target, and the matching graph features are updated through the message passing network. For each edge, the score of the matching relationship represented by the edge is calculated by inputting the edge into an edge classifier (which is composed of a fully connected layer network). Forming a cost matrix by the scores between the track and the target, and obtaining an optimal solution of the cost matrix by calculating a linear assignment problem, namely a matching relation;

and fifthly, sorting the current matching relation, and recovering the detection of low confidence coefficient and the inactive track caused by shielding and the like.

2. The method of multiple object tracking for modeling relationships between objects according to claim 1, wherein said step one further comprises: the detector network may select the YOLO series or the centrnet, etc., the appearance feature extraction network is OSNet, the high confidence threshold is selected to be 0.5, the low confidence threshold is selected to be 0.2, and the dimension of each target appearance feature vector is 64.

3. The multi-object tracking method for modeling relationships between objects according to claim 1, wherein the specific method for constructing an intra-frame map in the second step is as follows: the number of the vertexes and the edges of the intra-frame image is related to a specific detection result, and the geometric distance is adopted to find the adjacent target of the target due to the fact that the density of the target is not fixed, the range is that the target is taken as the center, and the smaller one of the image height and the width is taken as the radius by 0.1 times. The target appearance characteristics adopt 64-dimensional vectors, the included angles between the target appearance characteristics and the neighboring targets and the normalized distance are respectively preset as 20-dimensional vectors, and if the neighboring targets are less than 20, 0 is used for filling; if the number is more than 20, the nearest 20 are taken. In edge feature coding of an intra-frame image, fused vertex features are obtained through attention calculation, geometric distances are Euclidean distances, and the similarity of the boundary boxes comprises a center point difference value, an aspect ratio logarithmic value and a Wasserstein distance.

4. The method for modeling multiple targets according to claim 1, wherein the specific method for constructing the matching graph in the third step is to fuse vertex features of the trajectory graph and vertex features of the intra-frame graph, and add a similarity of position information.

5. The multi-objective tracking method for modeling relationships between objectives according to claim 1, wherein the specific method for calculating the matching relationships from the matching graphs in the third step is as follows: inputting the matching relation represented by the edges of the matching graph into an edge classifier for confidence calculation, wherein the number of layers of the full-connection layer of the edge classifier is 2, and the activation function adopts Sigmoid for normalization output; and constructing a matching cost matrix according to the edge score, setting the corresponding position of the edge with the score less than or equal to 0.4 as 1E5 to reject matching, setting the response position of the cost matrix as 1 minus the probability score, and then adopting a Sinkhorn algorithm or a Hungarian algorithm to obtain the optimal solution of the matching matrix, namely the corresponding relation between the track and the target.

6. The multi-objective tracking method according to claim 1, wherein the update mode of the message passing network in the third and fourth steps is that the number of message passing iterations of the intra-frame graph and the matching graph is 3, and the message passing mode is an average aggregation mode, that is, the update feature of each vertex is formed by connecting the self feature of the upper layer and the average feature of the neighboring vertex, and the average feature of the incoming edge and the outgoing edge with residual structures.

7. The multi-target tracking method for modeling relationships between targets according to claim 1, wherein the specific method for track arrangement in the fifth step is as follows:

matching the successfully-matched activity track with a high-confidence target: the matching relation of the situation is often reliable, the target is marked into a corresponding track, and the current characteristic of the target and the track characteristic are exponentially smoothed to be used as a new track characteristic;

matching the successfully matched activity track with a low confidence target: the situation indicates that the target may be blocked, the target is restored and is drawn into the corresponding track, and the track characteristics are not updated because the current target characteristics are not credible;

successful matching of the inactive trajectory with the high confidence target: the situation shows that the target is reproduced after losing for a period of time, the target is marked into a corresponding track, and the current target characteristic is used as a new track characteristic;

traces that did not match successfully: this situation indicates that the target is out of view or severely occluded. If the state of the track is an active track, marking the track as inactive, and recording the duration of the inactive; if the state of the track is an inactive track, judging that if the duration of the inactive state is greater than the maximum retention time, deleting the inactive state from the track set;

8. The method of modeling multiple targets tracking according to claim 7, wherein the specific parameters are defined as: after the edge score of the matching graph is calculated, edges with the score being greater than or equal to 0.4 are screened first, and when a cost matrix is constructed, the cost with the score being less than 0.4 is set as a large positive number (for example, 1E 5). The maximum retention time of the lost track is 30 frames.