CN114913209A

CN114913209A - Multi-target tracking network construction method and device based on overlook projection

Info

Publication number: CN114913209A
Application number: CN202210826068.XA
Authority: CN
Inventors: 李勇; 戴亮; 戴红波; 苏进和; 张少成; 耿阳; 张维; 郭志峰; 汤青; 王浩; 郭旋; 束长勇
Original assignee: Jiangsu Xiangtai Electric Power Industry Co ltd; Nanjing Houmo Intelligent Technology Co ltd; Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Jiangsu Xiangtai Electric Power Industry Co ltd; Nanjing Houmo Intelligent Technology Co ltd; Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2022-08-16
Anticipated expiration: 2042-07-14
Also published as: CN114913209B

Abstract

The invention discloses a multi-target tracking network construction method and a device based on overlook projection, wherein the method comprises the following steps: acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set; constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder; and performing iterative training on the initial network according to the multi-target tracking data set until convergence, so as to obtain the multi-target tracking network based on the top-view projection. According to the invention, a multi-target tracking network based on the overlook projection is constructed, so that the problems of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized by 2D detection at present are solved, and the multi-target tracking capability is improved.

Description

Multi-target tracking network construction method and device based on overlook projection

Technical Field

The invention relates to the field of information technology processing, in particular to a multi-target tracking network construction method and device based on overlook projection.

Background

The target tracking has important application value in automatic driving and security systems. Current monocular tracking methods based on camera-acquired images or videos for analysis are mainly implemented based on 2D detection. Due to the fact that the 2D detection has no spatial information and phenomena such as overlapping shielding are prone to occurring, post-processing association based on the 2D detection is difficult due to the overlapping shielding, and failure can be caused. Even objects that are far apart along the viewing direction in real space may still overlap on the image plane due to the limited perceived pixel pitch on the image plane, making it difficult for conventional 2D detection frame-based object tracking methods to distinguish overlapping objects.

For example, patent CN114419098A provides a method and an apparatus for predicting a moving target trajectory based on visual transformation, which perform target trajectory prediction and target tracking by extracting depth features from an acquired 2D bounding box to obtain a predicted coordinate of the moving target trajectory, and convert the predicted coordinate of the moving target trajectory into a vehicle coordinate system according to a coordinate conversion relationship between image pixels and a vehicle body obtained by calibration in advance to obtain a predicted coordinate of the moving target trajectory in the vehicle coordinate system. The extraction and identification of the depth features depend on a depth convolution re-identification appearance model in a multi-target tracking algorithm Deepsort and are fused with a target tracking module of a cascade matching algorithm.

The scheme can solve the problems of missed detection and target shielding to a certain extent. However, in this scheme, the recognition error of the depth feature is related to the attribute of the target feature itself, and there is a problem that the target tracking accuracy is not good enough. When the actual target is tracked, comprehensive consideration is often performed based on the characteristics of the tracked target. For example, when tracking different moving objects, inertia factors among the different moving objects, characteristics of the moving objects and differences of movement speed characteristics need to be considered, movement speeds of vehicles and pedestrians, movement tracks of vehicles and pedestrians, and the like.

Therefore, how to construct a tracking network adapted to multiple targets, and to alleviate and overcome the problems of overlapping occlusion, low accuracy and the like existing when the current 2D detection is used to realize the multiple target tracking, so as to improve the capability of the multiple target tracking, are problems to be urgently solved by those skilled in the art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multi-target tracking network construction method and device based on overlook projection. The constructed multi-target tracking network based on the overlooking projection solves the problems of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized by 2D detection at present, and improves the multi-target tracking capability.

In a first aspect, the invention provides a multi-target tracking network construction method based on top view projection, which comprises the following steps:

acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set;

constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder;

performing iterative training on the initial network according to the multi-target tracking data set until convergence to obtain a multi-target tracking network based on the top-view projection;

the backbone network is used for sampling in the training image group to form a feature map group; the overlook coder is used for sensing spatial information in the backbone network characteristic graph group, fusing the spatial information with the category information, and overlooking and projecting to form a characteristic matrix containing the spatial information; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.

Further, the spatial information includes coordinate information of the target center point at a spatial position.

Further, performing iterative training on the initial network according to the multi-target tracking data set until convergence specifically includes:

s1: selecting training image groups of t frames in the multi-target tracking data set, and inputting the training image groups into a backbone network;

s2: sampling a backbone network to form a characteristic graph group of a t frame;

s3: inputting the feature map group of the t frame into a overlook encoder, sensing spatial information, fusing the spatial information with category information, and performing overlook projection to form a t frame feature matrix containing the spatial information;

s4: inputting the t frame feature matrix into a space decoder, and obtaining an object query variable of the t frame based on an initial object query variable;

s5: selecting a training image group of a t +1 frame in the multi-target tracking data set, repeating the steps S2-S3 to obtain a t +1 frame characteristic matrix, and inputting the t +1 frame characteristic matrix and an object query variable of the t frame into a space decoder to obtain an object query variable of the t +1 frame;

s6: and calculating loss based on the object query variables of the t frame and the t +1 frame, and continuously taking the subsequent time frame to repeat the steps S1-S5 until convergence.

Further, the overlook encoder comprises a spatial sensing module and a projection module, and the step S3 specifically comprises:

inputting the feature image group of the t frame into a spatial perception module, extracting the spatial feature of the target, and fusing the category features to form a t frame target feature image group;

transforming the t frame target characteristic image group into a pseudo point cloud through the overlooking projection of the projection module;

and forming a t-frame feature matrix containing spatial information based on the transformed pseudo-point cloud.

Further, inputting the feature map group of the t frame into a spatial perception module, extracting spatial features of the target, and fusing category features to form a t frame target feature map, which specifically includes:

inputting the feature map group of the t frames into a space perception module, extracting the class features of dimensionality N x k x H/alpha x W/beta, and obtaining the space features of dimension N x D x H/alpha W/beta through convolution mapping, wherein N is the number of the feature map group of the t frames, k is the number of class feature channels of the feature maps, H is the number of height channels of the feature maps, alpha is a compression constant of the number of height channels of the feature maps, W is the number of width channels of the feature maps, beta is a compression constant of the number of width channels of the feature maps, and D is the number of space feature channels of the feature maps;

and carrying out array combination on the category features and the spatial features to form a t-frame target feature map containing spatial information N (k + D) H/alpha W/beta.

Further, transforming the t-frame target feature map into a pseudo point cloud through the top projection of the projection module, specifically comprising:

establishing a world coordinate system, and setting the size of a grid under the world coordinate system;

projecting pixel points in the t-frame target characteristic graph into a grid of a world coordinate system according to a preset projection rule;

and analyzing and processing the target characteristics in the grid, and transforming to obtain pseudo point cloud.

Further, establishing a world coordinate system, and setting the size of a grid under the world coordinate system, specifically comprising:

taking the central point of the training image group acquisition equipment as a center, vertically and upwards making a Z axis, and establishing a world coordinate system;

obtaining the size of the grid under the world coordinate system based on the size of the detection range of the training image group acquisition equipment and the size of a set single grid, wherein the size calculation formula is as follows:

x*y=int(L/l)*int(W/w)

wherein x y is the grid size under the world coordinate system, int is an integer function, L is the length of the detection range of the training image group acquisition equipment, W is the width of the detection range of the training image group acquisition equipment, L is the length of a set single grid, and W is the width of the set single grid;

analyzing and processing the target features in the grid, which specifically comprises the following steps:

counting pixel points contained in the same grid, and solving a characteristic average value of the pixel points to form a grid characteristic value;

and if the number of the pixel points in the grid is zero, setting the characteristic value of the grid to be zero.

Further, the step S6 specifically includes:

comparing object query variables of the t frame and the t +1 frame, giving out a first loss function aiming at the object query variables of the adjacent frames, and giving out a second loss function aiming at the object query variables of the same frame;

repeating steps S1-S5 based on the subsequent time frame, and continuously updating the first loss function and the second loss function;

and when the first loss function and the second loss function are both smaller than the set threshold value, the convergence is finished.

Further, the formula of the first loss function is expressed as follows:

the second loss function is formulated as follows:

wherein,

in order to be a function of the first loss,

in order to be a function of the second loss,

for the purpose of the cosine calculation,

a variable is queried for the object(s),

，

are labeled differently for the objects.

In a second aspect, the present invention further provides an apparatus for implementing any one of the above multi-target tracking network construction methods, including:

the acquisition unit is used for acquiring training image groups of a plurality of continuous time frames, each training image group comprises a plurality of identical targets, and the category, the center point coordinate, the frame width and the space information of each target are labeled to obtain a multi-target tracking data set;

the system comprises a construction unit, a tracking unit and a tracking unit, wherein the construction unit is used for constructing an initial network of the multi-target tracking and comprises a backbone network, a top view encoder and a space decoder;

the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable;

and the training unit is used for performing iterative training on the initial network according to the multi-target tracking data set until convergence is achieved, so that the multi-target tracking network based on the top view projection is obtained.

The invention provides a multi-target tracking network construction method and a device based on overlook projection, which at least have the following beneficial effects:

(1) the constructed multi-target tracking network is based on a characteristic diagram containing spatial information, the spatial characteristics and the category characteristics are fused through a overlooking encoder and a spatial decoder, and the loss function of an object query variable is calculated through a time frame, so that the phenomena of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized through 2D detection at present are relieved and overcome, and the multi-target tracking capability is improved.

(2) In the training of target tracking, the output target query variable containing the target at the previous moment is used as the input target query variable at the current moment, so that single-stage input is realized, and the target tracking network is more concise.

(3) Spatial information is introduced into a overlook encoder, object query variables comprise space and characteristics representing appearance and category of the target, and richer characteristics are beneficial to target association and tracking.

(4) And calculating a first loss function and a second loss function of the object query variables in the same and adjacent time frames, so that the accuracy and efficiency of target tracking training can be improved, fast iteration can be realized, and convergence can be completed.

Drawings

FIG. 1 is a flow chart of the construction of a multi-target tracking network based on top view projection according to the present invention;

FIG. 2 is a diagram of the overall architecture of the initial network of the present invention;

FIG. 3 is a block diagram of an initial network look-down encoder of the present invention;

FIG. 4 is a diagram of an initial cyber-spatial feature fusion process of the present invention;

fig. 5 is a device for implementing the multi-target tracking network construction method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.

As shown in fig. 1, the present invention provides a method for constructing a multi-target tracking network based on top view projection, which comprises the following steps:

the source of the training images in the multi-target tracking dataset can be any source containing any multiple targets (such as pedestrians, animals, vehicles and the like).

as shown in fig. 2, the initial network includes a backbone network, a top view encoder and a spatial decoder, wherein the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.

The spatial information of each target marked by each frame of training image group mainly comprises coordinate information of a target center point at a spatial position.

As shown in fig. 2, the setting starts from a time frame t, and the iterative training is performed on the initial network according to the multi-target tracking data set until the training converges, specifically including:

s6: and calculating loss based on the object query variables of the t frame and the t +1 frame, and continuously taking the subsequent time frame to repeat the steps S1-S5 until convergence. That is, subsequent training of the t +2 frame, the t +3 frame … …, and the like is continued, and the loss of the object query variable of the adjacent frame is calculated until convergence.

In the training of target tracking, the output target query variable containing the target at the previous moment is used as the input target query variable at the current moment, so that single-stage input is realized, and the target tracking network is more concise. In addition, spatial information is introduced into the overlook encoder, so that the object query variables comprise space and characteristics representing appearance and category of the target, and richer characteristics are beneficial to target association and tracking.

As shown in fig. 3, the top view encoder of the initial network includes a spatial perception module and a projection module, so that in step S3, continuing to take t frames as an example, inputting feature map groups of t frames into the top view encoder, perceiving spatial information, and fusing with category information, and forming a t frame feature matrix containing the spatial information by top view projection, specifically including:

The process of other time frames is consistent with that of t frame, and is not described herein again.

As shown in fig. 4, inputting the feature map group of the t frame into the spatial perception module, extracting spatial features of the target, and fusing category features to form a t frame target feature map, which specifically includes:

inputting the feature map group of the t frames into a space perception module, extracting the class features of dimensionality N x k x H/alpha x W/beta, and obtaining the space features of dimension N x D x H/alpha W/beta through convolution mapping, wherein N is the number of the feature map group of the t frames, k is the number of class feature channels of the feature maps, H is the number of height channels of the feature maps, alpha is a compression constant of the number of height channels of the feature maps, W is the number of width channels of the feature maps, beta is a compression constant of the number of width channels of the feature maps, and D is the number of space feature channels of the feature maps; the values of k, α, and β may be set according to the scene requirement, for example, k may be set to 64, and the values of α and β are both 16.

The method for transforming the t-frame target feature map into the pseudo point cloud through the overlooking projection of the projection module specifically comprises the following steps of:

The method comprises the following steps of establishing a world coordinate system, and setting the size of a grid under the world coordinate system, wherein the method specifically comprises the following steps:

x*y=int(L/l)*int(W/w)

Wherein, in the specific process of performing iterative training on the initial network according to the multi-target tracking data set until the training converges, the step S6 specifically includes:

The loss function of target detection in the same time frame can be realized by a commonly used hungarian matching algorithm, which is not particularly limited herein. And when the target tracking is carried out on the adjacent time frames, iteration is carried out according to the object query variable.

Wherein the first loss function is formulated as follows:

the first loss function is smaller than a set threshold, and a specific value of the threshold is limited according to an actual scene, which is not particularly specified herein.

，

Reference numbers for different targets are intended to indicate that the query variables for objects at different times for the same target are sufficiently similar, while the query variables for objects at different targets are as dissimilar as possible.

The second loss function is formulated as follows:

in the target tracking, even if the targets are the same-class targets, each target is an independent instance, so a regular term is introduced to constrain the object query variable, namely a second loss function.

The second loss function is smaller than the set threshold, where the threshold may be different from or the same as the threshold of the first loss function, and is limited according to the actual scene, and no particular provision is made here.

，

The purpose of the labels for different objects is to indicate that the different objects are as dissimilar as possible in the same frame.

Wherein,

in order to be a function of the first loss,

in order to be a function of the second loss,

for the purpose of the cosine calculation,

variables are queried for objects.

Through the calculation of the first loss function and the second loss function, the accuracy and efficiency of target tracking training can be improved, fast iteration can be realized, and convergence is completed.

the acquisition unit 101 is configured to acquire training image groups of a plurality of continuous time frames, where each training image group includes a plurality of identical targets, and labels categories, coordinates of a center point, width and height of a frame, and spatial information of each target to obtain a multi-target tracking data set;

the building unit 102 is configured to build an initial network for multi-target tracking, and includes a backbone network, a top view encoder, and a spatial decoder;

and the training unit 103 is used for performing iterative training on the initial network according to the multi-target tracking data set until convergence, so as to obtain the multi-target tracking network based on the top view projection.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A multi-target tracking network construction method based on overlook projection is characterized by comprising the following steps:

the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.

2. The construction method according to claim 1, wherein the spatial information includes coordinate information of a position in space of the target center point.

3. The building method of claim 1, wherein iteratively training the initial network to converge according to the multi-target tracking dataset comprises:

s3: inputting the characteristic image group of the t frame into a overlook encoder, sensing spatial information, fusing the spatial information with category information, and overlooking and projecting to form a t frame characteristic matrix containing the spatial information;

4. The building method according to claim 3, wherein the top view encoder includes a spatial perception module and a projection module, and the step S3 specifically includes:

5. The construction method according to claim 4, wherein the feature map group of the t frame is input to a spatial perception module, the spatial features of the target are extracted, and the category features are fused to form a t frame target feature map, and specifically comprises:

6. The construction method according to claim 4, wherein transforming the t-frame target feature map into a pseudo point cloud by the top projection of the projection module specifically comprises:

7. The method of claim 6, wherein establishing a world coordinate system and setting dimensions of a grid under the world coordinate system comprises:

x*y=int(L/l)*int(W/w)

8. The construction method according to claim 3, wherein the step of S6 specifically includes:

9. The construction method according to claim 8, wherein the first loss function is formulated as follows:

the second loss function is formulated as follows:

wherein,

in order to be a function of the first loss,

in order to be a function of the second loss,

for the purpose of the cosine calculation,

a variable is queried for the object(s),

，

are labeled differently for the objects.

10. An apparatus for implementing the method for constructing the multi-target tracking network based on the top view projection according to any one of claims 1 to 9, comprising:

the backbone network is used for sampling in the training image group to form a feature map group; the overlook coder is used for sensing spatial information in the backbone network characteristic graph group, fusing the spatial information with the category information, and overlooking and projecting to form a characteristic matrix containing the spatial information; the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable;