CN114913209A - Multi-target tracking network construction method and device based on overlook projection - Google Patents

Multi-target tracking network construction method and device based on overlook projection Download PDF

Info

Publication number
CN114913209A
CN114913209A CN202210826068.XA CN202210826068A CN114913209A CN 114913209 A CN114913209 A CN 114913209A CN 202210826068 A CN202210826068 A CN 202210826068A CN 114913209 A CN114913209 A CN 114913209A
Authority
CN
China
Prior art keywords
frame
target
target tracking
spatial information
grid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210826068.XA
Other languages
Chinese (zh)
Other versions
CN114913209B (en
Inventor
李勇
戴亮
戴红波
苏进和
张少成
耿阳
张维
郭志峰
汤青
王浩
郭旋
束长勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xiangtai Electric Power Industry Co ltd
Nanjing Houmo Intelligent Technology Co ltd
Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Jiangsu Xiangtai Electric Power Industry Co ltd
Nanjing Houmo Intelligent Technology Co ltd
Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Xiangtai Electric Power Industry Co ltd, Nanjing Houmo Intelligent Technology Co ltd, Taizhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd filed Critical Jiangsu Xiangtai Electric Power Industry Co ltd
Priority to CN202210826068.XA priority Critical patent/CN114913209B/en
Publication of CN114913209A publication Critical patent/CN114913209A/en
Application granted granted Critical
Publication of CN114913209B publication Critical patent/CN114913209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

The invention discloses a multi-target tracking network construction method and a device based on overlook projection, wherein the method comprises the following steps: acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set; constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder; and performing iterative training on the initial network according to the multi-target tracking data set until convergence, so as to obtain the multi-target tracking network based on the top-view projection. According to the invention, a multi-target tracking network based on the overlook projection is constructed, so that the problems of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized by 2D detection at present are solved, and the multi-target tracking capability is improved.

Description

Multi-target tracking network construction method and device based on overlook projection
Technical Field
The invention relates to the field of information technology processing, in particular to a multi-target tracking network construction method and device based on overlook projection.
Background
The target tracking has important application value in automatic driving and security systems. Current monocular tracking methods based on camera-acquired images or videos for analysis are mainly implemented based on 2D detection. Due to the fact that the 2D detection has no spatial information and phenomena such as overlapping shielding are prone to occurring, post-processing association based on the 2D detection is difficult due to the overlapping shielding, and failure can be caused. Even objects that are far apart along the viewing direction in real space may still overlap on the image plane due to the limited perceived pixel pitch on the image plane, making it difficult for conventional 2D detection frame-based object tracking methods to distinguish overlapping objects.
For example, patent CN114419098A provides a method and an apparatus for predicting a moving target trajectory based on visual transformation, which perform target trajectory prediction and target tracking by extracting depth features from an acquired 2D bounding box to obtain a predicted coordinate of the moving target trajectory, and convert the predicted coordinate of the moving target trajectory into a vehicle coordinate system according to a coordinate conversion relationship between image pixels and a vehicle body obtained by calibration in advance to obtain a predicted coordinate of the moving target trajectory in the vehicle coordinate system. The extraction and identification of the depth features depend on a depth convolution re-identification appearance model in a multi-target tracking algorithm Deepsort and are fused with a target tracking module of a cascade matching algorithm.
The scheme can solve the problems of missed detection and target shielding to a certain extent. However, in this scheme, the recognition error of the depth feature is related to the attribute of the target feature itself, and there is a problem that the target tracking accuracy is not good enough. When the actual target is tracked, comprehensive consideration is often performed based on the characteristics of the tracked target. For example, when tracking different moving objects, inertia factors among the different moving objects, characteristics of the moving objects and differences of movement speed characteristics need to be considered, movement speeds of vehicles and pedestrians, movement tracks of vehicles and pedestrians, and the like.
Therefore, how to construct a tracking network adapted to multiple targets, and to alleviate and overcome the problems of overlapping occlusion, low accuracy and the like existing when the current 2D detection is used to realize the multiple target tracking, so as to improve the capability of the multiple target tracking, are problems to be urgently solved by those skilled in the art.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a multi-target tracking network construction method and device based on overlook projection. The constructed multi-target tracking network based on the overlooking projection solves the problems of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized by 2D detection at present, and improves the multi-target tracking capability.
In a first aspect, the invention provides a multi-target tracking network construction method based on top view projection, which comprises the following steps:
acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set;
constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder;
performing iterative training on the initial network according to the multi-target tracking data set until convergence to obtain a multi-target tracking network based on the top-view projection;
the backbone network is used for sampling in the training image group to form a feature map group; the overlook coder is used for sensing spatial information in the backbone network characteristic graph group, fusing the spatial information with the category information, and overlooking and projecting to form a characteristic matrix containing the spatial information; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.
Further, the spatial information includes coordinate information of the target center point at a spatial position.
Further, performing iterative training on the initial network according to the multi-target tracking data set until convergence specifically includes:
s1: selecting training image groups of t frames in the multi-target tracking data set, and inputting the training image groups into a backbone network;
s2: sampling a backbone network to form a characteristic graph group of a t frame;
s3: inputting the feature map group of the t frame into a overlook encoder, sensing spatial information, fusing the spatial information with category information, and performing overlook projection to form a t frame feature matrix containing the spatial information;
s4: inputting the t frame feature matrix into a space decoder, and obtaining an object query variable of the t frame based on an initial object query variable;
s5: selecting a training image group of a t +1 frame in the multi-target tracking data set, repeating the steps S2-S3 to obtain a t +1 frame characteristic matrix, and inputting the t +1 frame characteristic matrix and an object query variable of the t frame into a space decoder to obtain an object query variable of the t +1 frame;
s6: and calculating loss based on the object query variables of the t frame and the t +1 frame, and continuously taking the subsequent time frame to repeat the steps S1-S5 until convergence.
Further, the overlook encoder comprises a spatial sensing module and a projection module, and the step S3 specifically comprises:
inputting the feature image group of the t frame into a spatial perception module, extracting the spatial feature of the target, and fusing the category features to form a t frame target feature image group;
transforming the t frame target characteristic image group into a pseudo point cloud through the overlooking projection of the projection module;
and forming a t-frame feature matrix containing spatial information based on the transformed pseudo-point cloud.
Further, inputting the feature map group of the t frame into a spatial perception module, extracting spatial features of the target, and fusing category features to form a t frame target feature map, which specifically includes:
inputting the feature map group of the t frames into a space perception module, extracting the class features of dimensionality N x k x H/alpha x W/beta, and obtaining the space features of dimension N x D x H/alpha W/beta through convolution mapping, wherein N is the number of the feature map group of the t frames, k is the number of class feature channels of the feature maps, H is the number of height channels of the feature maps, alpha is a compression constant of the number of height channels of the feature maps, W is the number of width channels of the feature maps, beta is a compression constant of the number of width channels of the feature maps, and D is the number of space feature channels of the feature maps;
and carrying out array combination on the category features and the spatial features to form a t-frame target feature map containing spatial information N (k + D) H/alpha W/beta.
Further, transforming the t-frame target feature map into a pseudo point cloud through the top projection of the projection module, specifically comprising:
establishing a world coordinate system, and setting the size of a grid under the world coordinate system;
projecting pixel points in the t-frame target characteristic graph into a grid of a world coordinate system according to a preset projection rule;
and analyzing and processing the target characteristics in the grid, and transforming to obtain pseudo point cloud.
Further, establishing a world coordinate system, and setting the size of a grid under the world coordinate system, specifically comprising:
taking the central point of the training image group acquisition equipment as a center, vertically and upwards making a Z axis, and establishing a world coordinate system;
obtaining the size of the grid under the world coordinate system based on the size of the detection range of the training image group acquisition equipment and the size of a set single grid, wherein the size calculation formula is as follows:
x*y=int(L/l)*int(W/w)
wherein x y is the grid size under the world coordinate system, int is an integer function, L is the length of the detection range of the training image group acquisition equipment, W is the width of the detection range of the training image group acquisition equipment, L is the length of a set single grid, and W is the width of the set single grid;
analyzing and processing the target features in the grid, which specifically comprises the following steps:
counting pixel points contained in the same grid, and solving a characteristic average value of the pixel points to form a grid characteristic value;
and if the number of the pixel points in the grid is zero, setting the characteristic value of the grid to be zero.
Further, the step S6 specifically includes:
comparing object query variables of the t frame and the t +1 frame, giving out a first loss function aiming at the object query variables of the adjacent frames, and giving out a second loss function aiming at the object query variables of the same frame;
repeating steps S1-S5 based on the subsequent time frame, and continuously updating the first loss function and the second loss function;
and when the first loss function and the second loss function are both smaller than the set threshold value, the convergence is finished.
Further, the formula of the first loss function is expressed as follows:
Figure 783708DEST_PATH_IMAGE001
the second loss function is formulated as follows:
Figure 961617DEST_PATH_IMAGE002
wherein,
Figure 300326DEST_PATH_IMAGE003
in order to be a function of the first loss,
Figure 972615DEST_PATH_IMAGE004
in order to be a function of the second loss,
Figure 578915DEST_PATH_IMAGE005
for the purpose of the cosine calculation,
Figure 694770DEST_PATH_IMAGE006
a variable is queried for the object(s),
Figure 482292DEST_PATH_IMAGE007
Figure 692693DEST_PATH_IMAGE008
are labeled differently for the objects.
In a second aspect, the present invention further provides an apparatus for implementing any one of the above multi-target tracking network construction methods, including:
the acquisition unit is used for acquiring training image groups of a plurality of continuous time frames, each training image group comprises a plurality of identical targets, and the category, the center point coordinate, the frame width and the space information of each target are labeled to obtain a multi-target tracking data set;
the system comprises a construction unit, a tracking unit and a tracking unit, wherein the construction unit is used for constructing an initial network of the multi-target tracking and comprises a backbone network, a top view encoder and a space decoder;
the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable;
and the training unit is used for performing iterative training on the initial network according to the multi-target tracking data set until convergence is achieved, so that the multi-target tracking network based on the top view projection is obtained.
The invention provides a multi-target tracking network construction method and a device based on overlook projection, which at least have the following beneficial effects:
(1) the constructed multi-target tracking network is based on a characteristic diagram containing spatial information, the spatial characteristics and the category characteristics are fused through a overlooking encoder and a spatial decoder, and the loss function of an object query variable is calculated through a time frame, so that the phenomena of overlapping shielding, low accuracy and the like existing when the multi-target tracking is realized through 2D detection at present are relieved and overcome, and the multi-target tracking capability is improved.
(2) In the training of target tracking, the output target query variable containing the target at the previous moment is used as the input target query variable at the current moment, so that single-stage input is realized, and the target tracking network is more concise.
(3) Spatial information is introduced into a overlook encoder, object query variables comprise space and characteristics representing appearance and category of the target, and richer characteristics are beneficial to target association and tracking.
(4) And calculating a first loss function and a second loss function of the object query variables in the same and adjacent time frames, so that the accuracy and efficiency of target tracking training can be improved, fast iteration can be realized, and convergence can be completed.
Drawings
FIG. 1 is a flow chart of the construction of a multi-target tracking network based on top view projection according to the present invention;
FIG. 2 is a diagram of the overall architecture of the initial network of the present invention;
FIG. 3 is a block diagram of an initial network look-down encoder of the present invention;
FIG. 4 is a diagram of an initial cyber-spatial feature fusion process of the present invention;
fig. 5 is a device for implementing the multi-target tracking network construction method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.
As shown in fig. 1, the present invention provides a method for constructing a multi-target tracking network based on top view projection, which comprises the following steps:
acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set;
the source of the training images in the multi-target tracking dataset can be any source containing any multiple targets (such as pedestrians, animals, vehicles and the like).
Constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder;
performing iterative training on the initial network according to the multi-target tracking data set until convergence to obtain a multi-target tracking network based on the top-view projection;
as shown in fig. 2, the initial network includes a backbone network, a top view encoder and a spatial decoder, wherein the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.
The spatial information of each target marked by each frame of training image group mainly comprises coordinate information of a target center point at a spatial position.
As shown in fig. 2, the setting starts from a time frame t, and the iterative training is performed on the initial network according to the multi-target tracking data set until the training converges, specifically including:
s1: selecting training image groups of t frames in the multi-target tracking data set, and inputting the training image groups into a backbone network;
s2: sampling a backbone network to form a characteristic graph group of a t frame;
s3: inputting the feature map group of the t frame into a overlook encoder, sensing spatial information, fusing the spatial information with category information, and performing overlook projection to form a t frame feature matrix containing the spatial information;
s4: inputting the t frame feature matrix into a space decoder, and obtaining an object query variable of the t frame based on an initial object query variable;
s5: selecting a training image group of a t +1 frame in the multi-target tracking data set, repeating the steps S2-S3 to obtain a t +1 frame characteristic matrix, and inputting the t +1 frame characteristic matrix and an object query variable of the t frame into a space decoder to obtain an object query variable of the t +1 frame;
s6: and calculating loss based on the object query variables of the t frame and the t +1 frame, and continuously taking the subsequent time frame to repeat the steps S1-S5 until convergence. That is, subsequent training of the t +2 frame, the t +3 frame … …, and the like is continued, and the loss of the object query variable of the adjacent frame is calculated until convergence.
In the training of target tracking, the output target query variable containing the target at the previous moment is used as the input target query variable at the current moment, so that single-stage input is realized, and the target tracking network is more concise. In addition, spatial information is introduced into the overlook encoder, so that the object query variables comprise space and characteristics representing appearance and category of the target, and richer characteristics are beneficial to target association and tracking.
As shown in fig. 3, the top view encoder of the initial network includes a spatial perception module and a projection module, so that in step S3, continuing to take t frames as an example, inputting feature map groups of t frames into the top view encoder, perceiving spatial information, and fusing with category information, and forming a t frame feature matrix containing the spatial information by top view projection, specifically including:
inputting the feature image group of the t frame into a spatial perception module, extracting the spatial feature of the target, and fusing the category features to form a t frame target feature image group;
transforming the t frame target characteristic image group into a pseudo point cloud through the overlooking projection of the projection module;
and forming a t-frame feature matrix containing spatial information based on the transformed pseudo-point cloud.
The process of other time frames is consistent with that of t frame, and is not described herein again.
As shown in fig. 4, inputting the feature map group of the t frame into the spatial perception module, extracting spatial features of the target, and fusing category features to form a t frame target feature map, which specifically includes:
inputting the feature map group of the t frames into a space perception module, extracting the class features of dimensionality N x k x H/alpha x W/beta, and obtaining the space features of dimension N x D x H/alpha W/beta through convolution mapping, wherein N is the number of the feature map group of the t frames, k is the number of class feature channels of the feature maps, H is the number of height channels of the feature maps, alpha is a compression constant of the number of height channels of the feature maps, W is the number of width channels of the feature maps, beta is a compression constant of the number of width channels of the feature maps, and D is the number of space feature channels of the feature maps; the values of k, α, and β may be set according to the scene requirement, for example, k may be set to 64, and the values of α and β are both 16.
And carrying out array combination on the category features and the spatial features to form a t-frame target feature map containing spatial information N (k + D) H/alpha W/beta.
The method for transforming the t-frame target feature map into the pseudo point cloud through the overlooking projection of the projection module specifically comprises the following steps of:
establishing a world coordinate system, and setting the size of a grid under the world coordinate system;
projecting pixel points in the t-frame target characteristic graph into a grid of a world coordinate system according to a preset projection rule;
and analyzing and processing the target characteristics in the grid, and transforming to obtain pseudo point cloud.
The method comprises the following steps of establishing a world coordinate system, and setting the size of a grid under the world coordinate system, wherein the method specifically comprises the following steps:
taking the central point of the training image group acquisition equipment as a center, vertically and upwards making a Z axis, and establishing a world coordinate system;
obtaining the size of the grid under the world coordinate system based on the size of the detection range of the training image group acquisition equipment and the size of a set single grid, wherein the size calculation formula is as follows:
x*y=int(L/l)*int(W/w)
wherein x y is the grid size under the world coordinate system, int is an integer function, L is the length of the detection range of the training image group acquisition equipment, W is the width of the detection range of the training image group acquisition equipment, L is the length of a set single grid, and W is the width of the set single grid;
analyzing and processing the target features in the grid, which specifically comprises the following steps:
counting pixel points contained in the same grid, and solving a characteristic average value of the pixel points to form a grid characteristic value;
and if the number of the pixel points in the grid is zero, setting the characteristic value of the grid to be zero.
Wherein, in the specific process of performing iterative training on the initial network according to the multi-target tracking data set until the training converges, the step S6 specifically includes:
comparing object query variables of the t frame and the t +1 frame, giving out a first loss function aiming at the object query variables of the adjacent frames, and giving out a second loss function aiming at the object query variables of the same frame;
repeating steps S1-S5 based on the subsequent time frame, and continuously updating the first loss function and the second loss function;
and when the first loss function and the second loss function are both smaller than the set threshold value, the convergence is finished.
The loss function of target detection in the same time frame can be realized by a commonly used hungarian matching algorithm, which is not particularly limited herein. And when the target tracking is carried out on the adjacent time frames, iteration is carried out according to the object query variable.
Wherein the first loss function is formulated as follows:
Figure 654964DEST_PATH_IMAGE001
the first loss function is smaller than a set threshold, and a specific value of the threshold is limited according to an actual scene, which is not particularly specified herein.
Figure 174676DEST_PATH_IMAGE007
Figure 956818DEST_PATH_IMAGE008
Reference numbers for different targets are intended to indicate that the query variables for objects at different times for the same target are sufficiently similar, while the query variables for objects at different targets are as dissimilar as possible.
The second loss function is formulated as follows:
Figure 705332DEST_PATH_IMAGE002
in the target tracking, even if the targets are the same-class targets, each target is an independent instance, so a regular term is introduced to constrain the object query variable, namely a second loss function.
The second loss function is smaller than the set threshold, where the threshold may be different from or the same as the threshold of the first loss function, and is limited according to the actual scene, and no particular provision is made here.
Figure 489486DEST_PATH_IMAGE007
Figure 930831DEST_PATH_IMAGE008
The purpose of the labels for different objects is to indicate that the different objects are as dissimilar as possible in the same frame.
Wherein,
Figure 934691DEST_PATH_IMAGE003
in order to be a function of the first loss,
Figure 470583DEST_PATH_IMAGE004
in order to be a function of the second loss,
Figure 594397DEST_PATH_IMAGE005
for the purpose of the cosine calculation,
Figure 957376DEST_PATH_IMAGE006
variables are queried for objects.
Through the calculation of the first loss function and the second loss function, the accuracy and efficiency of target tracking training can be improved, fast iteration can be realized, and convergence is completed.
In a second aspect, the present invention further provides an apparatus for implementing any one of the above multi-target tracking network construction methods, including:
the acquisition unit 101 is configured to acquire training image groups of a plurality of continuous time frames, where each training image group includes a plurality of identical targets, and labels categories, coordinates of a center point, width and height of a frame, and spatial information of each target to obtain a multi-target tracking data set;
the building unit 102 is configured to build an initial network for multi-target tracking, and includes a backbone network, a top view encoder, and a spatial decoder;
the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable;
and the training unit 103 is used for performing iterative training on the initial network according to the multi-target tracking data set until convergence, so as to obtain the multi-target tracking network based on the top view projection.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A multi-target tracking network construction method based on overlook projection is characterized by comprising the following steps:
acquiring training image groups of a plurality of continuous time frames, wherein each training image group comprises a plurality of same targets, and labeling the category, the center point coordinate, the labeling frame width and height and the spatial information of each target to obtain a multi-target tracking data set;
constructing an initial network of multi-target tracking, which comprises a backbone network, a top view encoder and a space decoder;
performing iterative training on the initial network according to the multi-target tracking data set until convergence to obtain a multi-target tracking network based on the top-view projection;
the backbone network is used for sampling in the training image group to form a feature map group; the overlook encoder is used for sensing spatial information in a backbone network characteristic graph group, fusing the spatial information with class information, and forming a characteristic matrix containing the spatial information through overlook projection; and the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable.
2. The construction method according to claim 1, wherein the spatial information includes coordinate information of a position in space of the target center point.
3. The building method of claim 1, wherein iteratively training the initial network to converge according to the multi-target tracking dataset comprises:
s1: selecting training image groups of t frames in the multi-target tracking data set, and inputting the training image groups into a backbone network;
s2: sampling a backbone network to form a characteristic graph group of a t frame;
s3: inputting the characteristic image group of the t frame into a overlook encoder, sensing spatial information, fusing the spatial information with category information, and overlooking and projecting to form a t frame characteristic matrix containing the spatial information;
s4: inputting the t frame feature matrix into a space decoder, and obtaining an object query variable of the t frame based on an initial object query variable;
s5: selecting a training image group of a t +1 frame in the multi-target tracking data set, repeating the steps S2-S3 to obtain a t +1 frame characteristic matrix, and inputting the t +1 frame characteristic matrix and an object query variable of the t frame into a space decoder to obtain an object query variable of the t +1 frame;
s6: and calculating loss based on the object query variables of the t frame and the t +1 frame, and continuously taking the subsequent time frame to repeat the steps S1-S5 until convergence.
4. The building method according to claim 3, wherein the top view encoder includes a spatial perception module and a projection module, and the step S3 specifically includes:
inputting the feature image group of the t frame into a spatial perception module, extracting the spatial feature of the target, and fusing the category features to form a t frame target feature image group;
transforming the t frame target characteristic image group into a pseudo point cloud through the overlooking projection of the projection module;
and forming a t-frame feature matrix containing spatial information based on the transformed pseudo-point cloud.
5. The construction method according to claim 4, wherein the feature map group of the t frame is input to a spatial perception module, the spatial features of the target are extracted, and the category features are fused to form a t frame target feature map, and specifically comprises:
inputting the feature map group of the t frames into a space perception module, extracting the class features of dimensionality N x k x H/alpha x W/beta, and obtaining the space features of dimension N x D x H/alpha W/beta through convolution mapping, wherein N is the number of the feature map group of the t frames, k is the number of class feature channels of the feature maps, H is the number of height channels of the feature maps, alpha is a compression constant of the number of height channels of the feature maps, W is the number of width channels of the feature maps, beta is a compression constant of the number of width channels of the feature maps, and D is the number of space feature channels of the feature maps;
and carrying out array combination on the category features and the spatial features to form a t-frame target feature map containing spatial information N (k + D) H/alpha W/beta.
6. The construction method according to claim 4, wherein transforming the t-frame target feature map into a pseudo point cloud by the top projection of the projection module specifically comprises:
establishing a world coordinate system, and setting the size of a grid under the world coordinate system;
projecting pixel points in the t-frame target characteristic graph into a grid of a world coordinate system according to a preset projection rule;
and analyzing and processing the target characteristics in the grid, and transforming to obtain pseudo point cloud.
7. The method of claim 6, wherein establishing a world coordinate system and setting dimensions of a grid under the world coordinate system comprises:
taking the central point of the training image group acquisition equipment as a center, vertically and upwards making a Z axis, and establishing a world coordinate system;
obtaining the size of the grid under the world coordinate system based on the size of the detection range of the training image group acquisition equipment and the size of a set single grid, wherein the size calculation formula is as follows:
x*y=int(L/l)*int(W/w)
wherein x y is the grid size under the world coordinate system, int is an integer function, L is the length of the detection range of the training image group acquisition equipment, W is the width of the detection range of the training image group acquisition equipment, L is the length of a set single grid, and W is the width of the set single grid;
analyzing and processing the target features in the grid, which specifically comprises the following steps:
counting pixel points contained in the same grid, and solving a characteristic average value of the pixel points to form a grid characteristic value;
and if the number of the pixel points in the grid is zero, setting the characteristic value of the grid to be zero.
8. The construction method according to claim 3, wherein the step of S6 specifically includes:
comparing object query variables of the t frame and the t +1 frame, giving out a first loss function aiming at the object query variables of the adjacent frames, and giving out a second loss function aiming at the object query variables of the same frame;
repeating steps S1-S5 based on the subsequent time frame, and continuously updating the first loss function and the second loss function;
and when the first loss function and the second loss function are both smaller than the set threshold value, the convergence is finished.
9. The construction method according to claim 8, wherein the first loss function is formulated as follows:
Figure 297675DEST_PATH_IMAGE001
the second loss function is formulated as follows:
Figure 884514DEST_PATH_IMAGE002
wherein,
Figure 248630DEST_PATH_IMAGE003
in order to be a function of the first loss,
Figure 828385DEST_PATH_IMAGE004
in order to be a function of the second loss,
Figure 961557DEST_PATH_IMAGE005
for the purpose of the cosine calculation,
Figure 234145DEST_PATH_IMAGE006
a variable is queried for the object(s),
Figure 334825DEST_PATH_IMAGE007
Figure 219735DEST_PATH_IMAGE008
are labeled differently for the objects.
10. An apparatus for implementing the method for constructing the multi-target tracking network based on the top view projection according to any one of claims 1 to 9, comprising:
the acquisition unit is used for acquiring training image groups of a plurality of continuous time frames, each training image group comprises a plurality of identical targets, and the category, the center point coordinate, the frame width and the space information of each target are labeled to obtain a multi-target tracking data set;
the system comprises a construction unit, a tracking unit and a tracking unit, wherein the construction unit is used for constructing an initial network of the multi-target tracking and comprises a backbone network, a top view encoder and a space decoder;
the backbone network is used for sampling in the training image group to form a feature map group; the overlook coder is used for sensing spatial information in the backbone network characteristic graph group, fusing the spatial information with the category information, and overlooking and projecting to form a characteristic matrix containing the spatial information; the spatial decoder is used for decoding the characteristic matrix containing the spatial information according to the time frame and updating the object query variable;
and the training unit is used for performing iterative training on the initial network according to the multi-target tracking data set until convergence is achieved, so that the multi-target tracking network based on the top view projection is obtained.
CN202210826068.XA 2022-07-14 2022-07-14 Multi-target tracking network construction method and device based on overlook projection Active CN114913209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210826068.XA CN114913209B (en) 2022-07-14 2022-07-14 Multi-target tracking network construction method and device based on overlook projection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210826068.XA CN114913209B (en) 2022-07-14 2022-07-14 Multi-target tracking network construction method and device based on overlook projection

Publications (2)

Publication Number Publication Date
CN114913209A true CN114913209A (en) 2022-08-16
CN114913209B CN114913209B (en) 2022-10-28

Family

ID=82771876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210826068.XA Active CN114913209B (en) 2022-07-14 2022-07-14 Multi-target tracking network construction method and device based on overlook projection

Country Status (1)

Country Link
CN (1) CN114913209B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138826A1 (en) * 2016-11-14 2019-05-09 Zoox, Inc. Spatial and Temporal Information for Semantic Segmentation
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN112505684A (en) * 2020-11-17 2021-03-16 东南大学 Vehicle multi-target tracking method based on radar vision fusion under road side view angle in severe environment
CN113139986A (en) * 2021-04-30 2021-07-20 东风越野车有限公司 Integrated environment perception and multi-target tracking system
CN113139602A (en) * 2021-04-25 2021-07-20 南京航空航天大学 3D target detection method and system based on monocular camera and laser radar fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138826A1 (en) * 2016-11-14 2019-05-09 Zoox, Inc. Spatial and Temporal Information for Semantic Segmentation
CN110223324A (en) * 2019-06-05 2019-09-10 东华大学 A kind of method for tracking target of the twin matching network indicated based on robust features
CN112505684A (en) * 2020-11-17 2021-03-16 东南大学 Vehicle multi-target tracking method based on radar vision fusion under road side view angle in severe environment
CN113139602A (en) * 2021-04-25 2021-07-20 南京航空航天大学 3D target detection method and system based on monocular camera and laser radar fusion
CN113139986A (en) * 2021-04-30 2021-07-20 东风越野车有限公司 Integrated environment perception and multi-target tracking system

Also Published As

Publication number Publication date
CN114913209B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN111476822B (en) Laser radar target detection and motion tracking method based on scene flow
CN110415342B (en) Three-dimensional point cloud reconstruction device and method based on multi-fusion sensor
CN110569704B (en) Multi-strategy self-adaptive lane line detection method based on stereoscopic vision
CN109242884B (en) Remote sensing video target tracking method based on JCFNet network
US8582816B2 (en) Method and apparatus for video analytics based object counting
CN111563415A (en) Binocular vision-based three-dimensional target detection system and method
CN113592905B (en) Vehicle driving track prediction method based on monocular camera
US11651581B2 (en) System and method for correspondence map determination
CN116229408A (en) Target identification method for fusing image information and laser radar point cloud information
CN112381132A (en) Target object tracking method and system based on fusion of multiple cameras
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN115273034A (en) Traffic target detection and tracking method based on vehicle-mounted multi-sensor fusion
CN115661569A (en) High-precision fine-grained SAR target detection method
CN114648551B (en) Trajectory prediction method and apparatus
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN113408550B (en) Intelligent weighing management system based on image processing
CN114820765A (en) Image recognition method and device, electronic equipment and computer readable storage medium
CN114913209B (en) Multi-target tracking network construction method and device based on overlook projection
CN116645508A (en) Lightweight semantic target segmentation method based on local window cross attention
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
CN113496501B (en) Method and system for detecting invader in dynamic scene based on video prediction
CN112215873A (en) Method for tracking and positioning multiple targets in transformer substation
CN112634331A (en) Optical flow prediction method and device
CN117726687B (en) Visual repositioning method integrating live-action three-dimension and video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant