CN114638862A - Visual tracking method and tracking device - Google Patents

Visual tracking method and tracking device Download PDF

Info

Publication number
CN114638862A
CN114638862A CN202210297392.7A CN202210297392A CN114638862A CN 114638862 A CN114638862 A CN 114638862A CN 202210297392 A CN202210297392 A CN 202210297392A CN 114638862 A CN114638862 A CN 114638862A
Authority
CN
China
Prior art keywords
frame
tracking
loss
feature
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210297392.7A
Other languages
Chinese (zh)
Inventor
王好谦
闫嘉依
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202210297392.7A priority Critical patent/CN114638862A/en
Publication of CN114638862A publication Critical patent/CN114638862A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a visual tracking method and a tracking device, wherein the method comprises the following steps: acquiring a video to be detected containing a target person in real time; constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure; and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result. By adding the structure of the feature pool, the feature of the template branch is optimized, the feature pool can dynamically update the template at low time complexity, the features of the subsequent frames are better matched, the accumulated error is effectively reduced, and the problem of the drift of the tracking frame is relieved; the characteristic pool structure can keep the tracking network model stable in long-term sequence tracking, and the robustness of the tracking method is improved.

Description

Visual tracking method and tracking device
Technical Field
The invention relates to the technical field of computer vision, in particular to a visual tracking method and a tracking device.
Background
The robot technology relates to a plurality of technologies such as perception, path planning, mechanical control and the like, and the character tracking technology is used as a key link of perception to determine the performance of the robot. In recent years, people tracking technology has been greatly advanced for two reasons: the method has the advantages that firstly, hardware driving and computing power are improved, advanced computing modes such as GPU and cloud computing are adopted, storage and processing capabilities of mass data are overlapped, character features of time and space dimensions can be effectively extracted by the deep learning method, and the possibility is provided for development of deep learning on character tracking technology. And secondly, the upstream task is driven, the upstream task of the person tracking technology comprises person detection, pedestrian re-identification, video and image processing and the like, and the continuous accumulation of the upstream task technology pushes the person tracking technology to advance towards the directions of accurate tracking and real-time tracking. Therefore, the person tracking technology can provide more accurate, real-time and effective information for the robot in the planning and control links.
The person tracking method based on deep learning is high in generalization capability and can learn and process large-scale data. However, in a real scene, when a target person is tracked by using a deep learning-based method, the problem of tracking frame drift occurs, which directly affects the performance of subsequent robot planning control. Especially when there is a lot of interference around the target person, the drifting phenomenon is more serious, and the following three situations are specifically included: firstly, when the target person is shielded by people or objects, the tracking frame is difficult to locate the shielded target person; secondly, when the target person walks with other people side by side, the situation that the tracking frame tracks the non-target person in the same row or tracks the target person and the non-target person at the same time exists; thirdly, when the person with the similar figure or the person with the similar clothing passes by the target person, the tracking frame is easy to be taken away.
The prior art lacks a visual tracking method for solving the problem of tracking frame drift.
The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.
Disclosure of Invention
The invention provides a visual tracking method and a tracking device for solving the existing problems.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a visual tracking method, comprising the steps of: s1: acquiring a video to be detected containing a target person in real time; s2: constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure; s3: and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result.
Preferably, the structure of the tracking network is constructed as follows: enhancing and fusing template features, head features and subsequent frame features in the feature pool by adopting a feature fusion network to obtain a fusion feature map; and predicting the fusion characteristic graph by adopting a prediction head network to obtain a tracking result of a subsequent frame.
Preferably, the feature fusion network adopts a Transformer network structure, and comprises two mechanisms of self-attention and mutual attention; the prediction head network comprises three parallel structures of a classification branch, a regression branch and a central measurement branch; the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set; the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set; the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.
Preferably, updating the template features using the feature pool structure based on the comparison learning structure comprises: characteristic pool F ═ Fi} of whichIn, fiIs a memory frame, i is a positive integer; the storage frame is stored as a queue structure according to the size of the subscript, and the smaller the subscript is, the closer the storage position is; where, when i is 1, it is the template frame, i>1 is the subsequent frame; for the subsequent frame, the larger the product of the classification branch prediction confidence coefficient and the central measurement branch confidence coefficient of the frame is, the smaller the index i is; presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the frame number in the feature pool is smaller than the threshold, fusing the feature vectors corresponding to all the frames in the feature pool to obtain the template features.
Preferably, a head frame and a whole body frame are added in the feature pool, and the constraint of the relative position is jointly constrained by the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal line of the whole body frame; the head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.
Preferably, the features in the feature pool are fused in a weighted fusion mode, and the weight coefficient is
Figure BDA0003562126120000021
The power of the feature is obtained, the overall coefficient is adjusted after the weighted sum of the features is obtained, so that the sum of all the weighted coefficients is 1, and the expression of a specific fusion result is as follows:
Figure BDA0003562126120000022
wherein, XkIs the fused feature template obtained from the feature pool, and k is the number of frames taken from the fused feature pool.
Preferably, the acquired pedestrian video data set uses an overall tracking loss function LTTraining the tracking network; the integral tracking loss function expression is that the head track constrains loss LHAnd dense loss LCThe two parts are as follows:
LT=βLH+(1-β)LC
wherein beta is a hyperparameter;
wherein the head restraint loss LHThe expression of (a) is:
Figure BDA0003562126120000031
wherein, gamma is1、γ2For hyper-parameters, L is the distance between the marked head frame and the marked body frame, L is the diagonal length of the marked body frame, theta is the included angle between L and L,
Figure BDA0003562126120000032
is the value obtained from the corresponding prediction;
the dense loss function LCThe expression of (a) is:
Lc=Lcls1Lreg2Lcent
wherein L isclsIs the loss of classification, LregIs the regression loss, LcentIs the loss of the central measure, λ1And λ2Is a weight parameter;
the classification loss and the central measure loss are represented in a cross entropy loss form, and the expression is as follows:
Figure BDA0003562126120000033
wherein a is cls or cent, LaIs the classification loss or the central measure loss, j is the j th frame sample, yajIs the label of the jth frame, pajIs the prediction confidence of the j frame classification branch or the central measure branch;
the prediction confidence expression of the central measure branch is as follows:
Figure BDA0003562126120000034
wherein l*、r*、t*、b*Respectively the distances from the predicted central point to the left boundary, the right boundary, the upper boundary and the lower boundary of the label frame;
the regression loss expression is:
Lreg=LGIOU1Lagg2Lrep
wherein L isGIOUIs the loss of the generalized cross-over ratio, LaggIs the loss of polymerization, LrepIs the rejection loss, alpha1、α2Is a weight parameter.
The generalized intersection-to-parallel ratio loss function expression is as follows:
LGIOU=1-GIOU(gt,bj)
the generalized cross-over ratio expression is as follows:
Figure BDA0003562126120000041
wherein gt is a label frame, bjIs a prediction box, C is a box capable of enclosing gt and bjThe smallest box of (c).
The aggregation loss function expression is as follows:
Figure BDA0003562126120000042
whereinjIs the tag frame of the target person of the jth frame, piIs a prediction box, p, belonging to the tag box of the jth framej+I is the number of candidate frames predicted as positive samples by the jth frame;
the smoothsl1The functional expression is:
Figure BDA0003562126120000043
the repulsion loss function expression is:
Figure BDA0003562126120000044
wherein, biIs a prediction box, gjIs the frame with the largest prediction as the background, p, intersected with the frame label framej+I is the number of candidate boxes for which the jth frame is predicted as a positive sample, IOG is biAnd gjCross-over ratio of (a);
the smoothslnThe functional expression is:
Figure BDA0003562126120000045
where σ ∈ [0,1) denotes a smoothing parameter.
Preferably, the step of determining the target person frame of the target person in the video to be tested by using the trained tracking network comprises the following steps: inputting the characteristics of the image frames in the video to be tested and the template characteristics into the trained tracking network at the same time; storing the image frame into the corresponding position of the feature pool according to the product value of the classification branch prediction confidence and the prediction confidence of the central measurement branch; taking the vector index with the value of the root number larger than 0.5 as a candidate frame index set according to the product value of the classification branch prediction confidence and the prediction confidence of the central measure branch; finding a candidate frame set in the regression branch according to the candidate frame index set; and finding the box with the highest confidence degree of the classification branch prediction in the candidate box set as a target character box.
Preferably, the method further comprises the following steps: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising: acquiring a depth map from the video to be detected; calibrating the depth map and the image frame in the video to be detected; and obtaining the depth value of each pixel point in the corresponding area in the target character frame and calculating the average value of the depth values as the distance of the target character.
The invention also provides a tracking device adopting the visual tracking method.
The invention has the beneficial effects that: the feature pool can dynamically update the template at low time complexity, better match the features of subsequent frames, effectively reduce accumulated errors, alleviate the problem of tracking frame drift and improve the robustness of the tracking method.
Furthermore, the characteristic pool structure can keep the tracking network model stable in long-term sequence tracking, and the robustness of the tracking method is improved.
Furthermore, the invention increases the constraint of the relative positions of the head frame and the whole body frame, and the head-body constraint assumes that the relative positions of the head and the body are unchanged when the same person walks, so that the invention innovatively introduces the common sense assumption, increases the relative position constraint, reduces the search space of the solution, effectively inhibits the drift problem of the tracking frame, and relieves the interference caused by the fact that the tracking frame suddenly contains the confused object.
Furthermore, the invention adopts the integral tracking loss function to carry out optimization training on the network, considers the aggregation force of the whole body label frame of the positive sample to the prediction frame and the repulsion force of the whole body label frame of the positive sample to the prediction frame, increases the aggregation force and the repulsion force among the samples through training, reduces the drift problem of the tracking frame and ensures that the tracking effect is better.
Drawings
Fig. 1 is a schematic diagram of a visual tracking method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a tracking method of a robot in the embodiment of the present invention.
Fig. 3 is a method for constructing a tracking network and training the tracking network with a collected pedestrian video data set according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a feature fusion network in the embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a gauging head network in the embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a tracking network feature pool module in the embodiment of the present invention.
Fig. 7 is a schematic diagram of a method for determining a target person frame of a target person in the video to be tested by using a trained tracking network in the embodiment of the present invention.
Fig. 8 is a schematic diagram of a method for obtaining a tracking result in an embodiment of the present invention.
FIG. 9 is a diagram illustrating a method for tracking a target person's trajectory according to an embodiment of the invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing or a circuit communication.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
There are three reasons for the problem of tracking frame drift: firstly, a deep learning model is difficult to converge, on one hand, the deep learning network has high complexity, and the real scene is relatively complex because a data set is taken from the real scene; on the other hand, compared with image data, the time dimension is increased for video data, and the person tracking needs to be trained by using the video data, so that the model convergence difficulty is further increased; secondly, the conventional algorithm ignores the constraint of extra monitoring information, such as walking track information, images of unshielded parts, ReID information and the like, so that the algorithm is unstable and the phenomenon of tracking frame drift occurs; and thirdly, the tracking algorithm needs to use a non-maximum suppression method for post-processing, and selects an optimal tracking frame from a plurality of candidate frames, but the method is difficult to adjust the frame deleting threshold value, because the non-maximum suppression is sensitive to the setting of the threshold value.
Based on the above analysis, the present invention provides a visual tracking method.
As shown in fig. 1, the present invention provides a visual tracking method, comprising the steps of:
s1: acquiring a video to be detected containing a target person in real time;
s2: constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure;
s3: and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result.
According to the invention, by adding the structure of the feature pool, the features of the template branches are optimized, the feature pool can dynamically update the template at low time complexity, the features of the subsequent frames are better matched, the accumulated error is effectively reduced, the drift problem of the tracking frame is relieved, and the robustness of the tracking method is improved.
Furthermore, the characteristic pool structure can keep the tracking network model stable in long-term sequence tracking, and the robustness of the tracking method is improved.
In a specific embodiment, the above method is applied to a tracking robot.
As shown in fig. 2, the tracking device of the present invention is a tracking robot, and a tracking method of the tracking robot specifically includes the following steps:
a1: acquiring RGBD camera information of the robot, and obtaining an RGB (red, green and blue) graph and a depth graph from the RGBD camera information;
a2: the RGB map is input into a tracking network RobTranSim. The tracking network updates the template features based on the comparison learning structure by using the feature pool structure;
a3: obtaining a target character frame;
a4: acquiring depth information according to the depth map;
a5: estimating a target person track;
a6: and driving the robot to track the motion of the target person.
It can be understood that the RGBD camera on the tracking robot of the present invention obtains the video to be measured including the target person.
As shown in fig. 3, the present invention provides a method for constructing a tracking network and training the tracking network with a collected pedestrian video data set, which specifically includes the following steps:
b1: collecting a marked pedestrian video data set:
b2: constructing a person tracking network RobTranSim:
b3: constructing a feature pool, and updating template features:
b4: using the global tracking loss function LTAnd training the tracking network by using the collected pedestrian video data set.
B5: and tracking the target person in the video to be detected by using the trained network to obtain a tracking result.
In step B1, firstly, the pedestrian video is captured, and then the tag frame is labeled frame by frame, the tag frame includes a whole body frame and a head frame, the whole body frame highlights the horizontal and vertical coordinates of the upper left corner of the character frame, the length and width information of the frame, and the head frame labels the horizontal and vertical coordinate information of the head center position. In the tracking process, the network outputs information of a prediction frame of each frame, and the prediction frame comprises a prediction head frame and a prediction whole body frame.
In step B2, the tracking network adopts the following structure based on the comparative learning:
enhancing and fusing template features, head features and subsequent frame features in the feature pool by adopting a feature fusion network to obtain a fusion feature map;
and predicting the fusion characteristic graph by adopting a prediction head network to obtain a tracking result of a subsequent frame.
As shown in fig. 4, the feature fusion network adopts a Transformer network structure, which includes two mechanisms, namely self-attention and mutual attention; compared with the fusion mode of the convolutional layer, the Transformer can learn global information, and the nonlinear structure of the Transformer can be fused with more effective representation.
As shown in fig. 5, the gauging head network includes three parallel structures of a classification branch, a regression branch, and a central measurement branch; the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set; the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set; the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.
Wherein, the box type with the intersection ratio of the label box marked in the step B1 being more than 0.5 is 1, and the box type with the intersection ratio of less than or equal to 0.5 is 0.
In step B3, updating the template features using the feature pool structure based on the comparison learning structure includes:
characteristic pool F ═ FiIn which fiIs a memory frame, i is a positive integer; the storage frame is stored as a queue structure according to the size of the subscript, and the smaller the subscript is, the closer the storage position is; where, when i is 1, it is the template frame, i>1 is the subsequent frame; for the subsequent frame, the larger the product of the classification branch prediction confidence coefficient and the central measurement branch confidence coefficient of the frame is, the smaller the index i is;
based on the assumption that the relative positions of the head and the body of the pedestrian are not changed, the relative position constraint is increased, the search space of the solution is reduced, and the problem of the drift of the tracking frame is effectively inhibited; and adding a head frame and a whole body frame in the feature pool, and using the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal line of the whole body frame to jointly constrain the constraint of the relative position.
Using feature extraction network to pair currentFrame extraction frame feature xiStoring and updating the feature pool; presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the number of frames in the feature pool is smaller than the threshold, fusing feature vectors corresponding to all frames in the feature pool to obtain template features. The head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.
In a specific embodiment, the feature extraction network is ResNet 50.
In a preferred embodiment, the features in the feature pool are fused in a weighted fusion mode, and the weight coefficient is
Figure BDA0003562126120000091
The power of the feature is obtained, the overall coefficient is adjusted after the weighted sum of the features is obtained, so that the sum of all the weighted coefficients is 1, and the expression of a specific fusion result is as follows:
Figure BDA0003562126120000092
wherein, XkIs the fused feature template obtained from the feature pool, and k is the number of frames taken from the fused feature pool.
The method has three advantages that firstly, richer information is fused, the characteristics of the template frame are not considered, and the more important characteristics are that the weight coefficient is larger; secondly, the calculated amount is small, when the feature pool is updated, only the incremental part of the frame features newly entering the pool needs to be added, and the fusion result does not need to be recalculated during each iteration; and thirdly, the long-term sequence tracking is more effective, and the longer the time is, the more stable the template frame characteristics are.
Then, the global tracking loss function L is usedTTraining the tracking network; the integral tracking function considers the aggregation force of the whole body label frame of the positive sample to the prediction frame and the repulsion force of the whole body label frame of the positive sample to the prediction frame as the background frame, and increases the aggregation among the samples through trainingResultant force and repulsive force reduce the problem of the drift of the tracking frame, so that the tracking effect is better.
In a specific embodiment, the overall tracking loss function expression is a head-track constrained loss LHAnd dense loss LCThe two parts are as follows:
LT=βLH+(1-β)LC
wherein beta is a hyperparameter; during the training process, the overall tracking loss function is carried out in a decreasing direction.
Wherein the head restraint loss LHThe expression of (a) is:
Figure BDA0003562126120000093
wherein, γ1、γ2For hyper-parameters, L is the distance between the marked head frame and the marked whole body frame, L is the length of the marked whole body frame diagonal line, theta is the included angle between L and L,
Figure BDA0003562126120000094
is the value obtained from the corresponding prediction;
the dense loss function LCThe expression of (a) is:
Lc=Lcls1Lreg2Lcent
wherein L isclsIs the loss of classification, LregIs the regression loss, LcentIs the loss of the central measure, λ1And λ2Is a weight parameter;
the classification loss and the central measure loss are represented in a cross entropy loss form, and the expression is as follows:
Figure BDA0003562126120000101
wherein a is cls or cent, LaIs the classification loss or the central measure loss, j is the j th frame sample, yajIs the firstLabel of j frame, pajIs the prediction confidence of the j frame classification branch or the central measure branch;
the prediction confidence expression of the central measure branch is as follows:
Figure BDA0003562126120000102
wherein l*、r*、t*、b*Respectively the distances from the predicted central point to the left boundary, the right boundary, the upper boundary and the lower boundary of the label frame;
the regression loss expression is:
Lreg=LGIOU1Lagg2Lrep
wherein L isGIOUIs the loss of the generalized cross-over ratio, LaggIs the loss of polymerization, LrepIs the rejection loss, alpha1、α2Is a weight parameter.
The generalized intersection-to-parallel ratio loss function expression is as follows:
LGIOU=1-GIOU(gt,bj)
the generalized cross-over ratio expression is as follows:
Figure BDA0003562126120000103
wherein gt is a whole body tag frame, bjIs a whole body prediction box, C is a box capable of enclosing gt and bjThe smallest box of (c).
The aggregation loss function expression is as follows:
Figure BDA0003562126120000104
whereinjIs the whole body tag box of the target person of the jth frame, piIs a whole-body prediction box, | p, attributed to the jth frame tag boxj+I is the number of candidate frames predicted as positive samples by the jth frame;
the smoothsl1The functional expression is:
Figure BDA0003562126120000111
the repulsion loss function expression is:
Figure BDA0003562126120000112
wherein, biIs the whole body prediction frame, gjIs the frame with the largest prediction as the background and intersected with the whole body label frame of the frame, | pj+I is the number of candidate boxes for which the jth frame is predicted as a positive sample, IOG is biAnd gjCross-over ratio of (a);
the smoothslnThe functional expression is:
Figure BDA0003562126120000113
where σ ∈ [0,1) denotes a smoothing parameter.
In step B5, in the feature pool structure shown in fig. 6, the template frame is stored as the head frame in the feature pool, and the position of the template frame is not adjusted thereafter.
And inputting each frame of subsequent frame fi of the video sequence into a feature extraction network of the feature pool to obtain a feature vector.
And setting a threshold, and fusing the feature vectors corresponding to the previous k frames with the same number as the threshold in the feature pool to obtain the template features if the number of the frames in the feature pool is greater than or equal to the threshold. And if the number of the frames in the feature pool is less than the threshold value, fusing the feature vectors corresponding to all the frames in the pool. The threshold is a positive integer greater than or equal to 1, and in this embodiment, the threshold is 8, where the marginal utility of the feature pool is the maximum. In particular, when the threshold takes 1, the network degenerates to a standard contrast learning network.
As shown in fig. 7, the step of determining the whole body frame of the target person in the video to be tested by using the trained tracking network includes the following steps:
inputting the characteristics of the image frames in the video to be tested and the template characteristics into the trained tracking network at the same time;
storing the image frame into the corresponding position of the feature pool according to the product value of the classification branch prediction confidence and the prediction confidence of the central measurement branch;
taking the vector index with the value of the root number larger than 0.5 as a candidate frame index set according to the product value of the classification branch prediction confidence and the prediction confidence of the central measure branch; finding a candidate frame set in the regression branch according to the candidate frame index set;
and finding the box with the highest confidence of the classification branch prediction in the candidate box set as the whole-body box of the target person.
Specifically, the features of the frame fi and the template features are simultaneously input into the trained network, and the prediction results of the four branches are calculated as shown in fig. 5.
And finding the box with the highest confidence of the classified branch prediction in the candidate box set as the tracking result of the frame.
As shown in fig. 8, the tracking method of the present invention further includes: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising:
acquiring a depth map from the video to be detected;
calibrating the depth map and the image frame in the video to be detected;
and obtaining the depth values of all pixel points in the corresponding area in the whole frame of the target character and calculating the average value of the depth values to be used as the distance of the target character.
As shown in fig. 9, corresponding to step a5, the tracking of the target person's trajectory using the apparatus of the present invention includes the following steps:
c1: obtaining a depth map:
in the present embodiment, a depth map portion is extracted from information of the RGBD camera.
C2: the depth map is calibrated with the RGB map:
in order to obtain a clear depth map, in this embodiment, first, the opencv library is used for calibration, and internal and external parameters of the camera are acquired and corrected to realize epipolar alignment. And obtaining a disparity map, calculating a depth map by a HashMatch method, and splicing the calculated depth map and the original camera depth map.
C3: and (3) intercepting a depth map in the character frame:
in the present embodiment, in the depth map, the depth of the corresponding position in the prediction frame of the RGB map of each frame is acquired.
C4: calculating the average depth:
in the present embodiment, the average distance is calculated from the in-frame depth predicted from the RGB map per frame.
C5: and updating the distance and the angle of the robot.
In the above process, the method further comprises the following steps:
c6: adjusting the camera to keep the character frame in the center of the visual field;
c7: the declination angle of the camera relative to the last position is calculated.
In a specific example, the test was carried out using the method and apparatus as described previously. Firstly, a video of a tracking target person is collected in a scene with dense crowd, and a head frame and a whole body frame are labeled on each frame of image. Training a tracking network model by using the first 70% images, wherein the tracking network model is trained on a server formed by using 8 RTX 2080 Ti; the tracking effect was tested using the latter 30% images. Comparing the test result with the single-target tracking method in the prior art, it can be seen that the tracking precision of the embodiment is higher.
TABLE 1 results of the experiment
Tracking method ATOM SiamRPN++ TransT Method of the invention
Precision (%) 80.5 76.8 83.9 85.3
An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.
Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.
Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.
The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAMEN), Synchronous linked Dynamic Random Access Memory (DRAM), and Direct Random Access Memory (DRMBER). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to arrive at new method embodiments.
The features disclosed in the several product embodiments presented in this application can be combined arbitrarily, without conflict, to arrive at new product embodiments.
The features disclosed in the several method or apparatus embodiments provided herein may be combined in any combination to arrive at a new method or apparatus embodiment without conflict.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (10)

1. A visual tracking method, comprising the steps of:
s1: acquiring a video to be detected containing a target person in real time;
s2: constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure;
s3: and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result.
2. The visual tracking method of claim 1, wherein the tracking network is constructed as follows:
enhancing and fusing template features, head features and subsequent frame features in the feature pool by adopting a feature fusion network to obtain a fusion feature map;
and predicting the fusion characteristic graph by adopting a prediction head network to obtain a tracking result of a subsequent frame.
3. The visual tracking method of claim 2, wherein the feature fusion network employs a Transformer network structure, including two mechanisms of self-attention and mutual attention;
the prediction head network comprises three parallel structures of a classification branch, a regression branch and a central measurement branch;
the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set;
the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set;
the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.
4. The visual tracking method of claim 3, wherein updating template features using a feature pool structure based on a comparison learning structure comprises:
characteristic pool F ═ FiIn which fiIs a memory frame, i is a positive integer; the storage frame is stored as a queue structure according to the size of the subscript, and the smaller the subscript is, the closer the storage position is; where, when i is 1, it is the template frame, i>1 is the subsequent frame; for the subsequent frame, the larger the product of the classification branch prediction confidence coefficient and the central measurement branch confidence coefficient of the frame is, the smaller the index i is;
presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the frame number in the feature pool is smaller than the threshold, fusing the feature vectors corresponding to all the frames in the feature pool to obtain the template features.
5. The visual tracking method of claim 4, wherein a head frame and a whole body frame are added in the feature pool, and the constraint of the relative position is jointly constrained by the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal of the whole body frame;
the head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.
6. The visual tracking method of claim 5, wherein the features in the feature pool are fused using a weighted fusion approachIs characterized by a weight coefficient of
Figure FDA0003562126110000021
The power of the feature is obtained, the overall coefficient is adjusted after the weighted sum of the features is obtained, so that the sum of all the weighted coefficients is 1, and the expression of a specific fusion result is as follows:
Figure FDA0003562126110000022
wherein, XkIs the fused feature template obtained from the feature pool, and k is the number of frames taken from the fused feature pool.
7. The visual tracking method of claim 6, wherein the acquired pedestrian video data set uses an overall tracking loss function LTTraining the tracking network;
the integral tracking loss function expression is that the head track constrains loss LHAnd dense loss LCThe two parts are as follows:
LT=βLH+(1-β)LC
wherein beta is a hyperparameter;
wherein the head restraint loss LHThe expression of (a) is:
Figure FDA0003562126110000023
wherein, γ1、γ2For hyper-parameters, L is the distance between the marked head frame and the marked whole body frame, L is the length of the marked whole body frame diagonal line, theta is the included angle between L and L,
Figure FDA0003562126110000024
is a value derived from the corresponding prediction;
the dense loss function LCThe expression of (a) is:
Lc=Lcls1Lreg2Lcent
wherein L isclsIs the loss of classification, LregIs the regression loss, LcentIs the loss of the central measure, λ1And λ2Is a weight parameter;
the classification loss and the central measure loss are represented in a cross entropy loss form, and the expression is as follows:
Figure FDA0003562126110000025
wherein a is cls or cent, LaIs the classification loss or the central measure loss, j is the j th frame sample, yajIs the label of the jth frame, pajIs the prediction confidence of the j frame classification branch or the central measure branch;
the prediction confidence expression of the central measure branch is as follows:
Figure FDA0003562126110000031
wherein l*、r*、t*、b*Respectively the distances from the predicted central point to the left boundary, the right boundary, the upper boundary and the lower boundary of the whole body label frame;
the regression loss expression is:
Lreg=LGIOU1Lagg2Lrep
wherein L isGIOUIs the loss of the generalized cross-over ratio, LaggIs the loss of polymerization, LrepIs the rejection loss, alpha1、α2Is a weight parameter.
The generalized intersection-to-parallel ratio loss function expression is as follows:
LGIOU=1-GIOU(gt,bj)
the generalized cross-over ratio expression is as follows:
Figure FDA0003562126110000032
wherein gt is a whole body tag frame, bjIs a whole body prediction box, C is a box capable of enclosing gt and bjThe smallest box of (c).
The aggregation loss function expression is as follows:
Figure FDA0003562126110000033
whereinjIs the whole body tag box of the target person of the jth frame, piIs a whole-body prediction box, | p, attributed to the jth frame tag boxj+I is the number of candidate frames predicted as positive samples by the jth frame;
the smoothsl1The functional expression is:
Figure FDA0003562126110000034
the repulsion loss function expression is:
Figure FDA0003562126110000035
wherein, biIs the whole body prediction frame, gjIs the frame with the largest prediction as the background and intersected with the whole body label frame of the frame, | pj+I is the number of candidate boxes for which the jth frame is predicted as a positive sample, IOG is biAnd gjCross-over ratio of (a);
the smoothslnThe functional expression is:
Figure FDA0003562126110000041
where σ ∈ [0,1) denotes a smoothing parameter.
8. The visual tracking method of claim 7, wherein determining the target person frame of the target person in the video under test using the trained tracking network comprises:
inputting the characteristics of the image frames in the video to be tested and the template characteristics into the trained tracking network at the same time;
storing the image frame into the corresponding position of the feature pool according to the product value of the classification branch prediction confidence and the prediction confidence of the central measurement branch;
taking the vector index with the value of the root number larger than 0.5 as a candidate frame index set according to the product value of the classification branch prediction confidence and the prediction confidence of the central measure branch; finding a candidate frame set in the regression branch according to the candidate frame index set;
and finding the box with the highest confidence degree of the classification branch prediction in the candidate box set as a target character box.
9. The visual tracking method of claim 8, further comprising: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising:
acquiring a depth map from the video to be detected;
calibrating the depth map and the image frame in the video to be detected;
and obtaining the depth value of each pixel point in the corresponding area in the target character frame and calculating the average value of the depth values as the distance of the target character.
10. A tracking device, characterized in that a visual tracking method according to any one of claims 1-9 is used.
CN202210297392.7A 2022-03-24 2022-03-24 Visual tracking method and tracking device Pending CN114638862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210297392.7A CN114638862A (en) 2022-03-24 2022-03-24 Visual tracking method and tracking device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210297392.7A CN114638862A (en) 2022-03-24 2022-03-24 Visual tracking method and tracking device

Publications (1)

Publication Number Publication Date
CN114638862A true CN114638862A (en) 2022-06-17

Family

ID=81949472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210297392.7A Pending CN114638862A (en) 2022-03-24 2022-03-24 Visual tracking method and tracking device

Country Status (1)

Country Link
CN (1) CN114638862A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393384A (en) * 2022-09-14 2022-11-25 清华大学 Cross-camera-based multi-target tracking model training method and device
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393384A (en) * 2022-09-14 2022-11-25 清华大学 Cross-camera-based multi-target tracking model training method and device
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN116402858B (en) * 2023-04-11 2023-11-21 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method

Similar Documents

Publication Publication Date Title
CN114638862A (en) Visual tracking method and tracking device
CN111079695B (en) Human body key point detection and self-learning method and device
CN109974743B (en) Visual odometer based on GMS feature matching and sliding window pose graph optimization
CN110276785B (en) Anti-shielding infrared target tracking method
CN111178161B (en) Vehicle tracking method and system based on FCOS
US10701336B2 (en) Rectifying a sequence of stereo images
CN111582349B (en) Improved target tracking algorithm based on YOLOv3 and kernel correlation filtering
CN111832414B (en) Animal counting method based on graph regular optical flow attention network
CN110956131B (en) Single-target tracking method, device and system
CN111583220A (en) Image data detection method and device
CN113570530B (en) Image fusion method, device, computer readable storage medium and electronic equipment
WO2023109361A1 (en) Video processing method and system, device, medium and product
CN111950440A (en) Method, device and storage medium for identifying and positioning door
CN110569706A (en) Deep integration target tracking algorithm based on time and space network
CN109800635A (en) A kind of limited local facial critical point detection and tracking based on optical flow method
CN113011401A (en) Face image posture estimation and correction method, system, medium and electronic equipment
CN111291760A (en) Semantic segmentation method and device for image and electronic equipment
Talker et al. Efficient sliding window computation for nn-based template matching
WO2021092797A1 (en) Image registration method, terminal, and computer storage medium
US10677881B2 (en) Map assisted inertial navigation
CN112802104B (en) Loop detection method based on RGB-D camera
CN114612545A (en) Image analysis method and training method, device, equipment and medium of related model
CN117557804A (en) Multi-label classification method combining target structure embedding and multi-level feature fusion
CN112884804A (en) Action object tracking method and related equipment
CN111079523B (en) Object detection method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination