CN114638862A

CN114638862A - Visual tracking method and tracking device

Info

Publication number: CN114638862A
Application number: CN202210297392.7A
Authority: CN
Inventors: 王好谦; 闫嘉依
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-06-17

Abstract

The invention provides a visual tracking method and a tracking device, wherein the method comprises the following steps: acquiring a video to be detected containing a target person in real time; constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure; and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result. By adding the structure of the feature pool, the feature of the template branch is optimized, the feature pool can dynamically update the template at low time complexity, the features of the subsequent frames are better matched, the accumulated error is effectively reduced, and the problem of the drift of the tracking frame is relieved; the characteristic pool structure can keep the tracking network model stable in long-term sequence tracking, and the robustness of the tracking method is improved.

Description

Visual tracking method and tracking device

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual tracking method and a tracking device.

Background

The robot technology relates to a plurality of technologies such as perception, path planning, mechanical control and the like, and the character tracking technology is used as a key link of perception to determine the performance of the robot. In recent years, people tracking technology has been greatly advanced for two reasons: the method has the advantages that firstly, hardware driving and computing power are improved, advanced computing modes such as GPU and cloud computing are adopted, storage and processing capabilities of mass data are overlapped, character features of time and space dimensions can be effectively extracted by the deep learning method, and the possibility is provided for development of deep learning on character tracking technology. And secondly, the upstream task is driven, the upstream task of the person tracking technology comprises person detection, pedestrian re-identification, video and image processing and the like, and the continuous accumulation of the upstream task technology pushes the person tracking technology to advance towards the directions of accurate tracking and real-time tracking. Therefore, the person tracking technology can provide more accurate, real-time and effective information for the robot in the planning and control links.

The person tracking method based on deep learning is high in generalization capability and can learn and process large-scale data. However, in a real scene, when a target person is tracked by using a deep learning-based method, the problem of tracking frame drift occurs, which directly affects the performance of subsequent robot planning control. Especially when there is a lot of interference around the target person, the drifting phenomenon is more serious, and the following three situations are specifically included: firstly, when the target person is shielded by people or objects, the tracking frame is difficult to locate the shielded target person; secondly, when the target person walks with other people side by side, the situation that the tracking frame tracks the non-target person in the same row or tracks the target person and the non-target person at the same time exists; thirdly, when the person with the similar figure or the person with the similar clothing passes by the target person, the tracking frame is easy to be taken away.

The prior art lacks a visual tracking method for solving the problem of tracking frame drift.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a visual tracking method and a tracking device for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a visual tracking method, comprising the steps of: s1: acquiring a video to be detected containing a target person in real time; s2: constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure; s3: and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result.

Preferably, the structure of the tracking network is constructed as follows: enhancing and fusing template features, head features and subsequent frame features in the feature pool by adopting a feature fusion network to obtain a fusion feature map; and predicting the fusion characteristic graph by adopting a prediction head network to obtain a tracking result of a subsequent frame.

Preferably, the feature fusion network adopts a Transformer network structure, and comprises two mechanisms of self-attention and mutual attention; the prediction head network comprises three parallel structures of a classification branch, a regression branch and a central measurement branch; the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set; the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set; the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.

Preferably, updating the template features using the feature pool structure based on the comparison learning structure comprises: characteristic pool F ═ F_i} of whichIn, f_iIs a memory frame, i is a positive integer; the storage frame is stored as a queue structure according to the size of the subscript, and the smaller the subscript is, the closer the storage position is; where, when i is 1, it is the template frame, i>1 is the subsequent frame; for the subsequent frame, the larger the product of the classification branch prediction confidence coefficient and the central measurement branch confidence coefficient of the frame is, the smaller the index i is; presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the frame number in the feature pool is smaller than the threshold, fusing the feature vectors corresponding to all the frames in the feature pool to obtain the template features.

Preferably, a head frame and a whole body frame are added in the feature pool, and the constraint of the relative position is jointly constrained by the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal line of the whole body frame; the head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.

Preferably, the features in the feature pool are fused in a weighted fusion mode, and the weight coefficient is

The power of the feature is obtained, the overall coefficient is adjusted after the weighted sum of the features is obtained, so that the sum of all the weighted coefficients is 1, and the expression of a specific fusion result is as follows:

wherein, X_kIs the fused feature template obtained from the feature pool, and k is the number of frames taken from the fused feature pool.

Preferably, the acquired pedestrian video data set uses an overall tracking loss function L_TTraining the tracking network; the integral tracking loss function expression is that the head track constrains loss L_HAnd dense loss L_CThe two parts are as follows:

L_T＝βL_H+(1-β)L_C

wherein beta is a hyperparameter;

wherein the head restraint loss L_HThe expression of (a) is:

wherein, gamma is₁、γ₂For hyper-parameters, L is the distance between the marked head frame and the marked body frame, L is the diagonal length of the marked body frame, theta is the included angle between L and L,

is the value obtained from the corresponding prediction;

the dense loss function L_CThe expression of (a) is:

L_c＝L_cls+λ₁L_reg+λ₂L_cent

wherein L is_clsIs the loss of classification, L_regIs the regression loss, L_centIs the loss of the central measure, λ₁And λ₂Is a weight parameter;

the classification loss and the central measure loss are represented in a cross entropy loss form, and the expression is as follows:

wherein a is cls or cent, L_aIs the classification loss or the central measure loss, j is the j th frame sample, y_ajIs the label of the jth frame, p_ajIs the prediction confidence of the j frame classification branch or the central measure branch;

the prediction confidence expression of the central measure branch is as follows:

wherein l^*、r^*、t^*、b^*Respectively the distances from the predicted central point to the left boundary, the right boundary, the upper boundary and the lower boundary of the label frame;

the regression loss expression is:

L_reg＝L_GIOU+α₁L_agg-α₂L_rep

wherein L is_GIOUIs the loss of the generalized cross-over ratio, L_aggIs the loss of polymerization, L_repIs the rejection loss, alpha₁、α₂Is a weight parameter.

The generalized intersection-to-parallel ratio loss function expression is as follows:

L_GIOU＝1-GIOU(gt,b_j)

the generalized cross-over ratio expression is as follows:

wherein gt is a label frame, b_jIs a prediction box, C is a box capable of enclosing gt and b_jThe smallest box of (c).

The aggregation loss function expression is as follows:

wherein_jIs the tag frame of the target person of the jth frame, p_iIs a prediction box, p, belonging to the tag box of the jth frame_j+I is the number of candidate frames predicted as positive samples by the jth frame;

the smooths_l1The functional expression is:

the repulsion loss function expression is:

wherein, b_iIs a prediction box, g_jIs the frame with the largest prediction as the background, p, intersected with the frame label frame_j+I is the number of candidate boxes for which the jth frame is predicted as a positive sample, IOG is b_iAnd g_jCross-over ratio of (a);

the smooths_lnThe functional expression is:

where σ ∈ [0,1) denotes a smoothing parameter.

Preferably, the step of determining the target person frame of the target person in the video to be tested by using the trained tracking network comprises the following steps: inputting the characteristics of the image frames in the video to be tested and the template characteristics into the trained tracking network at the same time; storing the image frame into the corresponding position of the feature pool according to the product value of the classification branch prediction confidence and the prediction confidence of the central measurement branch; taking the vector index with the value of the root number larger than 0.5 as a candidate frame index set according to the product value of the classification branch prediction confidence and the prediction confidence of the central measure branch; finding a candidate frame set in the regression branch according to the candidate frame index set; and finding the box with the highest confidence degree of the classification branch prediction in the candidate box set as a target character box.

Preferably, the method further comprises the following steps: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising: acquiring a depth map from the video to be detected; calibrating the depth map and the image frame in the video to be detected; and obtaining the depth value of each pixel point in the corresponding area in the target character frame and calculating the average value of the depth values as the distance of the target character.

The invention also provides a tracking device adopting the visual tracking method.

The invention has the beneficial effects that: the feature pool can dynamically update the template at low time complexity, better match the features of subsequent frames, effectively reduce accumulated errors, alleviate the problem of tracking frame drift and improve the robustness of the tracking method.

Furthermore, the characteristic pool structure can keep the tracking network model stable in long-term sequence tracking, and the robustness of the tracking method is improved.

Furthermore, the invention increases the constraint of the relative positions of the head frame and the whole body frame, and the head-body constraint assumes that the relative positions of the head and the body are unchanged when the same person walks, so that the invention innovatively introduces the common sense assumption, increases the relative position constraint, reduces the search space of the solution, effectively inhibits the drift problem of the tracking frame, and relieves the interference caused by the fact that the tracking frame suddenly contains the confused object.

Furthermore, the invention adopts the integral tracking loss function to carry out optimization training on the network, considers the aggregation force of the whole body label frame of the positive sample to the prediction frame and the repulsion force of the whole body label frame of the positive sample to the prediction frame, increases the aggregation force and the repulsion force among the samples through training, reduces the drift problem of the tracking frame and ensures that the tracking effect is better.

Drawings

Fig. 1 is a schematic diagram of a visual tracking method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a tracking method of a robot in the embodiment of the present invention.

Fig. 3 is a method for constructing a tracking network and training the tracking network with a collected pedestrian video data set according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a feature fusion network in the embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a gauging head network in the embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a tracking network feature pool module in the embodiment of the present invention.

Fig. 7 is a schematic diagram of a method for determining a target person frame of a target person in the video to be tested by using a trained tracking network in the embodiment of the present invention.

Fig. 8 is a schematic diagram of a method for obtaining a tracking result in an embodiment of the present invention.

FIG. 9 is a diagram illustrating a method for tracking a target person's trajectory according to an embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing or a circuit communication.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

There are three reasons for the problem of tracking frame drift: firstly, a deep learning model is difficult to converge, on one hand, the deep learning network has high complexity, and the real scene is relatively complex because a data set is taken from the real scene; on the other hand, compared with image data, the time dimension is increased for video data, and the person tracking needs to be trained by using the video data, so that the model convergence difficulty is further increased; secondly, the conventional algorithm ignores the constraint of extra monitoring information, such as walking track information, images of unshielded parts, ReID information and the like, so that the algorithm is unstable and the phenomenon of tracking frame drift occurs; and thirdly, the tracking algorithm needs to use a non-maximum suppression method for post-processing, and selects an optimal tracking frame from a plurality of candidate frames, but the method is difficult to adjust the frame deleting threshold value, because the non-maximum suppression is sensitive to the setting of the threshold value.

Based on the above analysis, the present invention provides a visual tracking method.

As shown in fig. 1, the present invention provides a visual tracking method, comprising the steps of:

s1: acquiring a video to be detected containing a target person in real time;

s2: constructing a tracking network and training the tracking network by using the collected pedestrian video data set, wherein the tracking network uses a feature pool structure to update template features based on a contrast learning structure;

s3: and determining a target character frame of a target character in the video to be detected by using the trained tracking network to obtain a tracking result.

According to the invention, by adding the structure of the feature pool, the features of the template branches are optimized, the feature pool can dynamically update the template at low time complexity, the features of the subsequent frames are better matched, the accumulated error is effectively reduced, the drift problem of the tracking frame is relieved, and the robustness of the tracking method is improved.

In a specific embodiment, the above method is applied to a tracking robot.

As shown in fig. 2, the tracking device of the present invention is a tracking robot, and a tracking method of the tracking robot specifically includes the following steps:

a1: acquiring RGBD camera information of the robot, and obtaining an RGB (red, green and blue) graph and a depth graph from the RGBD camera information;

a2: the RGB map is input into a tracking network RobTranSim. The tracking network updates the template features based on the comparison learning structure by using the feature pool structure;

a3: obtaining a target character frame;

a4: acquiring depth information according to the depth map;

a5: estimating a target person track;

a6: and driving the robot to track the motion of the target person.

It can be understood that the RGBD camera on the tracking robot of the present invention obtains the video to be measured including the target person.

As shown in fig. 3, the present invention provides a method for constructing a tracking network and training the tracking network with a collected pedestrian video data set, which specifically includes the following steps:

b1: collecting a marked pedestrian video data set:

b2: constructing a person tracking network RobTranSim:

b3: constructing a feature pool, and updating template features:

b4: using the global tracking loss function L_TAnd training the tracking network by using the collected pedestrian video data set.

B5: and tracking the target person in the video to be detected by using the trained network to obtain a tracking result.

In step B1, firstly, the pedestrian video is captured, and then the tag frame is labeled frame by frame, the tag frame includes a whole body frame and a head frame, the whole body frame highlights the horizontal and vertical coordinates of the upper left corner of the character frame, the length and width information of the frame, and the head frame labels the horizontal and vertical coordinate information of the head center position. In the tracking process, the network outputs information of a prediction frame of each frame, and the prediction frame comprises a prediction head frame and a prediction whole body frame.

In step B2, the tracking network adopts the following structure based on the comparative learning:

enhancing and fusing template features, head features and subsequent frame features in the feature pool by adopting a feature fusion network to obtain a fusion feature map;

and predicting the fusion characteristic graph by adopting a prediction head network to obtain a tracking result of a subsequent frame.

As shown in fig. 4, the feature fusion network adopts a Transformer network structure, which includes two mechanisms, namely self-attention and mutual attention; compared with the fusion mode of the convolutional layer, the Transformer can learn global information, and the nonlinear structure of the Transformer can be fused with more effective representation.

As shown in fig. 5, the gauging head network includes three parallel structures of a classification branch, a regression branch, and a central measurement branch; the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set; the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set; the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.

Wherein, the box type with the intersection ratio of the label box marked in the step B1 being more than 0.5 is 1, and the box type with the intersection ratio of less than or equal to 0.5 is 0.

In step B3, updating the template features using the feature pool structure based on the comparison learning structure includes:

characteristic pool F ═ F_iIn which f_iIs a memory frame, i is a positive integer; the storage frame is stored as a queue structure according to the size of the subscript, and the smaller the subscript is, the closer the storage position is; where, when i is 1, it is the template frame, i>1 is the subsequent frame; for the subsequent frame, the larger the product of the classification branch prediction confidence coefficient and the central measurement branch confidence coefficient of the frame is, the smaller the index i is;

based on the assumption that the relative positions of the head and the body of the pedestrian are not changed, the relative position constraint is increased, the search space of the solution is reduced, and the problem of the drift of the tracking frame is effectively inhibited; and adding a head frame and a whole body frame in the feature pool, and using the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal line of the whole body frame to jointly constrain the constraint of the relative position.

Using feature extraction network to pair currentFrame extraction frame feature x_iStoring and updating the feature pool; presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the number of frames in the feature pool is smaller than the threshold, fusing feature vectors corresponding to all frames in the feature pool to obtain template features. The head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.

In a specific embodiment, the feature extraction network is ResNet 50.

In a preferred embodiment, the features in the feature pool are fused in a weighted fusion mode, and the weight coefficient is

The method has three advantages that firstly, richer information is fused, the characteristics of the template frame are not considered, and the more important characteristics are that the weight coefficient is larger; secondly, the calculated amount is small, when the feature pool is updated, only the incremental part of the frame features newly entering the pool needs to be added, and the fusion result does not need to be recalculated during each iteration; and thirdly, the long-term sequence tracking is more effective, and the longer the time is, the more stable the template frame characteristics are.

Then, the global tracking loss function L is used_TTraining the tracking network; the integral tracking function considers the aggregation force of the whole body label frame of the positive sample to the prediction frame and the repulsion force of the whole body label frame of the positive sample to the prediction frame as the background frame, and increases the aggregation among the samples through trainingResultant force and repulsive force reduce the problem of the drift of the tracking frame, so that the tracking effect is better.

In a specific embodiment, the overall tracking loss function expression is a head-track constrained loss L_HAnd dense loss L_CThe two parts are as follows:

L_T＝βL_H+(1-β)L_C

wherein beta is a hyperparameter; during the training process, the overall tracking loss function is carried out in a decreasing direction.

Wherein the head restraint loss L_HThe expression of (a) is:

wherein, γ₁、γ₂For hyper-parameters, L is the distance between the marked head frame and the marked whole body frame, L is the length of the marked whole body frame diagonal line, theta is the included angle between L and L,

is the value obtained from the corresponding prediction;

the dense loss function L_CThe expression of (a) is:

L_c＝L_cls+λ₁L_reg+λ₂L_cent

wherein a is cls or cent, L_aIs the classification loss or the central measure loss, j is the j th frame sample, y_ajIs the firstLabel of j frame, p_ajIs the prediction confidence of the j frame classification branch or the central measure branch;

the regression loss expression is:

L_reg＝L_GIOU+α₁L_agg-α₂L_rep

L_GIOU＝1-GIOU(gt,b_j)

the generalized cross-over ratio expression is as follows:

wherein gt is a whole body tag frame, b_jIs a whole body prediction box, C is a box capable of enclosing gt and b_jThe smallest box of (c).

The aggregation loss function expression is as follows:

wherein_jIs the whole body tag box of the target person of the jth frame, p_iIs a whole-body prediction box, | p, attributed to the jth frame tag box_j+I is the number of candidate frames predicted as positive samples by the jth frame;

the smooths_l1The functional expression is:

the repulsion loss function expression is:

wherein, b_iIs the whole body prediction frame, g_jIs the frame with the largest prediction as the background and intersected with the whole body label frame of the frame, | p_j+I is the number of candidate boxes for which the jth frame is predicted as a positive sample, IOG is b_iAnd g_jCross-over ratio of (a);

the smooths_lnThe functional expression is:

where σ ∈ [0,1) denotes a smoothing parameter.

In step B5, in the feature pool structure shown in fig. 6, the template frame is stored as the head frame in the feature pool, and the position of the template frame is not adjusted thereafter.

And inputting each frame of subsequent frame fi of the video sequence into a feature extraction network of the feature pool to obtain a feature vector.

And setting a threshold, and fusing the feature vectors corresponding to the previous k frames with the same number as the threshold in the feature pool to obtain the template features if the number of the frames in the feature pool is greater than or equal to the threshold. And if the number of the frames in the feature pool is less than the threshold value, fusing the feature vectors corresponding to all the frames in the pool. The threshold is a positive integer greater than or equal to 1, and in this embodiment, the threshold is 8, where the marginal utility of the feature pool is the maximum. In particular, when the threshold takes 1, the network degenerates to a standard contrast learning network.

As shown in fig. 7, the step of determining the whole body frame of the target person in the video to be tested by using the trained tracking network includes the following steps:

inputting the characteristics of the image frames in the video to be tested and the template characteristics into the trained tracking network at the same time;

storing the image frame into the corresponding position of the feature pool according to the product value of the classification branch prediction confidence and the prediction confidence of the central measurement branch;

taking the vector index with the value of the root number larger than 0.5 as a candidate frame index set according to the product value of the classification branch prediction confidence and the prediction confidence of the central measure branch; finding a candidate frame set in the regression branch according to the candidate frame index set;

and finding the box with the highest confidence of the classification branch prediction in the candidate box set as the whole-body box of the target person.

Specifically, the features of the frame fi and the template features are simultaneously input into the trained network, and the prediction results of the four branches are calculated as shown in fig. 5.

And finding the box with the highest confidence of the classified branch prediction in the candidate box set as the tracking result of the frame.

As shown in fig. 8, the tracking method of the present invention further includes: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising:

acquiring a depth map from the video to be detected;

calibrating the depth map and the image frame in the video to be detected;

and obtaining the depth values of all pixel points in the corresponding area in the whole frame of the target character and calculating the average value of the depth values to be used as the distance of the target character.

As shown in fig. 9, corresponding to step a5, the tracking of the target person's trajectory using the apparatus of the present invention includes the following steps:

c1: obtaining a depth map:

in the present embodiment, a depth map portion is extracted from information of the RGBD camera.

C2: the depth map is calibrated with the RGB map:

in order to obtain a clear depth map, in this embodiment, first, the opencv library is used for calibration, and internal and external parameters of the camera are acquired and corrected to realize epipolar alignment. And obtaining a disparity map, calculating a depth map by a HashMatch method, and splicing the calculated depth map and the original camera depth map.

C3: and (3) intercepting a depth map in the character frame:

in the present embodiment, in the depth map, the depth of the corresponding position in the prediction frame of the RGB map of each frame is acquired.

C4: calculating the average depth:

in the present embodiment, the average distance is calculated from the in-frame depth predicted from the RGB map per frame.

C5: and updating the distance and the angle of the robot.

In the above process, the method further comprises the following steps:

c6: adjusting the camera to keep the character frame in the center of the visual field;

c7: the declination angle of the camera relative to the last position is calculated.

In a specific example, the test was carried out using the method and apparatus as described previously. Firstly, a video of a tracking target person is collected in a scene with dense crowd, and a head frame and a whole body frame are labeled on each frame of image. Training a tracking network model by using the first 70% images, wherein the tracking network model is trained on a server formed by using 8 RTX 2080 Ti; the tracking effect was tested using the latter 30% images. Comparing the test result with the single-target tracking method in the prior art, it can be seen that the tracking precision of the embodiment is higher.

TABLE 1 results of the experiment

Tracking method	ATOM	SiamRPN++	TransT	Method of the invention
					Precision (%)	80.5	76.8	83.9	85.3

An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.

Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.

Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAMEN), Synchronous linked Dynamic Random Access Memory (DRAM), and Direct Random Access Memory (DRMBER). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to arrive at new method embodiments.

The features disclosed in the several product embodiments presented in this application can be combined arbitrarily, without conflict, to arrive at new product embodiments.

The features disclosed in the several method or apparatus embodiments provided herein may be combined in any combination to arrive at a new method or apparatus embodiment without conflict.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A visual tracking method, comprising the steps of:

s1: acquiring a video to be detected containing a target person in real time;

2. The visual tracking method of claim 1, wherein the tracking network is constructed as follows:

3. The visual tracking method of claim 2, wherein the feature fusion network employs a Transformer network structure, including two mechanisms of self-attention and mutual attention;

the prediction head network comprises three parallel structures of a classification branch, a regression branch and a central measurement branch;

the classification branch is used for classifying the foreground and the background of the image frame in the pedestrian video data set;

the regression branch is used for regression of a bounding box of an image frame in the pedestrian video data set;

the central measurement branch is used for normalizing the distance from the pixel in the prediction box to the target center.

4. The visual tracking method of claim 3, wherein updating template features using a feature pool structure based on a comparison learning structure comprises:

presetting a threshold, and if the number of frames in the feature pool is greater than or equal to the threshold, fusing feature vectors corresponding to frames with the same number as the threshold in the feature pool to obtain template features; and if the frame number in the feature pool is smaller than the threshold, fusing the feature vectors corresponding to all the frames in the feature pool to obtain the template features.

5. The visual tracking method of claim 4, wherein a head frame and a whole body frame are added in the feature pool, and the constraint of the relative position is jointly constrained by the proportion and the included angle of the line segment of the center of the head frame and the whole body frame relative to the diagonal of the whole body frame;

the head frame and the whole body frame both maintain a group of the feature pool structures, and the storage positions of the head frame and the whole body frame of the same frame of the target character in the respective pools are the same.

6. The visual tracking method of claim 5, wherein the features in the feature pool are fused using a weighted fusion approachIs characterized by a weight coefficient of

7. The visual tracking method of claim 6, wherein the acquired pedestrian video data set uses an overall tracking loss function L_TTraining the tracking network;

the integral tracking loss function expression is that the head track constrains loss L_HAnd dense loss L_CThe two parts are as follows:

L_T＝βL_H+(1-β)L_C

wherein beta is a hyperparameter;

wherein the head restraint loss L_HThe expression of (a) is:

is a value derived from the corresponding prediction;

the dense loss function L_CThe expression of (a) is:

L_c＝L_cls+λ₁L_reg+λ₂L_cent

wherein l^*、r^*、t^*、b^*Respectively the distances from the predicted central point to the left boundary, the right boundary, the upper boundary and the lower boundary of the whole body label frame;

the regression loss expression is:

L_reg＝L_GIOU+α₁L_agg-α₂L_rep

L_GIOU＝1-GIOU(gt,b_j)

the generalized cross-over ratio expression is as follows:

The aggregation loss function expression is as follows:

the smooths_l1The functional expression is:

the repulsion loss function expression is:

the smooths_lnThe functional expression is:

where σ ∈ [0,1) denotes a smoothing parameter.

8. The visual tracking method of claim 7, wherein determining the target person frame of the target person in the video under test using the trained tracking network comprises:

and finding the box with the highest confidence degree of the classification branch prediction in the candidate box set as a target character box.

9. The visual tracking method of claim 8, further comprising: performing motion tracking on the target person according to the target person frame to obtain the tracking result, specifically comprising:

acquiring a depth map from the video to be detected;

calibrating the depth map and the image frame in the video to be detected;

and obtaining the depth value of each pixel point in the corresponding area in the target character frame and calculating the average value of the depth values as the distance of the target character.

10. A tracking device, characterized in that a visual tracking method according to any one of claims 1-9 is used.