CN112365523A

CN112365523A - Target tracking method and device based on anchor-free twin network key point detection

Info

Publication number: CN112365523A
Application number: CN202011225222.5A
Authority: CN
Inventors: 钱诚; 徐则中; 游庆祥; 赵宇航
Original assignee: Changzhou Institute of Technology
Current assignee: Changzhou Institute of Technology
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-12

Abstract

The invention provides a target tracking method and a target tracking device based on anchorless twin network key point detection, which comprise the following steps: acquiring a target tracking video; sending the target template image and the target search area image into a pre-trained central point estimation module to generate a central point position estimation heat map, an upper left corner to central point offset estimation heat map and a central point position error estimation heat map; sending the target search area image into a pre-trained corner estimation module to generate a corner position estimation heat map and a corner position error estimation heat map; and estimating the coordinates of the corner points according to the offset estimation heat map from the upper left corner to the central point, determining a target frame in the current frame image according to the central point and the corner point coordinates, and completing the tracking of the target. The target tracking is regarded as the determination problem of the center point and the upper left corner point and is decomposed into two parts, namely corner point position estimation and center point position estimation, so that the use of a preset anchor point is avoided, the output quantity of heat maps is reduced, the parameter quantity of a network is reduced, and the speed of a tracking algorithm is increased.

Description

Target tracking method and device based on anchor-free twin network key point detection

Technical Field

The invention relates to the technical field of image processing, in particular to a target tracking method and device for anchor-point-free twin network corner generation.

Background

Generally, target tracking is to determine the area where a target is located in subsequent video frames through continuous reasoning of a tracking method according to a target object to be tracked in a first frame of a video. Currently, some tracking methods use a twin network without anchor points to calculate the similarity between the image area of the target to be determined and the target template image, and then determine the target template image in the subsequent frame according to the maximum similarity.

2 article papers entitled "Simase Box Adaptive Network for Visual Tracking" and "SimCAR: Simase full Adaptive Classification and Regression for Visual Tracking", published in the International Conference IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2020, both propose a technical scheme for constructing an object tracker based on an anchorless twin Network, thereby determining an object in a subsequent frame. Wherein the content of the first and second substances,

the deep neural Network proposed in the "sieme Box Adaptive Network for Visual Tracking" paper consists of four components, including: the system comprises a main network module, a cross-correlation module, a classification module and a regression module, wherein the main network module is composed of 2 convolutional neural networks with the same parameters and structures, 1 network branch is used for extracting the characteristics of a target template image, and the other 1 network branch is used for extracting the characteristics of a target search area image area. In practical application, the output characteristics of the 3 rd convolution block conv-3, the 4 th convolution block conv-4 and the 5 th convolution block conv-5 of the residual error network ResNet-50 are taken as the extracted characteristics respectively. And the cross-correlation module performs cross-correlation convolution operation on the characteristics of the target template image and the characteristics of the target search area image area to obtain a cross-correlation diagram, and inputs the cross-correlation diagram into the classification module and the regression module respectively. The classification module and the regression module are both composed of 2 convolutional layers, the 2-channel heat map output by the classification module respectively represents the scores of the foreground and the background at each spatial position, and the point with the maximum score on the foreground map is used as the target center. The 4-channel heat map output by the regression module represents the distance from each spatial position to 4 edges of the target frame. And finally, carrying out linear addition on the classification heat map and the edge distance regression heat map respectively obtained by the output characteristics of the 3 convolution blocks to obtain a final 2-channel classification heat map and a final 4-channel edge distance regression heat map.

The deep neural network proposed in the Siamese full volumetric Classification and Regression for Visual Tracking paper consists of five modules including: the system comprises a main network module, a cross-correlation module, a classification module, a regression module and a heart rate module, wherein the main network module is also composed of 2 convolutional neural networks with the same parameters and structures, and the output features of a 3 rd convolutional block conv-3, a 4 th convolutional block conv-4 and a 5 th convolutional block conv-5 of a residual error network ResNet-50 are used as extracted features. And then, inputting the target template features output by each convolution block and the target search area image features into a cross-correlation module for convolution calculation. Then, the output features of the 3 convolution blocks are connected according to channels, and the number of channels is reduced to 256 channels by 1 × 1 convolution. And finally, the output characteristics are respectively input into a classification module, a regression module and a heart rate module, the classification module outputs foreground and background heat maps of 2 channels, the regression module outputs the distance between each spatial position and 4 edges of the target frame, and the center rate module outputs the possibility that each position is the target center.

In the above method, it is necessary to perform foreground classification on each spatial position and estimate the distance from each spatial position to the target frame, so as to determine the final target frame, which may cause two problems: 1) a foreground and background classification heat map and an edge distance estimation heat map need to be output, so that network parameters are increased; 2) the determination capacity of the target frame is easily affected by target deformation to generate larger errors, and the pixel-level classification is easy to generate more false positive results, so that the target frame result becomes unreliable.

Disclosure of Invention

Aiming at the problems, the invention provides a target tracking method and a target tracking device based on the detection of key points of an anchor-free twin network, which effectively solve the technical problems of low accuracy, large computation amount and the like of the existing target tracking method.

The technical scheme provided by the invention is as follows:

a target tracking method based on anchor-free twin network key point detection comprises the following steps:

acquiring a target tracking video, designating a target tracking frame with a first preset size in a first frame image of the target tracking video as a target template image, and selecting a target search area image with a second preset size in a frame with a target frame center point coordinate of a previous frame image in a current frame image of a target to be tracked as a reference frame, wherein the second preset size is larger than the first preset size;

sending the target template image and the target search area image into a pre-trained central point estimation module to generate a central point position estimation heat map, an upper left corner to central point offset estimation heat map and a central point position error estimation heat map;

sending the target search area image to a pre-trained corner estimation module to generate a corner position estimation heat map and a corner position error estimation heat map;

obtaining a center point coordinate after position compensation in the current frame image according to the center point position estimation heat map and the center point position error estimation heat map generated by the center point estimation module, and estimating the corner point coordinate according to the offset estimation heat map from the upper left corner to the center point;

respectively calculating the correlation between the estimated corner coordinates and the preliminary estimation values of the positions of the upper left corners in the corner position estimation heat map generated by the corner estimation module, and selecting the preliminary estimation value of the position of the upper left corner with the maximum correlation as the position estimation value of the upper left corner;

generating position-compensated corner coordinates in the current frame image according to the corner position error estimation heat map generated by the corner estimation module and the selected upper left corner position estimation value;

and obtaining a target frame of the tracking target in the current frame image according to the center point coordinates and the corner point coordinates after the position compensation, and completing the tracking of the target.

The invention also provides a target tracking device based on the detection of the key points of the anchor-free twin network, which comprises the following components:

a tracking image obtaining module, configured to obtain a target tracking video, designate a target tracking frame of a first preset size as a target template image in a first frame image of the target tracking video, and select a target search area image of a second preset size, where the second preset size is larger than the first preset size, in a frame that needs to track a target and has a target frame center point coordinate of a previous frame image as a reference frame;

the central point estimation module is used for generating a central point position estimation heat map, an upper left corner to central point offset estimation heat map and a central point position error estimation heat map according to the input target template image and the target search area image;

the corner estimation module is used for generating a corner position estimation heat map and a corner position error estimation heat map according to the input target search area image;

the calculation module is used for obtaining the coordinates of the center point after position compensation in the current frame image according to the center point position estimation heat map and the center point position error estimation heat map generated by the center point estimation module, and estimating the coordinates of the corner point according to the offset estimation heat map from the upper left corner to the center point; respectively calculating the correlation between the estimated corner coordinates and the preliminary estimation values of the positions of the upper left corners in the corner position estimation heat map generated by the corner estimation module, and selecting the preliminary estimation value of the position of the upper left corner with the maximum correlation as the position estimation value of the upper left corner; generating position-compensated corner coordinates in the current frame image according to the corner position error estimation heat map generated by the corner estimation module and the selected upper left corner position estimation value;

and the target frame forming module is used for obtaining a target frame of the tracking target in the current frame image according to the center point coordinates and the corner point coordinates after the position compensation, and completing the tracking of the target.

Compared with the prior art, the target tracking method and the target tracking device based on the detection of the key point of the anchor-free twin network have the following advantages and characteristics:

(1) the target tracking is regarded as the determination problem of the center point and the upper left corner point, and is divided into two parts of corner point position estimation and center point position estimation for solution, so that the use of a preset anchor point is avoided, the output quantity of heat maps is reduced, namely, the parameter quantity of a network is reduced, and the speed of a tracking algorithm is accelerated.

(2) On the aspect of determining the upper left corner point, the offset from the corner point to the center point is introduced, and the position prediction of the upper left corner point is restrained, so that some ambiguous candidate corner points are effectively filtered, the precision of a corner point position estimation module is compensated to a certain extent, and the accuracy of the whole target tracking is improved.

Drawings

The foregoing features, technical features, advantages and embodiments are further described in the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.

FIG. 1 is a schematic flow chart of a target tracking method based on anchorless twin network key point detection according to the present invention;

FIG. 2 is a diagram of a network model architecture according to the present invention;

FIG. 3 is a diagram of a target tracking device based on detection of key points in an anchorless twin network according to the present invention;

fig. 4 is a schematic structural diagram of a terminal device in the present invention.

Reference numerals:

11-an object template image, 12-an object search area image, 13-a first residual neural network, 14-a second residual neural network, 15-a feature fusion network, 16-a center point position estimation network, 17-a center point position estimation heat map, 18-an upper left corner to center point offset estimation heat map, 19-an upper left corner to center point offset estimation heat map, 20-a center point position error estimation network, 21-a center point position error estimation heat map, 22-an hourglass-type neural network, 23-a pooling network, 24-a corner position estimation network, 25-a corner position estimation heat map, 26-a corner position error estimation network, 27-a corner position error estimation heat map, 100-an object tracking device, 110-a tracked image acquisition module, 120-center point estimation module, 130-corner point estimation module, 140-calculation module, 150-target frame forming module.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

As shown in fig. 1, which is a schematic flow chart of a target tracking method based on anchorless twin network key point detection provided by the present invention, it can be seen from the figure that the target tracking method includes:

sending the target search area image into a pre-trained corner estimation module to generate a corner position estimation heat map and a corner position error estimation heat map;

obtaining a center point coordinate after position compensation in the current frame image according to the center point position estimation heat map and the center point position error estimation heat map generated by the center point estimation module, and estimating the corner point coordinate according to the upper left corner to center point offset estimation heat map;

In order to obtain network parameters suitable for target tracking, training data is needed to adjust the network parameters so that the neural network can meet the current target tracking task requirements, and therefore the training data needs to be prepared in advance. Specifically, the prepared training data takes a training set group as a unit, each training set group comprises two pictures, one of the two pictures is a target template image which is selected from a frame in a first frame image and contains a tracking target, and the other picture is a target search area image which is selected from a frame based on the coordinates of the center point of the target frame in the first frame image and has a second preset size, so that the purpose of finding the target frame of the target to be tracked in the target search area image based on the target template image is achieved according to the created network model. In order to improve the tracking efficiency, the relationship between the first preset size and the second preset size may be set according to the actual situation, and theoretically, the second preset size is larger than the first preset size.

In one example, the training data is selected from the manually labeled target detection image data sets VID and YouTube-bounding boxes data sets. Two frames of images with the frame difference not larger than 20 frames are randomly selected from each section of video, a rectangular frame (with the width w and the height h) with a target to be tracked in the previous frame as the center is used as a target template image, and the target template image is zoomed to 127 multiplied by 127 to be used as the original image input of the target template image. The target search area image with a width of 2w and a height of 2h is cut out in the subsequent frame around the center of the rectangular frame of the previous frame and is scaled to a size of 255 × 255. Each pair of target template image and target search area image constitutes 1 training data (corresponding to the training set described above).

The target frame in the video frame is determined by a center point and a corner point (upper left corner point), so the network model mainly comprises a corner point estimation module based on an hourglass convolution network and a center point estimation module based on a twin network, and the target frame is obtained by determining the positions of the center point and the upper left corner point of the target frame.

As shown in fig. 2, the constructed centroid estimation module is composed of a first residual neural network 13, a second residual neural network 14, a feature fusion network 15, a centroid position estimation network 16, an upper left corner to centroid offset estimation network 18, and a centroid position error estimation network 20, wherein outputs of the first residual neural network 13 and the second residual neural network 14 are used as inputs of the feature fusion network 15, and an output of the feature fusion network 15 is used as an input of the centroid position estimation network 16, the upper left corner to centroid offset estimation network 18, and the centroid position error estimation network 20.

Specifically, the first residual neural network 13 and the second residual neural network 14, which are main networks for extracting depth features, are twin networks and are both residual neural networks ResNet-50, and in order to alleviate the problem that the resolution of the feature map decreases as the network depth increases, downsampling operations are omitted in the last 2 convolution blocks (the 4 th convolution block and the 5 th convolution block) of ResNet-50, and a hole convolution is adopted to enlarge a receptive field, and the hole rate can be adjusted according to application requirements, for example, the hole rate in the 4 th convolution block is set to be 2, and the hole rate in the 5 th convolution block is set to be 4. The structures and parameters of the two convolutional neural networks are kept consistent and are respectively used for extracting the depth features of the target template image 11 and the target search area image 12. Considering that the features extracted by the multilayer convolutional neural network have obvious difference, the features output by the 3 rd convolution block, the 4 th convolution block and the 5 th convolution block are fused after the features extracted by the residual neural network on the input features are used. The central point position estimation module determines the central point position of the target frame by using the depth features output by the 3 convolution blocks, performs convolution calculation through a convolution kernel with the size of 1 multiplied by 1 corresponding to the output of each convolution block in each convolution branch, reduces the channel number of the features to 256, and then obtains a transformed feature map through convolution of 1 layer of convolution kernels with the size of 3 multiplied by 3.

Regarding the output of the 3 rd volume block, the 4 th volume block and the 5 th volume block, the feature graph of the target template image is regarded as a convolution kernel, the convolution kernel and the feature graph of the target search area image are subjected to convolution calculation, and a cross-correlation graph is obtained and used as the input of the center point position estimation, the offset estimation from the upper left corner to the center point and the center point position error estimation. In addition, in the process, 3 cross-correlation graphs are obtained by calculation of a 3 rd convolution block, a 4 th convolution block and a 5 th convolution block, an average value is taken on a corresponding channel to serve as a final cross-correlation graph, and 3 output branches are arranged and are respectively used for center point position estimation, center point position error estimation and offset estimation from the upper left corner to the center point.

The 1 st output branch is a center point position estimation branch (corresponding to the center point position estimation network 16). For each cross-correlation graph, passing through 3 convolutional layers, wherein each convolutional layer uses a convolutional kernel with the size of 3 multiplied by 3, the filling parameter is set to be 1, and the number of output channels is 256; then, the data is input into 1 convolution layer of convolution kernel of 1 × 1 size, the number of output channels is reduced to 1, and finally, the position of the center point (initial estimated value of the center point position) represented by 1 center point position estimation heat map 17 is obtained.

The 2 nd output branch is the top left corner-to-center offset estimation branch (corresponding to the top left corner-to-center offset estimation network 18). For each cross-correlation graph, passing through 3 convolutional layers, wherein each convolutional layer uses a convolutional kernel with the size of 3 multiplied by 3, the filling parameter is set to be 1, and the number of output channels is 256; then, the convolution layers of 1 layer of convolution kernel with 1 × 1 size are input, the number of output channels is reduced to 2, and finally, 2 estimation heat maps 19 representing the estimation values of the offset from the upper left corner point to the center point in the horizontal direction and the vertical direction are obtained.

The 3 rd output branch is the center point position error estimation branch (corresponding to the center point position error estimation network 20 described above). For each cross-correlation graph, passing through 3 convolutional layers, wherein each convolutional layer uses a convolutional kernel with the size of 3 multiplied by 3, the filling parameter is set to be 1, and the number of output channels is 256; then, the obtained data are inputted into 1 layer convolution layer of convolution kernel with 1 × 1 size, the number of output channels is reduced to 2, and finally 2 central point position error estimation heat maps 21 representing central point position error values in the horizontal direction and the vertical direction are obtained.

As shown in fig. 2, the corner estimation module includes: an hourglass-shaped neural network 22 for extracting features of an input target search area image, a pooling network 23 for obtaining pooling feature maps in vertical and horizontal directions according to the features output by the hourglass-shaped neural network, a corner position estimation network 24 for estimating corner position estimation heat maps according to the pooling feature maps output by the pooling network, and a corner position error estimation network 26 for estimating corner position error estimation heat maps according to the pooling feature maps output by the pooling network.

Specifically, a stacked 52-layer Hourglass neural (Hourglass) network is used as the backbone network, followed by a corner pooling layer (pooling network 22) to predict the spatial locations of the corners. The Hourglass-shaped Hourglass neural network is used as a main network of the module and used for extracting image depth features, the obtained feature map is respectively input into two pooling module components, each pooling module component consists of a standard convolution module component and a maximum pooling component, wherein the standard convolution module component comprises a convolution layer of convolution kernel with the size of 3 multiplied by 3, 1 batch normalization layer and a linear rectification function layer, and the number of output channels is 128; and the maximum pooling component finds out the maximum values from the horizontal direction and the vertical direction respectively to be used as the value of each position in the feature map, and finally adds the feature maps after maximum pooling in the horizontal direction and the vertical direction to obtain the final pooled feature map.

To estimate the corner location, 1 corner location estimation heatmap 25 is output on the pooled feature map by convolution operations of 2 standard convolution modules with an output channel number of 256 and a convolution layer output channel number of 1 × 1 convolution kernel of 1 × 1 size, and a convolution operation of 1 × 1 convolution kernel. The value for each position on the corner position estimate heat map 24 at this time represents the confidence that the point is the top left corner point. To estimate the corner position error, a 2 corner position error estimation heatmap 27 is also output on the pooled feature map by convolution operations of 2 standard convolution blocks and a convolution operation of a 1 x 1 sized convolution kernel, with the value of each position on the heatmap representing the horizontal and vertical position error estimates for the corner that is determined to be the top left corner.

After the network models of the center point estimation module and the corner point estimation module are built, a corner point position label setting method, a center point position label setting method, a corner point position error estimation method, a center point position error estimation method and an upper left corner to center point offset estimation method are further configured. Wherein the content of the first and second substances,

in the aspect of label setting of the position of the upper left corner, a soft label y is set at each corner position in the corner position estimation heat map by using the formula (1)_i,j：

Wherein, (i, j) represents the offset of the current point coordinate from the point coordinate at the upper left corner of the real target frame, and σ represents a preset distance threshold. As can be seen from this equation, the closer a point on the heat map is to the top left corner of the real target box, the higher the confidence that it is the top left corner of the real target box. When a point on the heat map is more than 3 σ away from the top left corner point of the real target box, its confidence is set to 0.

In order to compensate the problem of positioning accuracy loss caused by the downsampling operation of the convolutional neural network, a corner position error estimation module is arranged to estimate the error o between the corner position estimation on the corner position error estimation heat map and the corresponding position of the current frame image, as shown in formula (2):

wherein (x, y) denotes coordinates of a corner point on the current frame image,

representing the coordinates of points on the current frame image mapped onto the corner location heat map, and s represents the ratio of the current frame image resolution to the corner location heat map resolution.

Regarding the setting of the labels of the center point positions, the setting mode is similar to the setting mode of the corner point labels, on the center point position estimation heat map, a soft label is set for the center point on the heat map according to the formula (1), and at the moment, (i, j) represents the offset of the current point coordinate from the center point coordinate of the real target frame. In terms of error estimation, which also adopts a similar corner position error estimation method, the estimation error is calculated using equation (2).

Because a certain distance constraint relationship exists between the upper left corner and the central point, the invention uses the offset from the upper left corner to the central point to represent the distance relationship, as shown in formula (3), and the estimation method of the offset cs from the upper left corner to the central point comprises the following steps:

cs＝(log(ctx-tlx)，log(cty-tly)) (3)

wherein, (ctx, cty) represents the position coordinates of the center point, and (tlx, tly) represents the position coordinates of the top left corner point, thereby forming a distance constraint relationship between the center point and the top left corner point. In addition, for the current frame image, a second frame image is specifically referred to in the process of training the network model; in the target tracking process, the current frame image of the target to be tracked is referred to, and can be any other frame except the first frame image in the video. Since the input target template image is not changed in the process of automatically tracking the video input, the input target search area image is changed, and the frame image currently containing the target search area image is called a current frame image.

In order to adjust the network parameters to adapt to the target tracking task, a corresponding loss function L is set, as shown in formula (4):

wherein the content of the first and second substances,

represents the loss of center point position estimate, as in equation (5):

where H, W represents the height and width of the center point location estimation heat map, p_ijRepresenting neural networks centrallyConfidence values, y, of predictions at point locations (i, j) in a point location estimation heat map_ijIs the corresponding soft tag value;

the loss of the corner position estimate is expressed in the same manner as equation (5).

Represents the loss of center point position error estimation, as in equation (6):

wherein, Smooth_L1(-) represents the smoothing L1 loss function, o_ceAnd

respectively representing the true value of the position error of the central point and the position error estimated by the neural network;

the loss of the corner position error estimate is expressed in the same manner as equation (6).

L_sRepresenting the offset estimation loss from the upper left corner to the center point, a loss function L is constructed by using a smooth L1 loss function_sQuantifying this distance, as in equation (7):

wherein cs, c,

Respectively representing a true value of the offset from the upper left corner point to the central point and an estimated value given by the neural network; lambda [ alpha ]₁、λ₂、λ₃And λ₄Respectively, positive values of the regularization parameters.

Based on the method, in the training process, the target template image and the target search area image in the training set group are used as the input of an input central point estimation module, the target search area image is used as the input of an angular point estimation module, the constructed network model is trained through back propagation of a preset loss function of a formula (4), and the network parameters are adjusted until the loss function converges to complete the training of the network model.

After the network training is completed, taking the height and width of the image area of the target search area as 2 times of the height and width of the image of the target template as an example, the process of tracking the target is specifically shown in fig. 1,

s1, at the starting stage of target tracking, a target tracking frame (including a tracking target) is designated in a first frame of video, and an image in the tracking frame is used as a target template image.

And S2, in the subsequent tracking process, cutting out an image area with the height and width 2 times of the height and width of the target frame in the previous frame image from the center of the target frame in the previous frame image in the current frame image as a target search area image in the current frame image.

And S3, based on the trained network model, respectively inputting the target template image obtained in the step S1 and the target search area image obtained in the step S2 into a target template branch (corresponding to the branch where the first residual error neural network is located) and a target search branch (corresponding to the branch where the second residual error neural network is located) of the twin network.

And S4, according to the 1 central point position estimation heat map result output by the central point position estimation network, taking the position of the maximum value on the heat map as a primary estimation value of the central point position, and mapping the position to the current frame image according to a resolution ratio s to obtain the central point position estimation value. Assuming that the location coordinate of the maximum value on the heat map is (i, j), the location estimate mapped to the current frame image is (i, j)

S5, an error value at the central point position is taken from the central point position error estimation heat map output by the central point position error estimation network, and the error value and the position estimation value are addedAnd obtaining the accurate position of the central point, and determining the coordinates of the central point of the target frame in the current frame image. Assume that the position estimate of the current frame image is

The position-compensated center point coordinates are then

Wherein

The error of the position of the center point estimated for the neural network.

S6, the offset estimation network from the upper left corner to the central point gives the horizontal and vertical offsets from the central point to the upper left corner, so as to estimate the corner position (tlx, tly) as (ctx-exp (cs) according to the central point_x)，cty-exp(cs_y) Wherein, (cs)_x，cs_y) Indicating the horizontal offset, the component of the vertical offset.

S7, on the result of the 1 angular point position estimation heat map output by the angular point position estimation network, sorting the angular point positions on the heat map from large to small according to the value sizes, and taking the positions of the first 20 maximum values (the values can be set according to actual requirements, such as 10, 15, 25 and the like) as the initial estimation values of the angular point positions. Suppose the 20 upper left corner point coordinate estimates are

The correlation between these upper left corners and the corner positions output from step S6 is further measured

(where k is 1,2, …,20, ε > 0 is a positive constant). And then selecting the upper left corner point with the maximum correlation as the finally selected upper left corner position estimation value. Finally, the angular point position is mapped to the current frame image according to the resolution ratio s to obtain the angular point positionAn estimate is set. Assuming that the location coordinate of the maximum value on the heat map is (i, j), the location estimate mapped to the current frame image is

And S8, taking an error value on the position of the central point from the angular point position error estimation heat map output by the angular point position error estimation network, and adding the error value and the position estimation value to obtain the accurate position of the angular point (upper left corner), thereby determining the target frame in the current frame image. Assume that the position estimate of the original image is

The position-compensated corner coordinates are then

Wherein

The corner position error estimated for the neural network.

S9, repeating the steps S2 to S8 until the task of target tracking on all video frames is completed.

In the process, in each tracking, the target search area image in the next frame image is framed according to the target frame containing the tracking target in the previous frame, and the process is circulated until all video frames finish target tracking. It should be clear that, when the method is used to track a video target, after a target tracking frame is specified in a first frame image in the video, the target tracking frame is used as a reference in the whole video tracking process, but the frame selection of a current frame target search area image needs to be selected according to a target frame obtained from a previous frame. In the framing process, specifically, a central point coordinate of a target frame in a previous frame is taken as a reference, and an image with a second preset size is framed in the current frame image to be taken as a target search area image. For example, when a video includes 3 frames of images, which are a first frame of image, a second frame of image and a third frame of image, respectively, and tracking is started, after a target tracking frame is specified in the first frame of image, a corresponding target frame is obtained in the second frame of image by adopting the method; and then, with the target frame in the second frame image as a reference, framing the target search area image in the third frame image to find a corresponding target frame (at this time, the target template image is still the target tracking frame framed in the first frame image), and so on. In step S2, when the first tracking is performed, the previous frame refers to the first frame image.

As shown in fig. 3, the present invention further provides a schematic connection diagram of a target tracking apparatus 100 based on anchorless twin network keypoint detection, and as can be seen from the diagram, the target tracking apparatus includes:

a tracking image obtaining module 110, configured to obtain a target tracking video, designate a target tracking frame of a first preset size as a target template image in a first frame image of the target tracking video, and select a target search area image of a second preset size, where the second preset size is larger than the first preset size, in a frame that needs to track a target and has a target frame center point coordinate of a previous frame image as a reference frame;

a center point estimation module 120, configured to generate a center point position estimation heat map, an upper left corner-to-center point offset estimation heat map, and a center point position error estimation heat map according to the input target template image and the target search area image;

a corner estimation module 130, configured to generate a corner position estimation heat map and a corner position error estimation heat map according to the input target search area image;

a calculating module 140, configured to obtain a position-compensated center point coordinate in the current frame image according to the center point position estimation heat map and the center point position error estimation heat map generated by the center point estimation module, and estimate the corner point coordinate according to the offset estimation heat map from the top left corner to the center point; respectively calculating the correlation between the estimated corner coordinates and the preliminary estimation values of the positions of the upper left corners in the corner position estimation heat map generated by the corner estimation module, and selecting the preliminary estimation value of the position of the upper left corner with the maximum correlation as the position estimation value of the upper left corner; generating position-compensated corner coordinates in the current frame image according to the corner position error estimation heat map generated by the corner estimation module and the selected upper left corner position estimation value;

and the target frame forming module 150 is configured to obtain a target frame of the tracked target in the current frame image according to the center point coordinates and the corner point coordinates after the position compensation, and complete the tracking of the target.

The method used by each module in the target tracking apparatus 100 is the same as that in the target tracking method, and details are not repeated here, and reference may be made to the description in the target tracking method.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing the program modules from one another, and are not used for limiting the protection scope of the application.

Fig. 4 is a schematic structural diagram of a terminal device provided in an embodiment of the present invention, and as shown, the terminal device 200 includes: a processor 220, a memory 210, and a computer program 211 stored in the memory 210 and executable on the processor 220, such as: and (3) a target tracking program based on the detection of the key points of the anchor-free twin network. The processor 220 implements the steps in each of the above-mentioned target tracking method embodiments based on anchorless twin network keypoint detection when executing the computer program 211, or the processor 220 implements the functions of each module in each of the above-mentioned target tracking device embodiments based on anchorless twin network keypoint detection when executing the computer program 211.

The terminal device 200 may be a notebook, a palm computer, a tablet computer, a mobile phone, or the like. Terminal device 200 may include, but is not limited to, processor 220, memory 210. Those skilled in the art will appreciate that fig. 4 is merely an example of terminal device 200, does not constitute a limitation of terminal device 200, and may include more or fewer components than shown, or some components may be combined, or different components, such as: terminal device 200 may also include input-output devices, display devices, network access devices, buses, and the like.

The Processor 220 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 220 may be a microprocessor or the processor may be any conventional processor or the like.

The memory 210 may be an internal storage unit of the terminal device 200, such as: a hard disk or a memory of the terminal device 200. The memory 210 may also be an external storage device of the terminal device 200, such as: a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device 200. Further, the memory 210 may also include both an internal storage unit of the terminal device 200 and an external storage device. The memory 210 is used to store the computer program 211 and other programs and data required by the terminal device 200. The memory 210 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by sending instructions to relevant hardware by the computer program 211, where the computer program 211 may be stored in a computer-readable storage medium, and when the computer program 211 is executed by the processor 220, the steps of the method embodiments may be implemented. Wherein the computer program 211 comprises: computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the code of computer program 211, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the content of the computer readable storage medium can be increased or decreased according to the requirements of the legislation and patent practice in the jurisdiction, for example: in certain jurisdictions, in accordance with legislation and patent practice, the computer-readable medium does not include electrical carrier signals and telecommunications signals.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for persons skilled in the art, numerous modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should be considered as within the scope of the present invention.

Claims

1. A target tracking method based on anchor point-free twin network key point detection is characterized by comprising the following steps:

2. The target tracking method of claim 1, wherein the center point estimation module comprises: a first residual neural network for performing feature extraction on an input target template image, a second residual neural network for performing feature extraction on an input target search area image, a feature fusion network for fusing features output by the first residual neural network and the second residual neural network, the first residual error neural network and the second residual error neural network are twin networks.

3. The target tracking method of claim 2, wherein in the network model of the center point estimation module:

the first residual error neural network and the second residual error neural network are both residual error neural networks ResNet-50, and the 4 th volume block and the 5 th volume block of the two residual error neural networks are both convoluted by a hole;

in the feature fusion network, a feature graph of a target template image is used as a convolution kernel, and is subjected to convolution calculation with a feature graph of a target search area image to obtain a cross-correlation graph which is used as the input of a subsequent central point position estimation network, an upper left corner to central point offset estimation network and a central point position error estimation network; the characteristic diagram comprises the characteristic diagram output by a residual neural network ResNet-50 No. 3 convolution block, a No. 4 convolution block and a No. 5 convolution block;

the central point position estimation network extracts the characteristics of the cross-correlation diagram output by the characteristic fusion network to obtain 1 central point position estimation heat map which is used as a primary central point position estimation value;

extracting features from the cross-correlation diagram output by the feature fusion network by the upper left corner-to-center point offset estimation network to obtain 2 upper left corner-to-center point offset estimation heat maps which are used as offset estimation values from upper left corner points to center points in the horizontal direction and the vertical direction;

the central point position error estimation network extracts the characteristics of the cross-correlation diagram output by the characteristic fusion network to obtain 2 central point position error estimation heat maps which are used as central point position error values in the horizontal direction and the vertical direction.

4. The target tracking method of any one of claims 1-3, wherein the corner estimation module comprises: the system comprises an hourglass-shaped neural network used for extracting features of an input target search area image, a pooling network used for obtaining pooling feature maps in vertical and horizontal directions according to the features output by the hourglass-shaped neural network, a corner position estimation network used for estimating corner position estimation heat maps according to the pooling feature maps output by the pooling network, and a corner position error estimation network used for estimating corner position error estimation heat maps according to the pooling feature maps output by the pooling network.

5. The target tracking method of claim 4, wherein in the network model of the corner estimation module:

the hourglass-shaped neural network is formed by stacking 52 layers;

the pooling network comprises two pooling module assemblies and a feature adding assembly, each pooling module assembly consists of a standard convolution module assembly and a maximum pooling assembly, the standard convolution module assembly comprises convolution layers of 1 convolution kernel with the size of 3 multiplied by 3, 1 batch normalization layer and a linear rectification function layer, and the maximum pooling assemblies in the two pooling module assemblies respectively find out the maximum value from the horizontal direction and the vertical direction to serve as the value of each position in the feature map; the characteristic adding component adds the characteristic diagrams output by the two pooling module components to obtain a final pooling characteristic diagram;

the angular point position estimation network extracts features from the pooled feature map output by the pooled network to obtain 1 angular point position estimation heat map as an initial estimation value of the angular point position;

the corner position error estimation network extracts features from the pooled feature maps output by the pooled network to obtain 2 corner position error estimation heat maps as corner position error values in the horizontal direction and the vertical direction.

6. The method of claim 1,2, 3 or 5, further comprising the step of creating and training the center point estimation module and the corner point estimation module prior to acquiring the target tracking video, comprising:

acquiring a training video frame to be tracked of a target, adopting a target frame with a first preset size to frame a target template image containing the tracking target in a first frame image in two frames of images containing the same target to be tracked, and selecting a target search area image with a second preset size to form a training set group in a second frame image by taking the center point coordinate of the target frame in the first frame image as a reference frame, wherein the second preset size is larger than the first preset size;

constructing a central point estimation module and a network model of the corner point estimation module for training a training set group, and configuring a corner point position label setting method, a central point position label setting method, a corner point position error estimation method, a central point position error estimation method and an upper left corner to central point offset estimation method;

and training the constructed network model by taking the target template image and the target search area image in the training set group as the input of an input central point estimation module and taking the target search area image as the input of an angular point estimation module and performing back propagation through a preset loss function, and adjusting network parameters until the loss function converges to complete the training of the network model.

7. The method of object tracking according to claim 6 wherein label y is set for each corner location in the corner location estimation heat map_i，jThe setting method comprises the following steps:

wherein, (i, j) represents the offset of the current point coordinate from the point coordinate at the upper left corner of the real target frame, and sigma represents a preset distance threshold;

the method for estimating the angular point position error o in the angular point estimation module comprises the following steps:

representing the mapping of points on the current frame image to corner pointsCoordinates on the location heat map, s, represent the ratio of the current frame image resolution to the corner location heat map resolution.

8. The object tracking method of claim 6, wherein the center point location in the center point location estimation heat map is tagged with a tag y_i，jThe setting method comprises the following steps:

wherein, (i, j) represents the offset of the current point coordinate from the center point coordinate of the real target frame, and sigma represents a preset distance threshold;

the method for estimating the central point position error o in the central point estimation module comprises the following steps:

wherein (x, y) represents coordinates of a point on the current frame image,

representing coordinates of points on the current frame image mapped to the corner position heat map, and s represents the ratio of the current frame image resolution to the corner position heat map resolution;

the method for estimating the offset cs from the upper left corner to the central point in the central point estimation module comprises the following steps:

cs＝(log(ctx-tlx)，log(cty-tly))

where, (ctx, cty) represents the position coordinates of the center point, and (tlx, tly) represents the position coordinates of the upper left corner point.

9. The target tracking method of claim 6, wherein the predetermined loss function L is:

wherein the content of the first and second substances,

loss of center point position estimate:

H. w represents the height and width of the center point position estimation heat map, p_ijRepresenting the predicted confidence value, y, of the neural network at the midpoint location (i, j) in the midpoint location estimate heat map_ijIs the corresponding soft tag value;

represents the loss of the corner location estimate;

represents the loss of center point position error estimate:

Smooth_L1(-) represents the smoothing L1 loss function, o_ceAnd

represents the loss of the corner position error estimate;

L_srepresent the top left corner to center point offset estimation loss:

cs、

10. A target tracking device based on anchor point-free twin network key point detection is characterized by comprising the following components: