CN114662572A

CN114662572A - High-speed twin network target tracking method based on positioning perception

Info

Publication number: CN114662572A
Application number: CN202210220077.4A
Authority: CN
Inventors: 周丽芳; 丁相; 冷佳旭; 王懿; 王佩雯; 罗俊; 李佳其
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Jiangsu Xinchuang Wangan Data Technology Co ltd; Shenzhen Hongyue Enterprise Management Consulting Co ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-24
Anticipated expiration: 2042-03-08
Also published as: CN114662572B

Abstract

The invention discloses a high-speed twin network target tracking method based on positioning perception, and belongs to the technical field of computer vision. The method mainly comprises the following steps: taking AlexNet as a feature extraction subnet, and performing a feature extraction task of a template image and a search image; in order to enhance the characteristic ability of the characteristic, the invention provides a context enhancing module which captures rich target information from local and global levels and simultaneously provides a new characteristic fusion strategy which fully combines the context information of the characteristics with different scales; and finally, calculating the regression loss of the network by using Distance-IoU loss, and guiding the tracker to select a more accurate bounding box. The method can ensure higher tracking speed while adding a small amount of parameters, and effectively improve the tracking performance of the twin-network-based tracker in a complex scene.

Description

High-speed twin network target tracking method based on positioning perception

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a visual target tracking method.

Background

Target tracking is one of the most popular challenging research topics in computer vision, and attracts attention of researchers in various countries. Many countries invest a lot of manpower, material resources, and financial resources to study them, and successively emerge many excellent algorithms. With the continuous updating of the algorithm, the target tracking theory is more and more perfect, and the method is widely applied to the fields of video monitoring, intelligent human-computer interaction, automatic driving, robots, modern military and the like at present. However, given the size and position of the target in the initial frame of a video sequence, how to estimate the target position at high speed and accurately for each frame in the subsequent complex scene (illumination change, scale change, occlusion, deformation, motion blur, out-of-plane rotation, etc.) still remains a very challenging problem.

In recent years, tracking algorithms can be generally roughly classified into correlation filtering based methods and deep learning based methods. The related filtering method uses a filter trained by a target image to filter the image, and finds a maximum value in a response image, wherein the position of the maximum value is the current position of the target. The method solves in the Fourier domain, so that the method has the advantage of high speed naturally, but the RGB three-channel color characteristics make the target difficult to obtain more excellent tracking effect in color change; on the contrary, the method based on the deep neural network provides more discriminative feature representation, so that the tracker is more robust, and particularly, the method based on the siemese network has great potential by utilizing the similarity matching method, and can realize balanced precision and speed exceeding real time.

At present, the Siamese method for extracting features by using a shallow network can keep a higher tracking speed: in an RTX1080 ti-based device, the tracking speed of SiamFC is 86 Frames (FPS), the tracking speed of the AlexNet version of SiamFC + + is 160 frames, and the tracking speed of the AlexNet version of SiamGAT is 165 frames. These trackers are able to maintain a higher tracking speed compared to the standard real-time rate (25 FPS). However, the feature effectiveness of the shallow network extraction is not sufficient, and the target cannot be accurately positioned in a complex tracking scene, so that the tracking accuracy cannot be further improved. In order to fully utilize the capability of the modern deep neural network, SiamDW combines experiments and theories to discuss in detail why deep network frameworks such as VGG, ResNet, etc. perform quite well in the field of target detection, etc. in recent years, but do not work well in Siamese tracking, and then proposes a deeper and wider Siamese tracking framework. SiamRPN + + finds, through a number of experiments, that during training, if a positive sample is no longer centered, the target is shifted around the center point in a uniformly distributed sampling fashion. As the range of offsets increases, the deep network may gradually get better from completely ineffective. Compared with a tracking method based on a shallow network, the tracker based on a deep network framework (SiamRPN + +, Ocean, siamaattn) obtains a more excellent tracking effect, and proves its excellent tracking performance in a plurality of data sets. However, these tracking also present a significant problem: with an RTX1080 ti-based device, the tracking speed of SiamRPN + + is 35 Frames (FPS), the tracking speed of Ocean is 56 frames, and the tracking speed of SiamAttn is 33 frames.

As summarized in the above study, the current Siamese method has the following problems: 1) the tracking method based on the shallow network can keep a higher tracking speed, but the extracted features do not have strong discrimination due to the characteristics of the network, so that the target cannot be well positioned. 2) The tracking method based on the deep network can fully utilize the capability of the modern deep neural network, and can accurately estimate the target position in each frame in various complex scenes (illumination change, scale change, shielding, deformation, motion blurring, out-of-plane rotation and the like), but the parameter quantity is huge, a large amount of calculation overhead is required to be consumed, and the tracking speed is very slow. In order to solve the problems, the invention provides a high-speed twin network target tracking method and system based on positioning perception.

After retrieval, CN111489361A, a method for real-time visual target tracking based on deep feature aggregation of twin network, comprising: step 1, constructing a ResNet22 deep twin neural network with the stride of 8, and extracting features by using the ResNet22 deep twin neural network; step 2, inputting the target image and the search image into the ResNet22 deep twin neural network, and respectively generating corresponding feature maps by the ResNet22 deep twin neural network; and 3, aggregating different deep characteristic maps of the ResNet22 deep twin neural network, extracting multi-branch characteristics to cooperatively infer the position of the target object, and providing more comprehensive description for the target object so as to realize more efficient tracking. The method can extract high-level semantic information of the target object, can ensure the translation equivalence of feature mapping, and extracts multi-branch features to cooperatively infer target positioning by adopting a hierarchical aggregation mode.

The invention with publication number CN111489361A utilizes the deep network ResNet22 to extract multi-layer features, and then aggregates different deep feature maps, thereby realizing the position inference of the target object. Although both are twin network based tracking methods, the present invention and the invention with publication number CN111489361A are different from the following points:

(1) selecting a feature extraction network: the invention of publication CN111489361A utilizes the deep network ResNet22, while the invention utilizes the shallow network AlexNet. Therefore, the invention has the advantages of smaller parameter quantity, lower cost for calculation and very high tracking speed.

(2) Mode of feature polymerization: the invention with the publication number of CN111489361A aggregates features of different layers to cooperatively infer the position of a target object, and in order to enhance the expression capability of the features, the invention utilizes a context enhancement module to generate position perception features, then utilizes a positioning block to capture more complete target information, and finally carries out feature aggregation.

(3) Manner of target prediction: the invention with publication number CN111489361A regards the point with the largest value in the response map as the location of the predicted target, and the invention can guide the tracker to select a more accurate prediction box by classifying regression and introducing Distance-IoU loss function in the regression branch.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A high-speed twin network target tracking method based on positioning perception is provided. The technical scheme of the invention is as follows:

a high-speed twin network target tracking method based on positioning perception comprises the following steps:

step 1, inputting a template image and a search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain features of the template image and features of the search image;

step 2, inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics;

step 3, performing cross-correlation operation on the position perception characteristic of the template image and the characteristic of the search image;

step 4, inputting the feature diagram after the mutual correlation operation into a positioning block to strengthen position information, and then carrying out aggregation of multilayer features;

and 5, inputting the aggregated feature map into a prediction network, and guiding a tracker to select a more accurate boundary box for tracking the target by introducing Distance-IoU loss through a regression branch.

Further, the step 1: inputting the template image and the search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain features of the template image and features of the search image, wherein the method specifically comprises the following steps of:

a1, cutting an image of each frame from the video, and respectively cutting the image into 127 × 127 size as a template image z and 255 × 255 size as a search image x;

a2, inputting the template image z into the feature extraction network, and extracting to obtain the template feature phi of the 4 th layer₄(z) and layer 5 template features phi₅(z)；

A3, inputting the search image x into the same feature extraction network, and extracting to obtain the search feature phi of the 4 th layer₄(x) And search feature of layer 5₅(x)。

Further, the step 2: inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics, and specifically comprising the following steps:

b1 template feature φ from layer 4 according to step A2₄(z) input to a context enhancement module to obtain an optimized template feature phi'₄(z)；

B2 template feature φ from layer 5 according to step A2₅(z) input to a context enhancement module to obtain an optimized template feature phi'₅(z)；

B3, the formula of the designed context enhancement module is:

wherein F_outputRepresents the optimized feature, F_inputThe characteristics before the optimization are shown,

representing pixel multiplication operations, sigma representing an activation function, g (x) representing global context information, l (x) representing local context information,

representing a pixel addition operation.

Further, the step 3: the method for performing the cross-correlation operation on the characteristics of the template image and the characteristics of the search image specifically comprises the following steps:

c1 obtaining the optimized template characteristic phi 'from the 4 th layer according to the step B1'₄(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;

c2, according to the step A3, obtaining the search feature phi from the 4 th layer₄(x) Then, convolution operation is carried out, the same size as the search features from the 5 th layer is kept, the number of channels is kept, and the integrity of the channel information of the search image is guaranteed;

c3, optimizing the template characteristic phi'₄(z) and search feature phi₄(x) Performing cross-correlation operation to obtain a feature map of a 4 th layer, wherein the formula is as follows:

F₁＝φ'₄(z)*φ₄(x) (2)

wherein F₁Represents a feature map, phi ', after layer 4 cross-correlation'₄(z) represents the optimized template feature of layer 4, [ phi ]₄(x) Represents the search feature of level 4, which represents the cross-correlation operation;

c4 obtaining the optimized template characteristic phi 'from the 5 th layer according to the step B2'₅(z) performing a convolution operation on the cross-correlation;

c5 obtaining search features φ from layer 5 according to step A3₅(x) Performing convolution operation on the cross correlation before cross correlation;

c6, optimizing the template characteristic phi'₅(z) and search feature phi₅(x) Performing cross-correlation operation to obtain a characteristic diagram of a 5 th layer, wherein the formula is as follows:

F₂＝C₁(φ'₅(z))*C₂(φ₅(x)) (3)

wherein F₂Representing the feature map after layer 5 cross-correlation, C₁、C₂Denotes the convolution operation, phi'₅(z) represents the 5 th optimized template feature, phi₅(x) Represents the search feature of level 4, which represents the cross-correlation operation.

Further, the step 4: inputting the feature map after the cross-correlation operation into a positioning block to strengthen position information, and then carrying out multi-layer feature aggregation, wherein the method specifically comprises the following steps:

d1, obtaining a feature map F from the 4 th layer according to step C3₁Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F₁'；

D2, according to step C6, obtaining a feature map F from layer 5₂Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution to obtain a feature map F'₂；

D3, and D1 to generate the characteristic diagram F₁' feature map F ' generated with step D2 '₂Carrying out cascade connection;

d4, carrying out convolution aggregation on the cascaded feature maps generated in the step D3 to obtain the final feature map which is rich in different scales and has rich target information, wherein the formula is as follows:

F_final＝C₃(Cat(L₁(F₁'),L₂(F'₂))) (4)

wherein F_finalShows the final characteristic diagram, C₃Representing convolution operation, Cat representing cascade operation, L₁、L₂Indicating operation of different positioning blocks, F₁' means a characteristic diagram, F ', of the layer 4 passing through the locating block '₂A feature diagram of layer 5 passing through the positioning block is shown.

Further, in the step 5, the aggregated feature map is input into a prediction network, and the regression branch guides the tracker to select a more accurate bounding box of the tracked target by introducing Distance-IoU loss, which specifically comprises the following steps:

e1, according to step D4, the peculiarities to be obtainedSign diagram F_finalPerforming convolution, and then performing a classification task by adopting conventional cross entropy loss, wherein the classification loss function is as follows:

wherein L is_cMean loss of class, N_posIs represented by N pixels, L_cls(p,p^*) Representing the loss of calculating the predicted value and the true value of a single pixel point,

calculating the sum of the classification losses of the 1 st to the Nth pixel points;

e2, according to step D4, feature map F to be obtained_finalPerforming convolution, and performing regression task by using Distance-IoU loss to obtain regression loss L_r；

E3, loss L of classification obtained according to step E1_cRegression loss L obtained in step E2_rThe final overall loss function is calculated as:

L＝λ₁L_c+λ₂L_r (9)

wherein L represents the overall loss function, λ₁Representing a hyperparameter, λ, in a classification loss function₂Representing the hyperparameters in the regression loss function.

Further, the Distance-IoU loss and the regression loss function are as follows:

wherein R is_DIoUWeight factor, rho (b, b), representing the measure of the regression direction^gt) C represents the length of the diagonal line of the minimum circumscribed rectangle of the prediction box and the truth box;

L_reg＝1-IoU+R_DIoU (7)

wherein L is_regRepresenting a loss function of regression, R_DIoUThe trend is 0, which means that the closer the prediction box is to the truth box, the correct regression direction is; on the contrary, the regression direction needs to be adjusted;

wherein L is_rMean loss of regression, N_posIs represented by N pixels, L_reg(p,p^*) Representing the loss of calculating the predicted value and the true value of a single pixel point,

the calculation of the regression loss sum from the 1 st to the Nth pixel point is shown.

The invention has the following advantages and beneficial effects:

1. aiming at the problem of the balance between the speed and the precision in the field of target tracking, the method can effectively improve the expression capability of the characteristics by utilizing a deep network, but can bring huge parameters, has high calculation cost and very slow tracking speed. The invention proposes an idea: the tracking speed is ensured by utilizing the shallow network to extract features, and meanwhile, a series of optimization means are adopted to make up for the defects of the shallow network from a feature level and a loss level, so that the tracking effect is improved. Compared with the current state-of-the-art trackers (SiamFC + +, Ocean, SiamGAT), the present invention exhibits superior tracking performance on numerous datasets while guaranteeing a huge speed advantage (220 FPS).

2. The design of the feature aggregation network is crucial to the improvement of the tracking performance. The context enhancement module provided by the invention has the following advantages: from a local layer, the characteristics of a local block of a tracking target can be captured, and the tracking target can be better positioned in a low-resolution scene; from a global level, global context information can model long-term dependencies, enabling trackers to better understand and remember the tracked scenarios. By means of the context enhancement module, more efficient and discriminative location-aware features can be generated. Meanwhile, before feature aggregation, the direction of convolution can be dynamically adjusted according to the shape of the tracking target by utilizing the characteristic of deformable convolution in the positioning block. After the features are aggregated, the features with stronger expression capability and different scales of context information can be obtained.

3. The target tracking is divided into classification regression tasks, positive and negative samples are determined through classification, and a boundary frame of the target is determined through regression. Most of the current SimRPN-based tracking methods use Smooth L1 loss as a regression loss function, and Distance-IoU loss is proposed as a more appropriate loss function in the Diou paper. The loss function has higher convergence speed and can guide the tracker to select a more accurate bounding box, so that the Distance-IoU loss is introduced into the regression branch, and the tracking performance of the tracker is further improved.

Drawings

FIG. 1 is a general block diagram of a method of the present invention providing a preferred embodiment;

FIG. 2 is a schematic diagram of a context enhancement module according to the present invention.

FIG. 3 is a schematic view of the positioning block structure of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in the attached figure 1, a high-speed twin network target tracking method based on positioning perception comprises the following steps:

1. as shown in fig. 1, the template image and the search image are input to a feature extraction network to extract features:

1) cutting out an image of each frame from the video, and respectively cutting the image into 127 × 127 size as a template image (called z) and 255 × 255 size as a search image (called x);

2) inputting the template image z into a feature extraction network, and extracting to obtain the template feature phi of the 4 th layer₄(z) and layer 5 template features phi₅(z)；

3) Inputting the search image x into the same feature extraction network, and extracting to obtain the search feature phi of the 4 th layer₄(x) And search feature of layer 5₅(x)

2. As shown in fig. 1, the features of the template image are input to the context enhancement module for feature optimization. The context enhancement module contains a local context and a global context. Wherein the local context is composed of a convolutional layer (Conv), an activation function layer (ReLU), and a batch normalization layer (BN); the global context is composed of a global average pooling layer (GAP), a convolution layer (Conv), an activation function layer (ReLU) and a batch normalization layer (BN), and the module structure is shown in FIG. 2:

1) template features phi from layer 4₄(z) inputting a context enhancement module to obtain an optimized template characteristic phi'₄(z)：

2) Template features phi from layer 5₅(z) inputting a context enhancement module to obtain an optimized template characteristic phi'₅(z)；

3) In the above steps, the formula of the designed context enhancement module is:

representing a pixel addition operation.

3. As shown in fig. 1, the features of the template image are correlated with the features of the search image:

1) from layer 4 optimized template feature phi'₄(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;

2) search feature phi from layer 4₄(x) Then, convolution operation is carried out, the same size as the search features from the 5 th layer is kept, the number of channels is kept, and the integrity of the channel information of the search image is guaranteed;

3) c, optimizing the template characteristic phi'₄(z) and search feature phi₄(x) Performing cross-correlation operation to obtain a feature map of a 4 th layer, wherein the formula is as follows:

F₁＝φ'₄(z)*φ₄(x) (2)

wherein F₁Represents a feature map, phi ', after layer 4 cross-correlation'₄(z) represents the optimized template feature of layer 4,. phi₄(x) Represents the search feature of level 4, which represents the cross-correlation operation.

4) From layer 5 optimized template features phi'₅(z) performing a convolution operation on the cross-correlation signal before cross-correlation to further reduce the influence of interferents;

5) search feature phi from layer 5₅(x) Before cross-correlation, convolution operation is carried out on the cross-correlation signal, and the influence of an interference object is further reduced;

6) the optimized template characteristic phi'₅(z) and search feature phi₅(x) Performing cross-correlation operation to obtain a characteristic diagram of a 5 th layer, wherein the formula is as follows:

F₂＝C₁(φ'₅(z))*C₂(φ₅(x)) (3)

wherein F₂Feature map showing the level 5 cross-correlation, C₁、C₂Denotes the convolution operation, phi'₅(z) represents the 5 th optimized template feature, phi₅(x) Is shown as4-level search features, which represent cross-correlation operations.

4. As shown in fig. 1, the feature map after the cross-correlation operation is input into a positioning block to enhance the position information, and then the aggregation of the multi-layer features is performed, wherein the positioning block is composed of a deformable convolution, group normalization, an activation function and a convolution. As shown in fig. 3, the deformable convolution can add an offset to the convolution, and more accurate features can be extracted than the standard convolution:

1) feature map F from layer 4₁Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F₁'；

2) Feature map F from layer 5₂Then, the feature map F 'is obtained by inputting the feature map into a positioning block and enhancing the position information of the tracking target by using the characteristic of the deformable convolution'₂；

3) The characteristic diagram F generated in the step 1 is processed₁' feature map F ' generated with step 2 '₂Carrying out cascade connection;

4) and carrying out convolution polymerization on the cascaded feature maps generated in the step 3 to obtain the final feature maps which are rich in different scales and have rich target information, wherein the formula is as follows:

F_final＝C₃(Cat(L₁(F₁'),L₂(F'₂))) (4)

wherein F_finalShows the final characteristic diagram, C₃Representing convolution operation, Cat cascade operation, L₁、L₂Indicating operation of different positioning blocks, F₁'represents a characteristic diagram of layer 4 through the positioning block, F'₂A feature diagram of layer 5 through the locating block is shown.

5. As shown in fig. 1, the aggregated feature map is input into the prediction network, and the regression branch guides the tracker to select a bounding box of a more accurate tracking target by introducing Distance-IoU loss:

1) characteristic diagram F to be obtained_finalPerforming convolution and then using conventional interleavingThe entropy loss carries out a classification task, and the classification loss function is as follows:

it represents the calculation of the sum of the classification losses from the 1 st to the nth pixel point.

2) The obtained feature map F_finalAnd (3) performing convolution, and then performing a regression task by using Distance-IoU loss, wherein the Distance-IoU loss and the loss function of regression are as follows:

wherein R is_DIoUWeight factor, rho (b, b), representing the measure of the regression direction^gt) C represents the length of the diagonal of the minimum circumscribed rectangle of the prediction box and the truth box.

L_reg＝1-IoU+R_DIoU (7)

Wherein L is_regLoss function representing regression, R_DIoUThe trend is 0, which means that the closer the prediction box is to the truth box, the correct regression direction is; on the contrary, it is indicated that the regression direction needs to be adjusted.

Wherein L is_rMean loss of regression, N_posIs represented by N pixels, L_reg(p,p^*) Representing the predicted value of a single pixelIn comparison with the loss of the true value,

3) The classification loss L obtained according to step 1_cAnd the regression loss L obtained in step 2_rThe final overall loss function is calculated as:

L＝λ₁L_c+λ₂L_r (9)

The invention aims to ensure the tracking speed by utilizing the shallow network extraction characteristics, and simultaneously adopts a series of optimized means, including utilizing a context enhancement module to generate more effective and more discriminative position perception characteristics. And before feature aggregation, the direction of convolution can be dynamically adjusted according to the shape of the tracked target by utilizing the characteristic of deformable convolution in the positioning block, so that more complete target features are captured. Therefore, after the features are aggregated, the features with different scales and context information and stronger expression capability can be obtained. The design compensates the defects of the shallow network from a characteristic level and a loss level, so that the tracking effect is improved. And finally, introducing Distance-IoU loss to the regression branch to accelerate the convergence speed, and further guiding the tracker to select a more accurate bounding box.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A high-speed twin network target tracking method based on positioning perception is characterized by comprising the following steps:

2. The method for tracking the high-speed twin network target based on the localization awareness as claimed in claim 1, wherein the step 1: inputting the template image and the search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain the features of the template image and the features of the search image, wherein the method specifically comprises the following steps:

a2, inputting the template image z into the feature extraction network, and extractingObtaining the template characteristic phi of the 4 th layer₄(z) and layer 5 template features phi₅(z)；

3. The method for tracking the high-speed twin network target based on the localization awareness as claimed in claim 2, wherein the step 2: inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics, and specifically comprising the following steps:

B3, the formula of the designed context enhancement module is:

denotes pixel multiplication operation, sigma denotes activation function, g (x) denotes global context information, l (x) denotes local context information,

representing a pixel addition operation.

4. The method for tracking the high-speed twin network target based on localization awareness as claimed in claim 3, wherein the step 3: the method for performing the cross-correlation operation on the characteristics of the template image and the characteristics of the search image specifically comprises the following steps:

c1, obtaining the optimized template characteristics phi 'from the 4 th layer according to the step B1'₄(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;

F₁＝φ′₄(z)*φ₄(x) (2)

wherein F₁Represents a feature map, phi ', after layer 4 cross-correlation'₄(z) represents the optimized template feature of layer 4,. phi₄(x) Representing a search feature of level 4, representing a cross-correlation operation;

F₂＝C₁(φ′₅(z))*C₂(φ₅(x)) (3)

wherein F₂Feature map showing the level 5 cross-correlation, C₁、C₂Denotes the convolution operation, phi'₅(z) represents the 5 th optimized template feature, phi₅(x) Representing the 4 th layerSearch features, which represent cross-correlation operations.

5. The method for tracking the high-speed twin network target based on localization awareness as claimed in claim 4, wherein the step 4: inputting the feature map after the cross-correlation operation into a positioning block to strengthen position information, and then carrying out multi-layer feature aggregation, wherein the method specifically comprises the following steps:

d1, obtaining a characteristic diagram F from the 4 th layer according to the step C3₁Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F₁'；

D2, according to step C6, obtaining a feature map F from layer 5₂Then inputting the position information into a positioning block, and utilizing the characteristic of deformable convolution to strengthen the position information of the tracking target to obtain a characteristic diagram F₂'；

F_final＝C₃(Cat(L₁(F₁'),L₂(F₂'))) (4)

wherein F_finalShows the final characteristic diagram, C₃Representing convolution operation, Cat representing cascade operation, L₁、L₂Indicating operation of different positioning blocks, F₁' feature diagram of layer 4 through the positioning block, F₂' denotes a characteristic diagram of layer 5 passing through the positioning block.

6. The method for tracking the target of the high-speed twin network based on the localization awareness as claimed in claim 5, wherein the step 5 is to input the aggregated feature map into a prediction network, and the regression branch guides the tracker to select a bounding box of a more accurate tracked target by introducing Distance-IoU loss, and specifically comprises the following steps:

e1, according to step D4, feature map F to be obtained_finalPerforming convolution, and then performing a classification task by adopting conventional cross entropy loss, wherein the classification loss function is as follows:

wherein L is_cMean loss of representation classification, N_posIs represented by N pixels, L_cls(p,p^*) Representing the loss of calculating the predicted value and the true value of a single pixel point,

representing the calculation of the classification loss sum of the 1 st to the Nth pixel points;

e2, according to step D4, feature F to be obtained_finalPerforming convolution, and performing regression task by using Distance-IoU loss to obtain regression loss L_r；

L＝λ₁L_c+λ₂L_r (9)

7. The method as claimed in claim 6, wherein the Distance-IoU loss and the loss function of regression are as follows:

L_reg＝1-IoU+R_DIoU (7)

wherein L is_rMean loss of regression, N_posIs represented by N pixels, L_reg(p,p^*) Representing the loss of the predicted value and the true value of a single pixel point,