CN114662572A - High-speed twin network target tracking method based on positioning perception - Google Patents

High-speed twin network target tracking method based on positioning perception Download PDF

Info

Publication number
CN114662572A
CN114662572A CN202210220077.4A CN202210220077A CN114662572A CN 114662572 A CN114662572 A CN 114662572A CN 202210220077 A CN202210220077 A CN 202210220077A CN 114662572 A CN114662572 A CN 114662572A
Authority
CN
China
Prior art keywords
feature
layer
template
phi
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210220077.4A
Other languages
Chinese (zh)
Other versions
CN114662572B (en
Inventor
周丽芳
丁相
冷佳旭
王懿
王佩雯
罗俊
李佳其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xinchuang Wangan Data Technology Co ltd
Shenzhen Hongyue Enterprise Management Consulting Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210220077.4A priority Critical patent/CN114662572B/en
Publication of CN114662572A publication Critical patent/CN114662572A/en
Application granted granted Critical
Publication of CN114662572B publication Critical patent/CN114662572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a high-speed twin network target tracking method based on positioning perception, and belongs to the technical field of computer vision. The method mainly comprises the following steps: taking AlexNet as a feature extraction subnet, and performing a feature extraction task of a template image and a search image; in order to enhance the characteristic ability of the characteristic, the invention provides a context enhancing module which captures rich target information from local and global levels and simultaneously provides a new characteristic fusion strategy which fully combines the context information of the characteristics with different scales; and finally, calculating the regression loss of the network by using Distance-IoU loss, and guiding the tracker to select a more accurate bounding box. The method can ensure higher tracking speed while adding a small amount of parameters, and effectively improve the tracking performance of the twin-network-based tracker in a complex scene.

Description

High-speed twin network target tracking method based on positioning perception
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a visual target tracking method.
Background
Target tracking is one of the most popular challenging research topics in computer vision, and attracts attention of researchers in various countries. Many countries invest a lot of manpower, material resources, and financial resources to study them, and successively emerge many excellent algorithms. With the continuous updating of the algorithm, the target tracking theory is more and more perfect, and the method is widely applied to the fields of video monitoring, intelligent human-computer interaction, automatic driving, robots, modern military and the like at present. However, given the size and position of the target in the initial frame of a video sequence, how to estimate the target position at high speed and accurately for each frame in the subsequent complex scene (illumination change, scale change, occlusion, deformation, motion blur, out-of-plane rotation, etc.) still remains a very challenging problem.
In recent years, tracking algorithms can be generally roughly classified into correlation filtering based methods and deep learning based methods. The related filtering method uses a filter trained by a target image to filter the image, and finds a maximum value in a response image, wherein the position of the maximum value is the current position of the target. The method solves in the Fourier domain, so that the method has the advantage of high speed naturally, but the RGB three-channel color characteristics make the target difficult to obtain more excellent tracking effect in color change; on the contrary, the method based on the deep neural network provides more discriminative feature representation, so that the tracker is more robust, and particularly, the method based on the siemese network has great potential by utilizing the similarity matching method, and can realize balanced precision and speed exceeding real time.
At present, the Siamese method for extracting features by using a shallow network can keep a higher tracking speed: in an RTX1080 ti-based device, the tracking speed of SiamFC is 86 Frames (FPS), the tracking speed of the AlexNet version of SiamFC + + is 160 frames, and the tracking speed of the AlexNet version of SiamGAT is 165 frames. These trackers are able to maintain a higher tracking speed compared to the standard real-time rate (25 FPS). However, the feature effectiveness of the shallow network extraction is not sufficient, and the target cannot be accurately positioned in a complex tracking scene, so that the tracking accuracy cannot be further improved. In order to fully utilize the capability of the modern deep neural network, SiamDW combines experiments and theories to discuss in detail why deep network frameworks such as VGG, ResNet, etc. perform quite well in the field of target detection, etc. in recent years, but do not work well in Siamese tracking, and then proposes a deeper and wider Siamese tracking framework. SiamRPN + + finds, through a number of experiments, that during training, if a positive sample is no longer centered, the target is shifted around the center point in a uniformly distributed sampling fashion. As the range of offsets increases, the deep network may gradually get better from completely ineffective. Compared with a tracking method based on a shallow network, the tracker based on a deep network framework (SiamRPN + +, Ocean, siamaattn) obtains a more excellent tracking effect, and proves its excellent tracking performance in a plurality of data sets. However, these tracking also present a significant problem: with an RTX1080 ti-based device, the tracking speed of SiamRPN + + is 35 Frames (FPS), the tracking speed of Ocean is 56 frames, and the tracking speed of SiamAttn is 33 frames.
As summarized in the above study, the current Siamese method has the following problems: 1) the tracking method based on the shallow network can keep a higher tracking speed, but the extracted features do not have strong discrimination due to the characteristics of the network, so that the target cannot be well positioned. 2) The tracking method based on the deep network can fully utilize the capability of the modern deep neural network, and can accurately estimate the target position in each frame in various complex scenes (illumination change, scale change, shielding, deformation, motion blurring, out-of-plane rotation and the like), but the parameter quantity is huge, a large amount of calculation overhead is required to be consumed, and the tracking speed is very slow. In order to solve the problems, the invention provides a high-speed twin network target tracking method and system based on positioning perception.
After retrieval, CN111489361A, a method for real-time visual target tracking based on deep feature aggregation of twin network, comprising: step 1, constructing a ResNet22 deep twin neural network with the stride of 8, and extracting features by using the ResNet22 deep twin neural network; step 2, inputting the target image and the search image into the ResNet22 deep twin neural network, and respectively generating corresponding feature maps by the ResNet22 deep twin neural network; and 3, aggregating different deep characteristic maps of the ResNet22 deep twin neural network, extracting multi-branch characteristics to cooperatively infer the position of the target object, and providing more comprehensive description for the target object so as to realize more efficient tracking. The method can extract high-level semantic information of the target object, can ensure the translation equivalence of feature mapping, and extracts multi-branch features to cooperatively infer target positioning by adopting a hierarchical aggregation mode.
The invention with publication number CN111489361A utilizes the deep network ResNet22 to extract multi-layer features, and then aggregates different deep feature maps, thereby realizing the position inference of the target object. Although both are twin network based tracking methods, the present invention and the invention with publication number CN111489361A are different from the following points:
(1) selecting a feature extraction network: the invention of publication CN111489361A utilizes the deep network ResNet22, while the invention utilizes the shallow network AlexNet. Therefore, the invention has the advantages of smaller parameter quantity, lower cost for calculation and very high tracking speed.
(2) Mode of feature polymerization: the invention with the publication number of CN111489361A aggregates features of different layers to cooperatively infer the position of a target object, and in order to enhance the expression capability of the features, the invention utilizes a context enhancement module to generate position perception features, then utilizes a positioning block to capture more complete target information, and finally carries out feature aggregation.
(3) Manner of target prediction: the invention with publication number CN111489361A regards the point with the largest value in the response map as the location of the predicted target, and the invention can guide the tracker to select a more accurate prediction box by classifying regression and introducing Distance-IoU loss function in the regression branch.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A high-speed twin network target tracking method based on positioning perception is provided. The technical scheme of the invention is as follows:
a high-speed twin network target tracking method based on positioning perception comprises the following steps:
step 1, inputting a template image and a search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain features of the template image and features of the search image;
step 2, inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics;
step 3, performing cross-correlation operation on the position perception characteristic of the template image and the characteristic of the search image;
step 4, inputting the feature diagram after the mutual correlation operation into a positioning block to strengthen position information, and then carrying out aggregation of multilayer features;
and 5, inputting the aggregated feature map into a prediction network, and guiding a tracker to select a more accurate boundary box for tracking the target by introducing Distance-IoU loss through a regression branch.
Further, the step 1: inputting the template image and the search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain features of the template image and features of the search image, wherein the method specifically comprises the following steps of:
a1, cutting an image of each frame from the video, and respectively cutting the image into 127 × 127 size as a template image z and 255 × 255 size as a search image x;
a2, inputting the template image z into the feature extraction network, and extracting to obtain the template feature phi of the 4 th layer4(z) and layer 5 template features phi5(z);
A3, inputting the search image x into the same feature extraction network, and extracting to obtain the search feature phi of the 4 th layer4(x) And search feature of layer 55(x)。
Further, the step 2: inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics, and specifically comprising the following steps:
b1 template feature φ from layer 4 according to step A24(z) input to a context enhancement module to obtain an optimized template feature phi'4(z);
B2 template feature φ from layer 5 according to step A25(z) input to a context enhancement module to obtain an optimized template feature phi'5(z);
B3, the formula of the designed context enhancement module is:
Figure BDA0003536737350000051
wherein FoutputRepresents the optimized feature, FinputThe characteristics before the optimization are shown,
Figure BDA0003536737350000052
representing pixel multiplication operations, sigma representing an activation function, g (x) representing global context information, l (x) representing local context information,
Figure BDA0003536737350000053
representing a pixel addition operation.
Further, the step 3: the method for performing the cross-correlation operation on the characteristics of the template image and the characteristics of the search image specifically comprises the following steps:
c1 obtaining the optimized template characteristic phi 'from the 4 th layer according to the step B1'4(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;
c2, according to the step A3, obtaining the search feature phi from the 4 th layer4(x) Then, convolution operation is carried out, the same size as the search features from the 5 th layer is kept, the number of channels is kept, and the integrity of the channel information of the search image is guaranteed;
c3, optimizing the template characteristic phi'4(z) and search feature phi4(x) Performing cross-correlation operation to obtain a feature map of a 4 th layer, wherein the formula is as follows:
F1=φ'4(z)*φ4(x) (2)
wherein F1Represents a feature map, phi ', after layer 4 cross-correlation'4(z) represents the optimized template feature of layer 4, [ phi ]4(x) Represents the search feature of level 4, which represents the cross-correlation operation;
c4 obtaining the optimized template characteristic phi 'from the 5 th layer according to the step B2'5(z) performing a convolution operation on the cross-correlation;
c5 obtaining search features φ from layer 5 according to step A35(x) Performing convolution operation on the cross correlation before cross correlation;
c6, optimizing the template characteristic phi'5(z) and search feature phi5(x) Performing cross-correlation operation to obtain a characteristic diagram of a 5 th layer, wherein the formula is as follows:
F2=C1(φ'5(z))*C25(x)) (3)
wherein F2Representing the feature map after layer 5 cross-correlation, C1、C2Denotes the convolution operation, phi'5(z) represents the 5 th optimized template feature, phi5(x) Represents the search feature of level 4, which represents the cross-correlation operation.
Further, the step 4: inputting the feature map after the cross-correlation operation into a positioning block to strengthen position information, and then carrying out multi-layer feature aggregation, wherein the method specifically comprises the following steps:
d1, obtaining a feature map F from the 4 th layer according to step C31Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F1';
D2, according to step C6, obtaining a feature map F from layer 52Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution to obtain a feature map F'2
D3, and D1 to generate the characteristic diagram F1' feature map F ' generated with step D2 '2Carrying out cascade connection;
d4, carrying out convolution aggregation on the cascaded feature maps generated in the step D3 to obtain the final feature map which is rich in different scales and has rich target information, wherein the formula is as follows:
Ffinal=C3(Cat(L1(F1'),L2(F'2))) (4)
wherein FfinalShows the final characteristic diagram, C3Representing convolution operation, Cat representing cascade operation, L1、L2Indicating operation of different positioning blocks, F1' means a characteristic diagram, F ', of the layer 4 passing through the locating block '2A feature diagram of layer 5 passing through the positioning block is shown.
Further, in the step 5, the aggregated feature map is input into a prediction network, and the regression branch guides the tracker to select a more accurate bounding box of the tracked target by introducing Distance-IoU loss, which specifically comprises the following steps:
e1, according to step D4, the peculiarities to be obtainedSign diagram FfinalPerforming convolution, and then performing a classification task by adopting conventional cross entropy loss, wherein the classification loss function is as follows:
Figure BDA0003536737350000071
wherein L iscMean loss of class, NposIs represented by N pixels, Lcls(p,p*) Representing the loss of calculating the predicted value and the true value of a single pixel point,
Figure BDA0003536737350000072
calculating the sum of the classification losses of the 1 st to the Nth pixel points;
e2, according to step D4, feature map F to be obtainedfinalPerforming convolution, and performing regression task by using Distance-IoU loss to obtain regression loss Lr
E3, loss L of classification obtained according to step E1cRegression loss L obtained in step E2rThe final overall loss function is calculated as:
L=λ1Lc2Lr (9)
wherein L represents the overall loss function, λ1Representing a hyperparameter, λ, in a classification loss function2Representing the hyperparameters in the regression loss function.
Further, the Distance-IoU loss and the regression loss function are as follows:
Figure BDA0003536737350000073
wherein R isDIoUWeight factor, rho (b, b), representing the measure of the regression directiongt) C represents the length of the diagonal line of the minimum circumscribed rectangle of the prediction box and the truth box;
Lreg=1-IoU+RDIoU (7)
wherein L isregRepresenting a loss function of regression, RDIoUThe trend is 0, which means that the closer the prediction box is to the truth box, the correct regression direction is; on the contrary, the regression direction needs to be adjusted;
Figure BDA0003536737350000081
wherein L isrMean loss of regression, NposIs represented by N pixels, Lreg(p,p*) Representing the loss of calculating the predicted value and the true value of a single pixel point,
Figure BDA0003536737350000082
the calculation of the regression loss sum from the 1 st to the Nth pixel point is shown.
The invention has the following advantages and beneficial effects:
1. aiming at the problem of the balance between the speed and the precision in the field of target tracking, the method can effectively improve the expression capability of the characteristics by utilizing a deep network, but can bring huge parameters, has high calculation cost and very slow tracking speed. The invention proposes an idea: the tracking speed is ensured by utilizing the shallow network to extract features, and meanwhile, a series of optimization means are adopted to make up for the defects of the shallow network from a feature level and a loss level, so that the tracking effect is improved. Compared with the current state-of-the-art trackers (SiamFC + +, Ocean, SiamGAT), the present invention exhibits superior tracking performance on numerous datasets while guaranteeing a huge speed advantage (220 FPS).
2. The design of the feature aggregation network is crucial to the improvement of the tracking performance. The context enhancement module provided by the invention has the following advantages: from a local layer, the characteristics of a local block of a tracking target can be captured, and the tracking target can be better positioned in a low-resolution scene; from a global level, global context information can model long-term dependencies, enabling trackers to better understand and remember the tracked scenarios. By means of the context enhancement module, more efficient and discriminative location-aware features can be generated. Meanwhile, before feature aggregation, the direction of convolution can be dynamically adjusted according to the shape of the tracking target by utilizing the characteristic of deformable convolution in the positioning block. After the features are aggregated, the features with stronger expression capability and different scales of context information can be obtained.
3. The target tracking is divided into classification regression tasks, positive and negative samples are determined through classification, and a boundary frame of the target is determined through regression. Most of the current SimRPN-based tracking methods use Smooth L1 loss as a regression loss function, and Distance-IoU loss is proposed as a more appropriate loss function in the Diou paper. The loss function has higher convergence speed and can guide the tracker to select a more accurate bounding box, so that the Distance-IoU loss is introduced into the regression branch, and the tracking performance of the tracker is further improved.
Drawings
FIG. 1 is a general block diagram of a method of the present invention providing a preferred embodiment;
FIG. 2 is a schematic diagram of a context enhancement module according to the present invention.
FIG. 3 is a schematic view of the positioning block structure of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in the attached figure 1, a high-speed twin network target tracking method based on positioning perception comprises the following steps:
1. as shown in fig. 1, the template image and the search image are input to a feature extraction network to extract features:
1) cutting out an image of each frame from the video, and respectively cutting the image into 127 × 127 size as a template image (called z) and 255 × 255 size as a search image (called x);
2) inputting the template image z into a feature extraction network, and extracting to obtain the template feature phi of the 4 th layer4(z) and layer 5 template features phi5(z);
3) Inputting the search image x into the same feature extraction network, and extracting to obtain the search feature phi of the 4 th layer4(x) And search feature of layer 55(x)
2. As shown in fig. 1, the features of the template image are input to the context enhancement module for feature optimization. The context enhancement module contains a local context and a global context. Wherein the local context is composed of a convolutional layer (Conv), an activation function layer (ReLU), and a batch normalization layer (BN); the global context is composed of a global average pooling layer (GAP), a convolution layer (Conv), an activation function layer (ReLU) and a batch normalization layer (BN), and the module structure is shown in FIG. 2:
1) template features phi from layer 44(z) inputting a context enhancement module to obtain an optimized template characteristic phi'4(z):
2) Template features phi from layer 55(z) inputting a context enhancement module to obtain an optimized template characteristic phi'5(z);
3) In the above steps, the formula of the designed context enhancement module is:
Figure BDA0003536737350000101
wherein FoutputRepresents the optimized feature, FinputThe characteristics before the optimization are shown,
Figure BDA0003536737350000102
representing pixel multiplication operations, sigma representing an activation function, g (x) representing global context information, l (x) representing local context information,
Figure BDA0003536737350000103
representing a pixel addition operation.
3. As shown in fig. 1, the features of the template image are correlated with the features of the search image:
1) from layer 4 optimized template feature phi'4(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;
2) search feature phi from layer 44(x) Then, convolution operation is carried out, the same size as the search features from the 5 th layer is kept, the number of channels is kept, and the integrity of the channel information of the search image is guaranteed;
3) c, optimizing the template characteristic phi'4(z) and search feature phi4(x) Performing cross-correlation operation to obtain a feature map of a 4 th layer, wherein the formula is as follows:
F1=φ'4(z)*φ4(x) (2)
wherein F1Represents a feature map, phi ', after layer 4 cross-correlation'4(z) represents the optimized template feature of layer 4,. phi4(x) Represents the search feature of level 4, which represents the cross-correlation operation.
4) From layer 5 optimized template features phi'5(z) performing a convolution operation on the cross-correlation signal before cross-correlation to further reduce the influence of interferents;
5) search feature phi from layer 55(x) Before cross-correlation, convolution operation is carried out on the cross-correlation signal, and the influence of an interference object is further reduced;
6) the optimized template characteristic phi'5(z) and search feature phi5(x) Performing cross-correlation operation to obtain a characteristic diagram of a 5 th layer, wherein the formula is as follows:
F2=C1(φ'5(z))*C25(x)) (3)
wherein F2Feature map showing the level 5 cross-correlation, C1、C2Denotes the convolution operation, phi'5(z) represents the 5 th optimized template feature, phi5(x) Is shown as4-level search features, which represent cross-correlation operations.
4. As shown in fig. 1, the feature map after the cross-correlation operation is input into a positioning block to enhance the position information, and then the aggregation of the multi-layer features is performed, wherein the positioning block is composed of a deformable convolution, group normalization, an activation function and a convolution. As shown in fig. 3, the deformable convolution can add an offset to the convolution, and more accurate features can be extracted than the standard convolution:
1) feature map F from layer 41Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F1';
2) Feature map F from layer 52Then, the feature map F 'is obtained by inputting the feature map into a positioning block and enhancing the position information of the tracking target by using the characteristic of the deformable convolution'2
3) The characteristic diagram F generated in the step 1 is processed1' feature map F ' generated with step 2 '2Carrying out cascade connection;
4) and carrying out convolution polymerization on the cascaded feature maps generated in the step 3 to obtain the final feature maps which are rich in different scales and have rich target information, wherein the formula is as follows:
Ffinal=C3(Cat(L1(F1'),L2(F'2))) (4)
wherein FfinalShows the final characteristic diagram, C3Representing convolution operation, Cat cascade operation, L1、L2Indicating operation of different positioning blocks, F1'represents a characteristic diagram of layer 4 through the positioning block, F'2A feature diagram of layer 5 through the locating block is shown.
5. As shown in fig. 1, the aggregated feature map is input into the prediction network, and the regression branch guides the tracker to select a bounding box of a more accurate tracking target by introducing Distance-IoU loss:
1) characteristic diagram F to be obtainedfinalPerforming convolution and then using conventional interleavingThe entropy loss carries out a classification task, and the classification loss function is as follows:
Figure BDA0003536737350000121
wherein L iscMean loss of class, NposIs represented by N pixels, Lcls(p,p*) Representing the loss of calculating the predicted value and the true value of a single pixel point,
Figure BDA0003536737350000131
it represents the calculation of the sum of the classification losses from the 1 st to the nth pixel point.
2) The obtained feature map FfinalAnd (3) performing convolution, and then performing a regression task by using Distance-IoU loss, wherein the Distance-IoU loss and the loss function of regression are as follows:
Figure BDA0003536737350000132
wherein R isDIoUWeight factor, rho (b, b), representing the measure of the regression directiongt) C represents the length of the diagonal of the minimum circumscribed rectangle of the prediction box and the truth box.
Lreg=1-IoU+RDIoU (7)
Wherein L isregLoss function representing regression, RDIoUThe trend is 0, which means that the closer the prediction box is to the truth box, the correct regression direction is; on the contrary, it is indicated that the regression direction needs to be adjusted.
Figure BDA0003536737350000133
Wherein L isrMean loss of regression, NposIs represented by N pixels, Lreg(p,p*) Representing the predicted value of a single pixelIn comparison with the loss of the true value,
Figure BDA0003536737350000134
the calculation of the regression loss sum from the 1 st to the Nth pixel point is shown.
3) The classification loss L obtained according to step 1cAnd the regression loss L obtained in step 2rThe final overall loss function is calculated as:
L=λ1Lc2Lr (9)
wherein L represents the overall loss function, λ1Representing a hyperparameter, λ, in a classification loss function2Representing the hyperparameters in the regression loss function.
The invention aims to ensure the tracking speed by utilizing the shallow network extraction characteristics, and simultaneously adopts a series of optimized means, including utilizing a context enhancement module to generate more effective and more discriminative position perception characteristics. And before feature aggregation, the direction of convolution can be dynamically adjusted according to the shape of the tracked target by utilizing the characteristic of deformable convolution in the positioning block, so that more complete target features are captured. Therefore, after the features are aggregated, the features with different scales and context information and stronger expression capability can be obtained. The design compensates the defects of the shallow network from a characteristic level and a loss level, so that the tracking effect is improved. And finally, introducing Distance-IoU loss to the regression branch to accelerate the convergence speed, and further guiding the tracker to select a more accurate bounding box.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (7)

1. A high-speed twin network target tracking method based on positioning perception is characterized by comprising the following steps:
step 1, inputting a template image and a search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain features of the template image and features of the search image;
step 2, inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics;
step 3, performing cross-correlation operation on the position perception characteristic of the template image and the characteristic of the search image;
step 4, inputting the feature diagram after the mutual correlation operation into a positioning block to strengthen position information, and then carrying out aggregation of multilayer features;
and 5, inputting the aggregated feature map into a prediction network, and guiding a tracker to select a more accurate boundary box for tracking the target by introducing Distance-IoU loss through a regression branch.
2. The method for tracking the high-speed twin network target based on the localization awareness as claimed in claim 1, wherein the step 1: inputting the template image and the search image into a feature extraction network based on AlexNet, and performing feature extraction to obtain the features of the template image and the features of the search image, wherein the method specifically comprises the following steps:
a1, cutting an image of each frame from the video, and respectively cutting the image into 127 × 127 size as a template image z and 255 × 255 size as a search image x;
a2, inputting the template image z into the feature extraction network, and extractingObtaining the template characteristic phi of the 4 th layer4(z) and layer 5 template features phi5(z);
A3, inputting the search image x into the same feature extraction network, and extracting to obtain the search feature phi of the 4 th layer4(x) And search feature of layer 55(x)。
3. The method for tracking the high-speed twin network target based on the localization awareness as claimed in claim 2, wherein the step 2: inputting the characteristics of the template image into a context enhancement module to generate position perception characteristics, and specifically comprising the following steps:
b1 template feature φ from layer 4 according to step A24(z) input to a context enhancement module to obtain an optimized template feature phi'4(z);
B2 template feature φ from layer 5 according to step A25(z) input to a context enhancement module to obtain an optimized template feature phi'5(z);
B3, the formula of the designed context enhancement module is:
Figure FDA0003536737340000021
wherein FoutputRepresents the optimized feature, FinputThe characteristics before the optimization are shown,
Figure FDA0003536737340000022
denotes pixel multiplication operation, sigma denotes activation function, g (x) denotes global context information, l (x) denotes local context information,
Figure FDA0003536737340000023
representing a pixel addition operation.
4. The method for tracking the high-speed twin network target based on localization awareness as claimed in claim 3, wherein the step 3: the method for performing the cross-correlation operation on the characteristics of the template image and the characteristics of the search image specifically comprises the following steps:
c1, obtaining the optimized template characteristics phi 'from the 4 th layer according to the step B1'4(z), then performing convolution operation, keeping the same size with the template features from the 5 th layer, and simultaneously keeping the number of channels to ensure the integrity of the channel information of the template image;
c2, according to the step A3, obtaining the search feature phi from the 4 th layer4(x) Then, convolution operation is carried out, the same size as the search features from the 5 th layer is kept, the number of channels is kept, and the integrity of the channel information of the search image is guaranteed;
c3, optimizing the template characteristic phi'4(z) and search feature phi4(x) Performing cross-correlation operation to obtain a feature map of a 4 th layer, wherein the formula is as follows:
F1=φ′4(z)*φ4(x) (2)
wherein F1Represents a feature map, phi ', after layer 4 cross-correlation'4(z) represents the optimized template feature of layer 4,. phi4(x) Representing a search feature of level 4, representing a cross-correlation operation;
c4 obtaining the optimized template characteristic phi 'from the 5 th layer according to the step B2'5(z) performing a convolution operation on the cross-correlation;
c5 obtaining search features φ from layer 5 according to step A35(x) Performing convolution operation on the cross correlation before cross correlation;
c6, optimizing the template characteristic phi'5(z) and search feature phi5(x) Performing cross-correlation operation to obtain a characteristic diagram of a 5 th layer, wherein the formula is as follows:
F2=C1(φ′5(z))*C25(x)) (3)
wherein F2Feature map showing the level 5 cross-correlation, C1、C2Denotes the convolution operation, phi'5(z) represents the 5 th optimized template feature, phi5(x) Representing the 4 th layerSearch features, which represent cross-correlation operations.
5. The method for tracking the high-speed twin network target based on localization awareness as claimed in claim 4, wherein the step 4: inputting the feature map after the cross-correlation operation into a positioning block to strengthen position information, and then carrying out multi-layer feature aggregation, wherein the method specifically comprises the following steps:
d1, obtaining a characteristic diagram F from the 4 th layer according to the step C31Then inputting the data into a positioning block, and enhancing the position information of the tracking target by using the characteristic of deformable convolution while down-sampling to obtain a feature map F1';
D2, according to step C6, obtaining a feature map F from layer 52Then inputting the position information into a positioning block, and utilizing the characteristic of deformable convolution to strengthen the position information of the tracking target to obtain a characteristic diagram F2';
D3, and D1 to generate the characteristic diagram F1' feature map F ' generated with step D2 '2Carrying out cascade connection;
d4, carrying out convolution aggregation on the cascaded feature maps generated in the step D3 to obtain the final feature map which is rich in different scales and has rich target information, wherein the formula is as follows:
Ffinal=C3(Cat(L1(F1'),L2(F2'))) (4)
wherein FfinalShows the final characteristic diagram, C3Representing convolution operation, Cat representing cascade operation, L1、L2Indicating operation of different positioning blocks, F1' feature diagram of layer 4 through the positioning block, F2' denotes a characteristic diagram of layer 5 passing through the positioning block.
6. The method for tracking the target of the high-speed twin network based on the localization awareness as claimed in claim 5, wherein the step 5 is to input the aggregated feature map into a prediction network, and the regression branch guides the tracker to select a bounding box of a more accurate tracked target by introducing Distance-IoU loss, and specifically comprises the following steps:
e1, according to step D4, feature map F to be obtainedfinalPerforming convolution, and then performing a classification task by adopting conventional cross entropy loss, wherein the classification loss function is as follows:
Figure FDA0003536737340000041
wherein L iscMean loss of representation classification, NposIs represented by N pixels, Lcls(p,p*) Representing the loss of calculating the predicted value and the true value of a single pixel point,
Figure FDA0003536737340000042
representing the calculation of the classification loss sum of the 1 st to the Nth pixel points;
e2, according to step D4, feature F to be obtainedfinalPerforming convolution, and performing regression task by using Distance-IoU loss to obtain regression loss Lr
E3, loss L of classification obtained according to step E1cRegression loss L obtained in step E2rThe final overall loss function is calculated as:
L=λ1Lc2Lr (9)
wherein L represents the overall loss function, λ1Representing a hyperparameter, λ, in a classification loss function2Representing the hyperparameters in the regression loss function.
7. The method as claimed in claim 6, wherein the Distance-IoU loss and the loss function of regression are as follows:
Figure FDA0003536737340000051
wherein R isDIoUWeight factor, rho (b, b), representing the measure of the regression directiongt) C represents the length of the diagonal line of the minimum circumscribed rectangle of the prediction box and the truth box;
Lreg=1-IoU+RDIoU (7)
wherein L isregRepresenting a loss function of regression, RDIoUThe trend is 0, which means that the closer the prediction box is to the truth box, the correct regression direction is; on the contrary, the regression direction needs to be adjusted;
Figure FDA0003536737340000052
wherein L isrMean loss of regression, NposIs represented by N pixels, Lreg(p,p*) Representing the loss of the predicted value and the true value of a single pixel point,
Figure FDA0003536737340000053
the calculation of the regression loss sum from the 1 st to the Nth pixel point is shown.
CN202210220077.4A 2022-03-08 2022-03-08 High-speed twin network target tracking method based on positioning sensing Active CN114662572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210220077.4A CN114662572B (en) 2022-03-08 2022-03-08 High-speed twin network target tracking method based on positioning sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210220077.4A CN114662572B (en) 2022-03-08 2022-03-08 High-speed twin network target tracking method based on positioning sensing

Publications (2)

Publication Number Publication Date
CN114662572A true CN114662572A (en) 2022-06-24
CN114662572B CN114662572B (en) 2024-07-19

Family

ID=82030258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210220077.4A Active CN114662572B (en) 2022-03-08 2022-03-08 High-speed twin network target tracking method based on positioning sensing

Country Status (1)

Country Link
CN (1) CN114662572B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150117760A1 (en) * 2013-10-30 2015-04-30 Nec Laboratories America, Inc. Regionlets with Shift Invariant Neural Patterns for Object Detection
US20200327679A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deeply and densely connected neural network
CN112509008A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Target tracking method based on intersection-to-parallel ratio guided twin network
CN113807188A (en) * 2021-08-20 2021-12-17 北京工业大学 Unmanned aerial vehicle target tracking method based on anchor frame matching and Simese network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150117760A1 (en) * 2013-10-30 2015-04-30 Nec Laboratories America, Inc. Regionlets with Shift Invariant Neural Patterns for Object Detection
US20200327679A1 (en) * 2019-04-12 2020-10-15 Beijing Moviebook Science and Technology Co., Ltd. Visual target tracking method and apparatus based on deeply and densely connected neural network
CN112509008A (en) * 2020-12-15 2021-03-16 重庆邮电大学 Target tracking method based on intersection-to-parallel ratio guided twin network
CN113807188A (en) * 2021-08-20 2021-12-17 北京工业大学 Unmanned aerial vehicle target tracking method based on anchor frame matching and Simese network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIFANG ZHOU等: "A location-aware siamese network for high-speed visual tracking", 《APPLIED INTELLIGENCE》, vol. 53, no. 4, 10 June 2022 (2022-06-10), pages 4431 *
王科平等: "多时空感知相关滤波融合的目标跟踪算法", 《计算机辅助设计与图形学学报》, vol. 32, no. 11, 25 September 2020 (2020-09-25), pages 1840 - 1852 *

Also Published As

Publication number Publication date
CN114662572B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
CN108734723B (en) Relevant filtering target tracking method based on adaptive weight joint learning
Xu et al. Light-YOLOv3: fast method for detecting green mangoes in complex scenes using picking robots
CN111460968B (en) Unmanned aerial vehicle identification and tracking method and device based on video
CN111832443B (en) Construction method and application of construction violation detection model
Liang et al. Comparison detector for cervical cell/clumps detection in the limited data scenario
CN112395442B (en) Automatic identification and content filtering method for popular pictures on mobile internet
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN113902991A (en) Twin network target tracking method based on cascade characteristic fusion
CN115375737B (en) Target tracking method and system based on adaptive time and serialized space-time characteristics
CN113192124A (en) Image target positioning method based on twin network
CN114898403A (en) Pedestrian multi-target tracking method based on Attention-JDE network
CN116091979A (en) Target tracking method based on feature fusion and channel attention
CN114333062B (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
Gao et al. Feature alignment in anchor-free object detection
Ouyang et al. An anchor-free detector with channel-based prior and bottom-enhancement for underwater object detection
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
CN114662572B (en) High-speed twin network target tracking method based on positioning sensing
CN113379794B (en) Single-target tracking system and method based on attention-key point prediction model
CN115937654A (en) Single-target tracking method based on multi-level feature fusion
CN111489361B (en) Real-time visual target tracking method based on deep feature aggregation of twin network
Pei et al. FGO-Net: Feature and Gaussian Optimization Network for visual saliency prediction
CN113112522A (en) Twin network target tracking method based on deformable convolution and template updating
CN112419227B (en) Underwater target detection method and system based on small target search scaling technology
Dou et al. Improved Siamese classification and regression adaptive network for visual tracking
Luo et al. Infrared Road Object Detection Based on Improved YOLOv8.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240619

Address after: Room 201, Building 3, Liye Park, No. 26 Zhishi Road, Qilin Street, Jiangning District, Nanjing City, Jiangsu Province, 211135

Applicant after: Jiangsu Xinchuang Wangan Data Technology Co.,Ltd.

Country or region after: China

Address before: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant before: Shenzhen Hongyue Enterprise Management Consulting Co.,Ltd.

Country or region before: China

Effective date of registration: 20240619

Address after: 518000 1104, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Hongyue Enterprise Management Consulting Co.,Ltd.

Country or region after: China

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Applicant before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China

GR01 Patent grant