CN116630369A - Unmanned aerial vehicle target tracking method based on space-time memory network - Google Patents
Unmanned aerial vehicle target tracking method based on space-time memory network Download PDFInfo
- Publication number
- CN116630369A CN116630369A CN202310156686.2A CN202310156686A CN116630369A CN 116630369 A CN116630369 A CN 116630369A CN 202310156686 A CN202310156686 A CN 202310156686A CN 116630369 A CN116630369 A CN 116630369A
- Authority
- CN
- China
- Prior art keywords
- image
- sequence
- encode
- memory
- unmanned aerial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000005070 sampling Methods 0.000 claims abstract description 3
- 108091026890 Coding region Proteins 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 230000000873 masking effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 238000012512 characterization method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/17—Terrestrial scenes taken from planes or by drones
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10032—Satellite or aerial image; Remote sensing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30241—Trajectory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Remote Sensing (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which comprises the following steps of: step 1, sampling images from a data set and performing image enhancement to form a training data set; step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network; step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network; step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3; and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. The method solves the problem caused by deformation of the unmanned aerial vehicle target, and improves the tracking success rate and accuracy.
Description
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle target tracking, and particularly relates to an unmanned aerial vehicle target tracking method based on a space-time memory network.
Background
Visual object tracking is a very attractive force in the field of computer vision, which aims at initializing the position and size of an object in an initial frame for a given video or image sequence, and tracking the given object from frame to frame. Along with the development of deep learning, target tracking has been widely applied to the fields of environmental monitoring, disaster detection, intelligent monitoring and the like. Unmanned aerial vehicles are used as an emerging remote sensing platform, and are paid more attention to by virtue of the advantages of small size, simplicity in operation, capability of adapting to various environments and weather and the like. In the context of intelligent trends, unmanned aerial vehicle vision-based target tracking is favored by people and is increasingly applied to the civil field.
Compared with ground target tracking, the unmanned aerial vehicle target tracking has the advantages that the unmanned aerial vehicle shooting visual angle is higher, the shot video range is wide, and more background information is contained, so that the target contains less characteristic information and is easy to be interfered by surrounding objects and the background; in addition, as camera shake and flying speed change easily occur in the flying process of the unmanned aerial vehicle, the target is deformed, shielded and other complex conditions are caused. Unmanned aerial vehicle target tracking is much more difficult than ground target tracking.
With the development of deep learning, the field of target tracking has been remarkably developed, and a group of outstanding algorithms are emerging, wherein the Siamese network-based tracking algorithm is favored by a plurality of students. The full convolution twin network algorithm (SiamFC) directly learns the matching function of the target template and the candidate target by utilizing the twin network, then compares the similarity of the target template and the search area by utilizing the matching function, finally obtains a score diagram of the search area to obtain the position of the tracked target, and effectively converts the target tracking problem into the similarity matching problem. To further improve the performance of the model, subsequent algorithms continue to add feature fusion and attention mechanisms on this basis. The algorithm only obtains the score of the search area through the similarity function, obtains the position information of the target, and does not obtain the scale information of the target, so that the scale information of the algorithm is lost. The SiamRPN algorithm introduces RPN on the basis of a Siamese network, converts tracking of each frame into a local detection task, and enables the algorithm to adapt to the change of the scale through priori anchor frame setting, so that the algorithm obtains higher precision and speed, and when an interfering object exists around a target and is shielded, the probability of tracking loss is still high. In recent years, transformers have been applied to computer vision models due to their great success in tasks such as natural language processing and speech recognition, but their application in computer vision is still limited, mainly in combination with convolutional networks, for replacing certain modules of the convolutional networks to keep the overall structure unchanged.
Through the analysis, the existing method has the following defects:
(1) The tracking algorithm with a simple model structure has good tracking effect on a specific target, but has no strong robustness, and has poor performance on the problems of serious background interference and the like in target tracking, and has low model generalization.
(2) Most of the existing tracking algorithms adopt a first frame as a template frame, the template features are single, the problems of deformation of a target and the like are not well solved, tracking failure is easy, and the tracking success rate and the tracking precision are low.
Disclosure of Invention
The invention provides an unmanned aerial vehicle target tracking method based on a space-time memory network, which solves the problem caused by deformation of an unmanned aerial vehicle target and improves the tracking success rate and accuracy.
The technical scheme adopted by the invention is that the unmanned aerial vehicle target tracking method based on the space-time memory network comprises the following steps:
step 1, sampling images from a data set and performing image enhancement to form a training data set;
step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network;
step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network;
step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3;
and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result.
The present invention is also characterized in that,
the dataset in step 1 was TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.
The specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.
The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.
The step 3 is specifically implemented according to the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x ;
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function;
step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking mask token Are spliced togetherComposing the query coding sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is men_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S memory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isIndex of each pixel, j is +.>The index of each pixel, by which is meant the vector dot product, s is a scale factor;
step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance.
Step 4 is specifically implemented according to the following steps:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting two images by a certain scale by taking a target as a center, wherein if the template image is cut into x size, the search image is cut into 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x ;
Step 4.3, the spliced image block sequence S x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode 。
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S menory Using a sequence of search image blocks S S_encode And memory coding sequence S menory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
Step 5 is specifically implemented according to the following steps:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode ;
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory Feature fusion is carried out, and the fused features are sent to a decoder;
step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode ;
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
The beneficial effects of the invention are as follows:
(1) Aiming at the problems that the background interference of the target in the unmanned aerial vehicle video is serious, the target is easy to be blurred and the like, a tracking model is required to have good generalization performance for the algorithm to predict the target, a mask reconstruction-based pre-training method is provided, an image mask is reconstructed by Vision Transformer, so that stronger target characterization capability is obtained, training is performed through a target detection task, and generalization of the tracking model is effectively improved.
(2) Aiming at the problems that the target is easy to deform and is blocked in unmanned aerial vehicle target tracking, if only an initial frame is used as a template feature, the method has less target feature information, and provides the feature information of the target in a memory network storage history frame, and the feature information of the history frame is used for obtaining more complete feature description of the tracked target, so that tracking accuracy and precision are improved.
Drawings
FIG. 1 is a general framework of the method of the present invention;
FIG. 2 is a flow chart of a video sequence tracking process in the method of the present invention;
FIG. 3 is a tracking effect diagram of the 100 th frame of the video in the embodiment 1 of the present invention;
FIG. 4 is a tracking effect diagram of the 400 th frame of the video in the embodiment 1 of the present invention;
FIG. 5 is a graph of tracking accuracy of different position error thresholds in a UAV 123;
fig. 6 shows the tracking success rate of the method of the present invention for different overlap rate thresholds in the unmanned aerial vehicle generic data set UAV 123.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which is shown in figure 1, and comprises three parts of mask pre-training, network fine tuning and online tracking, wherein the method comprises the following specific steps:
step 1, three frames of images are sampled from the data sets TrackingNet, laSOT, GOT k and the COCO, wherein the three frames of images are directly sampled from a video at intervals of a certain frame number for the video data sets TrackingNet, laSOT and the GOT10k, the COCO data set is added to solve the problem of insufficient sample types in the video data sets, two frames of images are additionally generated by adopting translation or brightness dithering for an original image in the COCO data set, three frames of images are obtained by adding the original image, and finally, data enhancement operations of translation, clipping and gray scale change are carried out on all the images to form a training data set.
Step 2, building an unmanned aerial vehicle target tracking network model based on a space-time memory network, wherein the building of the unmanned aerial vehicle target tracking network model based on the space-time memory network specifically comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head. The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.
Step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network by using the training data set through a mask reconstruction task and a target detection task after mask reconstruction, and a pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network is obtained; the model representation capability is improved, and the pre-training method based on mask reconstruction comprises the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x ;
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 Human-fed memory branch encoder for constructing image blocks by self-attention mechanism in Vision TransformerTo obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of the matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function.
Step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking maask token Spliced to form a query code sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S memory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isCords of each pixelLeading, j is->The index of each pixel, by which is meant the vector dot product, s is a scale factor.
Step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
step 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance;
step 4: retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network to obtain a trained unmanned aerial vehicle target tracking network model based on the space-time memory network, enabling the model to be more focused on learning target characteristics by utilizing a target detection task to ensure that the model can be better applied to the unmanned aerial vehicle target tracking task, wherein the retraining process is as follows:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting two images by a certain scale by taking a target as a center, wherein if the template image is cut into x size, the search image is cut into 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x ;
Step 4.3, spliced graphsImage block sequence S x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode 。
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
Step 5: inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. As shown in fig. 2, the specific procedure is as follows:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode ;
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, and sending the fused features to a decoder.
Step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode 。
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
Example 1
In the embodiment, the video on the UAV123 is used as the video to be tracked, steps 1 to 5 are executed,
wherein the template images in step 3.1 and step 4.1 are cropped to 128 x 128 size, and the search image is cropped to 256 x 256 size; the image block size is 16 x 16.
The obtained results are shown in fig. 3-4, and fig. 3-4 are the visual tracking results of the 100 th frame and 400 th frame of the video respectively, so as to obtain the position information of the target in the image.
Fig. 5-6 show the tracking accuracy of the error threshold at different positions and the tracking success rate of the threshold at different overlapping rates, respectively, as shown in fig. 5-6, the average tracking success rate of the present embodiment reaches 0.57, and the tracking accuracy reaches 0.742 under the condition that the error threshold is 20 pixels. The following are the tracking success rate and tracking accuracy of the present implementation under different environmental attributes of the UAV123 dataset, and the comparison of tracking accuracy and tracking accuracy of the present implementation and some tracking algorithms on the UAV123 general dataset.
Table 1 tracking success and accuracy under different environments of the present implementation
Environmental attributes | Tracking success rate | Tracking accuracy |
The target being partially occluded | 0.563 | 0.780 |
Target out of view | 0.592 | 0.790 |
Target scale change | 0.616 | 0.822 |
Illumination variation | 0.606 | 0.827 |
Rapid movement | 0.576 | 0.768 |
Viewing angle change | 0.657 | 0.856 |
Background interference | 0.646 | 0.860 |
Small target | 0.598 | 0.829 |
Table 2 this implementation is compared with other tracking algorithms
Tracking algorithm | Tracking success rate | Tracking accuracy |
SiamFC | 0.498 | 0.726 |
MDNet | 0.528 | 0.735 |
ECO | 0.525 | 0.741 |
SiamRPN | 0.527 | 0.748 |
Tracking algorithm of the invention | 0.627 | 0.835 |
As can be seen from Table 1, the invention can obtain good tracking success rate and tracking accuracy in most environments, can effectively solve the problems of serious background interference of targets, easy occurrence of blurring of targets and the like in unmanned aerial vehicle videos, and well improves model generalization.
As can be seen from Table 2, the average tracking success rate of the invention on the UAV123 can reach 0.627, the average tracking precision is 0.835, and the tracking speed can reach 45 FPS.
Aiming at the problems that targets in unmanned aerial vehicle videos are easy to be blocked, deformed, and interfered by similar objects, the unmanned aerial vehicle target tracking method based on the space-time memory network acquires more robust characteristic information through a pre-training network model, reduces the influence of complex background on a tracking algorithm, improves model generalization, and designs target characteristic information of a memory network storage history frame; the problem caused by deformation of the unmanned aerial vehicle target is solved, and the model tracking success rate and accuracy are improved.
Claims (7)
1. The unmanned aerial vehicle target tracking method based on the space-time memory network is characterized by comprising the following steps of:
step 1, sampling images from a data set and performing image enhancement to form a training data set;
step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network;
step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network;
step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3;
and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result.
2. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the data set in the step 1 is TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.
3. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network in step 2 is as follows: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.
4. The unmanned aerial vehicle target tracking method based on the space-time memory network of claim 3, wherein the bounding box prediction header comprises a classification header and a regression header which are sequentially connected, and the classification header and the regression header are constructed by 3 convolution blocks.
5. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 3 or 4, wherein the step 3 is specifically implemented according to the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x ;
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function;
step 3.5, similar to the encoder, constructing a symmetric decoder using a VisionTransformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking mask token Spliced to form a query code sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S menory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isIndex of each pixel, j is +.>The index of each pixel, by which is meant the vector dot product, s is a scale factor;
step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, and updating the model weight.
6. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 5, wherein the step 4 is specifically implemented according to the following steps:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; centering on the target for two imagesCutting, wherein if the template image is cut to be x size, the search image is cut to be 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S ;
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x ;
Step 4.3, the spliced image block sequence S x Sending into query branch encoder, S T2 The sender memory branch encoder constructs the relation between image blocks through a self-attention mechanism in a Vision transducer to obtain an encoded image block sequence S query_encode And S is mem_encode ;
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
7. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 6, wherein the step 5 is specifically implemented according to the following steps:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode ;
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x by taking the previous frame of prediction target as the center as a search image, and cutting out the search image into a fixed sizeImage block, obtaining image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory Feature fusion is carried out, and the fused features are sent to a decoder;
step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode ;
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310156686.2A CN116630369A (en) | 2023-02-23 | 2023-02-23 | Unmanned aerial vehicle target tracking method based on space-time memory network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310156686.2A CN116630369A (en) | 2023-02-23 | 2023-02-23 | Unmanned aerial vehicle target tracking method based on space-time memory network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116630369A true CN116630369A (en) | 2023-08-22 |
Family
ID=87615821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310156686.2A Pending CN116630369A (en) | 2023-02-23 | 2023-02-23 | Unmanned aerial vehicle target tracking method based on space-time memory network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116630369A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333514A (en) * | 2023-12-01 | 2024-01-02 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
-
2023
- 2023-02-23 CN CN202310156686.2A patent/CN116630369A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333514A (en) * | 2023-12-01 | 2024-01-02 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
CN117333514B (en) * | 2023-12-01 | 2024-04-16 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11176381B2 (en) | Video object segmentation by reference-guided mask propagation | |
CN109711463B (en) | Attention-based important object detection method | |
CN111079532A (en) | Video content description method based on text self-encoder | |
CN115393396B (en) | Unmanned aerial vehicle target tracking method based on mask pre-training | |
Chen et al. | Log hyperbolic cosine loss improves variational auto-encoder | |
CN110334589B (en) | High-time-sequence 3D neural network action identification method based on hole convolution | |
CN113657388B (en) | Image semantic segmentation method for super-resolution reconstruction of fused image | |
CN111428718A (en) | Natural scene text recognition method based on image enhancement | |
CN113344932B (en) | Semi-supervised single-target video segmentation method | |
CN111476133B (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
CN111738054A (en) | Behavior anomaly detection method based on space-time self-encoder network and space-time CNN | |
CN113139468A (en) | Video abstract generation method fusing local target features and global features | |
CN115690152A (en) | Target tracking method based on attention mechanism | |
CN116630369A (en) | Unmanned aerial vehicle target tracking method based on space-time memory network | |
CN113869234B (en) | Facial expression recognition method, device, equipment and storage medium | |
CN115393690A (en) | Light neural network air-to-ground observation multi-target identification method | |
CN116863384A (en) | CNN-Transfomer-based self-supervision video segmentation method and system | |
CN115713546A (en) | Lightweight target tracking algorithm for mobile terminal equipment | |
CN112418229A (en) | Unmanned ship marine scene image real-time segmentation method based on deep learning | |
CN115690917B (en) | Pedestrian action identification method based on intelligent attention of appearance and motion | |
CN116543338A (en) | Student classroom behavior detection method based on gaze target estimation | |
CN116597503A (en) | Classroom behavior detection method based on space-time characteristics | |
CN117196963A (en) | Point cloud denoising method based on noise reduction self-encoder | |
CN116503314A (en) | Quality inspection system and method for door manufacturing | |
CN115830505A (en) | Video target segmentation method and system for removing background interference through semi-supervised learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |