CN116630369A - Unmanned aerial vehicle target tracking method based on space-time memory network - Google Patents

Unmanned aerial vehicle target tracking method based on space-time memory network Download PDF

Info

Publication number
CN116630369A
CN116630369A CN202310156686.2A CN202310156686A CN116630369A CN 116630369 A CN116630369 A CN 116630369A CN 202310156686 A CN202310156686 A CN 202310156686A CN 116630369 A CN116630369 A CN 116630369A
Authority
CN
China
Prior art keywords
image
sequence
encode
memory
unmanned aerial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310156686.2A
Other languages
Chinese (zh)
Inventor
梁继民
牟剑
郑洋
卫晨
郭开泰
胡海虹
王梓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310156686.2A priority Critical patent/CN116630369A/en
Publication of CN116630369A publication Critical patent/CN116630369A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which comprises the following steps of: step 1, sampling images from a data set and performing image enhancement to form a training data set; step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network; step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network; step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3; and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. The method solves the problem caused by deformation of the unmanned aerial vehicle target, and improves the tracking success rate and accuracy.

Description

Unmanned aerial vehicle target tracking method based on space-time memory network
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle target tracking, and particularly relates to an unmanned aerial vehicle target tracking method based on a space-time memory network.
Background
Visual object tracking is a very attractive force in the field of computer vision, which aims at initializing the position and size of an object in an initial frame for a given video or image sequence, and tracking the given object from frame to frame. Along with the development of deep learning, target tracking has been widely applied to the fields of environmental monitoring, disaster detection, intelligent monitoring and the like. Unmanned aerial vehicles are used as an emerging remote sensing platform, and are paid more attention to by virtue of the advantages of small size, simplicity in operation, capability of adapting to various environments and weather and the like. In the context of intelligent trends, unmanned aerial vehicle vision-based target tracking is favored by people and is increasingly applied to the civil field.
Compared with ground target tracking, the unmanned aerial vehicle target tracking has the advantages that the unmanned aerial vehicle shooting visual angle is higher, the shot video range is wide, and more background information is contained, so that the target contains less characteristic information and is easy to be interfered by surrounding objects and the background; in addition, as camera shake and flying speed change easily occur in the flying process of the unmanned aerial vehicle, the target is deformed, shielded and other complex conditions are caused. Unmanned aerial vehicle target tracking is much more difficult than ground target tracking.
With the development of deep learning, the field of target tracking has been remarkably developed, and a group of outstanding algorithms are emerging, wherein the Siamese network-based tracking algorithm is favored by a plurality of students. The full convolution twin network algorithm (SiamFC) directly learns the matching function of the target template and the candidate target by utilizing the twin network, then compares the similarity of the target template and the search area by utilizing the matching function, finally obtains a score diagram of the search area to obtain the position of the tracked target, and effectively converts the target tracking problem into the similarity matching problem. To further improve the performance of the model, subsequent algorithms continue to add feature fusion and attention mechanisms on this basis. The algorithm only obtains the score of the search area through the similarity function, obtains the position information of the target, and does not obtain the scale information of the target, so that the scale information of the algorithm is lost. The SiamRPN algorithm introduces RPN on the basis of a Siamese network, converts tracking of each frame into a local detection task, and enables the algorithm to adapt to the change of the scale through priori anchor frame setting, so that the algorithm obtains higher precision and speed, and when an interfering object exists around a target and is shielded, the probability of tracking loss is still high. In recent years, transformers have been applied to computer vision models due to their great success in tasks such as natural language processing and speech recognition, but their application in computer vision is still limited, mainly in combination with convolutional networks, for replacing certain modules of the convolutional networks to keep the overall structure unchanged.
Through the analysis, the existing method has the following defects:
(1) The tracking algorithm with a simple model structure has good tracking effect on a specific target, but has no strong robustness, and has poor performance on the problems of serious background interference and the like in target tracking, and has low model generalization.
(2) Most of the existing tracking algorithms adopt a first frame as a template frame, the template features are single, the problems of deformation of a target and the like are not well solved, tracking failure is easy, and the tracking success rate and the tracking precision are low.
Disclosure of Invention
The invention provides an unmanned aerial vehicle target tracking method based on a space-time memory network, which solves the problem caused by deformation of an unmanned aerial vehicle target and improves the tracking success rate and accuracy.
The technical scheme adopted by the invention is that the unmanned aerial vehicle target tracking method based on the space-time memory network comprises the following steps:
step 1, sampling images from a data set and performing image enhancement to form a training data set;
step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network;
step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network;
step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3;
and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result.
The present invention is also characterized in that,
the dataset in step 1 was TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.
The specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.
The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.
The step 3 is specifically implemented according to the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function;
step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking mask token Are spliced togetherComposing the query coding sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is men_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S memory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isIndex of each pixel, j is +.>The index of each pixel, by which is meant the vector dot product, s is a scale factor;
step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance.
Step 4 is specifically implemented according to the following steps:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting two images by a certain scale by taking a target as a center, wherein if the template image is cut into x size, the search image is cut into 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x
Step 4.3, the spliced image block sequence S x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S menory Using a sequence of search image blocks S S_encode And memory coding sequence S menory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
Step 5 is specifically implemented according to the following steps:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory Feature fusion is carried out, and the fused features are sent to a decoder;
step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
The beneficial effects of the invention are as follows:
(1) Aiming at the problems that the background interference of the target in the unmanned aerial vehicle video is serious, the target is easy to be blurred and the like, a tracking model is required to have good generalization performance for the algorithm to predict the target, a mask reconstruction-based pre-training method is provided, an image mask is reconstructed by Vision Transformer, so that stronger target characterization capability is obtained, training is performed through a target detection task, and generalization of the tracking model is effectively improved.
(2) Aiming at the problems that the target is easy to deform and is blocked in unmanned aerial vehicle target tracking, if only an initial frame is used as a template feature, the method has less target feature information, and provides the feature information of the target in a memory network storage history frame, and the feature information of the history frame is used for obtaining more complete feature description of the tracked target, so that tracking accuracy and precision are improved.
Drawings
FIG. 1 is a general framework of the method of the present invention;
FIG. 2 is a flow chart of a video sequence tracking process in the method of the present invention;
FIG. 3 is a tracking effect diagram of the 100 th frame of the video in the embodiment 1 of the present invention;
FIG. 4 is a tracking effect diagram of the 400 th frame of the video in the embodiment 1 of the present invention;
FIG. 5 is a graph of tracking accuracy of different position error thresholds in a UAV 123;
fig. 6 shows the tracking success rate of the method of the present invention for different overlap rate thresholds in the unmanned aerial vehicle generic data set UAV 123.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which is shown in figure 1, and comprises three parts of mask pre-training, network fine tuning and online tracking, wherein the method comprises the following specific steps:
step 1, three frames of images are sampled from the data sets TrackingNet, laSOT, GOT k and the COCO, wherein the three frames of images are directly sampled from a video at intervals of a certain frame number for the video data sets TrackingNet, laSOT and the GOT10k, the COCO data set is added to solve the problem of insufficient sample types in the video data sets, two frames of images are additionally generated by adopting translation or brightness dithering for an original image in the COCO data set, three frames of images are obtained by adding the original image, and finally, data enhancement operations of translation, clipping and gray scale change are carried out on all the images to form a training data set.
Step 2, building an unmanned aerial vehicle target tracking network model based on a space-time memory network, wherein the building of the unmanned aerial vehicle target tracking network model based on the space-time memory network specifically comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head. The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.
Step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network by using the training data set through a mask reconstruction task and a target detection task after mask reconstruction, and a pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network is obtained; the model representation capability is improved, and the pre-training method based on mask reconstruction comprises the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 Human-fed memory branch encoder for constructing image blocks by self-attention mechanism in Vision TransformerTo obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of the matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function.
Step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking maask token Spliced to form a query code sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S memory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isCords of each pixelLeading, j is->The index of each pixel, by which is meant the vector dot product, s is a scale factor.
Step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
step 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance;
step 4: retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network to obtain a trained unmanned aerial vehicle target tracking network model based on the space-time memory network, enabling the model to be more focused on learning target characteristics by utilizing a target detection task to ensure that the model can be better applied to the unmanned aerial vehicle target tracking task, wherein the retraining process is as follows:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting two images by a certain scale by taking a target as a center, wherein if the template image is cut into x size, the search image is cut into 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x
Step 4.3, spliced graphsImage block sequence S x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
Step 5: inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. As shown in fig. 2, the specific procedure is as follows:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, and sending the fused features to a decoder.
Step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
Example 1
In the embodiment, the video on the UAV123 is used as the video to be tracked, steps 1 to 5 are executed,
wherein the template images in step 3.1 and step 4.1 are cropped to 128 x 128 size, and the search image is cropped to 256 x 256 size; the image block size is 16 x 16.
The obtained results are shown in fig. 3-4, and fig. 3-4 are the visual tracking results of the 100 th frame and 400 th frame of the video respectively, so as to obtain the position information of the target in the image.
Fig. 5-6 show the tracking accuracy of the error threshold at different positions and the tracking success rate of the threshold at different overlapping rates, respectively, as shown in fig. 5-6, the average tracking success rate of the present embodiment reaches 0.57, and the tracking accuracy reaches 0.742 under the condition that the error threshold is 20 pixels. The following are the tracking success rate and tracking accuracy of the present implementation under different environmental attributes of the UAV123 dataset, and the comparison of tracking accuracy and tracking accuracy of the present implementation and some tracking algorithms on the UAV123 general dataset.
Table 1 tracking success and accuracy under different environments of the present implementation
Environmental attributes Tracking success rate Tracking accuracy
The target being partially occluded 0.563 0.780
Target out of view 0.592 0.790
Target scale change 0.616 0.822
Illumination variation 0.606 0.827
Rapid movement 0.576 0.768
Viewing angle change 0.657 0.856
Background interference 0.646 0.860
Small target 0.598 0.829
Table 2 this implementation is compared with other tracking algorithms
Tracking algorithm Tracking success rate Tracking accuracy
SiamFC 0.498 0.726
MDNet 0.528 0.735
ECO 0.525 0.741
SiamRPN 0.527 0.748
Tracking algorithm of the invention 0.627 0.835
As can be seen from Table 1, the invention can obtain good tracking success rate and tracking accuracy in most environments, can effectively solve the problems of serious background interference of targets, easy occurrence of blurring of targets and the like in unmanned aerial vehicle videos, and well improves model generalization.
As can be seen from Table 2, the average tracking success rate of the invention on the UAV123 can reach 0.627, the average tracking precision is 0.835, and the tracking speed can reach 45 FPS.
Aiming at the problems that targets in unmanned aerial vehicle videos are easy to be blocked, deformed, and interfered by similar objects, the unmanned aerial vehicle target tracking method based on the space-time memory network acquires more robust characteristic information through a pre-training network model, reduces the influence of complex background on a tracking algorithm, improves model generalization, and designs target characteristic information of a memory network storage history frame; the problem caused by deformation of the unmanned aerial vehicle target is solved, and the model tracking success rate and accuracy are improved.

Claims (7)

1. The unmanned aerial vehicle target tracking method based on the space-time memory network is characterized by comprising the following steps of:
step 1, sampling images from a data set and performing image enhancement to form a training data set;
step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network;
step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network;
step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3;
and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result.
2. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the data set in the step 1 is TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.
3. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network in step 2 is as follows: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.
4. The unmanned aerial vehicle target tracking method based on the space-time memory network of claim 3, wherein the bounding box prediction header comprises a classification header and a regression header which are sequentially connected, and the classification header and the regression header are constructed by 3 convolution blocks.
5. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 3 or 4, wherein the step 3 is specifically implemented according to the following steps:
step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;
step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 3.3, for the search image block sequence S S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' S Mask marking mask token Then S 'is carried out' S And S is T1 Are spliced together to obtain an image block sequence S' x
Step 3.4, splicing the image block sequence S' x Sending into query branch encoder, S T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S query_encode And S is mem_encode Wherein the attention calculation formula is as follows:
wherein Q, K, V is a matrix obtained by linear transformation of the input, d k Is the dimension of matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function;
step 3.5, similar to the encoder, constructing a symmetric decoder using a VisionTransformer to encode the sequence of image blocks S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode ,S S_encode And mask marking mask token Spliced to form a query code sequence S query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a query coding sequence S query And memory coding sequence S menory Performing feature fusion to obtain fused features S feature Wherein the feature fusion calculation formula is as follows:
wherein, (S) memory ) T Is S memory W is S query And S is memory The similarity weight of w is calculated as follows:
wherein i isIndex of each pixel, j is +.>The index of each pixel, by which is meant the vector dot product, s is a scale factor;
step 3.6, and merging the features S feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;
and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, and updating the model weight.
6. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 5, wherein the step 4 is specifically implemented according to the following steps:
step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; centering on the target for two imagesCutting, wherein if the template image is cut to be x size, the search image is cut to be 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S T1 、S T2 And searching for a sequence of image blocks S S
Step 4.2, the template image block sequence S T1 And searching for a sequence of image blocks S S Are spliced together to obtain an image block sequence S x
Step 4.3, the spliced image block sequence S x Sending into query branch encoder, S T2 The sender memory branch encoder constructs the relation between image blocks through a self-attention mechanism in a Vision transducer to obtain an encoded image block sequence S query_encode And S is mem_encode
Step 4.4, the coded image block sequence S query_encode Segmentation into a sequence of search image blocks S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.
7. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 6, wherein the step 5 is specifically implemented according to the following steps:
step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S T And S is combined with T Sending to a memory branch encoder to obtain S mem_encode
Step 5.2, reading the next frame of image and cutting out an image with the size of 2x by taking the previous frame of prediction target as the center as a search image, and cutting out the search image into a fixed sizeImage block, obtaining image block sequence S S Will S T And S is S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S inpute Will S inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S S_encode And template image block sequence S T_encode Template image block sequence S T_encode And S is mem_encode Splicing to form memory coding sequence S memory Using a sequence of search image blocks S S_encode And memory coding sequence S memory Feature fusion is carried out, and the fused features are sent to a decoder;
step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;
step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S mem Will S mem Splice to S mem_encode
And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.
CN202310156686.2A 2023-02-23 2023-02-23 Unmanned aerial vehicle target tracking method based on space-time memory network Pending CN116630369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310156686.2A CN116630369A (en) 2023-02-23 2023-02-23 Unmanned aerial vehicle target tracking method based on space-time memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310156686.2A CN116630369A (en) 2023-02-23 2023-02-23 Unmanned aerial vehicle target tracking method based on space-time memory network

Publications (1)

Publication Number Publication Date
CN116630369A true CN116630369A (en) 2023-08-22

Family

ID=87615821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310156686.2A Pending CN116630369A (en) 2023-02-23 2023-02-23 Unmanned aerial vehicle target tracking method based on space-time memory network

Country Status (1)

Country Link
CN (1) CN116630369A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment
CN117333514B (en) * 2023-12-01 2024-04-16 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
CN109711463B (en) Attention-based important object detection method
CN111079532A (en) Video content description method based on text self-encoder
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
Chen et al. Log hyperbolic cosine loss improves variational auto-encoder
CN110334589B (en) High-time-sequence 3D neural network action identification method based on hole convolution
CN113657388B (en) Image semantic segmentation method for super-resolution reconstruction of fused image
CN111428718A (en) Natural scene text recognition method based on image enhancement
CN113344932B (en) Semi-supervised single-target video segmentation method
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN111738054A (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN113139468A (en) Video abstract generation method fusing local target features and global features
CN115690152A (en) Target tracking method based on attention mechanism
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
CN113869234B (en) Facial expression recognition method, device, equipment and storage medium
CN115393690A (en) Light neural network air-to-ground observation multi-target identification method
CN116863384A (en) CNN-Transfomer-based self-supervision video segmentation method and system
CN115713546A (en) Lightweight target tracking algorithm for mobile terminal equipment
CN112418229A (en) Unmanned ship marine scene image real-time segmentation method based on deep learning
CN115690917B (en) Pedestrian action identification method based on intelligent attention of appearance and motion
CN116543338A (en) Student classroom behavior detection method based on gaze target estimation
CN116597503A (en) Classroom behavior detection method based on space-time characteristics
CN117196963A (en) Point cloud denoising method based on noise reduction self-encoder
CN116503314A (en) Quality inspection system and method for door manufacturing
CN115830505A (en) Video target segmentation method and system for removing background interference through semi-supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination