CN116630369A

CN116630369A - Unmanned aerial vehicle target tracking method based on space-time memory network

Info

Publication number: CN116630369A
Application number: CN202310156686.2A
Authority: CN
Inventors: 梁继民; 牟剑; 郑洋; 卫晨; 郭开泰; 胡海虹; 王梓宇
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-08-22

Abstract

The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which comprises the following steps of: step 1, sampling images from a data set and performing image enhancement to form a training data set; step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network; step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network; step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3; and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. The method solves the problem caused by deformation of the unmanned aerial vehicle target, and improves the tracking success rate and accuracy.

Description

Unmanned aerial vehicle target tracking method based on space-time memory network

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle target tracking, and particularly relates to an unmanned aerial vehicle target tracking method based on a space-time memory network.

Background

Visual object tracking is a very attractive force in the field of computer vision, which aims at initializing the position and size of an object in an initial frame for a given video or image sequence, and tracking the given object from frame to frame. Along with the development of deep learning, target tracking has been widely applied to the fields of environmental monitoring, disaster detection, intelligent monitoring and the like. Unmanned aerial vehicles are used as an emerging remote sensing platform, and are paid more attention to by virtue of the advantages of small size, simplicity in operation, capability of adapting to various environments and weather and the like. In the context of intelligent trends, unmanned aerial vehicle vision-based target tracking is favored by people and is increasingly applied to the civil field.

Compared with ground target tracking, the unmanned aerial vehicle target tracking has the advantages that the unmanned aerial vehicle shooting visual angle is higher, the shot video range is wide, and more background information is contained, so that the target contains less characteristic information and is easy to be interfered by surrounding objects and the background; in addition, as camera shake and flying speed change easily occur in the flying process of the unmanned aerial vehicle, the target is deformed, shielded and other complex conditions are caused. Unmanned aerial vehicle target tracking is much more difficult than ground target tracking.

With the development of deep learning, the field of target tracking has been remarkably developed, and a group of outstanding algorithms are emerging, wherein the Siamese network-based tracking algorithm is favored by a plurality of students. The full convolution twin network algorithm (SiamFC) directly learns the matching function of the target template and the candidate target by utilizing the twin network, then compares the similarity of the target template and the search area by utilizing the matching function, finally obtains a score diagram of the search area to obtain the position of the tracked target, and effectively converts the target tracking problem into the similarity matching problem. To further improve the performance of the model, subsequent algorithms continue to add feature fusion and attention mechanisms on this basis. The algorithm only obtains the score of the search area through the similarity function, obtains the position information of the target, and does not obtain the scale information of the target, so that the scale information of the algorithm is lost. The SiamRPN algorithm introduces RPN on the basis of a Siamese network, converts tracking of each frame into a local detection task, and enables the algorithm to adapt to the change of the scale through priori anchor frame setting, so that the algorithm obtains higher precision and speed, and when an interfering object exists around a target and is shielded, the probability of tracking loss is still high. In recent years, transformers have been applied to computer vision models due to their great success in tasks such as natural language processing and speech recognition, but their application in computer vision is still limited, mainly in combination with convolutional networks, for replacing certain modules of the convolutional networks to keep the overall structure unchanged.

Through the analysis, the existing method has the following defects:

(1) The tracking algorithm with a simple model structure has good tracking effect on a specific target, but has no strong robustness, and has poor performance on the problems of serious background interference and the like in target tracking, and has low model generalization.

(2) Most of the existing tracking algorithms adopt a first frame as a template frame, the template features are single, the problems of deformation of a target and the like are not well solved, tracking failure is easy, and the tracking success rate and the tracking precision are low.

Disclosure of Invention

The invention provides an unmanned aerial vehicle target tracking method based on a space-time memory network, which solves the problem caused by deformation of an unmanned aerial vehicle target and improves the tracking success rate and accuracy.

The technical scheme adopted by the invention is that the unmanned aerial vehicle target tracking method based on the space-time memory network comprises the following steps:

step 1, sampling images from a data set and performing image enhancement to form a training data set;

step 2, creating an unmanned aerial vehicle target tracking network model based on a space-time memory network;

step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network;

step 4, retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network in the step 3;

and 5, inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result.

The present invention is also characterized in that,

the dataset in step 1 was TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.

The specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.

The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.

The step 3 is specifically implemented according to the following steps:

step 3.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting three images by a certain scale by taking a target as a center, wherein the template image is cut into x-x size, and then the search image is cut into 2 x-2 x size;

step 3.2, dividing the template image and the search image into non-overlapping image blocks with the pixel size of 16 x 16 respectively to obtain a template image block sequence S _T1 、S _T2 And searching for a sequence of image blocks S _S ；

Step 3.3, for the search image block sequence S _S Performing random masking, removing the masked image blocks from the sequence to obtain a masked image block sequence S' _S Mask marking mask _token Then S 'is carried out' _S And S is _T1 Are spliced together to obtain an image block sequence S' _x ；

Step 3.4, splicing the image block sequence S' _x Sending into query branch encoder, S _T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S _{query_encode} And S is _{mem_encode} Wherein the attention calculation formula is as follows:

wherein Q, K, V is a matrix obtained by linear transformation of the input, d _k Is the dimension of matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function;

step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S _{query_encode} Segmentation into a sequence of search image blocks S _{S_encode} And template image block sequence S _{T_encode} ，S _{S_encode} And mask marking mask _token Are spliced togetherComposing the query coding sequence S _query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks _{T_encode} And S is _{men_encode} Splicing to form memory coding sequence S _memory Using a query coding sequence S _query And memory coding sequence S _memory Performing feature fusion to obtain fused features S _feature Wherein the feature fusion calculation formula is as follows:

wherein, (S) _memory ) ^T Is S _memory W is S _query And S is _memory The similarity weight of w is calculated as follows:

wherein i isIndex of each pixel, j is +.>The index of each pixel, by which is meant the vector dot product, s is a scale factor;

step 3.6, and merging the features S _feature Sending the image data to a decoder, carrying out mask reconstruction by the decoder according to the input information, reconstructing an input image by predicting pixel values of each image block shielded by the mask, outputting each element representing a pixel value vector of one image block by the decoder, outputting the number of channels equal to the number of pixel values in one image block, and then reconstructing the output into a reconstructed image;

and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance.

Step 4 is specifically implemented according to the following steps:

step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; cutting two images by a certain scale by taking a target as a center, wherein if the template image is cut into x size, the search image is cut into 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S _T1 、S _T2 And searching for a sequence of image blocks S _S ；

Step 4.2, the template image block sequence S _T1 And searching for a sequence of image blocks S _S Are spliced together to obtain an image block sequence S _x ；

Step 4.3, the spliced image block sequence S _x Sending into query branch encoder, S _T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S _{query_encode} And S is _{mem_encode} 。

Step 4.4, the coded image block sequence S _{query_encode} Segmentation into a sequence of search image blocks S _{S_encode} And template image block sequence S _{T_encode} Template image block sequence S _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _menory Using a sequence of search image blocks S _{S_encode} And memory coding sequence S _menory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.

Step 5 is specifically implemented according to the following steps:

step 5.1, clipping an image with x size around the position of a given target in the first frame image of the video sequence as a template image, clipping the template image into image blocks with fixed size to obtain an image block sequence S _T And S is combined with _T Sending to a memory branch encoder to obtain S _{mem_encode} ；

Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S _S Will S _T And S is _S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S _inpute Will S _inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S _{S_encode} And template image block sequence S _{T_encode} Template image block sequence S _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a sequence of search image blocks S _{S_encode} And memory coding sequence S _memory Feature fusion is carried out, and the fused features are sent to a decoder;

step 5.3, sending the decoded characteristics into a boundary frame prediction head to obtain a target position predicted by the current frame;

step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S _mem Will S _mem Splice to S _{mem_encode} ；

And 5.5, reading the next frame of image, and repeating the steps 5.2 to 5.4 until the whole video sequence is finished, so as to obtain an input video tracking result.

The beneficial effects of the invention are as follows:

(1) Aiming at the problems that the background interference of the target in the unmanned aerial vehicle video is serious, the target is easy to be blurred and the like, a tracking model is required to have good generalization performance for the algorithm to predict the target, a mask reconstruction-based pre-training method is provided, an image mask is reconstructed by Vision Transformer, so that stronger target characterization capability is obtained, training is performed through a target detection task, and generalization of the tracking model is effectively improved.

(2) Aiming at the problems that the target is easy to deform and is blocked in unmanned aerial vehicle target tracking, if only an initial frame is used as a template feature, the method has less target feature information, and provides the feature information of the target in a memory network storage history frame, and the feature information of the history frame is used for obtaining more complete feature description of the tracked target, so that tracking accuracy and precision are improved.

Drawings

FIG. 1 is a general framework of the method of the present invention;

FIG. 2 is a flow chart of a video sequence tracking process in the method of the present invention;

FIG. 3 is a tracking effect diagram of the 100 th frame of the video in the embodiment 1 of the present invention;

FIG. 4 is a tracking effect diagram of the 400 th frame of the video in the embodiment 1 of the present invention;

FIG. 5 is a graph of tracking accuracy of different position error thresholds in a UAV 123;

fig. 6 shows the tracking success rate of the method of the present invention for different overlap rate thresholds in the unmanned aerial vehicle generic data set UAV 123.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses an unmanned aerial vehicle target tracking method based on a space-time memory network, which is shown in figure 1, and comprises three parts of mask pre-training, network fine tuning and online tracking, wherein the method comprises the following specific steps:

step 1, three frames of images are sampled from the data sets TrackingNet, laSOT, GOT k and the COCO, wherein the three frames of images are directly sampled from a video at intervals of a certain frame number for the video data sets TrackingNet, laSOT and the GOT10k, the COCO data set is added to solve the problem of insufficient sample types in the video data sets, two frames of images are additionally generated by adopting translation or brightness dithering for an original image in the COCO data set, three frames of images are obtained by adding the original image, and finally, data enhancement operations of translation, clipping and gray scale change are carried out on all the images to form a training data set.

Step 2, building an unmanned aerial vehicle target tracking network model based on a space-time memory network, wherein the building of the unmanned aerial vehicle target tracking network model based on the space-time memory network specifically comprises the following steps: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head. The boundary frame prediction head comprises a classification head and a regression head which are sequentially connected, wherein the classification head and the regression head are constructed by 3 convolution blocks.

Step 3, pre-training based on mask reconstruction is carried out on the unmanned aerial vehicle target tracking network model based on the space-time memory network by using the training data set through a mask reconstruction task and a target detection task after mask reconstruction, and a pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network is obtained; the model representation capability is improved, and the pre-training method based on mask reconstruction comprises the following steps:

Step 3.4, splicing the image block sequence S' _x Sending into query branch encoder, S _T2 Human-fed memory branch encoder for constructing image blocks by self-attention mechanism in Vision TransformerTo obtain the encoded image block sequence S _{query_encode} And S is _{mem_encode} Wherein the attention calculation formula is as follows:

wherein Q, K, V is a matrix obtained by linear transformation of the input, d _k Is the dimension of the matrix Q, K, softmax () represents the normalized exponential function, and Attention () is the Attention calculation formula function.

Step 3.5, similar to the encoder, constructing a symmetric decoder using Vision Transformer to encode the sequence of image blocks S _{query_encode} Segmentation into a sequence of search image blocks S _{S_encode} And template image block sequence S _{T_encode} ，S _{S_encode} And mask marking maask _token Spliced to form a query code sequence S _query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a query coding sequence S _query And memory coding sequence S _memory Performing feature fusion to obtain fused features S _feature Wherein the feature fusion calculation formula is as follows:

wherein i isCords of each pixelLeading, j is->The index of each pixel, by which is meant the vector dot product, s is a scale factor.

step 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, updating model weight, leading the model to learn strong characterization capability, and improving generalization performance;

step 4: retraining the pre-trained unmanned aerial vehicle target tracking network model based on the space-time memory network to obtain a trained unmanned aerial vehicle target tracking network model based on the space-time memory network, enabling the model to be more focused on learning target characteristics by utilizing a target detection task to ensure that the model can be better applied to the unmanned aerial vehicle target tracking task, wherein the retraining process is as follows:

Step 4.3, spliced graphsImage block sequence S _x Sending into query branch encoder, S _T2 The person sending memory branch encoder constructs the relation between image blocks through the self-attention mechanism in Vision Transformer to obtain the encoded image block sequence S _{query_encode} And S is _{mem_encode} 。

Step 4.4, the coded image block sequence S _{query_encode} Segmentation into a sequence of search image blocks S _{S_encode} And template image block sequence S _{T_encode} Template image block sequence S _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a sequence of search image blocks S _{S_encode} And memory coding sequence S _memory And carrying out feature fusion, sending the fused features to a decoder, and finally sending the decoded features to a boundary frame prediction head to obtain final target position prediction.

Step 5: inputting the video to be tracked into the unmanned aerial vehicle target tracking network model based on the space-time memory network trained in the step 4, and obtaining a tracking result. As shown in fig. 2, the specific procedure is as follows:

Step 5.2, reading the next frame of image and cutting out an image with the size of 2x 2x by taking the previous frame of prediction target as the center as a search image, cutting out the search image into image blocks with fixed size to obtain an image block sequence S _S Will S _T And S is _S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S _inpute Will S _inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S _{S_encode} And template image block sequence S _{T_encode} Template image block sequence S _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a sequence of search image blocks S _{S_encode} And memory coding sequence S _memory And carrying out feature fusion, and sending the fused features to a decoder.

step 5.4, cutting the current frame image into x size by taking the predicted target position of the previous frame as the center, cutting the current frame image into image blocks with fixed size, and sending the image blocks into a memory branch encoder to obtain S _mem Will S _mem Splice to S _{mem_encode} 。

Example 1

In the embodiment, the video on the UAV123 is used as the video to be tracked, steps 1 to 5 are executed,

wherein the template images in step 3.1 and step 4.1 are cropped to 128 x 128 size, and the search image is cropped to 256 x 256 size; the image block size is 16 x 16.

The obtained results are shown in fig. 3-4, and fig. 3-4 are the visual tracking results of the 100 th frame and 400 th frame of the video respectively, so as to obtain the position information of the target in the image.

Fig. 5-6 show the tracking accuracy of the error threshold at different positions and the tracking success rate of the threshold at different overlapping rates, respectively, as shown in fig. 5-6, the average tracking success rate of the present embodiment reaches 0.57, and the tracking accuracy reaches 0.742 under the condition that the error threshold is 20 pixels. The following are the tracking success rate and tracking accuracy of the present implementation under different environmental attributes of the UAV123 dataset, and the comparison of tracking accuracy and tracking accuracy of the present implementation and some tracking algorithms on the UAV123 general dataset.

Table 1 tracking success and accuracy under different environments of the present implementation

Environmental attributes	Tracking success rate	Tracking accuracy
			The target being partially occluded	0.563	0.780
Target out of view	0.592	0.790
			Target scale change	0.616	0.822
Illumination variation	0.606	0.827
			Rapid movement	0.576	0.768
Viewing angle change	0.657	0.856
			Background interference	0.646	0.860
Small target	0.598	0.829

Table 2 this implementation is compared with other tracking algorithms

Tracking algorithm	Tracking success rate	Tracking accuracy
			SiamFC	0.498	0.726
MDNet	0.528	0.735
			ECO	0.525	0.741
SiamRPN	0.527	0.748
			Tracking algorithm of the invention	0.627	0.835

As can be seen from Table 1, the invention can obtain good tracking success rate and tracking accuracy in most environments, can effectively solve the problems of serious background interference of targets, easy occurrence of blurring of targets and the like in unmanned aerial vehicle videos, and well improves model generalization.

As can be seen from Table 2, the average tracking success rate of the invention on the UAV123 can reach 0.627, the average tracking precision is 0.835, and the tracking speed can reach 45 FPS.

Aiming at the problems that targets in unmanned aerial vehicle videos are easy to be blocked, deformed, and interfered by similar objects, the unmanned aerial vehicle target tracking method based on the space-time memory network acquires more robust characteristic information through a pre-training network model, reduces the influence of complex background on a tracking algorithm, improves model generalization, and designs target characteristic information of a memory network storage history frame; the problem caused by deformation of the unmanned aerial vehicle target is solved, and the model tracking success rate and accuracy are improved.

Claims

1. The unmanned aerial vehicle target tracking method based on the space-time memory network is characterized by comprising the following steps of:

2. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the data set in the step 1 is TrackingNet, laSOT, GOT k or COCO; the images in step 1 are three frames of images sampled from the same video in the video dataset TrackingNet, laSOT or GOT10k, or the original image in the COCO dataset is shifted or dithered in brightness to generate two images, and the original image is added to obtain three frames of images.

3. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 1, wherein the specific method for creating the unmanned aerial vehicle target tracking network model based on the space-time memory network in step 2 is as follows: the memory branch encoder, the query branch encoder, the feature fusion module, the decoder and the boundary frame prediction head are constructed by utilizing Vision Transformer, the output of the memory branch encoder and the output of the query branch encoder are both connected with the input of the feature fusion module, the output of the feature fusion module is connected with the input of the decoder, and the output end of the decoder is connected with the boundary frame prediction head.

4. The unmanned aerial vehicle target tracking method based on the space-time memory network of claim 3, wherein the bounding box prediction header comprises a classification header and a regression header which are sequentially connected, and the classification header and the regression header are constructed by 3 convolution blocks.

5. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 3 or 4, wherein the step 3 is specifically implemented according to the following steps:

step 3.5, similar to the encoder, constructing a symmetric decoder using a VisionTransformer to encode the sequence of image blocks S _{query_encode} Segmentation into a sequence of search image blocks S _{S_encode} And template image block sequence S _{T_encode} ，S _{S_encode} And mask marking mask _token Spliced to form a query code sequence S _query Wherein each mask flag is a shared, learnable vector representing missing image blocks to be predicted, a sequence S of template image blocks _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a query coding sequence S _query And memory coding sequence S _menory Performing feature fusion to obtain fused features S _feature Wherein the feature fusion calculation formula is as follows:

and 3.7, sending the reconstructed image into a boundary frame pre-measuring head, respectively classifying and regressing to obtain a predicted boundary frame, calculating the mean square error loss between the reconstructed image and the original image and between the predicted boundary frame and the real boundary frame, carrying out back propagation on the loss, and updating the model weight.

6. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 5, wherein the step 4 is specifically implemented according to the following steps:

step 4.1, taking one image in every three images in the training data set as a search image and the other two images as template images; centering on the target for two imagesCutting, wherein if the template image is cut to be x size, the search image is cut to be 2x 2x size; dividing the image blocks into non-overlapping image blocks with the same size to obtain a template image block sequence S _T1 、S _T2 And searching for a sequence of image blocks S _S ；

Step 4.3, the spliced image block sequence S _x Sending into query branch encoder, S _T2 The sender memory branch encoder constructs the relation between image blocks through a self-attention mechanism in a Vision transducer to obtain an encoded image block sequence S _{query_encode} And S is _{mem_encode} ；

7. The unmanned aerial vehicle target tracking method based on the space-time memory network according to claim 6, wherein the step 5 is specifically implemented according to the following steps:

Step 5.2, reading the next frame of image and cutting out an image with the size of 2x by taking the previous frame of prediction target as the center as a search image, and cutting out the search image into a fixed sizeImage block, obtaining image block sequence S _S Will S _T And S is _S Splicing together while embedding position codes to represent the relative positions of image blocks to obtain an input sequence S _inpute Will S _inpute Sending the image block sequence into a trained query branch encoder to divide the encoded image block sequence into a search image block sequence S _{S_encode} And template image block sequence S _{T_encode} Template image block sequence S _{T_encode} And S is _{mem_encode} Splicing to form memory coding sequence S _memory Using a sequence of search image blocks S _{S_encode} And memory coding sequence S _memory Feature fusion is carried out, and the fused features are sent to a decoder;