CN115661688B

CN115661688B - Unmanned aerial vehicle target re-identification method, system and equipment with rotation invariance

Info

Publication number: CN115661688B
Application number: CN202211225141.4A
Authority: CN
Inventors: 叶茫; 陈朔怡; 杜博
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2024-04-26
Anticipated expiration: 2042-10-09
Also published as: CN115661688A

Abstract

The invention discloses a target re-identification method, a target re-identification system and target re-identification equipment for an unmanned aerial vehicle with rotation invariance, wherein an original picture is divided into a plurality of small blocks in a mode of partially overlapping through a block generation module; flattening the small blocks into a sequence, adding a random initialization small block as a subsequent global classification characteristic, and inputting all the small blocks into a transducer layer with depth of h; features obtained through Tranformer layers of learning enter two branches, one branch rotates at a feature level to obtain a plurality of rotation features, and the other branch learns through a transducer layer to obtain an original feature; then, the average value of a plurality of rotation features and an original feature are optimized by adopting rotation invariant constraint loss; the multiple rotation features and the original feature are optimized by adopting triplet loss; and finally, classifying and identifying through the full connection layer and the batch normalization layer pictures. The invention enhances the generalization capability of the angle change of the target under the unmanned airport scene and improves the retrieval accuracy.

Description

Unmanned aerial vehicle target re-identification method, system and equipment with rotation invariance

Technical Field

The invention belongs to the technical field of computer vision target retrieval, relates to an unmanned aerial vehicle target re-identification method, system and equipment, and in particular relates to an unmanned aerial vehicle target re-identification method, system and equipment with rotation invariance.

Background

Target Re-identification (Re-ID) is a task of retrieving a specific object (e.g., pedestrian, vehicle) through a non-overlapping camera [ documents 1-3]. The prior research in this field has focused mainly on urban cameras. However, the conventional city camera has a limitation in capturing images, particularly in a large open area, and the position of the city camera is fixed, resulting in a limited shooting range, and some dead zones exist [ document 4]. With the rapid development of unmanned aerial vehicles in the field of video monitoring, unmanned aerial vehicles can now easily cover large areas and difficult-to-reach areas, exhibiting a wider variety of and irreplaceable viewing angles [ documents 5-6]. This technique can be applied in many scenarios, such as urban security, large public place management. We define a new task that is more challenging than normal target re-recognition: target re-recognition in an unmanned scene, i.e., recognition of a particular target in many aerial images captured at high altitude overlooking angles.

The rapid movement and constantly changing height of the drone results in a great viewing angle difference compared to a fixed city camera [ document 6]. In order to correctly identify an identity, the image needs to contain the whole body of the target. However, this faces two major difficulties: 1) The resulting bounding box shape varies greatly. This illustrates that the bounding box contains more background area than it does under normal viewing angles, which makes the model more susceptible to some meaningless content disturbances. 2) The same person's body in the bounding box has different directions of rotation. This results in a greater intra-class distance for unmanned target re-recognition than for conventional target re-recognition. It is challenging to identify targets of widely used convolutional neural network models that rotate viewing angles over large distances.

A large number of target re-identification methods based on convolutional neural networks [ literature 2,7,8,9,10] have achieved great success in urban camera scenarios, but they have difficulty in solving the problem of rotation of unmanned aerial vehicle scenarios. The unmanned aerial vehicle pedestrian image inevitably contains a large portion of the background, and convolution of the convolutional neural network is a typical operation between locally adjacent pixels [ document 11]. Therefore, convolutional neural network-based methods always take too much time on the background to accurately model target areas that provide useful information, limiting their applicability to unmanned air scenes. Whereas a transducer is a structure based entirely on the mechanism of attention, visual transducer [ document 12] exhibits a powerful ability to simulate global and remote relationships between each input image block portion. This property motivates us to study a solution that is rotation invariant under the transducer framework.

In solving the rotation problem, there are some studies based on convolutional neural networks to achieve rotation invariance applied to image classification, object monitoring, and other visual tasks [ documents 13-15]. The adaptability of the image conversion is improved, for example, by inserting a learnable module into a convolutional neural network [ document 15]. There are also methods to achieve rotational invariance by forcing training samples before and after rotation to share similar feature manifestations [ document 14]. But these methods based on convolution and two-dimensional image level operations are difficult to apply on a transducer due to the block operation.

In summary, it is important to design a feature learning model with unchanged rotation for the unmanned aerial vehicle Re-ID to solve the above-mentioned problems. [ literature ] 1]Ying-Cong Chen,Xiatian Zhu,Wei-Shi Zheng,and Jian-Huang Lai.2017.Person re-identification by camera correlation aware feature augmentation.IEEE TPAMI 40,2(2017),392-408.

[ Literature ] 2]Mang Ye,Jianbing Shen,Gaojie Lin,Tao Xiang,Ling Shao,and Steven C.H.Hoi.2021.Deep learning for person re-identification:A survey and outlook.IEEE TPAMI(2021),1-1.

[ Literature ] 3]Liang Zheng,Yi Yang,and Alexander G Hauptmann.2016.Person reidentification:Past present and future.arXiv preprint arXiv:1610.02984(2016).

[ Literature ] 4]Shizhou Zhang,Qi Zhang,Yifei Yang,Xing Wei,Peng Wang,Bingliang Jiao,and Yanning Zhang.2020.Person re-identification in aerial imagery.IEEE TMM 23(2020),281-291.

[ Document 5]SV Kumar,Ehsan Yaghoubi,Abhijit Das,BS Harish,and Hugo ]2020.The P-DESTRE:a fully annotated dataset for pedestrian detection,tracking,reidentification and search from aerial devices.arXiv preprint arXiv:2004.02782(2020).

[ Literature ] 6]Tianjiao Li,Jun Liu,Wei Zhang,Yun Ni,Wenqian Wang,and Zhiheng Li.2021.UAV-Human:ALarge Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles.In CVPR.16266-16275.

[ Literature ] 7]Yifan Sun,Liang Zheng,Yi Yang,Qi Tian,and Shengjin Wang.2018.Beyond part models:Person retrieval with refined part pooling(and a strong convolutionalbaseline).In ECCV.480-496.

[ Literature ] 8]Guanshuo Wang,Yufeng Yuan,Xiong Chen,Jiwei Li,and Xi Zhou.2018.Learning discriminative features with multiple granularities for person re-identification.In ACM MM.274-282.

[ Literature ] 9]Hao Luo,Wei Jiang,Youzhi Gu,Fuxu Liu,Xingyu Liao,Shenqi Lai,and JianyangGu.2019.A strong baseline and batch normalization neck for deep person re-identification.IEEE TMM 22,10(2019),2597-2609.

[ Literature ] 10]Kaiyang Zhou,Yongxin Yang,Andrea Cavallaro,and Tao Xiang.2019.Omni-scale feature learning for person re-identification.In ICCV.3702-3712.

[ Literature ] 11]Xiaolong Wang,Ross Girshick,Abhinav Gupta,and Kaiming He.2018.Non-local neural networks.In CVPR.7794-7803.

[ In the document 12]Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,

XiaohuaZhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,et al.2020.An image is worth 16x16 words:Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020).

[ Literature ] 13]Aharon Azulay and Yair Weiss.2018.Why do deep convolutional networks gen eralize so poorly to small image transformationsarXiv preprint arXiv:1805.121

77(2018).

[ Literature 14]Gong Cheng,Peicheng Zhou,and Junwei Han.2016.Rifd-cnn: rotation

-invariant and fisher discriminative convolutional neural networks for object detection.In CVPR.2884-2893.

[ Literature ] 15]Max Jaderberg,Karen Simonyan,Andrew Zisserman,et al.2015.Spatial transformer networks.Advances in neural information processing systems 28(2015),2017-2025.

Disclosure of Invention

In order to solve the technical problems, the invention provides Vision Transformer (ViT) -based unmanned aerial vehicle target re-identification method, system and equipment with rotation invariance, which realize improvement of unmanned aerial vehicle scene target re-identification accuracy.

The technical scheme adopted by the method is as follows: the unmanned aerial vehicle target re-identification method with rotation invariance adopts a rotation invariance target identification network to carry out unmanned aerial vehicle target re-identification; the rotation-invariant target recognition network comprises a block generation module and a plurality of Transformer layers;

The method specifically comprises the following steps:

step 1: dividing an original picture strip into a plurality of small blocks in a partially overlapped mode through a block generation module;

Step 2: flattening the small blocks into a sequence, adding a random initialization small block as a subsequent global classification characteristic, and inputting all the small blocks into a transducer layer with depth of h;

Step 3: the features obtained through Tranformer layers of learning in the step 2 enter two branches, one branch rotates at a feature level to obtain a plurality of rotation features, and the other branch learns through a transducer layer to obtain an original feature;

Step 4: optimizing an average value of a plurality of rotation features and an original feature by adopting rotation invariant constraint loss;

step 5: adopting triple loss optimization to the plurality of rotation features and one original feature processed in the step 4;

Step 6: and (5) classifying and identifying the pictures processed in the step (5) through the full connection layer and the batch normalization layer.

The system of the invention adopts the technical proposal that: an unmanned aerial vehicle target re-identification system with rotation invariance is used for carrying out unmanned aerial vehicle target re-identification by adopting a rotation invariance target identification network; the rotation-invariant target recognition network comprises a block generation module and a plurality of Transformer layers;

the method specifically comprises the following modules:

The module 1 is used for dividing an original picture strip into a plurality of small blocks in a partially overlapped mode through the block generation module;

The module 2 is used for flattening the small blocks into a sequence, adding a random initialization small block as a subsequent global classification characteristic, and inputting all the small blocks into a transducer layer with depth h;

the module 3 is used for entering two branches through the features obtained through Tranformer layers of learning in the module 2, wherein one branch obtains a plurality of rotation features after feature level rotation, and the other branch obtains an original feature after learning through a transducer layer;

a module 4 for optimizing an average of a plurality of rotation features and an original feature using rotation invariant constraint losses;

A module 5, configured to optimize the multiple rotation features and the original feature processed by the module 4 by using a triplet loss;

and the module 6 is used for classifying and identifying the pictures processed by the module 5 through the full connection layer and the batch normalization layer.

The technical scheme adopted by the equipment is as follows: an unmanned aerial vehicle target re-identification device with rotational invariance, comprising:

One or more processors;

And the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the unmanned aerial vehicle target re-identification method with rotation invariance.

The invention has the following advantages:

(1) The invention designs a novel feature-level rotation strategy to enhance the generalization capability of the unmanned aerial vehicle to cope with rotation changes.

(2) The invention integrates rotation invariance constraint into the characteristic learning process, enhances the robustness to space change and reduces the error classification caused by rotation change.

(3) The method provided by the invention evaluates on the unmanned aerial vehicle and the urban camera, and obtains better performance than the prior art. On the challenging PRAI-1581 dataset, rank-1/mAP was promoted from 63.3%/55.1% to 70.8%/63.7%.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a diagram of a rotation invariant object identification network architecture in accordance with an embodiment of the present invention;

Fig. 3 is a feature level rotation schematic of an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for purposes of illustration and explanation only and are not intended to limit the scope of the invention.

Considering that ViT has strong modeling capability and generalization capability, the method is excellent in common target recognition tasks. The core idea of the invention is to design a new feature-level rotation strategy based on ViT to enhance generalization of coping with rotation variation, integrate rotation invariance constraint into the feature learning process, and enhance robustness to spatial variation so as to reduce misclassification caused by rotation variation. In particular, the present invention proposes a method of simulating a block feature rotation at a feature level to produce a rotated feature. Finally, the invention establishes strong constraints between the plurality of rotation features and the original features, and optimizes the rotation features together with the original target, thereby improving the retrieval rate.

Referring to fig. 1, the method for identifying the target of the unmanned aerial vehicle with rotation invariance provided by the invention adopts a rotation invariance target identification network to identify the target of the unmanned aerial vehicle again; the rotation-invariant target recognition network comprises a block generation module and a plurality of transducer layers;

Referring to fig. 2, the block generating module of the present embodiment includes a convolution layer, and divides the source image by 16×16 in an overlapping manner. A convolution kernel size of 16 x 16 is used, with a step size of 12. The transducer layer consists of MSA (multi-headed self-attention) and MLP (two layers of fully connected networks using GELU activation functions), both LayerNorm and residual connections being added before MSA and MLP.

The method of the embodiment specifically comprises the following steps:

global feature representation by web learning N+1 here consists of a sequence of blocks of length N (denoted f _p) and one global classification feature (denoted c _O, comprising N original features). To simulate a rotation operation in two dimensions, the present embodiment will/>Reconstruction as/>X and Y herein represent the spatial size of step S generated by the overlapped block embedding. The calculation formula of X and Y is:

wherein W, H is the length and width of the image, P is the size of one block, and D is the dimension;

Looking at fig. 3, the present embodiment treats each block as a pixel, and f _res can be visually seen as a two-dimensional matrix. In this way, the present embodiment can apply an operation similar to a rotation matrix to the block feature level. The captured matrix angle is randomly varied due to the continuous movement of the drone. This embodiment generates a series of angles a= { θ _i |i=1, 2, …, n }, by random. The coordinates of each block vector in the two-dimensional matrix are expressed as (x, y), and the rotation angle formula is:

Unlike pixel-based picture rotation, feature level rotation is performed over a larger block, so rotating by a small angle appears to actually simulate a relatively large rotation. Thus, the present embodiment defines a parameter α to limit the magnitude of the angle produced, θ∈ [ - α, α ]. A series of multi-angle rotated pictures F _r＝{f_r1,f_r2,…,f_rn are obtained by performing the above rotation operation. In this step, the multi-angle features of the unmanned aerial vehicle scene are introduced into the model in advance to simulate diversified rotations, and the global classification features learn all information from the original pictures.

The feature level rotation in the step 3 improves the generalization capability of the network to angle changes from a variety perspective. Furthermore, both the rotation feature and the original feature represent the same object. The present embodiment artificially adds invariance constraints of the rotated feature and the original feature to the loss function to establish a relationship therebetween. In this way, the distance (Mang Ye,Jianbing Shen,Xu Zhang,Pong C Yuen,and Shih-Fu Chang.2020.Augmentation invariant and instance spreading feature for softmaxembedding.IEEE TPAMI(2020)), within the class is shortened to better facilitate proper classification. There is a many-to-one relationship between the set of global classification features c _r for the rotated feature and the global classification feature c _O for the original feature if constraints are established for each pair of original features and each rotated feature, it will take a significant computational cost. To avoid redundant computation, the mean value of the rotation characteristics is used to build invariance, expressed as:

Where c _r1、c_r2、…、c_rn represents the global features for classification added for each rotation feature, respectively.

The aim of this embodiment is to limit the difference between the average rotation feature and the original feature. It is necessary to ensure that the classification exhibited by the rotation feature is not impaired. Mean Square Error (MSE) is the most commonly used loss function, which represents the sum of squares of the differences between predicted and target values. The present embodiment selects the smoothl 1 loss to calculate the difference value, which can effectively prevent the gradient explosion problem. The rotational invariance constraint of this portion is expressed as:

During the training phase, the overall loss function consists of three parts. When the rotated feature is updated, the original feature is also input to a transducer layer to further update the global classification feature representing the global. The present embodiment represents the original feature learned by multiple convertors layers as c _O. After batch normalization, the triplet loss and cross entropy loss are also employed:

furthermore, the average rotation feature is an auxiliary feature representation that accommodates angular diversity. Invariance constraints control the differences between the original and rotated features. The overall learning objective function is:

where λ and 1- λ represent the specific gravities of the original and rotated features, respectively.

Step 6: and (5) classifying and identifying the pictures processed in the step (5) through the full connection layer and the batch normalization layer (BN layer).

The rotation-invariant target recognition network of the embodiment is a trained rotation-invariant target recognition network; each rotation feature containing different information can be seen as a new feature due to the randomness of the rotation. To learn various features, two dimensionsShould be flattened to/>Thus, the transducer can receive the sequence of blocks. Each rotated feature has the same size N blocks as the original feature and it is difficult to cover all the information of the blocks when classifying. The global classification features learned by the plurality of convertors are integrated into global feature expression. In this embodiment, the replication of the global classification feature c _O of n original features is added to each rotation feature to obtain/>The purpose of this operation is that each rotation feature can be classified by a global classification feature that can be learned as samples c _r1、c_r2、…、c_rn of n rotation blocks. Then, in this embodiment, a transducer layer is built for each sample to ensure learning diversity. The global classification feature of c _r representing the rotation feature during training is updated based on the original global classification feature c _O already containing rich feature information, which effectively avoids the loss caused by the rotation feature. The present embodiment sets up n classifiers for the global classification characteristic of the rotation characteristic of the transducer layer update. The most common cross entropy loss function is used after batch normalization (Hao Luo,Wei Jiang,Youzhi Gu,Fuxu Liu,Xingyu Liao,Shenqi Lai,and JianyangGu.2019.A strong baseline and batch normalization neck for deep person re-identification.IEEE TMM 22,10(2019),2597-2609). Furthermore, for fine-grained identification, the final loss function using the triplet loss function (Alexander Hermans,Lucas Beyer,and Bastian Leibe.2017.In Defense of the Triplet Loss for Person Re-Identification.arXiv preprint arXiv:1703.07737(2017)). rotation feature at each global classification feature is:

Each global classification feature representing a rotation feature plays an equivalent role in updating the entire model.

The principles of the present embodiment are further described below in connection with specific experiments.

The deep learning framework adopted in this embodiment is Pytorch. The hardware environment of the experiment is NVIDIA GeForce RTX.multidot.3090.multidot.8 graphic card, and the processor is Intel (R) Xeon (R) Gold 6240. The experimental procedure is as follows:

the first step: rotation feature generation network construction

The Vision Transformer (ViT) network is used as a feature extractor in the experiment to simulate the block feature rotation on the feature level to generate rotation features, and finally, the constraint between the original features and the rotated features is established. And identity classification loss, triple loss, smooth L1 loss, cross entropy loss, end-to-end feature extractor, rotation feature generation network and rotation invariant constraint are adopted.

And a second step of: network training

Dividing the target object photo and the unmanned aerial vehicle photo into a training set and a testing set. The target object photo is sent to the feature rotation network for training. The network parameters are optimized and updated using forward and backward propagation.

And a third step of: network testing

The image of the target object in the test set is used as a set to be queried, and the unmanned aerial vehicle shoots a sample set to be used as a gallery set. And (3) reasoning by adopting a model with the best effect in the training process to obtain a final retrieval result on the test set. The evaluation index adopts Rank-1, rank5, mAP and mINP matching precision, and the precision reflects the retrieval probability of the correct re-identification image.

In PRAI-1581, UAVHuman and VRAI, three datasets shot by unmanned aerial vehicles and Market1501 and MSMT7 are two commonly used pedestrian re-recognition datasets collected by using ground monitoring cameras. PRAI-1581 are data sets proposed for unmanned aerial vehicle tasks. The system consists of 39461 images of 1581 pedestrians shot by two unmanned aerial vehicles flying at the height of 20-60 meters. UAVHuman is mainly used for unmanned plane pedestrian behavior research, and can also be used for various tasks such as pedestrian re-recognition, action recognition and altitude estimation. This dataset contains 1444 pedestrians and 41290 images. VRAI is a vehicle re-identification dataset consisting of 137613 photographs of 13033 vehicles. Vehicle pictures are collected by unmanned aerial vehicles flying at different places at heights of 15 to 80 meters. There are also rich notes including colors, car categories, attributes, pictures and places of distinction.

The invention uniformly adjusts the images to 256 x 256. Furthermore, padding of 10 pixels, random clipping and random erasure with a probability of 0.5 are employed in the training data. Network parameter initialization was performed with the parameters trained beforehand by ImageNet-1K. In the overlapped block embedding stage, the patch size is set to 16 and the stride size is set to 12. In the feature level rotation, the feature number N of rotation is 4, and the random rotation angle ranges from-15 degrees to 15 degrees. The angle is not set too large because it is based on block rotation. For the original and rotated features extracted from the backbone, no edge-wise triplet loss is used, and cross entropy loss is used after the features pass through the batch generalization layer. The weight of the original feature lambda is 0.5 and the weight of the rotated features 1-lambda is 0.5. A smoothl 1 penalty is applied between the average rotation feature and the original feature. During training, a random gradient descent (SGD) optimizer is used. The initial learning rate was 0.008, with cosine learning rate decay. The training frequency was 200. The batch size is set to 64, including 16 identities, each with 4 images. In the test phase, the distance matrix is calculated using only the original features. Implementation of the whole experiment is based on PyTorch.

In order to verify the effectiveness of the invention, the section compares the retrieval result of the invention with the existing unmanned aerial vehicle re-identification method, and the existing target re-identification method mainly comprises the following steps:

(1)PCB：Jianlou Si,Honggang Zhang,Chun-Guang Li,Jason Kuen,Xiangfei Kong,Alex CKot,and Gang Wang.2018.Dual attention matching network forcontext-aware feature sequence based person re-identification.In CVPR.5363-5372.

(2)SP：Shizhou Zhang,Qi Zhang,Yifei Yang,Xing Wei,Peng Wang,Bingliang Jiao,and Yanning Zhang.2020.Person re-identification in aerial imagery.IEEE TMM 23(2020),281-291.

(3)AGW：Mang Ye,Jianbing Shen,Gaojie Lin,Tao Xiang,Ling Shao,and Steven C.H.Hoi.2021.Deep learning for person re-identification:A survey and outlook.IEEE TPAMI(2021),1-1.

(4)Multi-task：Peng Wang,Bingliang Jiao,Lu Yang,Yifei Yang,Shizhou Zhang,Wei Wei,and Yanning Zhang.2019.Vehicle re-identification in aerial imagery:Dataset and approach.In ICCV.460-469.

(5)Baseline(ViT)：Shuting He,Hao Luo,Pichao Wang,Fan Wang,Hao Li,and Wei Jiang.2021.Transreid:Transformer-based object re-identification.In ICCV.15013-15022.

(6)TransReID：Shuting He,Hao Luo,Pichao Wang,Fan Wang,Hao Li,and Wei Jiang.2021.Transreid:Transformer-based object re-identification.In ICCV.15013-15022.

tests were performed on PRAI-1581, UAVHuman, VRAI dataset, the results are shown in Table 1

TABLE 1

Tests were performed on the Market-1501 and MSMT17 datasets, the results of which are shown in Table 2

TABLE 2

As can be seen from tables 1 and 2: compared with Re-ID in recent years, the method provided by the invention realizes the improvement of the retrieval rate in the aspects of target Re-identification of the unmanned aerial vehicle and target Re-identification of the urban camera. On PRAI-1581 data set, the performance of the method is obviously superior to all methods in the table, and the performance of the method is respectively superior to the current optimal method TransRe-ID of 4.8% and 5.9% on Rank1 and mAP. On UAVHuman datasets, mAP was better than current best method TransRe-ID 2%. On VRAI data sets, the method provided by the invention realizes the Rank1 accuracy of 83.5% and the mAP of 84.8% without using any auxiliary information, and exceeds all other methods. On the Market-1501 and MSMT17 data sets, experiments of the method provided by the invention also show that the method has strong generalization capability in the scene of a common city camera, and mAP and Rank1 are respectively improved by 5.4% and 3.2% compared with the current optimal method. Experimental results on three unmanned aerial vehicle acquisition data sets and two ground cameras acquisition data sets prove the effectiveness and generalization of the method.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The unmanned aerial vehicle target re-identification method with rotation invariance is characterized by comprising the following steps of: performing unmanned aerial vehicle target re-recognition by adopting a rotation invariant target recognition network; the rotation-invariant target recognition network comprises a block generation module and a plurality of Transformer layers;

The method specifically comprises the following steps:

2. The unmanned aerial vehicle target re-identification method with rotational invariance according to claim 1, wherein: the block generation module comprises a convolution layer, the size of the convolution kernel is 16 x 16, and the step length is 12; the original picture is divided in units of 16 x 16 in an overlapping manner.

3. The unmanned aerial vehicle target re-identification method with rotational invariance according to claim 1, wherein: the transducer layer consists of a multi-headed self-attention MSA and two layers of fully connected network MLP using GELU activation functions, where both MSA and MLP were previously configured with LayerNorm and residual connections.

4. The unmanned aerial vehicle target re-identification method with rotational invariance according to claim 1, wherein: in step 3, the feature level is rotated to obtain a plurality of rotation featuresReconstruction as/>X and Y herein represent the spatial size of step S generated by overlapped block embedding; the calculation formula of X and Y is:

Wherein W, H is the length and width of the image, P is the size of one block, and D is the dimension; f _p denotes the block order of N, and global features are expressed as N+1 consists of a block sequence f _p of length N and a global classification feature c _O comprising N original features;

Each block is considered a pixel, and f _res is visually considered a two-dimensional matrix; by randomly generating a series of angles a= { θ _i |i=1, 2, …, n }, the coordinates of each block vector in the two-dimensional matrix are expressed as (x, y), and the rotation angle formula is:

The rotation operation results in a plurality of rotation features F _r＝{f_r1,f_r2,…,f_rn.

5. The unmanned aerial vehicle target re-identification method with rotational invariance of claim 1, wherein the rotational invariance constraint in step 4 is:

6. The unmanned aerial vehicle target re-identification method with rotational invariance according to any one of claims 1-5, wherein: the rotation-invariant target recognition network is a trained rotation-invariant target recognition network; the loss function adopted in the training process is as follows:

wherein, Triplet loss representing rotation characteristics,/>The classification loss of the rotation characteristics is represented as 1.ltoreq.i.ltoreq.n.

7. Unmanned aerial vehicle target recognition system with rotatory invariance, its characterized in that: performing unmanned aerial vehicle target re-recognition by adopting a rotation invariant target recognition network; the rotation-invariant target recognition network comprises a block generation module and a plurality of Transformer layers;

the method specifically comprises the following modules:

8. An unmanned aerial vehicle target re-identification device with rotational invariance, comprising:

One or more processors;

Storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the unmanned aerial vehicle target re-identification method with rotational invariance of any one of claims 1 to 6.