CN116403015A

CN116403015A - Unsupervised target re-identification method and system based on perception-aided learning transducer model

Info

Publication number: CN116403015A
Application number: CN202310248659.8A
Authority: CN
Inventors: 叶茫; 陈朔怡; 李辰玥
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-07-07
Anticipated expiration: 2043-03-13
Also published as: CN116403015B

Abstract

The invention discloses an unsupervised target re-identification method and system based on a perception-assisted learning transducer model, and designs a mutual learning method taking discrimination information and detail perception into consideration on the basis of utilizing the advantages of the transducer in terms of global modeling and learning structural information aiming at target re-identification tasks under an unsupervised scene. For discriminant feature learning, based on pseudo tags generated by clustering, the model is optimized by clustering and instance-level combination loss. For perception aided learning, the invention performs local masking on the image at the block level and constructs an alignment strategy guided by the original visual signal to achieve the purpose of fine-grained modeling. Furthermore, the present invention proposes a target-aware masking strategy to avoid part of the background interference. Under the condition that the test setting and reasoning time is not increased, the invention greatly improves the retrieval accuracy of the unsupervised target re-identification task.

Description

Unsupervised target re-identification method and system based on perception-aided learning transducer model

Technical Field

The invention belongs to the technical field of computer vision image retrieval, relates to a target re-identification method and a target re-identification system, and particularly relates to an unsupervised target re-identification method and a target re-identification system based on a perception-assisted learning transducer model.

Background

Unsupervised target Re-identification (Re-ID) is a task of retrieving specific objects (e.g. pedestrians, vehicles) by non-overlapping cameras without tags of data, whose query set consists of captured images of multiple pedestrians or vehicles. Much research in this area has focused mainly on supervised learning methods that require tags. In an actual application scene, a large amount of monitoring image data is marked, so that a large amount of time and labor cost are consumed, and the unsupervised target re-identification can save a large amount of labor cost, so that the method has a great application value.

The lack of identity tags to supervise the syndrome model learning makes unsupervised target re-identification very challenging. The existing unsupervised target re-identification research is mainly based on a convolutional neural network, and the main stream method is to train a model by using clustering pseudo tags. Some of these methods are based on feature extraction of convolutional neural models, some of which focus mainly on generating high quality pseudo tags, while others focus more on the design of clustering algorithms and training strategies. Lin et al (document 1) propose a bottom-up clustering method that exploits diversity among individuals and intra-individual similarity. Dai et al (literature 2) designed an unsupervised baseline based on clusters for storing features and calculating contrast loss at cluster level. RLCCs proposed by Zhang et al (document 3) note improving cluster quality and they propose a method of generating samples that can provide supplemental information to aid in clustering. In addition, PPLR (document 4) uses a relationship between local features and global features to reduce tag noise to improve pseudo tag quality. However, convolutional neural network structures are limited by local receptive fields, and long-distance relationships are difficult to establish in the early stages. Recently, luo et al (document 5) employed a visual transducer-based self-monitoring method, pre-training using a large scale unlabeled pedestrian re-recognition dataset LUperson (document 6). The study shows that the application of the pre-trained visual transducer directly to the existing method can significantly improve the performance of the Re-ID task. Indeed, the self-attention mechanism of the visual transducer has natural long-range properties for efficient global modeling. Furthermore, the transformation is more prone to learn shape and structure information than CNNs that rely on local texture information. For common challenges such as occlusion and interference faced by Re-ID tasks, visual transformers have shape recognition capabilities comparable to the human visual system, with greater robustness (document 7). The potential of visual transformers in the field of unsupervised Re-ID can be further exploited.

On the other hand, while visual transducer has demonstrated a powerful capability to extract feature representations, simply applying it to existing methods still suffers from the lack of capturing fine-grained information. Because the existing unsupervised Re-ID methods are based on global discriminant learning of pseudo tags, they are mainly focused on identity-related attributes at the category level. In fact, the visual perception of the details of the image itself is not well exploited. Compared with convolutional neural networks, visual transformers have greater potential in learning visual information that is rich in images. Using visual transducer to introduce the concept of blocking, mae (document 8) builds self-supervised training by randomly masking blocks and then performing pixel-level reconstruction. Similarly, simMIM (document 9) learns better feature representations by predicting the original signal of the occluded region to enhance the model's understanding of visual information. These studies indicate that model learning can also benefit from low-level visual signals. Also, introducing visual information learning strategies (e.g., masking) into convolutional neural networks typically requires very complex designs, because the feature map generated by the convolutional operation retains a large number of interference edges. However, these vision transducer-based methods can only learn generalized features, often requiring task-specific supervised fine tuning when applied to different types of downstream tasks. For Re-ID tasks, learning identity discrimination features plays a key role, while local fine-grained information helps to further distinguish difficult samples (documents 10-14).

Therefore, how to combine discrimination information and local detail perception when feature learning is performed under a unified framework, and improving fine granularity modeling capability while performing identity discrimination is a critical problem in an unsupervised target re-recognition task.

[ document 1]Yutian Lin,Xuanyi Dong,Liang Zheng,Yan Yan,and Yi Yang.A bottom-up clustering approach to unsupervised person re-identification.In AAAI, volume 33, pages 8738-8745,2019.

[ document 2]Zuozhuo Dai,Guangyuan Wang,Weihao Yuan,Xiaoli Liu,Siyu Zhu,and Ping Tan.Cluster contrast for unsupervised person re-identification.arXiv preprintarXiv:2103.11568,2021.

[ document 3]Xiao Zhang,Yixiao Ge,Yu Qiao,and Hongsheng Li.Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In CVPR, pages 3436-3445,2021.

[ document 4]YoonkiCho,Woo Jae Kim,SeunghoonHong,and Sung-EuiYoon. Part-based pseudo label refinement for unsupervised person re-identification. InCVPR, pages 7308-7318,2022.

[ 5]Hao Luo,Pichao Wang,Yi Xu,Feng Ding,Yanxin Zhou,Fan Wang,Hao Li,and Rong Jin.Self-supervised pre-training for transformer-based person re-identification.arXiv preprint arXiv:2111.12084,2021.

[ document 6]Dengpan Fu,Dongdong Chen,Jianmin Bao,Hao Yang,Lu Yuan,Lei Zhang,Houqiang Li,and Dong Chen.Unsupervised pre-training for person re-identification.InCVPR, pages 14750-14759,2021.

[ document 7]Muhammad Muzammal Naseer,Kanchana Ranasinghe,Salman H Khan,Munawar Hayat,Fahad Shahbaz Khan,and Ming-Hua yang. Including properties of vision transformers. Neurops, 34:23296-23308,2021.

[ 8]Kaiming He,Xinlei Chen,Saining Xie,Yanghao Li,Piotr Doll ar, and Ross Girsheck. Molded autoencoders are scalable vision features. In CVPR, pages 16000-16009,2022.

[ document 9]ZhendaXie,Zheng Zhang,Yue Cao,Yutong Lin,Jianmin Bao,Zhuliang Yao,Qi Dai,and Han Hu.Simmim:A simple framework for masked image modeling.In CVPR,pages 9653-9663,2022.

[ document 10]GuanshuoWang,YufengYuan,XiongChen,JiweiLi,and Xi Zhou.Learning discriminative features with multiple granularities for person re-identification. In ACM MM pages 274-282,2018.

[ document 11]Yifan Sun,Liang Zheng,Yi Yang,Qi Tian,and Shengjin Wang.Beyond part models:Person retrieval with refined part pooling (and a strong convolutional baseline): in ECCV pages480-496,2018.

[ document 12]YoonkiCho,Woo Jae Kim,SeunghoonHong,and Sung-EuiYoon. Part-based pseudo label refinement for unsupervised person re-identification. InCVPR, pages 7308-7318,2022.

[ document 13]Kuan Zhu,Haiyun Guo,Tianyi Yan,Yousong Zhu,Jinqiao Wang,and Ming Tang.Pass:Part-aware self-supervised pretraining for person re-identification.In ECCV, pages 198-214.Springer Nature Switzerland Cham,2022.

[ document 14]Yifan Sun,Qin Xu,Yali Li,Chi Zhang,Yikang Li,Shengjin Wang,and Jian Sun.Perceive where to focus:Learning visibility-aware part-level features for partial person re-identification.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pages 393-402,2019.

Disclosure of Invention

Aiming at the defects of the prior invention, the invention provides an unsupervised target re-identification method and system based on a perception-aided learning transducer model, and designs a target perception local mask alignment method to mine fine-grained visual perception information of an image, so as to assist and supplement the learning of discrimination features, and achieve the aim of improving the retrieval accuracy of the unsupervised target re-identification model.

The technical scheme adopted by the method is as follows: an unsupervised target re-identification method based on a perception-assisted learning transducer model comprises the following steps:

step 1: constructing a transducer model based on perception aided learning;

the perception assisted learning transducer model comprises a block generation module, a target perception mask module, a transducer backbone network module and a mask alignment module;

the block generation module comprises four convolution layers which are sequentially connected, wherein the convolution kernel of the first convolution layer is 7*7, a half of channels are processed by adopting a batch normalization layer after convolution, the other half of channels are processed by adopting an example normalization layer, and a function layer is activated by RELU after processing; the convolution kernel of the second convolution layer is 3*3, after convolution, batch normalization layer processing is adopted for half channels, the other half channels are processed by example normalization layers, and after processing, function layers are activated by RELU; the convolution kernel of the third convolution layer is 3*3, all channels are processed by adopting a batch normalization layer after convolution, and a function layer is activated by RELU after processing; the convolution kernel size of the fourth convolution layer is 16×16;

the target perception mask module comprises a plurality of random initialization mask modules, each mask module is a trainable parameter, the data format is the same as the block length of the block generation module, and the mask modules are used for replacing a designated part of common blocks and then are used as input of a transform backbone network;

the converter backbone network comprises a plurality of converter layers, each layer consists of a multi-head self-attention MSA and two layers of full-connection network MLP using GELU activation function, and LayerNorm and residual connection are arranged in front of the MSA and the MLP;

the mask alignment module comprises a dimension conversion function, converts the original image dimension into a characteristic dimension and is used for a subsequent pixel level alignment loss function;

step 2: and inputting the image to be identified into the transducer model based on the perception aided learning to obtain a target re-identification result.

The system of the invention adopts the technical proposal that: an unsupervised target re-recognition system based on a perceptually aided learning transducer model, comprising:

one or more processors;

and a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement an unsupervised target re-identification method based on a perceptually assisted learning transducer model as described.

The invention has the following advantages:

(1) The invention applies the Transformer to the unsupervised target re-identification work for the first time. Using Vision Transformer remote attention modeling and more powerful feature extraction we propose a mutual learning framework that comprehensively considers discriminating features and detail perception.

(2) The invention designs a perception-assisted learning strategy based on target perception mask alignment, which helps a transducer learn block-level local details. Better discrimination characteristics supplemented by local details can be extracted under the mutual learning of the models;

(3) Compared with a model based on a convolutional neural network, the method provided by the invention effectively improves the retrieval accuracy of the model under the condition that the test setting and the deducing time are kept unchanged.

Drawings

FIG. 1 is a block diagram of a generating module according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a perception aided learning system according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a framework for learning perception assistance and discrimination features based on a transducer according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

The invention provides an unsupervised target re-identification method based on a perception-assisted learning transducer model, which comprises the following steps:

step 1: constructing a transducer model based on perception aided learning;

the transducer model for perception aided learning comprises a block generation module, a target perception mask module, a transducer backbone network module and a mask alignment module;

please refer to fig. 1, the block generating module of the present embodiment includes four sequentially connected convolution layers, wherein the convolution kernel of the first convolution layer is 7*7, after convolution, a batch normalization layer (BN layer) is adopted for half channels, and the other half channels are processed by an example normalization layer (IN layer), and after processing, a function layer is activated by RELU; the convolution kernel of the second convolution layer is 3*3, after convolution, batch normalization layer processing is adopted for half channels, the other half channels are processed by example normalization layers, and after processing, function layers are activated by RELU; the convolution kernel of the third convolution layer is 3*3, all channels are processed by adopting a batch normalization layer after convolution, and a function layer is activated by RELU after processing; the convolution kernel size of the fourth convolution layer is 16×16;

the target perception mask module of the embodiment comprises a plurality of random initialization mask modules, wherein each mask module is a trainable parameter, the data format is identical to the block length of the block generation module and is 768-dimensional vector, and the mask module is used for replacing a designated part of common blocks and then is used as input of a transformer backbone network;

the converter backbone network of the embodiment comprises a plurality of converter layers, wherein each layer consists of a multi-head self-attention MSA and two layers of full-connection network MLP using GELU activation function, and LayerNorm and residual connection are arranged in front of the MSA and the MLP;

the mask alignment module of the embodiment includes a dimension conversion function, which converts the original image dimension into the feature dimension for the subsequent pixel level alignment loss function;

step 2: and inputting the image to be identified into a transducer model based on perception aided learning to obtain a target re-identification result.

The transducer model based on perception aided learning in this embodiment is a trained model; according to the method, the importance of fine granularity information on the ReID task and incapability of capturing detail perception information by judging and learning based on the conventional pseudo tag generated based on the clustering algorithm are comprehensively considered. The method has the core idea that the distinguishing characteristics are supplemented by improving the fine granularity modeling capability of the model through rich visual signals. Specifically, the perception-assisted learning transducer model provided by the invention utilizes the advantages of the visual transducer in terms of global modeling and learning structural information on the model to extract more robust discrimination features. The whole is composed of two branches, namely, distinguishing characteristic learning and perception auxiliary learning. For discriminant feature learning, the model is trained by combining the loss through a cluster level and an instance set with the help of pseudo tags generated by the clusters. For perceptually assisted learning, the invention locally masks the image at the block level and builds an alignment guided by the original visual signal. Wherein the unmasked image blocks may be considered as representations of local information, the model needs to use existing partially visible information to infer the visual signal of the blank area. It improves the model's ability to understand local detail. Finally, the two branches act together to complete the training process in a mutual learning mode.

Furthermore, it is contemplated that the direct use of random masking may be affected by a large area background, resulting in a model focused on some areas of interference. The present invention proposes a target-aware masking approach that is more prone to selecting a central region of the target to more align key regions of the target during training.

Under the condition that the test setting and reasoning time is not increased, the invention greatly improves the retrieval accuracy of the unsupervised target re-identification task.

The deep learning framework adopted in this embodiment is Pytorch. The hardware environment of the experiment is NVIDIA GeForce RTX3090 x 8 graphic card, and the processor is

Gold 6240, memory 256g 27999ddr 4. The specific implementation flow of the unsupervised target re-identification method based on the transformer is as follows:

the first step: constructing a transducer model based on perception aided learning;

please refer to fig. 3, in this experiment, vision Transformer (ViT) network is adopted as the feature extractor, and training is completed by discriminating the feature learning branch and the perception assisting learning branch to learn each other. In the feature learning, the obtained features are clustered to obtain pseudo tags, and then the network updating is guided by adopting contrast learning according to the pseudo tags. In the perceptually assisted learning branch, the model is guided to learn fine-grained information by aligning the original pixels of the mask portion to construct a supervisory signal.

And a second step of: training a transducer model based on perception-aided learning;

dividing the photographed pictures into a training set and a testing set. The training set images are fed into a network using a perceptually assisted learning algorithm and based on a transducer. The network parameters are optimized and updated using forward and backward propagation.

(1) Discriminating characteristic learning;

definition set

To represent the input of a discriminating characteristic learning branch, where n is the number of training images; each input image is +.>

H and W represent the width and height of the image, respectively, C represents the three channels of RGB, ++>

Representing a real number; after the block embedding operation, the image is divided into N block embedments with D dimensions; then introducing a learnable global classification feature for the sequence representation of the global feature, and a learnable position embedding to maintain the spatial position relationship; during training, the perceptually assisted learning transducer model extracts features from all training images to obtain initial features represented by global classification features>

The initial features are clustered by the usual clustering algorithm (Martin Ester, hans-Peter Kriegel,)>

San der, xiaoiei Xu, et al, a density-based allgorithm for discovering clusters in large spatial databases with noise, in kdd, volume 96, pages 226-231,1996.), generating pseudo tags from the clustering result; in addition, a memory dictionary is established to store the representative characteristics of each cluster and the corresponding pseudo tag; initializing representative features of each cluster by a cluster average value, and subsequently carrying out momentum update; the examples are considered positive samples with the same pseudo tags in the memory dictionary, the rest are negative samples; the contrast loss at cluster level is:

wherein m is _j Representative features representing cluster levels in memory dictionary, m ₊ Representing the corresponding positive features in the memory dictionary, wherein k is the clustering number, and τ is a self-defined temperature parameter; as the network updates in each iteration, the memory dictionary will also update the features to maintain consistency;

momentum updates are as follows:

m _j ←μm _j +(1-μ)f ^h ，(2)

wherein f ^h Samples representing the most difficult of the batches (the one of the sample features of the same pseudo tag that is the least similar to the feature in the in-memory dictionary)μ represents momentum.

(2) Example level loss function design.

The invention takes the examples with the same pseudo labels as the input samples in the batch as positive samples and the others as negative samples. Pulling the positive samples apart a reduced distance and letting the negative samples get far away can bring the instances in the cluster together, which is easier to distinguish. This procedure is represented as follows:

in the process, ,

representing instance level losses, f ₊ Is characteristic of positive samples, f _- Is characteristic of a negative sample.

(3) The target perceives the mask.

For block embedding input similar to initial and discriminant feature learning, the set processed by the target perception mask is adopted

As input to perceptually assisted learning. The invention chooses to mask the partial block embedding near the center of the image by excluding the block embedding at the c-turn at the image edge and then randomly masking the remainder. These masks are defined as randomly initialized learner block embedding +.>

For subsequent direct learning of local visual perception wherein +.>

Indicating that a certain one of the learner-able mask blocks is embedded, and m indicates the number of mask blocks. The present embodiment replaces the block after the embedding operation corresponding to the center portion of the image with the mask feature as the final input of the perceptual learning branch.

(4) Visual perception alignment. Please refer to fig. 2The present invention performs a dimensional transformation on the original image input to facilitate alignment of pixels with block features,

will be reconfigured as +.>

The alignment formula for the perceived block and the block of the mask area of the image is:

wherein,,

representing the learned features of the ith mask block,/-)>

Representing pixel values of a corresponding i-th block portion of the image; m represents the number of mask blocks;

to enhance visual perception and fine-grained modeling capabilities to aid and supplement discriminant feature learning, the present invention establishes a direct correlation between feature level information and pixel information. The final loss function is expressed as:

wherein lambda is ₁ And lambda (lambda) ₂ Representing the weight of each part.

And a third step of: transducer model test based on perception aided learning;

in the test stage, the method only uses a trained visual transducer model to extract the characteristics, and then calculates the similarity between the target characteristics to be queried and all image characteristics of the database to obtain a search result sequence with high similarity.

The image of the target object in the test set is used as a set to be queried, and the rest of the shooting images are used as a gallery set. And (3) reasoning by adopting a model with the best effect in the training process to obtain a final retrieval result on the test set. The evaluation index adopts Rank-1, mAP and mINP matching precision, and the precision reflects the retrieval probability of the correct re-identification image.

In the experiment, the invention performs experimental verification on two common pedestrian re-identification data sets Market1501 and MSMT7 collected by using a ground monitoring camera. The mark 1501 includes 1501 pedestrians photographed by 6 cameras (5 high-definition cameras and 1 low-definition camera), 32668 detected pedestrian rectangular frames. Each pedestrian is captured by at least 2 cameras and may have multiple images in one camera. The training set comprises 751 people and comprises 12936 images, and 17 training data of each person are obtained on average; the test set contained 750 people, 19732 images, and an average of 26 test data for each person. The pedestrian detection rectangle of 3368 query images was manually drawn, and the pedestrian detection box in the gamma was detected using a DPM detector. The fixed number of training sets and test sets provided by the data set may be used under single-shot or multi-shot test settings. The MSMT7 employs a 15 camera network placed in a campus, including 12 outdoor cameras and 3 indoor cameras, resulting in a 126441 pedestrian rectangular frame for 4101 pedestrians.

The invention uniformly adjusts the images to 256 x 128. In addition, data enhancement methods such as padding of 10 pixels, random clipping, and random erasure with a probability of 0.5 are employed in the training data. The block size is set to 16 x 16, resulting in a feature dimension of 768. The batch size is set to 256, including 32 identities, each with 8 images. The training batch number is 50, and the iteration number of each batch is 400. We used DBSCAN clustering algorithm with the maximum neighborhood distance on the mark 1501 test set to 0.5 and the maximum neighborhood distance of MSMT17 set to 0.7. The memory dictionary is initialized with the mean value of the cluster features, and the momentum mu of the memory update of the most difficult sample is set to 0.2. For the discriminating characteristic learning branch, the temperature of the cluster level contrast loss is set to 0.05. For the perceptually assisted learning branch, the feature centered in the image is masked with a mask ratio of 25%. In terms of the penalty function, the weight of the cluster-level penalty is 1, the weight of the instance-level penalty is 0.4 for Market1501, 0.6 for MSMT17, and the weight of the mask alignment penalty is 1. A random gradient descent (SGD) optimizer was used during training. The initial learning rate was 0.00035, 10-fold decrease per 20 training sessions.

In order to verify the effectiveness of the invention, the search result of the invention is compared with the existing unsupervised and self-supervised target re-identification method, and the existing unsupervised and self-supervised target re-identification method mainly comprises the following steps:

(1)BagTricks：Hao Luo,Youzhi Gu,Xingyu Liao,Shenqi Lai,and Wei Jiang.Bag of tricks and a strong baseline for deep person re-identification.In CVPRW,pages 0–0,2019.

(2)SCSN：Xuesong Chen,Canmiao Fu,Yong Zhao,Feng Zheng,Jingkuan Song,Rongrong Ji,and Yi Yang.Salience-guided cascaded suppression network for person re-identification.In CVPR,pages 3300–3310,2020.

(3)AGW：Mang Ye,Jianbing Shen,Gaojie Lin,Tao Xiang,Ling Shao,and Steven C.H.Hoi.2021.Deep learning for person re-identification:A survey and outlook.IEEE TPAMI(2021),1–1.

(4)TransReID：Shuting He,Hao Luo,Pichao Wang,Fan Wang,Hao Li,and Wei Jiang.2021.Transreid:Transformer-based object re-identification.In ICCV.15013–15022.

(5)SpCL:Yixiao Ge,Feng Zhu,Dapeng Chen,Rui Zhao,et al.Selfpaced contrastive learning with hybrid memory for domain adaptive object re-id.NeurIPS,33:11309–11321,2020.

(6)ICE:Hao Chen,Benoit Lagadec,and Francois Bremond.Ice:Inter-instance contrastive encoding for unsupervised person re-identification.In ICCV,pages 14960–14969,2021.

(7)Cluster-Contrast:Zuozhuo Dai,Guangyuan Wang,Weihao Yuan,Xiaoli Liu,Siyu Zhu,and Ping Tan.Cluster contrast for unsupervised person re-identification.arXiv preprint arXiv:2103.11568,2021.

(8)IIDS:Shiyu Xuan and Shiliang Zhang.Intra-inter domain similarity forunsupervised person re-identification.IEEE TPAMI,2022.

(9)ISE:Xinyu Zhang,Dongdong Li,Zhigang Wang,Jian Wang,Errui Ding,Javen Qinfeng Shi,Zhaoxiang Zhang,and Jingdong Wang.Implicit sample extension for unsupervised person re-identification.In CVPR,pages 7369–7378,2022.

(10)PPLR:Yoonki Cho,Woo Jae Kim,Seunghoon Hong,and Sung-EuiYoon.Part-based pseudo label refinement for unsupervised person re-identification.In CVPR,pages 7308–7318,2022.

(11)PASS:Kuan Zhu,Haiyun Guo,Tianyi Yan,Yousong Zhu,Jinqiao Wang,and Ming Tang.Pass:Part-aware self-supervised pretraining for person re-identification.In ECCV,pages 198–214.Springer Nature Switzerland Cham,2022.

(12)TransReID-SSL:Hao Luo,Pichao Wang,Yi Xu,Feng Ding,Yanxin Zhou,Fan Wang,Hao Li,and Rong Jin.Self-supervised pre-training for transformer-based person re-identification.arXiv preprint arXiv:2111.12084,2021.

tests were performed on the mark-1501 and MSMT17 datasets and the results are shown in Table 1:

TABLE 1

As can be seen from table 1: evaluation is carried out on a Market1501 data set, the highest 96.0% accuracy of the method Rank1 is achieved, and the mAP is 91.0%. In addition, the evaluation result of the method on a larger and more complex data set MSMT17 is obviously superior to that of a CNN-based method, and also exceeds that of two methods based on a transducer which are self-supervised pre-training and fine tuning by using ClusterContrast, and the accuracy rate of Rank1 and mAP respectively reach 78.6% and 56.2%, which are improved by 3.6% and 5.6% compared with the previous unsupervised method. In contrast to the supervised approach, although the proposed approach is purely unsupervised, it is still not inferior in performance to the supervised approach. On the mark 1501 dataset, the proposed method Rank1 achieves 96.0% accuracy, mAP achieves 91.0%, superior to most of the most advanced and supervised methods in the past two years. The experimental results on both data sets demonstrate the effectiveness and superiority of the present invention.

The method provided by the invention tests on a large number of unsupervised target re-identification data sets, and the obtained result is superior to the current most advanced unsupervised target re-identification method, and even has competitive power in the supervised target re-identification method.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. An unsupervised target re-identification method based on a perception-assisted learning transducer model is characterized by comprising the following steps:

step 1: constructing a transducer model based on perception aided learning;

2. The method for unsupervised target re-identification based on a perception-assisted learning transducer model according to claim 1, wherein the method comprises the following steps:

the transducer model based on perception aided learning is a trained model; in the training process, mutual learning is carried out through a distinguishing feature learning branch and a perception auxiliary learning branch to finish training together; in the distinguishing feature learning, the obtained features are clustered to obtain pseudo tags, and then the parameters of the guiding model are updated by adopting contrast learning according to the pseudo tags; in the perception auxiliary learning branch, a supervision signal is constructed by aligning the original pixels of the mask part, and the model is guided to learn fine granularity information;

the distinguishing feature learning and defining set

H and W represent the width and height of the image, respectively, C represents the three passes of RGBLane, I/O (the)>

Representing a real number; after the block embedding operation, the image is divided into N block embedments with D dimensions; then introducing a learnable global classification feature for the sequence representation of the global feature, and a learnable position embedding to maintain the spatial position relationship; during training, the perceptually assisted learning transform model extracts features from all training images to obtain initial features represented by global classification features +.>

Adopting a clustering algorithm to the initial characteristics, and generating a pseudo tag according to a clustering result; storing representative features of each cluster and corresponding pseudo tags through a memory dictionary; initializing representative features of each cluster by a cluster average value, and subsequently carrying out momentum update; the examples are considered positive samples with the same pseudo tags in the memory dictionary, the rest are negative samples; the contrast loss at cluster level is:

momentum updates are as follows:

m _j ←μm _j +(1-μ)f ^h ， (2)

wherein f ^h Represents the most difficult sample in the batch, μ represents momentum.

3. The method for unsupervised target re-identification based on a perception-assisted learning transducer model according to claim 2, wherein the method comprises the following steps: during clustering, the pulling distance between positive samples is reduced, and negative samples are far away from each other, so that examples in the clusters are clustered together;

wherein,,

4. The method for unsupervised target re-identification based on a perception-assisted learning transducer model according to claim 2, wherein the method comprises the following steps: processed set using target perception mask

As input for perception-assisted learning; selecting to mask partial block embedding near the center of the image, specifically, excluding block embedding at the edge c circle of the image, and then randomly masking the rest part; these masks are defined as randomly initialized learner block embedding +.>

Wherein (1)>

Representing that a certain one of the learnable mask blocks is embedded, and m represents the number of mask blocks for subsequent direct learning of local visual perception.

5. The method for unsupervised target re-identification based on a perception-assisted learning transducer model according to claim 2, wherein the method comprises the following steps: the original image input is dimensionally transformed to facilitate alignment of pixels with block features,

will be reconfigured as +.>

wherein f _i ^p Representing the features learned by the ith mask block,

the final loss function is expressed as:

6. An unsupervised target re-recognition system based on a perception-assisted learning transducer model, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the unsupervised target re-identification method based on a perceptually assisted learning transducer model as claimed in any one of claims 1 to 5.