CN115601791A

CN115601791A - Unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution

Info

Publication number: CN115601791A
Application number: CN202211404730.9A
Authority: CN
Inventors: 蒋敏; 张千; 孔军; 陶雪峰
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-01-13
Anticipated expiration: 2042-11-10
Also published as: CN115601791B

Abstract

The invention relates to an unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution. The multi-branch network recognition model multi-former is constructed based on a transform network, and comprises a single-camera-domain Intraformer network and a multi-camera-domain Interformer network, all the single-camera-domain Intraformer networks share backbone network parameters, the generalization capability is enhanced, inter-domain differences caused by backgrounds, illumination and the like of different camera domains are relieved to a certain extent, the robustness of the model to noise pseudo labels is improved, and the accuracy of unsupervised pedestrian re-recognition is further improved. By using the self-adaptive outlier sample redistribution, the number of the pseudo labels can be expanded, and the feature representation capability of a multi-branch network recognition model Multiformer is enhanced. During model training, by using the combined learning formed by the example-level comparison learning and the clustering-level comparison learning, the clustering accuracy can be greatly improved, and the problem of noise pseudo-labels is alleviated, so that the accuracy and robustness of unsupervised pedestrian re-identification are effectively improved.

Description

Unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution

Technical Field

The invention relates to an unsupervised pedestrian re-identification method, in particular to an unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution.

Background

With the extensive research in theory and practice of computer vision, pedestrian re-identification is also becoming an important branch of the computer vision, which aims to identify a target pedestrian in a non-overlapping camera. Pedestrian re-identification has a wide range of real-world applications such as criminal searches, multi-camera tracking, and missing person searches.

At present, the traditional research of pedestrian re-identification relies on a large number of manually marked images, the method is inefficient and expensive, the problem is thoroughly solved by unsupervised pedestrian re-identification, the technology does not need to additionally mark the identity of a pedestrian, and compared with the traditional pedestrian re-identification, the unsupervised pedestrian re-identification has wider application space.

Due to the diversity of objective environments and the subjective complexity of pedestrian actions, at present, unsupervised pedestrian re-identification still has many problems to be solved urgently, wherein the problems to be solved mainly include: 1) If no real identity label exists, the model must determine a pseudo identity label related to the training data; at present, similar images are mainly distributed with the same label through clustering or KNN searching and the like so as to generate a pseudo label for training, but if the estimated identity is incorrect, the learning of the model is hindered; 2) Because the pedestrian images have factors such as shielding, different visual angles, background interference and the like, the estimated pseudo label has noise, and the main task of the pedestrian re-identification model is to learn distinctive pedestrian feature representation from different pedestrian images, minimize the influence of the noise pseudo label and maximize the discrimination of the model, which is also a challenge of unsupervised pedestrian re-identification; 3) The pedestrian re-identification is essentially a multi-camera retrieval task, and how to fully learn the pedestrian features unchanged across cameras due to the differences of backgrounds, visual angles, light rays and the like among different cameras is also a problem to be solved.

In addition, the traditional unsupervised pedestrian re-identification task mainly adopts the CNN as a backbone network to extract features, the CNN can only process one local neighborhood each time, the receptive field is limited, global information cannot be well captured, and the convolution and down-sampling operation of the CNN can cause great detail information and space information loss, so that the unsupervised pedestrian re-identification requirement cannot be effectively met.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides an unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution, and effectively improves the accuracy and robustness of unsupervised pedestrian re-identification.

According to the technical scheme provided by the invention, the unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution comprises the following steps:

constructing a Transformer network-based multi-branch network identification model Multiformer to perform required unsupervised pedestrian re-identification on pedestrian images acquired by m cameras by using the constructed multi-branch network identification model Multiformer, wherein,

identifying a model Multiformer for the constructed multi-branch network, wherein the model Multiformer comprises a single-camera-domain Intromer network constructed on the basis of a Transformer network for each camera and a multi-camera-domain Interormer network constructed on the basis of the Transformer network for all cameras;

when a multi-branch network recognition model multi-camera is constructed, the single-camera domain Intraform networks and the multi-camera domain Interform networks of all cameras adopt the same backbone network, and the single-camera domain Intraform networks of all cameras share the backbone network parameters during training;

when the pedestrian is re-identified, feature extraction is carried out on an identification image containing the pedestrian to be identified by using a multi-camera domain Interformer network, so that the pedestrian image matched with the extracted pedestrian feature is searched and determined in the pedestrian images collected by the m cameras according to the extracted pedestrian feature.

When the multi-branch network recognition model Multiformer is constructed, the construction steps comprise:

constructing a multi-branch network identification basic model based on a transform network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the transform network and m single-camera domain basic networks based on the transform network, and classifiers are configured in the multi-camera domain basic network and all the single-camera domain basic networks, wherein the configured classifiers are adaptively connected with corresponding backbone networks in the multi-camera domain basic network or the single-camera domain basic network;

when a multi-branch network identification basic model is constructed, pre-training a backbone network for constructing a multi-camera domain basic network on the basis of an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;

when the constructed single-camera domain basic network is trained, loading the obtained multi-camera domain backbone network pre-training parameters to the backbone networks of all the single-camera domain basic networks so as to enable the single-camera domain basic networks of all the cameras to share the network backbone parameters;

performing required training on the constructed multi-branch network identification basic model, so as to form a corresponding single-camera-domain Intraformer network based on a trained single-camera-domain basic network and form a multi-camera-domain Interformer network based on a trained multi-camera-domain basic network when a target training state is reached;

and forming a multi-branch network identification model Multiformer by using the multi-camera domain Interformer network and the m single-camera domain Intraformer networks.

When the constructed multi-branch network recognition basic model is trained, the training process comprises the following steps:

step 1, performing feature extraction on a training data set by utilizing a multi-branch network identification basic model to obtain multi-camera-domain picture features F _mc And single camera field picture feature F of the ith camera _{c_i} ，i＝1,…,m；

Step 2, the obtained multi-camera-domain picture characteristics F _mc And single camera field picture feature F of ith camera _{c_i} Clustering is carried out, wherein successfully clustered pictures form clustering points Inliers, clustering point pseudo labels are distributed to pictures in the clustering points Inliers, and unsuccessfully clustered pictures form Outliers;

step 3, generating a clustering point pseudo label clustering center based on the clustering point pseudo labels, performing self-adaptive outlier sample redistribution on Outliers Outiers by using the generated clustering point pseudo label clustering center, distributing corresponding clustering point pseudo labels to Outliers in the Outliers Outiers after the self-adaptive outlier samples are redistributed, and forming a pseudo label training set by using all clustering point pseudo labels;

step 4, performing joint comparison learning on the multi-branch network identification basic model to perform model network parameter optimization based on the joint comparison learning on the multi-branch network identification basic model, wherein,

for the ith single-camera domain basic network, based on the training data set and the single-camera domain picture characteristic F of the ith camera _{c_i} Performing joint comparison learning on the clustering centers of the clustering point pseudo labels;

for the multi-camera domain basic network, based on the training data set and the multi-camera domain picture characteristics F _mc Performing joint comparison learning on the clustering centers of the clustering point pseudo labels;

the joint contrast learning comprises clustering level contrast learning and example level contrast learning;

step 5, carrying out the collaborative training of the single-camera domain basic network and the multi-camera domain basic network on the multi-branch network identification basic model after the optimization based on the joint comparison learning, wherein,

using multi-camera-domain picture features F _mc The pseudo label training set trains the multi-camera domain basic network;

for the ith single-camera domain basic network, utilizing the single-camera domain picture characteristic F of the ith camera _{c_i} Training by a pseudo label training set;

and 6, repeating the training process from the step 1 to the step 5 until a target training state is reached.

For step 1, extracting the multi-camera-domain picture characteristics F _mc Then, performing Spilt processing on any training picture in a training data set, connecting a parameter Cls token to each image block obtained by the Spilt processing, and embedding the position information of each image block and the camera information code of the training picture to configure and form training picture multi-camera domain feature extraction information;

processing the multi-camera domain feature extraction information of the training picture by using a multi-camera domain basic network to extract and obtain multi-camera domain picture features F _mc ；

Extracting single-camera domain picture feature F of ith camera _{c_i} Spilt processing is carried out on the training picture acquired by the ith camera, a parameter Cls token is connected to each image block obtained by the Spilt processing, and the position information of each image block is embedded to form training picture single-camera domain feature extraction information;

processing the single-camera domain feature extraction information of the training picture by using the single-camera domain basic network corresponding to the ith camera to extract and obtain single-camera domain picture features F _{c_i} 。

In step 2, the obtained multi-camera-domain picture characteristics F are processed _mc And all single camera field picture features F _c And when clustering is carried out, the clustering method comprises a DBSCAN clustering method.

In step 3, for the pseudo label clustering center of the clustering point, the following steps are provided:

wherein Y is the category number of the pseudo labels of the clustering points, phi _i As cluster center feature of class i, f _j Num is the feature of the ith picture _i The number of pictures included in the category of the ith category;

the generated clustering point pseudo label clustering Center is stored in a clustering Center feature repository Center Memory Bank;

computing an affinity matrix between outlier samples within Outliers and the clustering point pseudo-tag clustering center, wherein,

the affinity matrix between Outliers and the clustering point pseudo tag clustering center is:

AFM (i, j) is the ith cluster center feature phi in the affinity matrix AFM _i Value of mutual similarity relationship with jth outlier sample, O _j Features of a jth outlier sample; phi _{i_r} Representing the ith cluster center feature Φ _i R number of (1), O _{j_r} Represents the jth outlier sample feature O _j R, N represents a characteristic dimension;

and when the calculated affinity matrix AFM self-adaptive outlier sample is redistributed, the outlier sample is distributed to the clustering center of the clustering point pseudo label with the strongest mutual similarity relation.

Configuring a mutual similarity relation threshold value v for the mutual similarity relation between the outlier sample and the clustering point pseudo label clustering center, wherein,

Num _O is the number of Outliers, v, samples within the Outliers _start Is the initial value of threshold v of mutual similarity relation, gamma is threshold attenuation rate, epoch is training round, e _peak For characterizingThe training round when the threshold value v of the mutual similarity relation reaches the peak value, II (-) is an indication function, when the training round is less than e _peak Is 1, i.e. II (·) = II { epoch)<e _peak }；

When the outlier samples are distributed based on the configured mutual similarity relation threshold v, the jth outlier sample of which the mutual similarity relation value AFM (i, j) is greater than the mutual similarity relation threshold v is distributed to the clustering center of the pseudo label of the clustering point with the strongest mutual similarity relation.

In step 4, during the joint comparison learning, after the cluster level comparison learning, the cluster comparison loss l is obtained _c (ii) a After example-level contrast learning, example contrast loss l is obtained _t, wherein ,

to cluster contrast loss l _c Then, there are:

wherein ,Φ₊ The method comprises the steps that positive samples of a sample picture q are obtained, gamma is a set parameter, and f (q) is a query example feature of the sample picture q;

comparison of losses l to examples _t Then, there are:

wherein P is the number of different pedestrians selected in a given sample, K is the number of sample pictures selected for each pedestrian in the given sample, a is one picture in the K sample pictures,

for an anchor image with identity i,

is a positive sample with an identity of i,

is an identityIs a negative sample of j and is,

for anchor images of identity i

The minimum gap between the similarity of the beta positive sample pair and the similarity of the negative sample pair is the extracted image feature.

Determining a model network parameter theta to minimize a loss function of the NH training samples under the determined model network parameter theta when model network parameter optimization based on joint contrast learning is performed, wherein,

when optimizing, the multi-camera domain basic network and all the single-camera domain basic networks are optimized simultaneously, and the method comprises the following steps:

f(x _a ) For anchor point image x _a And (4) extracting image features.

For the loss of identity for collaborative training, there are:

wherein ,l_id In order to coordinate the loss of identity training,

is x _i Nz is the number of training samples in the training data set,

for training sample x _i The multi-branch network recognition basic model outputs a real identity label of

The probability of (c).

The invention has the advantages that: the multi-branch network identification model Multiformer is constructed based on a Transformer network, the constructed multi-branch network identification model Multiformer comprises a single-camera domain Intormer network and a multi-camera domain Interormer network, all the single-camera domain Intormer networks share backbone network parameters, the generalization capability is enhanced, the inter-domain difference caused by the background, illumination and the like of different camera domains is relieved to a certain extent, the robustness of the model to noise pseudo labels is improved, and the accuracy of unsupervised pedestrian re-identification is further improved.

By utilizing the self-adaptive outlier sample redistribution, the number of the pseudo labels can be expanded, and the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. During model training, by means of the combined learning of the example-level comparison learning and the cluster-level comparison learning, the clustering accuracy is greatly improved through the combined learning, and the problem of noise pseudo labels is relieved, so that the accuracy and robustness of unsupervised pedestrian re-identification are effectively improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a flowchart of an embodiment of constructing a multi-branch network recognition model Multiformer according to the present invention.

Fig. 3 is a diagram of an embodiment of a multi-branch network recognition model Multiformer according to the present invention.

FIG. 4 is a schematic diagram of a single-camera domain Intactor network and a multi-camera domain Interactor network according to an embodiment of the present invention.

Fig. 5 is a diagram illustrating the visualization effect of the multi-branch network according to the present invention.

FIG. 6 is a schematic diagram of an embodiment of counting the distribution of Outliers after clustering according to the present invention.

FIG. 7 is a diagram illustrating adaptive outlier sample reallocation according to the present invention.

FIG. 8 is a schematic diagram of the inventive association comparison.

Fig. 9 is a schematic diagram of the visualization effect in the comparative example of the present invention.

Detailed Description

The invention is further illustrated by the following specific figures and examples.

In order to effectively improve the accuracy and robustness of unsupervised pedestrian re-identification, the unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution is adopted, in an embodiment of the invention, the unsupervised pedestrian re-identification method comprises the following steps:

identifying a model Multiformer for the constructed multi-branch network, wherein the model Multiformer comprises a single-camera-domain Intormer network constructed on the basis of a Transformer network for each camera and a multi-camera-domain Interormer network constructed on the basis of a Transformer network for all cameras;

when the pedestrian is re-identified, feature extraction is carried out on an identification image containing the to-be-identified pedestrian by using a multi-camera domain Interformer network, so that a pedestrian image matched with the extracted pedestrian feature is searched and determined in the pedestrian images collected by the m cameras according to the extracted pedestrian feature.

Fig. 1 shows an implementation flowchart of unsupervised pedestrian re-identification, in which a multi-branch network identification model Multiformer based on a transform network is required to be constructed when unsupervised pedestrian re-identification is implemented, that is, the multi-branch network identification model Multiformer is constructed based on the transform network, where a scene range of pedestrian image acquisition determined by m cameras is a range of a pedestrian re-identification area of the multi-branch network identification model Multiformer, at this time, the constructed multi-branch network identification model Multiformer can perform unsupervised pedestrian re-identification on pedestrian images acquired by the m cameras, where the cameras are devices capable of acquiring pedestrian images, such as common cameras, camera heads, and the like, and the specific types of the cameras and the number of the cameras can be selected as required so as to meet the required unsupervised pedestrian re-identification. In addition, m cameras are generally installed in different areas, that is, images of pedestrians in m different area scenes can be acquired by using the m cameras.

In order to improve accuracy and robustness of unsupervised pedestrian re-identification, in an embodiment of the present invention, the multi-branch network identification model multi-camera needs to include a single-camera-domain intraform network constructed based on a transform network for each camera and a multi-camera-domain interfrmer network constructed based on a transform network for all cameras, where the single-camera-domain specifically refers to a camera acquiring a pedestrian image within an acquisition range, and the multi-camera-domain specifically refers to m cameras acquiring a pedestrian image within an acquisition range. Because the single-camera-domain Intramer network and the multi-camera-domain Interramer network are constructed on the basis of the Transformer network, the global information and the picture details can be better acquired by utilizing the characteristics of the Transformer network, and the effective information utilization rate of the global is enhanced.

In one embodiment of the invention, single-camera domain Intrameters of all cameras adopt the same backbone network, and share network backbone parameters, so that the generalization capability of a multi-branch network recognition model Multiformer can be enhanced, the inter-domain differences brought by the backgrounds, the illuminations and the like of different camera domains are relieved to a certain extent, the robustness of a noise pseudo label is improved, and the accuracy of unsupervised pedestrian re-recognition is further improved.

FIG. 5 is a T-SNE graph, plotted on a public data set Market-1501. Fig. (a) shows a feature distribution map obtained without applying the multi-branch network recognition model, multiformer, processing according to the present invention, and fig. (b) shows a feature distribution map obtained by applying the multi-branch network recognition model, multiformer, extraction. Where the same color dots represent the same camera, the Market-1501 data set extracts pictures from 6 cameras, and thus there are 6 colors in the figure. The image (a) is influenced by the domain difference between different cameras, so that the image features of the same camera are more similar, which means that the attention of the network is not influenced by the pedestrian but is influenced by noise. The image features of each camera in the graph (b) are uniformly distributed, and it can be seen that, after a multi-branch network recognition model, a multi-former is introduced, the domain difference among different cameras is obviously relieved.

In an embodiment of the present invention, when constructing a multi-branch network recognition model, the construction step includes:

constructing a multi-branch network identification basic model based on a transform network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the transform network and m single-camera domain basic networks based on the transform network, a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and the configured classifier is adaptively connected with corresponding backbone networks in the multi-camera domain basic network or the single-camera domain basic network;

when a multi-branch network identification basic model is constructed, pre-training a backbone network for constructing a multi-camera domain basic network on the basis of an ImageNet data set to obtain pre-training parameters of the multi-camera domain backbone network of the multi-camera domain basic network;

performing required training on the constructed multi-branch network recognition basic model, so as to form a corresponding single-camera-domain Intraform network based on a trained single-camera-domain basic network and form a multi-camera-domain Interform network based on a trained multi-camera-domain basic network when a target training state is reached;

As can be seen from the above description, since the multi-branch network identification model includes a single-camera-domain intranet network and a multi-camera-domain interposer network, the constructed multi-branch network identification basic model at least includes m single-camera-domain basic networks for forming the single-camera-domain intranet network and m multi-camera-domain basic networks for forming the multi-camera-domain interposer network, that is, the m single-camera-domain basic networks correspond to the m single-camera-domain intranet networks finally formed, and the multi-camera-domain basic networks correspond to the multi-camera-domain interposer networks.

In an embodiment of the present invention, the single-camera domain basic network and the multi-camera domain basic network both use the same backbone network, for example, both use an Encoder in a Transformer network as the backbone network. In addition, a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and then a multi-branch classifier is formed.

Fig. 3 shows a schematic diagram of the architectures corresponding to a multi-camera-domain Interformer network and a single-camera-domain Intraformer network after reaching a target training state, and since only the corresponding network parameters are optimized and adjusted during training, the corresponding architectures of the constructed single-camera-domain basic network and the multi-camera-domain basic network can refer to the diagram of fig. 3.

In fig. 3 and fig. 4, for a multi-camera domain inter-former network, split is to slice an input picture, so as to obtain a plurality of image blocks after slicing. The Liner Projection of Flattened Patches is linear Projection and dimension transformation processing, embedding is data Embedding, and Feature Extraction is Feature Extraction, wherein in the Feature Extraction, E is obtained in sequence _mc Block, and Token. In FIG. 4, branch-1 to Branch-m are the Intraform networks of m single-camera domains.

Affinity Matrix is an Affinity Matrix, pseudo Label is a Pseudo Label, AORA is an adaptive outlier sample redistribution strategy, joint Contrast Learning (JCL) is Joint contrast Learning, MLP is a Multilayer Perceptron (multilayered Perceptron), and Classincer is a Classifier. The Joint Contrast Learning (JCL) includes Instance contrast learning and Cluster contrast learning.

Fig. 4 shows an implementation of a multi-camera domain interpolator network and a single-camera domain interpolator network corresponding backbone network, where the backbone network in fig. 4 includes the above-mentioned Spilt, linear projection, dimension transformation processing, and the like, and the specific cases of forming the multi-camera domain interpolator network and the single-camera domain interpolator network corresponding backbone network based on the transform are the same as the prior art.

In one embodiment of the invention, the Classifier is adaptively connected to the backbone network via MLP, and in this case, the information added by the Classifier can be determined. In specific implementation, all Classifier classifiers use the same Classifier, and normal initialization can be adopted for all Classifier classifiers. After the established multi-branch network recognition basic model is trained to reach the target state, the corresponding Classifier classifiers can be obtained respectively.

For the single-camera-domain intemperer network, since the same backbone network is used as the multi-camera-domain intemperer network, in fig. 3, for a specific case of m single-camera-domain intemperer networks, reference may be made to the corresponding description of the multi-camera-domain intemperer network, and details thereof are not repeated here.

In order to realize the sharing of network backbone network parameters, in one embodiment of the invention, a backbone network in a multi-camera domain basic network is constructed and pre-trained on the basis of an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;

and loading the pre-training parameters of the multi-camera domain backbone networks of the multi-camera domain network backbone networks to the backbone networks of all the single-camera domain basic networks when the constructed single-camera domain basic network is constructed, so that the single-camera domain basic networks of all the cameras share the network backbone parameters.

In specific implementation, the ImageNet data set is a commonly used public data set, and the method and the process for pre-training the backbone network in the multi-camera domain basic network by using the ImageNet data set are consistent with the prior art. And in the multi-camera domain basic network, after pre-training to obtain pre-training parameters of the multi-camera domain network, adding a Classifier in the multi-camera domain basic network. In all single-camera domain basic networks, after pre-training parameters of the multi-camera domain network are loaded, a Classifier is added in each single-camera domain basic network.

The Classifier can be a current common classification form, and the mode for adding the Classifier and the specific form of the Classifier can be selected according to actual needs so as to meet the classification requirement required by the Classifier. After all classifiers are added, the construction of the multi-branch network recognition basic model is realized, and then the training of the multi-branch network recognition basic model is needed.

As can be seen from the above description, network parameter sharing is implemented for the backbone network of the single-camera domain foundation network constructed by each camera based on the transform network, but the corresponding parameters of the Classifier are not shared. In specific implementation, after the backbone networks of the single-camera domain basic network share network parameters, the corresponding backbone network parameters are basically consistent.

In an embodiment of the present invention, the training required for constructing the multi-branch network recognition basic model is specifically configured to implement sharing of network backbone parameters by the single-camera domain basic network of all cameras, and then the training is performed on the obtained multi-branch network recognition basic model until a target training state is reached.

Step 2, the obtained multi-camera-domain picture characteristics F _mc And single camera field picture feature F of the ith camera _{c_i} Clustering is carried out, wherein the successfully clustered pictures form clustering points Inliers, and clustering points In are subjected toDistributing clustering point pseudo labels to pictures in the liers, and forming Outliers by pictures which are not successfully clustered;

step 3, generating a clustering point pseudo label clustering center based on the clustering point pseudo labels, performing self-adaptive outlier sample redistribution on Outliers by using the generated clustering point pseudo label clustering center, distributing corresponding clustering point pseudo labels to Outliers in the Outliers after the self-adaptive outlier samples are redistributed, and forming a pseudo label training set by using all clustering point pseudo labels;

for the ith single-camera domain basic network, based on the training data set and the single-camera domain picture characteristics F of the ith camera _{c_i} Performing joint comparison learning on the clustering centers of the clustering point pseudo labels;

step 5, the multi-branch network identification basic model after optimization based on joint comparison learning is subjected to the collaborative training of a single-camera domain basic network and a multi-camera domain basic network, wherein,

using multiple camera-domain picture features F _mc The pseudo label training set trains the multi-camera domain basic network;

for the ith single-camera domain basic network, the single-camera domain picture characteristic F of the ith camera is utilized _{c_i} Training by a pseudo label training set;

during collaborative training, network parameter optimization based on collaborative training is carried out on the multi-branch network recognition basic model by utilizing the calculated collaborative training identity loss;

Fig. 2 shows an embodiment of a training process for a multi-branch network recognition basic model, that is, during training, steps such as feature extraction, clustering to generate partial pseudo labels and outliers, adaptive outlier sample allocation, joint comparison learning, and multi-branch network collaborative training are generally performed, where a termination condition of the training is generally whether the model converges, and when it is determined that the model converges after the training, the training is terminated, and at this time, a target training state is reached, otherwise, the training is repeatedly performed. In specific implementation, the condition for judging that the model is in the convergence state is as follows: during training, the precision of the model is not increased any more, and the loss of the model is not reduced any more.

The training procedure is described in detail below.

Specifically, during training, a training data set needs to be provided or configured, the training data set is images captured and collected by m cameras, and the size of the training data set can be selected according to actual needs so as to meet required training requirements.

In an embodiment of the present invention, for step 1, the multi-camera domain picture feature F is extracted _mc Then, performing Spilt processing on any training picture in a training data set, connecting a parameter Cls token to each image block obtained by the Spilt processing, and coding the position information of each image block and the camera information of the training picture to form training picture multi-camera domain feature extraction information;

Extracting single-camera domain picture feature F of ith camera _{c_i} Spilt processing is carried out on the training picture acquired by the ith camera, a parameter Cls token is connected to each image block obtained by the Spilt processing, and position information of each image block is processed to form training picture single-camera domain feature extraction information;

training image pairs using single-camera domain basis network corresponding to ith cameraProcessing the single-camera domain feature extraction information to extract and obtain single-camera domain picture features F _{c_i} 。

In specific implementation, the input training data is a set X of all camera pictures (i.e. pictures collected by m cameras) _mc ∈R ^B×C×H×W Where H × W is the resolution of the input picture, C is the number of channels (for RGB pictures, the number of channels C is 3), and B is the size of each batch, and the size of each batch B can be selected and determined according to the actual application scenario, etc. Segmenting (Spilt) an input picture and flattening the spatial dimension to obtain an image block Patch sequence

Wherein N is the number of Patch obtained by dividing, P _h ×P _w The size of the cut image block Patch is shown.

Image block Patch sequence X _p Obtaining Patch code E of image block after linear projection and dimension transformation _mc ∈R ^B ^×N×D And D is the generated characteristic dimension. Encoding E for an image Block Patch _mc Connecting a parameter Cls token for representing global characteristics, and embedding the position code and the camera information code to obtain E _{mc_cls} ∈R ^B×N′×D . After training, the Cls token parameter will contain a feature representation of the input picture for classification. The size of the parameter Cls token is R ^B×1×D Patch encoding of image blocks E _mc After connecting the parameters Cls token, the dimension of the Patch number N is increased by 1.

The parameter Cls token is a learnable parameter with a size R ^B×1×D (ii) a The position code is the position information of each image block Patch in the original picture after being divided, the camera information code is formed by the camera number information of the picture, and the sizes corresponding to the position code and the camera information code are R ^1×N′×D The initial values are all 0.

Will E _{mc_cls} Sending the training picture to a Block network of a transform network, and processing the training picture multi-camera-domain feature extraction information by using the Block network to extract and obtain multi-camera-domain picture features F _mc Extracted polyCamera field Picture feature F _mc Namely Token generated by the Interformer network in fig. 3 and fig. 4.

For the single-camera domain basic network, the input training pictures are classified according to the camera labels and are respectively sent into the corresponding single-camera domain basic network, for example, for the ith camera and the single-camera domain basic network corresponding to the ith camera, the training pictures sent into the single-camera domain basic network corresponding to the ith camera are the pictures acquired by the ith camera. Specifically, the input data for each single-camera domain infrastructure network is a set X of single-camera pictures _{c_i} ∈R ^B×C×H×W Wherein c _ i represents the ith camera, and the image block Patch code E is obtained after the image block Patch is subjected to the same segmentation and dimension transformation as those of the multi-camera domain basic network _{c_i} ∈R ^B×N×D . Encoding E for an image Block Patch _{c_i} A parameter Cls token is concatenated, into which the position code of the image patch is also embedded. Subsequently, the image Patch is encoded E _{c_i} Transmitting the image into a Block network of a transform network to extract and obtain the image characteristics F of a single camera field _{c_i} Extracting to obtain single-camera domain picture characteristics F _{c_i} I.e. Token generated by the Intraformer network in fig. 3 and fig. 4.

In FIG. 4 is shown that the Block network is based on E _{mc_cls} For an embodiment of processing the location information and the camera information to obtain Token, the specific method and process of obtaining Token of the Block network may refer to the processing procedure in fig. 4, which is similar to the Block network processing method in the existing transform network and will not be described in detail here.

In a multi-camera-domain basic network for forming a multi-camera-domain Interformer network and a single-camera-domain basic network for forming a single-camera-domain Intraformer network, parameters of a Block network are determined and obtained through the steps, so that picture features can be directly extracted and obtained by using the Block network, are specifically related to the Block network, are well known to those skilled in the art, and are not described herein again.

In an embodiment of the present invention, in step 2, the obtained multi-camera-domain picture feature F is processed _mc And all single camera field picture features F _c When clustering is performed, the clustering method packageThe method comprises a DBSCAN clustering method.

In the clustering process, part of the extracted image features are interfered by noises such as pedestrian postures, backgrounds and the like, so that the image features are far away from a clustering central point and cannot be successfully clustered, such samples are called outlier samples, and all the outlier samples form Outliers. In an embodiment of the invention, unsupervised pedestrian re-identification is performed by means of collaborative training by means of pseudo labels obtained by clustering, and outlier samples cannot be utilized in training due to missing labels.

In specific implementation, the clustering method can adopt a DBSCAN clustering method, the DBSCAN clustering method does not need to specify the clustering number, the clustering category number can be independently learned, after clustering, clustering point pseudo labels are distributed to the successfully clustered pictures, and the unsuccessfully clustered pictures form Outliers. Of course, in specific implementation, the clustering method may also adopt other common clustering forms, specifically based on meeting the actual clustering requirements. When the DBSCAN clustering method is adopted for clustering, the specific conditions for clustering to form clustering points Inliers and outlier Outliers can be selected and determined according to actual needs, so that the actual clustering requirements can be met.

Fig. 6 counts the number of outlier samples of the pedestrian re-identification data set Market-1501 after clustering by the DBSCAN, the outlier samples occupy more than 60% of the total samples in the initial training stage, and the outlier samples still occupy more than 10% after the model is iterated for multiple times. Compared with the convolutional neural network, the Transformer network has less generalized bias on the structure of input data, such as correlation and translational invariance, and therefore, more data is needed for training the Transformer network, especially in the early stage of model training. To get better results, full use is needed for outlier samples.

In an embodiment of the present invention, in step 3, for the clustering center of the pseudo labels at the clustering points, there are:

wherein Y is the category number of the pseudo labels of the clustering points,Φ _i as cluster center feature of class i, f _j Num is the feature of the ith picture _i The number of pictures contained in the ith category;

the generated clustering point pseudo label clustering Center is stored in a clustering Center characteristic repository Center Memory Bank;

computing an affinity matrix between outlier samples within Outliers and the cluster point pseudo-tag cluster centers, wherein,

and when the calculated affinity matrix AFM self-adaptive outlier sample is redistributed, allocating an outlier sample to the clustering center of the pseudo label of the clustering point with the strongest mutual similarity relation.

In specific implementation, after clustering is performed based on the DBSCAN clustering method, the category number Y of the pseudo labels of the clustering points can be obtained according to the clustering points Inliers formed, and of course, the feature f of the ith category of jth picture can also be obtained _j And the number of pictures num contained in the i-th category _i . Thus, after clustering, a clustering point pseudo tag clustering center can be generated, the generated clustering point pseudo tag clustering center being { Φ } ₁ ，Φ ₂ ，...Φ _i ，…Φ _Y }。

For the characteristic dimension N, it is related to the constructed multi-branch network identification model Multiformer, and for a certain multi-branch network identification model Multiformer, the characteristic dimension N remains fixed. At this time, the ith cluster center is speciallySign phi _i Having the same feature dimension as the jth outlier sample. For the affinity matrix AFM, the mutual similarity relation values between the corresponding outlier sample and the cluster center feature can be obtained in accordance with the prior art, that is, based on the affinity matrix AFM.

As can be seen from the above description, when determining whether to converge, it is generally necessary to train the constructed multi-branch network recognition base model multiple times. In an embodiment of the invention, a clustering Center characteristic repository Center Memory Bank is used for storing clustering point pseudo label clustering centers after clustering in each training process.

After the clustering point pseudo tag clustering centers are stored by using a clustering Center feature repository Center Memory Bank, the affinity matrix between Outliers in Outliers Outliers and the clustering point pseudo tag clustering centers can be calculated, self-adaptive outlier sample redistribution is carried out based on the calculated affinity matrix, the data volume of a training model is expanded, the feature representation capability of the model is enhanced, and better performance is obtained.

Ith cluster center feature phi in the AFM _i The ith cluster center feature Φ at the time of the mutual similarity relationship value AFM (i, j) with the jth outlier sample _i Specific case of characteristic dimension N and the ith clustering center characteristic phi _i And (4) correlating. In specific implementation, when the self-adaptive outlier sample is redistributed, the jth outlier sample is distributed to the clustering center of the pseudo label of the clustering point with the strongest mutual similarity relation. The mutual similarity relationship is strongest, specifically, for j outlier samples, the feature phi of the ith clustering center is _i The corresponding mutual similarity value AFM (i, j) is maximal.

In one embodiment of the invention, a mutual similarity threshold v is configured for the mutual similarity between the outlier sample and the clustering center of the pseudo label of the clustering point, wherein,

Num _O as Outliers inner OutliersNumber of books, v _start Is the initial value of threshold v of mutual similarity relation, gamma is threshold attenuation rate, epoch is training round, e _peak The training round is used for representing the time when the threshold value v of the mutual similarity reaches the peak value, II (-) is an indication function, and when the training round is less than e _peak Is 1, i.e. II (·) = II { epoch)<e _peak }；

In specific implementation, when the multi-branch network recognition basic model is in an initial training stage, the feature extraction capability is poor, and the accuracy of extracted features is relatively low, so that a smaller threshold v of mutual similarity relation is adopted. Along with the training, the model feature extraction capability is gradually enhanced, and similarly, the threshold value v of the mutual similarity relationship is also adaptively increased; however, due to the fact that some pictures have multiple pedestrians, occlusion, blurring and the like in the data, and part of outlier sample points are always in the condition of oscillation or clustering incapability, the outlier sample points are called as strong noise points, and therefore after iteration is performed for a certain turn, the threshold value ν of the mutual similarity relation is adaptively reduced to ignore the interference of the strong noise points on the model, as shown in fig. 7.

And configuring to obtain the threshold v of the mutual similarity relation based on the conditions in different training stages. Initial value v of mutual similarity relation threshold v _start May be set to 0.6 in general, the threshold decay rate y may be set to 0.9 in general _peak And is an empirical value that can be set to 10 in general. epoch is the round of model training.

After the mutual similarity relation threshold value v is configured, when the mutual similarity relation value AFM (i, j) is larger than the mutual similarity relation threshold value v, the jth outlier sample is distributed to the clustering point pseudo label clustering center with the strongest mutual similarity relation, otherwise, the jth outlier sample is not distributed.

An embodiment of the inventionIn step 4, in the process of joint contrast learning, after the cluster level contrast learning, the cluster contrast loss l is obtained _c (ii) a After example-level contrast learning, an example contrast loss l is obtained _t, wherein ,

to cluster contrast loss l _c Then, there are:

comparison of losses l to examples _t Then, there are:

for an anchor image with identity i,

is a positive sample with an identity of i,

for the negative sample with the identity j,

for anchor images of identity i

For anchor images of identity i

The characteristics of the image to be extracted are,

for anchor images of identity j

And (4) extracting image features.

In specific implementation, in addition to comparison learning in the conventional example level, in order to improve the clustering effect of the model and reduce the distance between the outlier sample and the clustering center of the pseudo label of the clustering point, in an embodiment of the invention, the cluster-level comparison learning is added, and the training is combined with the example-level comparison learning. For the cluster-level comparison learning, the similarity between the samples and the positive clusters is mainly drawn, the similarity between the samples and the negative clusters is pushed away, and compared with the example-level comparison learning, the calculation amount of the model can be greatly reduced. Clustering is facilitated by the use of comparison samples, and this cluster-oriented comparison learning paradigm helps the model to minimize the similarity between clusters to separate different clusters. The situation of joint contrast learning is shown in fig. 8.

In the specific implementation, during training, the single-camera domain basic network and the multi-camera domain basic network are required to be subjected to joint comparison learning, and in the joint comparison learning, the purpose of the cluster level comparison learning is to minimize the distance between the sample picture q and the positive cluster thereof and maximize the distance between the sample picture q and the negative cluster thereof, so that after the cluster level comparison learning, the cluster comparison loss l can be obtained _c 。

And the samples of each batch of the cluster-level contrast learning only need to be contrasted with the clustering center characteristics of the pseudo labels of the clustering points for the contrast learning. In specific implementation, performing cluster level comparison learning on a single-camera domain basic network, wherein the sample picture q is a picture shot and collected by a camera corresponding to the single-camera domain basic network; and performing cluster level comparison learning on the multi-camera domain basic network, wherein the sample picture q is any picture in the training data set.

As can be seen from the above description, after clustering, the pictures in the training data set are configured with the clustering point pseudo labels after successful clustering, so that after the sample picture q is determined, the picture belongs to a positive sample in the same category as the sample picture q, and belongs to a negative sample in a different category from the sample picture q, that is, the positive sample Φ of the sample picture q ₊ And the query instance features f (q) of the sample picture q can be determined by the technical means commonly used in the technical field. Therefore, the clustering contrast loss l corresponding to the single-camera domain basic network and the multi-camera domain basic network can be obtained respectively _c 。

In specific implementation, for the parameter Γ, the value range of the parameter Γ is [0,1], and if the parameter Γ can be 0.5. In addition, the minimum gap β between the similarity of the positive sample pair and the similarity of the negative sample pair is an empirical value, and may take 0.3, for example.

In specific implementation, the purpose of example-level contrast learning is to reduce the similarity between similar samples and expand the similarity between dissimilar samples. For a given batch of samples, the given samples are sample pictures selected from the training data set, P different pedestrians are selected from the sample pictures, and K pictures are selected for each pedestrian. For each image a, a most dissimilar positive sample p is picked to carry out example level comparison learning with a most similar negative sample n. Generally, in a given sample, P may be selected to be 8, K may be selected to be 32. The identity i or the identity j specifically refers to one of P different pedestrians.

The example-level comparison learning is beneficial to learning the significant features among different samples by the multi-branch recognition network basic model, and the feature representation capability of the multi-branch recognition network basic model is enhanced. The two are combined for comparison learning, so that the clustering accuracy of the model is greatly improved, and the problem of noise pseudo-labels is solved.

In an embodiment of the present invention, a query instance feature f (q) is used to perform a comparison learning with a central feature repository Center Memory Bank, and the central feature repository Center Memory Bank is updated, where the updating method includes:

in calculating clustering contrast loss l _c Then, updating the clustering center feature by using the query instance feature f (q), wherein a specific calculation formula is as follows:

Φ ₊ ←(1-u)f(q)+uΦ ₊

wherein u is a parameter, and is used for slowly updating the characteristics of the clustering Center characteristic repository Center Memory Bank, so as to avoid losing the characteristic consistency due to violent vibration. The value range of the parameter u is generally [0,1], and the specific value of the parameter u can be selected according to actual needs so as to meet the actual needs.

In one embodiment of the invention, model network parameters theta are determined during model network parameter optimization based on joint contrast learning to minimize the loss function of NH training samples under the determined model network parameters theta,

when optimizing, the multi-camera domain basic network and all the single-camera domain basic networks are optimized simultaneously, and the following steps are performed:

wherein ,f(x_a ) For anchor point images x _a And (4) extracting image features.

In specific implementation, the number of NH training samples, which are samples selected from the training data set, may be selected as desired. maxd _a,p Is that

mind _a,n Is that

I.e. performing the operation of the two norms.

In an embodiment of the present invention, for the loss of the identity of the collaborative training, there are:

wherein ,l_id In order to coordinate the loss of identity training,

is x _i Nz is the number of training samples in the training data set,

The probability of (c).

In the specific implementation, the number Nz of training samples in the training data set can be determined according to the condition of the provided training data set, and as can be seen from the above description, in the training process, after clustering, clustering is performed to assign clustering point pseudo labels and reassignment of outlier samples is performed, and then the training samples x are assigned to the outlier samples _i The real identity label output by the multi-branch network recognition basic model can be obtained as

Probability of (2)

In specific implementation, as can be seen from the above description, in the training process, the cluster contrast loss l can be obtained _c Example comparative loss l _t And co-training identity loss l _id Then, the total loss after one training can be obtained,

as can be seen from the above description, in determining the basis of multi-branch network identificationWhen the model is converged, the main judgment indexes comprise the precision of the model and the loss of the model, wherein the precision of the model can be the average precision average mAP generally, and the loss of the model is the total loss

After training, the specific calculation of the average precision mean value mAP can be consistent with the prior art, so that the judgment of whether the multi-branch network recognition basic model is converged can be effectively realized. And when the training of the multi-branch network recognition basic model is converged, the target training state is reached, and at the moment, the multi-branch network recognition basic model is utilized to form the multi-branch network recognition model Multiformer.

After obtaining the multi-branch network recognition model Multiformer, when unsupervised pedestrian re-recognition is performed, a query picture R needs to be provided to search for pedestrians with similar features to the query picture R in a picture set taken by m cameras. As can be seen from the above description, for a picture set captured by m cameras, all features in the picture set are extracted by using a multi-branch network recognition model Multiformer, and specifically, the manner and process for extracting the picture features in the picture set refer to the above description.

And after the query picture R is processed by the same technical means, extracting corresponding characteristics by using a multi-camera domain Interformer network. After the features of the query picture R are obtained, feature similarity with the features extracted from the picture set is calculated, and the specific manner and process for calculating the feature similarity may be the conventional manner. When the feature similarity is obtained through calculation, the pedestrian image matched with the extracted pedestrian feature can be selected and determined according to actual requirements, and if a feature similarity threshold value can be set, all the pedestrian images meeting the feature similarity threshold value are selected and determined as the pedestrian images matched with the query picture R. The selection and determination of the feature similarity threshold and the like can be selected according to the needs, so as to meet the requirements of actual application scenes.

In order to verify the accuracy and robustness of the invention, experiments are carried out by utilizing three public data sets, wherein the three public data sets are respectively as follows: market-1501, msmt17 and duke mtmc-reID, specifically the duke mtmc-reID dataset contains 36411 images of 1812 identities taken by 8 cameras, with a training set of 702 identities containing 16522 images and a test set of 702 identities. The Market-1501 data set contains 1501 pedestrians photographed by 6 cameras, with 751 identity in the training set containing 12936 images and 750 identity in the test set containing 19732 images. The MSMT17 dataset contains 4101 pedestrians and 126441 bounding boxes, which were captured by 15 cameras. The training set contains 1041 pedestrians for 32621 bounding boxes, and the test set contains 3060 pedestrians for 93820 bounding boxes.

Because these data sets are obtained under a plurality of camera devices, there are a variety of gestures, visual angles and illumination changes in the data sets, and meanwhile, there are a lot of cluttered backgrounds and occlusion between pedestrians under different scenes, and these data sets all have great challenges.

Table 1 data set introduction

Data set	Number of categories	Number of training classes	Number of test classes	Size of picture
					DukeMTMC-reID	1812	702	1110	256*128
Market-1501	1501	751	750	256*128
					MSMT17	4101	1041	3060	256*128

Table 1 is the total number of categories, training categories, and test categories for the three data sets, which may be set to 256 × 128 for the picture size.

TABLE 2 accuracy of the model over three pedestrian re-identification tasks

Data set	Market-1501	DukeMTMC-reID	MSMT17
				mAP	79.1％	68.9％	36.0％

Table 2 shows the test results of the unsupervised pedestrian re-identification method provided by the invention on three unsupervised pedestrian re-identification tasks, namely Market-1501, duke MTMC-reiD and MSMT17, and the average precision mean mAP is used as an evaluation index in the unsupervised pedestrian re-identification method.

The invention achieves higher recognition rate on the three data sets. Although the three data sets have the difficulties of shielding, deformation, background confusion, low resolution and the like, the robust feature representation capability of the Multiformer, the optimization capability of the joint contrast learning strategy on cluster representation and the high-efficiency data utilization capability of the self-adaptive outlier sample redistribution strategy provided by the invention are benefited, the method has good robustness on the difficulties, and the performance is very excellent.

In order to verify the performance improvement effect of the multi-branch network identification model Multiformer, the adaptive outlier sample redistribution strategy and the Joint comparison Learning strategy on the whole unsupervised pedestrian re-identification task, an ablation experiment is performed on a Market-1501 data set as shown in table 4, specifically, VIT is taken as a base line network, namely Baseline, the Multiformer is expressed as the multi-branch network identification model Multiformer of the invention, JCL is expressed as the Joint comparison Learning module Joint Contrast Learning, and AORA is expressed as the adaptive outlier sample redistribution strategy.

As can be seen from Table 3, the accuracy of pure use of the baseline network is only 59.6% for the unsupervised pedestrian re-identification task of Market-1501. In a basic-line network, the network model structure is modified into a multi-branch network recognition model Multiformer, and the precision can reach 69.2%, which shows that the multi-branch network recognition model Multiformer can improve the feature representation capability of the model.

After cluster center features are established for combined comparison learning, the accuracy of the model can reach 77.1%, which shows that cluster level comparison learning can effectively enable the model to learn the similarity between the model and positive clusters and the difference between the model and negative clusters. On the basis, after the self-adaptive outlier sample redistribution strategy is added, the precision of the model can reach 79.1%, which shows that the module can more fully utilize the limited data samples, so that the training effect of the model is more sufficient.

TABLE 3 influence of different modules on Market-1501 unsupervised pedestrian re-identification task

In order to better show the effect of the Multiformer, the adaptive outlier sample redistribution strategy and the joint contrast learning strategy designed in the present invention, a visualization result is given in fig. 9.

In summary, the multi-branch network identification model Multiformer is constructed based on a Transformer network, the constructed multi-branch network identification model Multiformer comprises a single-camera-domain Intormer network and a multi-camera-domain Interormer network, all the single-camera-domain Intormer networks share backbone network parameters, the generalization capability is enhanced, inter-domain differences caused by backgrounds, illumination and the like of different camera domains are relieved to a certain extent, the robustness of the model to noise false tags is improved, and the accuracy of unsupervised pedestrian re-identification is further improved.

By using the self-adaptive outlier sample redistribution, the number of the pseudo labels can be expanded, and the feature representation capability of a multi-branch network recognition model Multiformer is enhanced. During model training, by using the combined learning of the example level comparison learning and the cluster level comparison learning, the clustering accuracy is greatly improved through the combined learning, and the problem of noise pseudo labels is relieved.

Claims

1. An unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-distribution is characterized by comprising the following steps:

constructing a multi-branch network identification model Multiformer based on a Transformer network to perform required unsupervised pedestrian re-identification on pedestrian images collected by m cameras by using the constructed multi-branch network identification model Multiformer, wherein,

2. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample redistribution as claimed in claim 1, wherein when constructing the multi-branch network identification model Multiformer, the construction step comprises:

3. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-allocation as claimed in claim 2, wherein when training the constructed multi-branch network identification basic model, the training process comprises:

Step 2, the obtained multi-camera-domain picture characteristics F _mc And single camera field picture feature F of ith camera _{c_i} Clustering, wherein successfully clustered pictures form clustering points Inliers, clustering point pseudo labels are distributed to pictures in the clustering points Inliers, and unsuccessfully clustered pictures form Outliers;

for the ith single-camera domain basic network, utilizing the single-camera domain picture characteristic F of the ith camera _{c_i} Training by using a pseudo label training set;

4. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-allocation as claimed in claim 3, wherein for step 1, the multi-camera domain picture features F are extracted _mc Then, performing Spilt processing on any training picture in a training data set, connecting a parameter Cls token to each image block obtained by the Spilt processing, and embedding the position information of each image block and the camera information code of the training picture to configure and form training picture multi-camera domain feature extraction information;

processing the multi-camera domain feature extraction information of the training picture by using the multi-camera domain basic network to obtain the extraction resultMulti-camera-domain picture feature F _mc ；

Extracting single-camera domain picture characteristic F of ith camera _{c_i} Then, spilt processing is carried out on the training picture acquired by the ith camera, a parameter Cls token is connected to each image block obtained through the Spilt processing, and the position information of each image block is embedded to form single-camera domain feature extraction information of the training picture;

processing the training picture single-camera domain feature extraction information by using the single-camera domain basic network corresponding to the ith camera to extract and obtain a single-camera domain picture feature F _{c_i} 。

5. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample redistribution as claimed in claim 3, wherein in step 2, the obtained multi-camera domain picture feature F is _mc And all single camera field picture features F _c And when clustering is carried out, the clustering method comprises a DBSCAN clustering method.

6. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-allocation as claimed in claim 3, wherein in step 3, for the clustering center of the pseudo labels of the clustering points, there are:

wherein Y is the category number of the pseudo labels of the clustering points, phi _i As cluster center feature of class i, f _j Num is the feature of the ith picture _i The number of pictures included in the ith category;

AFM (i, j) is the ith clustering center characteristic phi in the AFM _i Mutual similarity relation value with jth outlier sample, O _j Features of a jth outlier sample; phi _{i_r} Represents the ith cluster center feature phi _i R number of (1), O _{j_r} Represents the jth outlier sample feature O _j R, N represents a characteristic dimension;

and when the calculated affinity matrix AFM self-adaptive outlier sample is redistributed, allocating the outlier sample to the clustering center of the clustering point pseudo label with the strongest mutual similarity relation.

7. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample redistribution as claimed in claim 6, wherein a mutual similarity relation threshold v is configured for the mutual similarity relation between the outlier sample and the clustering point pseudo tag clustering center, wherein,

Num _O is the number of Outliers, v, within the Outliers _start Is the initial value of threshold v of mutual similarity relation, gamma is threshold attenuation rate, epoch is training round, e _peak The training round is used for representing the time when the threshold value v of the mutual similarity reaches the peak value, II (-) is an indication function, and when the training round is less than e _peak Is 1, i.e. II (·) = II { epoch)<e _peak }；

8. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-allocation as claimed in claim 6, wherein in step 4, in the process of combined contrast learning, after the cluster-level contrast learning, the cluster contrast loss l is obtained _c (ii) a After example-level contrast learning, an example contrast loss l is obtained _t, wherein ,

contrast loss for clusters l _c Then, there are:

comparison of losses for examples l _t Then, there are:

for an anchor image with identity i,

is a positive sample with the identity i,

is a negative sample of the identity j,

for anchor images of identity i

And beta is the minimum gap between the similarity of the positive sample pair and the similarity of the negative sample pair.

9. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample redistribution as claimed in claim 8, wherein the model network parameter θ is determined during model network parameter optimization based on joint contrast learning to minimize a loss function of NH training samples under the determined model network parameter θ, wherein,

f(x _a ) For anchor point image x _a And (4) extracting image features.

10. The unsupervised pedestrian re-identification method based on Multiformer and outlier sample re-allocation as claimed in claim 8, wherein for the loss of co-training identity, there are:

wherein ,l_id In order to coordinate the loss of identity training,

is x _i Nz is the number of training samples in the training data set,

for training sample x _i Multi-branch network identification basic model output truthThe real identity label is

The probability of (c).