CN115641613A

CN115641613A - Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning

Info

Publication number: CN115641613A
Application number: CN202211372036.3A
Authority: CN
Inventors: 杨曦; 郑顾; 袁柳; 魏梓钰; 杨东
Original assignee: Xidian University; China Academy of Electronic and Information Technology of CETC
Current assignee: Xidian University; China Academy of Electronic and Information Technology of CETC
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-01-24

Abstract

The invention relates to an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning, which comprises the following steps: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using different initialization parameters and utilizing a source domain training set respectively to obtain two pre-training student models which are pre-trained, and copying the two pre-training student models respectively to obtain two pre-training teacher models; constructing a picture characteristic memory library; performing multi-round target domain interactive supervised learning on the two pre-trained student models and the two pre-trained teacher models by using a target domain training set until a preset learning termination condition is obtained, and obtaining two student models and two teacher models which are subjected to cross-domain learning; and identifying the target domain query sample by using any model after cross-domain learning, finding out pictures with the same label in the target domain background library picture set, and completing pedestrian re-identification. The method improves the identification precision of the unsupervised cross-domain pedestrian re-identification.

Description

Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning.

Background

Pedestrian re-identification refers to that a certain pedestrian image is given, and a computer vision or machine learning method is used for retrieving pedestrian images with the same identity from a given series of pedestrian images at different visual angles or under different scenes. In order to solve the problems of target retrieval, real-time tracking and the like in daily life, the method has a wide application scene in the field of intelligent video monitoring.

Cross-domain unsupervised pedestrian re-identification allows for transfer learning with another tagged data set. Because the labeling is carried out without spending labor cost, the learning aiming at the special scene only needs the automatic acquisition of a machine and the automatic training according to the existing model and data so as to fit the parameters adapting to the special scene. Therefore, it is very attractive to develop an efficient and robust unsupervised pedestrian re-identification system, both in academic and industrial fields.

In the early stage of the development of a cross-domain unsupervised pedestrian re-identification task, a domain conversion-based generation method is popular, and the method synthesizes images with the same style as a target domain image through a training generation model and a source domain image generation method, and then converts a cross-domain problem into a semi-supervised problem in the same domain according to a real label of the source domain image so as to train by using the semi-supervised method. However, the method based on the generated image leads to inevitable errors because a generated model is trained firstly, and the errors are further processed in subsequent training, thereby leading to irreparable performance loss and greatly reducing the identification accuracy of the model. Therefore, the method gradually loses the mainstream position in the cross-domain unsupervised pedestrian re-identification in recent years, and is replaced by a method based on pre-training fine adjustment. However, the method based on pre-training fine adjustment still has the problems of inaccurate pseudo labels, amplification error clustering and the like, and the accuracy of identification is also influenced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning. The technical problem to be solved by the invention is realized by the following technical scheme:

the invention provides an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning, which comprises the following steps:

step 1: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using a source domain training set respectively by adopting different initialization parameters to obtain a first pre-training student model and a second pre-training student model which are pre-trained, and copying the first pre-training student model and the second pre-training student model respectively to obtain a corresponding first pre-training teacher model and a corresponding second pre-training teacher model;

and 2, step: constructing a picture characteristic memory library, wherein the picture characteristic memory library is used for storing picture characteristics and corresponding labels of a target domain training set;

and 3, step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;

aiming at each round of target domain interactive supervised learning, inputting the target domain training set into the first pre-training student model and the second pre-training student model, performing DBSCAN clustering on picture features extracted by any one pre-training student model, and updating the picture features and the labels in the picture feature memory base according to clustering results;

updating parameters of the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model by using network total loss during each round of target domain interactive supervised learning, wherein the network total loss comprises hard and soft pseudo labels combined with supervised total loss and comparative learning total loss;

and 4, step 4: and identifying the target domain query sample by using any model after cross-domain learning, finding out the pictures with the same label in the target domain picture set, and completing pedestrian re-identification.

In an embodiment of the present invention, the source domain training set is a labeled picture set, and the target domain training set is an unlabeled picture set.

In one embodiment of the present invention, the step 1 comprises:

step 1.1: constructing two same original convolutional neural networks, and respectively adopting different initialization parameters for the two original convolutional neural networks;

step 1.2: respectively carrying out multiple rounds of pre-training on the two original convolutional neural networks by using a source domain training set until a preset pre-training termination condition is obtained, and obtaining a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained;

aiming at each round of pre-training, different random enhancement modes are respectively adopted for images input into two original convolutional neural networks, random dropouts are respectively adopted for output characteristics of the two original convolutional neural networks, and in each round of pre-training, the two original convolutional neural networks update network parameters by adopting Log softmax loss and triple loss back propagation;

and step 3: and respectively copying the structures and parameters of the first pre-training student model Net1 and the second pre-training student model Net2 to obtain a corresponding first pre-training teacher model Mean-Net1 and a corresponding second pre-training teacher model Mean-Net2.

In an embodiment of the present invention, after the picture feature memory library is constructed, initializing the picture feature memory library includes:

and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into the picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/, wherein F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.

In an embodiment of the present invention, inputting the target domain training set into the first pre-trained student model and the second pre-trained student model, performing DBSCAN clustering on picture features extracted by any one of the pre-trained student models, and updating the picture features and labels in the picture feature memory library according to a clustering result, includes:

inputting the target domain training set into the first and second pre-trained student models;

calculating the jaccard distance according to the picture features extracted by any one of the pre-trained student models;

performing DBSCAN clustering on the extracted picture features according to the jaccard distance;

calculating a clustering center of each clustering category, and distributing corresponding pseudo labels to the clustering categories;

and updating the picture features and the labels in the picture feature memory library according to the extracted picture features and the corresponding pseudo labels.

In one embodiment of the present invention, updating parameters of the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model, and the second pre-trained teacher model with a total loss of the network comprises:

according to the total loss of the network, performing parameter updating on a first pre-training student model and a second pre-training student model through gradient back propagation, and then performing parameter updating on the first pre-training teacher model and the second pre-training teacher model through an EMA (electronic emission model);

wherein the first pre-training teacher model and the second pre-training teacher model are updated with parameters according to the following formula,

in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the interactive supervised learning of the target domain in the Tth round, theta ₁ Parameter, θ, representing the current round of the first pre-trained student model ₂ Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter.

In one embodiment of the invention, the hard and soft pseudo tags in combination with a supervised total loss comprise: the method comprises the steps of classifying loss when a hard pseudo label is used for supervision, classifying loss when a soft pseudo label is used for supervision, triple loss when the hard pseudo label is used for supervision and triple loss when the soft pseudo label is used for supervision;

the hard and soft false label is obtained by combining supervision total loss according to the following formula:

wherein t is the current training round,

representing the soft pseudo label classification loss coefficient,

representing the soft pseudo label triplet loss coefficient,

indicating a loss of classification when using hard pseudo-tags for supervision,

indicating a loss of classification when supervised with soft pseudo-labels,

indicating a loss of triplets when using hard pseudo tags for supervision,

indicating the loss of triples when using soft pseudo labels for supervision;

wherein, the classification loss function when using the hard pseudo label for supervision is as follows:

in the formula, N _t Indicates the number of pictures, L _ce Represents a multi-class cross-entropy loss function,

representing the classification result of the picture after the feature extraction and classification are carried out by a pre-training student model,

a hard pseudo label representing a picture in the target domain training set,

representing pictures in a target domain training set;

the classification loss function when using soft pseudo-labels for supervision is:

in the formula (I), the compound is shown in the specification,

representing the class prediction value of the first pre-trained teacher model,

a classification prediction value representing a second pre-trained teacher model,

representing the classification result of the picture after feature extraction and classification through a first pre-training student model,

representing the classification result of the picture after feature extraction and classification through a second pre-training student model,

represent

Different random data enhancement modes are adopted;

the triple loss function when using hard pseudo-tags for supervision is:

in the formula, | | · | |, represents the euclidean distance,

and

respectively represent

M represents a margin over-parameter,

a feature representing an anchor sample of the input picture,

representing features of a positive sample picture of the input picture,

features representing a negative sample picture of the input picture;

the triple loss function when using soft pseudo-tags for supervision is:

in the formula, L _bce (p, q) = -qlogp- (1-q) log (1-p), representing a binary cross-entropy loss function used when hard pseudo-tags are supervised.

In one embodiment of the present invention, the contrast learning total loss includes a global contrast loss and a local contrast loss, wherein,

the global contrast loss is calculated as follows:

in the formula, N _c Representing the number of features of the clustered samples in the current memory pool, N ₀ Representing the number of unclustered sample features in the current memory bank, q _i I-th feature, f, representing a currently input small series of pictures ⁺ Representation characteristic q _i In the picture characteristic memory base, tau is the super-parameter temperature coefficient,<·，·>the inner product between two feature vectors representing the metric similarity,

representing already clustered samples, f _k Representing all unclustered sample characteristics in the picture characteristic memory library;

the local contrast loss is calculated as follows:

in the formula, y _i And y _j Respectively representing features q in the current input small-batch pictures _i And characteristic q _j B represents the number of currently input small-lot pictures.

Compared with the prior art, the invention has the beneficial effects that:

1. the unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning disclosed by the invention is based on an interactive average learning framework, and overcomes the large collapse risk of a model under the condition that the initial pseudo label is high in noise; the method can gradually reduce the noise of the pseudo label, improve the quality of the pseudo label and improve the accuracy of clustering, thereby improving the identification precision of unsupervised cross-domain pedestrian re-identification.

2. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is based on a global and local comparison learning method, a memory base is introduced, a difficult sample mining strategy is adopted, training error amplification caused by noise pseudo-class labels is prevented, more reliable target domain clusters are gradually generated and used for learning better characteristics in a hybrid memory, clustering is improved, clustering accuracy is improved, and unsupervised cross-domain pedestrian re-identification accuracy is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.

Drawings

Fig. 1 is a flowchart of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a pre-training scheme provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of unsupervised cross-domain pedestrian re-recognition training based on interactive average learning according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of unsupervised cross-domain pedestrian re-recognition training based on hybrid contrast learning according to an embodiment of the present invention;

fig. 6 is a schematic view of visualization of a picture classification test according to an embodiment of the present invention.

Detailed Description

In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined invention purpose, the following will explain in detail an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to the present invention with reference to the accompanying drawings and the detailed embodiments.

The foregoing and other technical contents, features and effects of the present invention will be more clearly understood from the following detailed description of the embodiments taken in conjunction with the accompanying drawings. While the present invention has been described in connection with the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Example one

Referring to fig. 1 and fig. 2 in combination, fig. 1 is a flowchart of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention; fig. 2 is a schematic diagram of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention. As shown in the figure, the unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning in the embodiment includes:

referring to the schematic diagram of pre-training shown in fig. 3, in an alternative embodiment, step 1 includes:

optionally, the original convolutional neural network comprises: the system comprises an original twin neural network, an original characteristic fusion module, an original characteristic optimization module and an original cross-correlation module.

Optionally, a backbone network ResNet-ibn50 may be used as the original twin network structure, where the ResNet-ibn50 is formed by adding an ibn module on a bottomlayer on the basis of the ResNet-50 network, and it should be noted that the original twin network structure is not specifically limited by the present invention. ResNet-50 is composed of 16 convolution blocks, each convolution block includes three convolutional layers, and the convolution kernel size of the first convolutional layer of each convolutional block is 1x1, the convolution kernel size of the second convolutional layer is 3x3, and the convolution kernel size of the third convolutional layer is 1x1. In addition, the ResNet-50 network framework is a typical residual network, can overcome the problems that the learning rate becomes lower and the accuracy rate cannot be effectively improved along with the deepening of the network depth, and can extract deep features of the picture.

in this embodiment, by inputting a source domain training set to an original convolutional neural network and setting initialization parameters such as a preset input parameter, a training parameter, a sample parameter, a training period parameter, a learning rate parameter, a loss function, a gradient descent function, and the like, the original convolutional neural network can be pre-trained with a minimum loss function as a target until a preset pre-training termination condition is reached, that is, a preset training period is reached or a loss function value reaches a preset threshold value, so as to obtain a student model after pre-training is completed.

In an optional embodiment, for each round of pre-training, different random enhancing manners, such as random cropping, random flipping, random erasing and the like, are respectively applied to the images input into the two original convolutional neural networks, and random dropouts are respectively applied to the output features of the two original convolutional neural networks.

In this embodiment, the source domain training set is a set of tagged pictures, optionally, each training minibatch contains 64 personal images of 16 real or pseudo identities, each identity comprising 4 images. In this embodiment, all images are resized to a size of 256 x 128 before being input to the network.

Illustratively, the sample parameter is 64, i.e., the minimum number of batch samples is 64; the training period parameter is 80, namely the training period is 80 rounds; the learning rate parameter of the first 10 rounds adopts a preheating (warmup) learning rate, namely, the learning rate is increased from 0.000035 to 0.00035 in a linear mode in the training period of the first 10 rounds, the 10 th round of training begins to maintain the learning rate, and the learning rate of the 40 th round and the 70 th round is respectively 0.1 times of the previous learning rate. An Adam optimizer was used to train the gradient descent function to optimize the network with a weight decay of 0.0005.

During each round of pre-training, the two original convolutional neural networks adopt Log softmax loss and Triplet loss back propagation to update network parameters, and the total loss L of the two original convolutional neural networks is calculated respectively _pre And reversely propagating and updating the network parameters of the corresponding network for training to obtain a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained.

Wherein, the total loss L _pre Comprises the following steps:

L _pre ＝λ _ls L _ls (x _i )+λ _t L(a,p,n), (1)；

in the formula, λ _ls And λ _t Respectively representing the classification loss L _ls (x _i ) And the coefficients of the triplet loss L (a, p, n) are set based on empirical values. 1

Log softmax loss is equivalent to taking the logarithm of the softmax loss, and the formula is as follows:

L _ls (x _i )＝LogSoftmax(x _i )＝log(exp(x _i )/∑ _j exp(x _j )), (2)；

in the formula, x _i Feature matrix, x, representing the input _j Are elements traversed by rows/columns therein.

The Triplet loss equation is as follows:

L(a,p,n)＝max(d(a,p)-d(a,n)+margin,0), (3)；

the input is a triplet including anchor samples (a), positive samples (p) and negative samples (n), the positive samples and a are samples of the same class, the negative samples and a are samples of different classes, and margin is a constant greater than 0. The final optimization objective is to zoom in and out the distance of a and p.

In the embodiment, a first pre-training teacher model Mean-Net1 and a second pre-training teacher model Mean-Net2 corresponding to the first pre-training student model Net1 and the second pre-training student model Net2 are arranged, so that two groups of teacher and student models based on an interactive average learning frame are realized, mutual supervised learning is carried out between the teacher models and the student models and between different networks, the quality of pseudo labels is improved, clustering is more accurate, and classification precision is improved.

Step 2: constructing a picture characteristic memory library, wherein the picture characteristic memory library is used for storing picture characteristics and corresponding labels of a target domain training set;

in this embodiment, the target domain training set is a unlabeled picture set.

It should be noted that after the picture feature memory library is constructed, the picture feature memory library is initialized, and the initialization is only performed before the target domain interaction supervised learning is performed on the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model and the second pre-trained teacher model.

The specific initialization process comprises the following steps: and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into a picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/2, F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.

In this embodiment, the target domain training set is respectively input into the first pre-training student model Net1 and the second pre-training student model Net2 to average the features, so that the features captured by the network are more balanced, the target domain instance features are clearer, and the stability of the system is improved.

And step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to a first pre-training student model, a second pre-training student model, a first pre-training teacher model and a second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;

in this embodiment, the predetermined learning termination condition is that a predetermined learning period is reached or the loss function value reaches a predetermined threshold.

The method comprises the following steps of inputting a target domain training set into a first pre-training student model and a second pre-training student model aiming at each round of target domain interactive supervised learning, carrying out DBSCAN clustering on picture features extracted by any one pre-training student model, and updating the picture features and labels in a picture feature memory library according to clustering results, and specifically comprises the following steps:

step (1): in each round of target domain interactive supervised learning, inputting a target domain training set into a first pre-training student model and a second pre-training student model;

step (2): calculating the jaccard distance according to the picture features extracted by any one of the pre-training student models (the first pre-training student model or the second pre-training student model);

the jaccard distance, i.e., the jaccard coefficient, reflects the similarity relationship between two vectors (the value of an element is 0 or 1). I.e. for vectors

And

defining:

the value of the medium element is 0 and

the number of middle element values of 0;

the value of the middle element is 1 and

the number of middle element values of 0;

the value of the middle element is 0 and

the number of medium element values is 1;

the value of the medium element is 1 and

the number of medium element values of 1;

then, the jaccard coefficient can be expressed as:

likewise, another method of set representation may be used:

it should be noted that the larger the obtained jaccard coefficient value is, the higher the similarity is.

And (3): performing DBSCAN clustering on the extracted picture features according to the jaccard distance;

among them, DBSCAN clustering is a density-based spatial clustering algorithm that divides areas with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points, and that groups training data into groupsClustered sample set X _c And a set of uncleaved outlier samples X _o 。

It should be noted that the DBSCAN clustering algorithm has two important hyper-parameters, one is to determine the maximum distance between features that can be clustered into the same class (if the distance is smaller than this, the two features are considered to be adjacent to each other), and the other is to determine whether the number of the adjacent features can be a cluster center. In the present embodiment, these two hyperparameters are empirically set to 0.5 and 5, respectively.

And (4): calculating the clustering center of each clustering class, and distributing corresponding pseudo labels to the clustering classes;

in this embodiment, samples that have clustered have the same pseudo-label, while each outlier sample that is not clustered treats it as an independent class-assigned pseudo-label. It should be noted that the pseudo tag is used as a hard pseudo tag for subsequent network loss calculation.

And (5): and updating the picture features and the labels in the picture feature memory library according to the extracted picture features and the corresponding pseudo labels.

It should be noted that, after the clustering center is recalculated according to the clustering result of each round and a corresponding pseudo label is assigned, normalization is performed according to the new clustering center and parameters are assigned to the classifier of the network.

The embodiment makes full use of sample data, so that not only are pseudo labels labeled to a large number of clustered sample sets, but also a small number of non-clustered outlier samples are provided with independent pseudo labels, the phenomenon that the outlier samples are labeled with wrong pseudo labels to cause amplification errors during training is avoided, the labeling accuracy and clustering accuracy of the pseudo labels are improved, and the classification accuracy is further improved.

In an alternative embodiment, parameters of the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model are updated by using total network loss during each round of target domain interactive supervised learning, as shown in fig. 4.

In this embodiment, the network total loss includes a hard and soft pseudo tag in combination with a supervised total loss and a contrasted learning total loss, i.e.,

L _total ＝L(θ ₁ ,θ ₂ )+L _L2G (6)；

in the formula, L _total Represents the total loss of the network, L (theta) ₁ ,θ ₂ ) Indicating hard and soft pseudo-tag binding supervision total loss, L _L2G Representing the total loss of comparative learning.

Optionally, when each round of target domain interactive supervised learning is performed, according to the total loss of the network, parameter updating is performed on the first pre-trained student model and the second pre-trained student model through gradient back propagation, and then parameter updating is performed on the first pre-trained teacher model and the second pre-trained teacher model through an EMA mode.

Wherein, the parameter E [ theta ] of EMA updating mode is the accumulated average value of the corresponding network parameter theta. Specifically, the parameters are not updated by back propagation of the loss function, but after each back propagation of the total loss through the network, the first and second pre-trained teacher models are updated with the parameters according to the following formula,

in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the interactive supervised learning of the target domain in the Tth round, theta ₁ Parameter, θ, representing the current round of the first pre-trained student model ₂ Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter. At initialization, E ⁽⁰⁾ [θ ₁ ]＝θ ₁ ，E ⁽⁰⁾ [θ ₂ ]＝θ ₂ 。

The EMA updating mode can be regarded as averaging the past parameters of the network, and the two pre-training teacher models have time accumulation through the EMA updating mode, so that the decoupling performance is higher, and the output is more independent and complementary.

In the embodiment, by utilizing an interactive average learning frame and a teacher-student model, two groups of teacher-student networks are monitored internally and externally, so that the learned characteristics are more stable, and the classification standard is more stable. The noise false tag can not amplify the error any more and is gradually corrected by a stable system, thereby improving the pedestrian recognition precision.

In an alternative embodiment, the hard and soft pseudo-tags in combination with the supervised total loss comprise: the loss of classification when using hard pseudo labels for supervision, the loss of classification when using soft pseudo labels for supervision, the loss of triplets when using hard pseudo labels for supervision and the loss of triplets when using soft pseudo labels for supervision.

Optionally, the hard and soft pseudo tag combined supervision total loss is calculated according to the following formula:

wherein t is the current training round,

representing the soft pseudo label classification loss coefficient,

representing soft pseudo tag triplet loss coefficients, which, in this embodiment,

and

the value is 0.5.

indicating a loss of classification when supervised with soft pseudo labels,

indicating a loss of triples when using hard pseudo tags for supervision,

indicating the loss of triplets when using soft pseudo-tags for supervision.

Wherein, the classification loss in supervision using the hard pseudo label can be realized by a general multi-classification cross entropy loss function L _ce To show that, in particular, the classification loss function when using hard pseudo tags for supervision is:

representing the classification result of the picture after the feature extraction and classification are carried out by pre-training the student model,

a hard pseudo label representing the picture in the target domain training set, wherein the hard pseudo label is a pseudo label distributed after clustering,

representing pictures in the target domain training set.

Under the framework of mutual average learning, soft pseudo labels in soft classification loss are classification predicted values of a pre-training teacher model Mean-Net1/2

And aiming at classification prediction, performing supervision by using a soft cross entropy loss function-qlogp to reduce the distance between two distributions. Specifically, the classification loss function when using soft pseudo labels for supervision is:

in the formula (I), the compound is shown in the specification,

representing the class prediction values of the first pre-trained teacher model,

to represent

Through different random data enhancement modes.

The triple loss function when using hard pseudo-tags for supervision is:

wherein, | | · | | represents the euclidean distance,

and

respectively represent

A positive sample and a negative sample of (a),m represents a residual-amount-over-parameter,

a feature representing an input picture anchor sample is represented,

representing features of a positive sample picture of the input picture,

representing features of a negative sample picture of the input picture.

Illustratively, if an input small batch of pictures comprises 32 pictures with 4 identities, each identity comprises 8 pictures, one of the pictures is an anchor sample, and the other pictures and the anchor sample are positive samples of the same identity and are not negative samples of the same identity. And then, taking each image in an input small-batch image as an anchor sample in sequence, calculating corresponding loss, and finally averaging 32 loss values to obtain the triple loss when the hard pseudo label is used for supervision.

The triple loss is the maximum triple label loss, and the soft maximum triple label loss refining pseudo label is to apply a reasonable soft pseudo label and a corresponding soft triple loss function to improve the quality of the pseudo label on the basis of the image characteristics of the triple. Softmax-triplets were used to represent the relationship between features within a triplet, as:

in the above formula, the numeric range of the result is [0,1 ].

At the same time, the supervision of triplets is softened (characteristic distance ratio Γ updated using EMA) _i (E ^(T) [θ]) Instead of a hard pseudo label "1", the softened value is between [0, 1)). In particular, in a mutual-averaging learning framework, a softmax-triplet computed from image features can be used as a "soft" pseudo-tag to supervise training of triplets, and then a triplet loss function when supervised using a soft pseudo-tag can be usedExpressed as:

In the embodiment, the hard label is softened into the soft label, the traditional classification loss and the triple loss are softly labeled, the total loss during supervision is combined by using the hard and soft pseudo labels, the pseudo label noise based on clustering unsupervised cross-domain pedestrian re-identification can be reduced, the quality of the pseudo label is improved, the overall and local characteristics of the picture are better captured, more discriminative character characteristics are learned, and the accuracy of characteristic classification is improved, so that the identification accuracy is improved.

As shown in fig. 5, in an alternative embodiment, the total loss of contrast learning includes a global contrast loss and a local contrast loss. That is to say that the temperature of the molten steel,

L _L2G ＝L _LB +L _GM (14)；

in the formula, L _L2G Represents the total loss of contrast learning, L _LB Denotes local contrast loss, L _GM Indicating a global contrast loss.

In this embodiment, the global contrast loss is computed using a difficult sample mining strategy, training samples x of the training set for each target domain _i Definition of f _θ (x _i ) For its features, vectors, abbreviated f _i ＝f _θ (x _i ),x _i ∈X _o ∪X _c ，X _c Is a set of already clustered samples, X _o Is an outlier sample set that is not clustered, then the global contrast loss L based on the dynamic memory library _GM Is defined as follows:

in the formula, N _c Representing the number of clustered sample features in the current memory bank, N ₀ Representing the number of unclustered sample features in the current memory bank, q _i I-th feature, f, representing the currently input small series of pictures ⁺ Representation characteristic q _i In the picture characteristic memory base, the characteristics of the corresponding samples, tau is a hyperparametric temperature coefficient,<·，·>the inner product between two feature vectors representing the metric similarity,

representing already clustered samples, f _k And representing all unclustered sample characteristics in the picture characteristic memory library.

If the current feature q _i Belonging to a clustered feature, then f ⁺ Is q _i The hardest positive sample in the same type sample in the picture characteristic memory base, wherein the hardest positive sample represents that the selected sample characteristic deviates from q in the same cluster _i The most distant. In all selected clustered samples

In addition to one being at q _i Except the most difficult positive sample selected from the same kind, the others are all in the rest N _c -the most difficult negative sample selected from 1 class, where the most difficult negative sample is q _i The closest sample feature in each cluster. Then all the non-clustered sample characteristics f in the picture characteristic memory base are selected _k ,k∈{1,...,N _o As negative samples. If q is _i If it is not a clustered sample feature, f is set ⁺ ＝f _k As q in picture feature memory library _i Corresponding non-clustered sample features when

Then all N are represented _c Q already grouped into classes _i The most difficult negative sample. Therefore, all the example-level self-supervision signals can be fully utilized for self-contrast learning. It should be noted that q represents the feature in the currently input small-batch pictures, and f represents the picture feature memoryFeatures in the library.

In this embodiment, the local contrast loss is calculated using a difficult sample mining strategy, and similarly to the global contrast loss, the local contrast loss L _LB The formula is as follows:

It should be noted that, based on the local contrast loss L inside the current input small batch of pictures _LB The method is characterized in that the method is used for selecting positive and negative sample characteristics in the traditional contrast loss, and the positive and negative sample characteristics are selected from the current input small-batch pictures by using the most difficult sample mining strategy.

In this embodiment, the difficult sample mining strategy is applied to global and local contrast learning loss, that is, the features with the farthest distance from the same class of features are selected as positive examples for each sample, and the features with the closest distance from different classes are selected as negative examples. In the process of supervised learning, the most difficult sample features mined are used for dynamically updating the example features stored in the image feature memory base, better features are learned in the image feature memory base, and clustering is improved by using a difficult sample mining strategy.

In the target domain interactive supervised learning process, the mined most difficult sample features dynamically update the example features stored in the image feature memory base. In the local-to-global comparison learning process, the training examples with the most value and the largest information amount can be mined out on the clustering level by utilizing the whole target domain training set, and the training errors are prevented from being amplified due to wrong clustering in the whole optimization process of the model, so that the stability and effectiveness of the training process are maintained, the robustness is higher, and the recognition accuracy is also well improved.

In this embodiment, the target domain picture set is a gallery, and is used to match the identity in the query gallery quad (target domain query sample).

Please refer to the visual schematic diagram of the image classification test shown in fig. 6, wherein the first column is a target domain query sample, and the remaining 5 columns are images with the same label in the target domain library image set.

The unsupervised cross-domain pedestrian re-identification method is verified based on experiments, and the specific experimental parameters are as follows:

a database: the evaluation was performed on two widely used sets of personal identity data, i.e. Market-1501 and DukeMTMC-reID. The Market-1501 data set consists of 32668 annotated images of 1501 identities captured by 6 cameras, 12936 images of 751 identities used for training, 19732 images of 750 identities in the test set. The DukeMTMC-reID contains 16522 images of the person for the 702 identities used for training, and the remaining images of the other 702 identities used for testing, all of which were collected from 8 cameras.

Evaluation criteria: the evaluation criteria used in this example were Cumulative Matching Characteristics (CMC) and average accuracy (mAP). CMC is generally replaced by Rank-1, rank-5 and the like, and the reaction retrieval accuracy is high; the Rank-n recognition rate is the ratio of the number of the test samples which can be judged to be correct at the nth time after matching according to a certain similarity matching rule. mAP average accuracy = number of correctly classified pictures/total number of pictures.

The results of the experiments are shown in tables 1-6.

TABLE 1 pedestrian re-identification accuracy (target field Market-1501, batch size = 16)

TABLE 2 pedestrian re-identification accuracy (Market-1501, batch size =32 for target field)

TABLE 3 pedestrian re-identification accuracy (target domain is Market-1501, batch size =64 time)

As can be seen from comparison among tables 1, 2, and 3, the accuracy of each model is improved to some extent as the batch size is increased, and the effect is the best when the batch size = 64. However, due to the need for a large batch size for the blend-contrast learning, the blend-contrast learning effect is not significant on a small batch size =16, but the effect is significant when the batch size =64 is changed to be large. As can be seen in each table, the performance is greatly improved by using interactive average learning on the pre-training model, and the effect is also improved by introducing mixed contrast learning. Taking the example of the batch size =64 for achieving the best effect, compared to the pre-training model, the interactive average learning frame brings about accuracy improvement of 47.3%/29.7%/20.3% respectively at the mAP/rank-1/rank-5, and on this basis, the hybrid contrast learning still brings about accuracy improvement of 1.8%/1.3%/0.3% respectively at the mAP/rank-1/rank-5. Therefore, the excellent effects of the interactive average learning framework and the hybrid contrast learning on unsupervised cross-domain pedestrian re-identification can be seen.

TABLE 4 pedestrian re-identification accuracy (target domain DukeMTMC-reiD, when batch size = 16)

TABLE 5 pedestrian re-identification accuracy (DukeMTMC-reiD target field, when batch size = 32)

TABLE 6 pedestrian re-identification accuracy (DukeMTMC-reiD target domain, when batch size = 64)

Tables 4, 5 and 6 show the results of experiments in which the source domain and the target domain in tables 1, 2 and 3 were replaced. Similarly, it can be seen that as the batch size increases, the accuracy of each model increases to some extent, and the effect is best at batch size = 64. Likewise, due to the need for large batch size for blend-contrast learning, the blend-contrast learning effect is not significant on small batches of batch size =16, but the effect is significant when changing the batch size to become larger at batch size = 64. Taking the batch size =64 for realizing the best effect as an example, compared with the pre-trained model, the interactive average learning framework brings about accuracy improvement of 41.7%/36.0%/28.9% respectively at the mAP/rank-1/rank-5, and on the basis, the hybrid contrast learning still brings about accuracy improvement of 2.6%/2.8%/0.6% respectively at the mAP/rank-1/rank-5. Therefore, the excellent effects of the interactive average learning framework and the hybrid contrast learning on unsupervised cross-domain pedestrian re-identification can be seen.

The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is based on an interactive average learning frame, and overcomes the large collapse risk of a model under the condition that the initial pseudo label noise is large; the method can gradually reduce the noise of the pseudo label, improve the quality of the pseudo label and improve the accuracy of clustering, thereby improving the identification precision of unsupervised cross-domain pedestrian re-identification. In addition, based on a global and local comparison learning method, a memory base is introduced, a difficult sample mining strategy is adopted, training error amplification caused by noise pseudo labels is prevented, and more reliable target domain clusters are gradually generated and used for learning better features in a hybrid memory, so that clustering is improved, clustering accuracy is improved, and identification precision of unsupervised cross-domain pedestrian re-identification is improved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrases "comprising one of \8230;" does not exclude the presence of additional like elements in an article or device comprising the element.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. An unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is characterized by comprising the following steps:

and step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;

the method comprises the steps that a target domain training set is input into a first pre-training student model and a second pre-training student model aiming at each round of target domain interactive supervised learning, picture features extracted by any one pre-training student model are subjected to DBSCAN clustering, and the picture features and labels in a picture feature memory library are updated according to clustering results;

2. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 1, wherein the source domain training set is a labeled picture set, and the target domain training set is an unlabeled picture set.

3. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to claim 2, wherein the step 1 comprises:

step 1.2: respectively performing multiple rounds of pre-training on the two original convolutional neural networks by using a source domain training set until a preset pre-training termination condition is obtained, and obtaining a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained;

according to each round of pre-training, different random enhancement modes are respectively adopted for images input into two original convolutional neural networks, random dropouts are respectively adopted for output characteristics of the two original convolutional neural networks, and during each round of pre-training, the two original convolutional neural networks both adopt Log softmax loss and triple loss back propagation to update network parameters;

4. The unsupervised cross-domain pedestrian re-recognition method based on clustering and multi-scale learning according to claim 3, wherein after the picture feature memory library is constructed, the picture feature memory library is initialized, and the method comprises the following steps:

and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into the picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/2, F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.

5. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 4, wherein the step of inputting the target domain training set into the first pre-trained student model and the second pre-trained student model, performing DBSCAN clustering on picture features extracted by any one pre-trained student model, and updating the picture features and labels in the picture feature memory base according to the clustering result comprises:

calculating the jaccard distance according to the picture features extracted by any one of the pre-training student models;

6. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 5, wherein updating the parameters of the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model and the second pre-trained teacher model with network total loss comprises:

wherein the first pre-trained teacher model and the second pre-trained teacher model are updated with parameters according to the following formula,

in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the T round of target domain interactive supervised learning, theta ₁ Parameter, θ, representing the current round of the first pre-trained student model ₂ Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter.

7. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to claim 6, wherein the hard and soft pseudo labels in combination with supervised total loss comprise: the method comprises the steps of (1) classification loss when a hard pseudo label is used for supervision, classification loss when a soft pseudo label is used for supervision, triple loss when a hard pseudo label is used for supervision and triple loss when a soft pseudo label is used for supervision;

wherein t is the current training round,

representing the soft pseudo label classification loss coefficient,

representing the soft pseudo tag triplet loss coefficients,

indicating a loss of classification when supervised with soft pseudo-labels,

indicating a loss of triples when using hard pseudo tags for supervision,

indicating the loss of triples when using soft pseudo labels for supervision;

a hard pseudo label representing a picture in the target domain training set,

representing pictures in a target domain training set;

the classification loss function when using soft pseudo labels for supervision is:

in the formula (I), the compound is shown in the specification,

to represent

Different random data enhancement modes are adopted;

the triple loss function when using hard pseudo-tags for supervision is:

in the formula, | | · | |, represents the euclidean distance,

and

respectively represent

M represents a margin over-parameter,

a feature representing an anchor sample of the input picture,

representing features of a positive sample picture of the input picture,

features representing a negative sample picture of the input picture;

the triple loss function when using soft pseudo-tags for supervision is:

in the formula, L _bce (p, q) = -qlogp- (1-q) log (1-p), representing a binary cross entropy loss function used when hard pseudo-tags are supervised.

8. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning of claim 7, wherein the contrast learning total loss comprises a global contrast loss and a local contrast loss, wherein,

the global contrast loss is calculated as follows:

in the formula, N _c Representing the number of features of the clustered samples in the current memory pool, N ₀ Representing the number of unclustered sample features in the current memory bank, q _i I-th feature, f, representing the currently input small series of pictures ⁺ Representation characteristic q _i In the picture characteristic memory base, tau is the super-parameter temperature coefficient,<·，·>inner product between two feature vectors representing metric similarity, f _ck * Representing already clustered samples, f _k Representing all non-clustered sample characteristics in the picture characteristic memory library;

the local contrast loss is calculated as follows: