CN115641613A - Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning - Google Patents

Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning Download PDF

Info

Publication number
CN115641613A
CN115641613A CN202211372036.3A CN202211372036A CN115641613A CN 115641613 A CN115641613 A CN 115641613A CN 202211372036 A CN202211372036 A CN 202211372036A CN 115641613 A CN115641613 A CN 115641613A
Authority
CN
China
Prior art keywords
training
picture
model
loss
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211372036.3A
Other languages
Chinese (zh)
Inventor
杨曦
郑顾
袁柳
魏梓钰
杨东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
China Academy of Electronic and Information Technology of CETC
Original Assignee
Xidian University
China Academy of Electronic and Information Technology of CETC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, China Academy of Electronic and Information Technology of CETC filed Critical Xidian University
Priority to CN202211372036.3A priority Critical patent/CN115641613A/en
Publication of CN115641613A publication Critical patent/CN115641613A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning, which comprises the following steps: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using different initialization parameters and utilizing a source domain training set respectively to obtain two pre-training student models which are pre-trained, and copying the two pre-training student models respectively to obtain two pre-training teacher models; constructing a picture characteristic memory library; performing multi-round target domain interactive supervised learning on the two pre-trained student models and the two pre-trained teacher models by using a target domain training set until a preset learning termination condition is obtained, and obtaining two student models and two teacher models which are subjected to cross-domain learning; and identifying the target domain query sample by using any model after cross-domain learning, finding out pictures with the same label in the target domain background library picture set, and completing pedestrian re-identification. The method improves the identification precision of the unsupervised cross-domain pedestrian re-identification.

Description

Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning.
Background
Pedestrian re-identification refers to that a certain pedestrian image is given, and a computer vision or machine learning method is used for retrieving pedestrian images with the same identity from a given series of pedestrian images at different visual angles or under different scenes. In order to solve the problems of target retrieval, real-time tracking and the like in daily life, the method has a wide application scene in the field of intelligent video monitoring.
Cross-domain unsupervised pedestrian re-identification allows for transfer learning with another tagged data set. Because the labeling is carried out without spending labor cost, the learning aiming at the special scene only needs the automatic acquisition of a machine and the automatic training according to the existing model and data so as to fit the parameters adapting to the special scene. Therefore, it is very attractive to develop an efficient and robust unsupervised pedestrian re-identification system, both in academic and industrial fields.
In the early stage of the development of a cross-domain unsupervised pedestrian re-identification task, a domain conversion-based generation method is popular, and the method synthesizes images with the same style as a target domain image through a training generation model and a source domain image generation method, and then converts a cross-domain problem into a semi-supervised problem in the same domain according to a real label of the source domain image so as to train by using the semi-supervised method. However, the method based on the generated image leads to inevitable errors because a generated model is trained firstly, and the errors are further processed in subsequent training, thereby leading to irreparable performance loss and greatly reducing the identification accuracy of the model. Therefore, the method gradually loses the mainstream position in the cross-domain unsupervised pedestrian re-identification in recent years, and is replaced by a method based on pre-training fine adjustment. However, the method based on pre-training fine adjustment still has the problems of inaccurate pseudo labels, amplification error clustering and the like, and the accuracy of identification is also influenced.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning. The technical problem to be solved by the invention is realized by the following technical scheme:
the invention provides an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning, which comprises the following steps:
step 1: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using a source domain training set respectively by adopting different initialization parameters to obtain a first pre-training student model and a second pre-training student model which are pre-trained, and copying the first pre-training student model and the second pre-training student model respectively to obtain a corresponding first pre-training teacher model and a corresponding second pre-training teacher model;
and 2, step: constructing a picture characteristic memory library, wherein the picture characteristic memory library is used for storing picture characteristics and corresponding labels of a target domain training set;
and 3, step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;
aiming at each round of target domain interactive supervised learning, inputting the target domain training set into the first pre-training student model and the second pre-training student model, performing DBSCAN clustering on picture features extracted by any one pre-training student model, and updating the picture features and the labels in the picture feature memory base according to clustering results;
updating parameters of the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model by using network total loss during each round of target domain interactive supervised learning, wherein the network total loss comprises hard and soft pseudo labels combined with supervised total loss and comparative learning total loss;
and 4, step 4: and identifying the target domain query sample by using any model after cross-domain learning, finding out the pictures with the same label in the target domain picture set, and completing pedestrian re-identification.
In an embodiment of the present invention, the source domain training set is a labeled picture set, and the target domain training set is an unlabeled picture set.
In one embodiment of the present invention, the step 1 comprises:
step 1.1: constructing two same original convolutional neural networks, and respectively adopting different initialization parameters for the two original convolutional neural networks;
step 1.2: respectively carrying out multiple rounds of pre-training on the two original convolutional neural networks by using a source domain training set until a preset pre-training termination condition is obtained, and obtaining a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained;
aiming at each round of pre-training, different random enhancement modes are respectively adopted for images input into two original convolutional neural networks, random dropouts are respectively adopted for output characteristics of the two original convolutional neural networks, and in each round of pre-training, the two original convolutional neural networks update network parameters by adopting Log softmax loss and triple loss back propagation;
and step 3: and respectively copying the structures and parameters of the first pre-training student model Net1 and the second pre-training student model Net2 to obtain a corresponding first pre-training teacher model Mean-Net1 and a corresponding second pre-training teacher model Mean-Net2.
In an embodiment of the present invention, after the picture feature memory library is constructed, initializing the picture feature memory library includes:
and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into the picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/, wherein F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.
In an embodiment of the present invention, inputting the target domain training set into the first pre-trained student model and the second pre-trained student model, performing DBSCAN clustering on picture features extracted by any one of the pre-trained student models, and updating the picture features and labels in the picture feature memory library according to a clustering result, includes:
inputting the target domain training set into the first and second pre-trained student models;
calculating the jaccard distance according to the picture features extracted by any one of the pre-trained student models;
performing DBSCAN clustering on the extracted picture features according to the jaccard distance;
calculating a clustering center of each clustering category, and distributing corresponding pseudo labels to the clustering categories;
and updating the picture features and the labels in the picture feature memory library according to the extracted picture features and the corresponding pseudo labels.
In one embodiment of the present invention, updating parameters of the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model, and the second pre-trained teacher model with a total loss of the network comprises:
according to the total loss of the network, performing parameter updating on a first pre-training student model and a second pre-training student model through gradient back propagation, and then performing parameter updating on the first pre-training teacher model and the second pre-training teacher model through an EMA (electronic emission model);
wherein the first pre-training teacher model and the second pre-training teacher model are updated with parameters according to the following formula,
Figure BDA0003925314330000058
in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the interactive supervised learning of the target domain in the Tth round, theta 1 Parameter, θ, representing the current round of the first pre-trained student model 2 Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter.
In one embodiment of the invention, the hard and soft pseudo tags in combination with a supervised total loss comprise: the method comprises the steps of classifying loss when a hard pseudo label is used for supervision, classifying loss when a soft pseudo label is used for supervision, triple loss when the hard pseudo label is used for supervision and triple loss when the soft pseudo label is used for supervision;
the hard and soft false label is obtained by combining supervision total loss according to the following formula:
Figure BDA0003925314330000051
wherein t is the current training round,
Figure BDA0003925314330000057
representing the soft pseudo label classification loss coefficient,
Figure BDA0003925314330000052
representing the soft pseudo label triplet loss coefficient,
Figure BDA0003925314330000053
indicating a loss of classification when using hard pseudo-tags for supervision,
Figure BDA0003925314330000054
indicating a loss of classification when supervised with soft pseudo-labels,
Figure BDA0003925314330000055
indicating a loss of triplets when using hard pseudo tags for supervision,
Figure BDA0003925314330000056
indicating the loss of triples when using soft pseudo labels for supervision;
wherein, the classification loss function when using the hard pseudo label for supervision is as follows:
Figure BDA0003925314330000061
in the formula, N t Indicates the number of pictures, L ce Represents a multi-class cross-entropy loss function,
Figure BDA0003925314330000062
representing the classification result of the picture after the feature extraction and classification are carried out by a pre-training student model,
Figure BDA0003925314330000063
a hard pseudo label representing a picture in the target domain training set,
Figure BDA0003925314330000064
representing pictures in a target domain training set;
the classification loss function when using soft pseudo-labels for supervision is:
Figure BDA0003925314330000065
Figure BDA0003925314330000066
in the formula (I), the compound is shown in the specification,
Figure BDA0003925314330000067
representing the class prediction value of the first pre-trained teacher model,
Figure BDA0003925314330000068
a classification prediction value representing a second pre-trained teacher model,
Figure BDA0003925314330000069
representing the classification result of the picture after feature extraction and classification through a first pre-training student model,
Figure BDA00039253143300000610
representing the classification result of the picture after feature extraction and classification through a second pre-training student model,
Figure BDA00039253143300000611
represent
Figure BDA00039253143300000612
Different random data enhancement modes are adopted;
the triple loss function when using hard pseudo-tags for supervision is:
Figure BDA00039253143300000613
in the formula, | | · | |, represents the euclidean distance,
Figure BDA00039253143300000614
and
Figure BDA00039253143300000615
respectively represent
Figure BDA00039253143300000616
M represents a margin over-parameter,
Figure BDA00039253143300000617
a feature representing an anchor sample of the input picture,
Figure BDA00039253143300000618
representing features of a positive sample picture of the input picture,
Figure BDA00039253143300000619
features representing a negative sample picture of the input picture;
the triple loss function when using soft pseudo-tags for supervision is:
Figure BDA0003925314330000071
Figure BDA0003925314330000072
Figure BDA0003925314330000073
in the formula, L bce (p, q) = -qlogp- (1-q) log (1-p), representing a binary cross-entropy loss function used when hard pseudo-tags are supervised.
In one embodiment of the present invention, the contrast learning total loss includes a global contrast loss and a local contrast loss, wherein,
the global contrast loss is calculated as follows:
Figure BDA0003925314330000074
in the formula, N c Representing the number of features of the clustered samples in the current memory pool, N 0 Representing the number of unclustered sample features in the current memory bank, q i I-th feature, f, representing a currently input small series of pictures + Representation characteristic q i In the picture characteristic memory base, tau is the super-parameter temperature coefficient,<·,·>the inner product between two feature vectors representing the metric similarity,
Figure BDA0003925314330000075
representing already clustered samples, f k Representing all unclustered sample characteristics in the picture characteristic memory library;
the local contrast loss is calculated as follows:
Figure BDA0003925314330000076
in the formula, y i And y j Respectively representing features q in the current input small-batch pictures i And characteristic q j B represents the number of currently input small-lot pictures.
Compared with the prior art, the invention has the beneficial effects that:
1. the unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning disclosed by the invention is based on an interactive average learning framework, and overcomes the large collapse risk of a model under the condition that the initial pseudo label is high in noise; the method can gradually reduce the noise of the pseudo label, improve the quality of the pseudo label and improve the accuracy of clustering, thereby improving the identification precision of unsupervised cross-domain pedestrian re-identification.
2. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is based on a global and local comparison learning method, a memory base is introduced, a difficult sample mining strategy is adopted, training error amplification caused by noise pseudo-class labels is prevented, more reliable target domain clusters are gradually generated and used for learning better characteristics in a hybrid memory, clustering is improved, clustering accuracy is improved, and unsupervised cross-domain pedestrian re-identification accuracy is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.
Drawings
Fig. 1 is a flowchart of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a pre-training scheme provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of unsupervised cross-domain pedestrian re-recognition training based on interactive average learning according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of unsupervised cross-domain pedestrian re-recognition training based on hybrid contrast learning according to an embodiment of the present invention;
fig. 6 is a schematic view of visualization of a picture classification test according to an embodiment of the present invention.
Detailed Description
In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined invention purpose, the following will explain in detail an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to the present invention with reference to the accompanying drawings and the detailed embodiments.
The foregoing and other technical contents, features and effects of the present invention will be more clearly understood from the following detailed description of the embodiments taken in conjunction with the accompanying drawings. While the present invention has been described in connection with the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Example one
Referring to fig. 1 and fig. 2 in combination, fig. 1 is a flowchart of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention; fig. 2 is a schematic diagram of an unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to an embodiment of the present invention. As shown in the figure, the unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning in the embodiment includes:
step 1: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using a source domain training set respectively by adopting different initialization parameters to obtain a first pre-training student model and a second pre-training student model which are pre-trained, and copying the first pre-training student model and the second pre-training student model respectively to obtain a corresponding first pre-training teacher model and a corresponding second pre-training teacher model;
referring to the schematic diagram of pre-training shown in fig. 3, in an alternative embodiment, step 1 includes:
step 1.1: constructing two same original convolutional neural networks, and respectively adopting different initialization parameters for the two original convolutional neural networks;
optionally, the original convolutional neural network comprises: the system comprises an original twin neural network, an original characteristic fusion module, an original characteristic optimization module and an original cross-correlation module.
Optionally, a backbone network ResNet-ibn50 may be used as the original twin network structure, where the ResNet-ibn50 is formed by adding an ibn module on a bottomlayer on the basis of the ResNet-50 network, and it should be noted that the original twin network structure is not specifically limited by the present invention. ResNet-50 is composed of 16 convolution blocks, each convolution block includes three convolutional layers, and the convolution kernel size of the first convolutional layer of each convolutional block is 1x1, the convolution kernel size of the second convolutional layer is 3x3, and the convolution kernel size of the third convolutional layer is 1x1. In addition, the ResNet-50 network framework is a typical residual network, can overcome the problems that the learning rate becomes lower and the accuracy rate cannot be effectively improved along with the deepening of the network depth, and can extract deep features of the picture.
Step 1.2: respectively carrying out multiple rounds of pre-training on the two original convolutional neural networks by using a source domain training set until a preset pre-training termination condition is obtained, and obtaining a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained;
in this embodiment, by inputting a source domain training set to an original convolutional neural network and setting initialization parameters such as a preset input parameter, a training parameter, a sample parameter, a training period parameter, a learning rate parameter, a loss function, a gradient descent function, and the like, the original convolutional neural network can be pre-trained with a minimum loss function as a target until a preset pre-training termination condition is reached, that is, a preset training period is reached or a loss function value reaches a preset threshold value, so as to obtain a student model after pre-training is completed.
In an optional embodiment, for each round of pre-training, different random enhancing manners, such as random cropping, random flipping, random erasing and the like, are respectively applied to the images input into the two original convolutional neural networks, and random dropouts are respectively applied to the output features of the two original convolutional neural networks.
In this embodiment, the source domain training set is a set of tagged pictures, optionally, each training minibatch contains 64 personal images of 16 real or pseudo identities, each identity comprising 4 images. In this embodiment, all images are resized to a size of 256 x 128 before being input to the network.
Illustratively, the sample parameter is 64, i.e., the minimum number of batch samples is 64; the training period parameter is 80, namely the training period is 80 rounds; the learning rate parameter of the first 10 rounds adopts a preheating (warmup) learning rate, namely, the learning rate is increased from 0.000035 to 0.00035 in a linear mode in the training period of the first 10 rounds, the 10 th round of training begins to maintain the learning rate, and the learning rate of the 40 th round and the 70 th round is respectively 0.1 times of the previous learning rate. An Adam optimizer was used to train the gradient descent function to optimize the network with a weight decay of 0.0005.
During each round of pre-training, the two original convolutional neural networks adopt Log softmax loss and Triplet loss back propagation to update network parameters, and the total loss L of the two original convolutional neural networks is calculated respectively pre And reversely propagating and updating the network parameters of the corresponding network for training to obtain a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained.
Wherein, the total loss L pre Comprises the following steps:
L pre =λ ls L ls (x i )+λ t L(a,p,n), (1);
in the formula, λ ls And λ t Respectively representing the classification loss L ls (x i ) And the coefficients of the triplet loss L (a, p, n) are set based on empirical values. 1
Log softmax loss is equivalent to taking the logarithm of the softmax loss, and the formula is as follows:
L ls (x i )=LogSoftmax(x i )=log(exp(x i )/∑ j exp(x j )), (2);
in the formula, x i Feature matrix, x, representing the input j Are elements traversed by rows/columns therein.
The Triplet loss equation is as follows:
L(a,p,n)=max(d(a,p)-d(a,n)+margin,0), (3);
the input is a triplet including anchor samples (a), positive samples (p) and negative samples (n), the positive samples and a are samples of the same class, the negative samples and a are samples of different classes, and margin is a constant greater than 0. The final optimization objective is to zoom in and out the distance of a and p.
And step 3: and respectively copying the structures and parameters of the first pre-training student model Net1 and the second pre-training student model Net2 to obtain a corresponding first pre-training teacher model Mean-Net1 and a corresponding second pre-training teacher model Mean-Net2.
In the embodiment, a first pre-training teacher model Mean-Net1 and a second pre-training teacher model Mean-Net2 corresponding to the first pre-training student model Net1 and the second pre-training student model Net2 are arranged, so that two groups of teacher and student models based on an interactive average learning frame are realized, mutual supervised learning is carried out between the teacher models and the student models and between different networks, the quality of pseudo labels is improved, clustering is more accurate, and classification precision is improved.
Step 2: constructing a picture characteristic memory library, wherein the picture characteristic memory library is used for storing picture characteristics and corresponding labels of a target domain training set;
in this embodiment, the target domain training set is a unlabeled picture set.
It should be noted that after the picture feature memory library is constructed, the picture feature memory library is initialized, and the initialization is only performed before the target domain interaction supervised learning is performed on the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model and the second pre-trained teacher model.
The specific initialization process comprises the following steps: and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into a picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/2, F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.
In this embodiment, the target domain training set is respectively input into the first pre-training student model Net1 and the second pre-training student model Net2 to average the features, so that the features captured by the network are more balanced, the target domain instance features are clearer, and the stability of the system is improved.
And step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to a first pre-training student model, a second pre-training student model, a first pre-training teacher model and a second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;
in this embodiment, the predetermined learning termination condition is that a predetermined learning period is reached or the loss function value reaches a predetermined threshold.
The method comprises the following steps of inputting a target domain training set into a first pre-training student model and a second pre-training student model aiming at each round of target domain interactive supervised learning, carrying out DBSCAN clustering on picture features extracted by any one pre-training student model, and updating the picture features and labels in a picture feature memory library according to clustering results, and specifically comprises the following steps:
step (1): in each round of target domain interactive supervised learning, inputting a target domain training set into a first pre-training student model and a second pre-training student model;
step (2): calculating the jaccard distance according to the picture features extracted by any one of the pre-training student models (the first pre-training student model or the second pre-training student model);
the jaccard distance, i.e., the jaccard coefficient, reflects the similarity relationship between two vectors (the value of an element is 0 or 1). I.e. for vectors
Figure BDA0003925314330000131
And
Figure BDA0003925314330000132
defining:
Figure BDA0003925314330000133
the value of the medium element is 0 and
Figure BDA0003925314330000134
the number of middle element values of 0;
Figure BDA0003925314330000135
the value of the middle element is 1 and
Figure BDA0003925314330000136
the number of middle element values of 0;
Figure BDA0003925314330000137
the value of the middle element is 0 and
Figure BDA0003925314330000138
the number of medium element values is 1;
Figure BDA0003925314330000139
the value of the medium element is 1 and
Figure BDA00039253143300001310
the number of medium element values of 1;
then, the jaccard coefficient can be expressed as:
Figure BDA00039253143300001311
likewise, another method of set representation may be used:
Figure BDA00039253143300001312
it should be noted that the larger the obtained jaccard coefficient value is, the higher the similarity is.
And (3): performing DBSCAN clustering on the extracted picture features according to the jaccard distance;
among them, DBSCAN clustering is a density-based spatial clustering algorithm that divides areas with sufficient density into clusters and finds arbitrarily shaped clusters in a spatial database with noise, which defines clusters as the largest set of density-connected points, and that groups training data into groupsClustered sample set X c And a set of uncleaved outlier samples X o
It should be noted that the DBSCAN clustering algorithm has two important hyper-parameters, one is to determine the maximum distance between features that can be clustered into the same class (if the distance is smaller than this, the two features are considered to be adjacent to each other), and the other is to determine whether the number of the adjacent features can be a cluster center. In the present embodiment, these two hyperparameters are empirically set to 0.5 and 5, respectively.
And (4): calculating the clustering center of each clustering class, and distributing corresponding pseudo labels to the clustering classes;
in this embodiment, samples that have clustered have the same pseudo-label, while each outlier sample that is not clustered treats it as an independent class-assigned pseudo-label. It should be noted that the pseudo tag is used as a hard pseudo tag for subsequent network loss calculation.
And (5): and updating the picture features and the labels in the picture feature memory library according to the extracted picture features and the corresponding pseudo labels.
It should be noted that, after the clustering center is recalculated according to the clustering result of each round and a corresponding pseudo label is assigned, normalization is performed according to the new clustering center and parameters are assigned to the classifier of the network.
The embodiment makes full use of sample data, so that not only are pseudo labels labeled to a large number of clustered sample sets, but also a small number of non-clustered outlier samples are provided with independent pseudo labels, the phenomenon that the outlier samples are labeled with wrong pseudo labels to cause amplification errors during training is avoided, the labeling accuracy and clustering accuracy of the pseudo labels are improved, and the classification accuracy is further improved.
In an alternative embodiment, parameters of the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model are updated by using total network loss during each round of target domain interactive supervised learning, as shown in fig. 4.
In this embodiment, the network total loss includes a hard and soft pseudo tag in combination with a supervised total loss and a contrasted learning total loss, i.e.,
L total =L(θ 12 )+L L2G (6);
in the formula, L total Represents the total loss of the network, L (theta) 12 ) Indicating hard and soft pseudo-tag binding supervision total loss, L L2G Representing the total loss of comparative learning.
Optionally, when each round of target domain interactive supervised learning is performed, according to the total loss of the network, parameter updating is performed on the first pre-trained student model and the second pre-trained student model through gradient back propagation, and then parameter updating is performed on the first pre-trained teacher model and the second pre-trained teacher model through an EMA mode.
Wherein, the parameter E [ theta ] of EMA updating mode is the accumulated average value of the corresponding network parameter theta. Specifically, the parameters are not updated by back propagation of the loss function, but after each back propagation of the total loss through the network, the first and second pre-trained teacher models are updated with the parameters according to the following formula,
Figure BDA0003925314330000151
in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the interactive supervised learning of the target domain in the Tth round, theta 1 Parameter, θ, representing the current round of the first pre-trained student model 2 Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter. At initialization, E (0)1 ]=θ 1 ,E (0)2 ]=θ 2
The EMA updating mode can be regarded as averaging the past parameters of the network, and the two pre-training teacher models have time accumulation through the EMA updating mode, so that the decoupling performance is higher, and the output is more independent and complementary.
In the embodiment, by utilizing an interactive average learning frame and a teacher-student model, two groups of teacher-student networks are monitored internally and externally, so that the learned characteristics are more stable, and the classification standard is more stable. The noise false tag can not amplify the error any more and is gradually corrected by a stable system, thereby improving the pedestrian recognition precision.
In an alternative embodiment, the hard and soft pseudo-tags in combination with the supervised total loss comprise: the loss of classification when using hard pseudo labels for supervision, the loss of classification when using soft pseudo labels for supervision, the loss of triplets when using hard pseudo labels for supervision and the loss of triplets when using soft pseudo labels for supervision.
Optionally, the hard and soft pseudo tag combined supervision total loss is calculated according to the following formula:
Figure BDA0003925314330000161
wherein t is the current training round,
Figure BDA0003925314330000162
representing the soft pseudo label classification loss coefficient,
Figure BDA0003925314330000163
representing soft pseudo tag triplet loss coefficients, which, in this embodiment,
Figure BDA0003925314330000164
and
Figure BDA0003925314330000165
the value is 0.5.
Figure BDA0003925314330000166
Indicating a loss of classification when using hard pseudo-tags for supervision,
Figure BDA0003925314330000167
indicating a loss of classification when supervised with soft pseudo labels,
Figure BDA0003925314330000168
indicating a loss of triples when using hard pseudo tags for supervision,
Figure BDA0003925314330000169
indicating the loss of triplets when using soft pseudo-tags for supervision.
Wherein, the classification loss in supervision using the hard pseudo label can be realized by a general multi-classification cross entropy loss function L ce To show that, in particular, the classification loss function when using hard pseudo tags for supervision is:
Figure BDA00039253143300001610
in the formula, N t Indicates the number of pictures, L ce Represents a multi-class cross-entropy loss function,
Figure BDA00039253143300001611
representing the classification result of the picture after the feature extraction and classification are carried out by pre-training the student model,
Figure BDA00039253143300001612
a hard pseudo label representing the picture in the target domain training set, wherein the hard pseudo label is a pseudo label distributed after clustering,
Figure BDA00039253143300001613
representing pictures in the target domain training set.
Under the framework of mutual average learning, soft pseudo labels in soft classification loss are classification predicted values of a pre-training teacher model Mean-Net1/2
Figure BDA0003925314330000171
And aiming at classification prediction, performing supervision by using a soft cross entropy loss function-qlogp to reduce the distance between two distributions. Specifically, the classification loss function when using soft pseudo labels for supervision is:
Figure BDA0003925314330000172
in the formula (I), the compound is shown in the specification,
Figure BDA0003925314330000173
representing the class prediction values of the first pre-trained teacher model,
Figure BDA0003925314330000174
a classification prediction value representing a second pre-trained teacher model,
Figure BDA0003925314330000175
representing the classification result of the picture after feature extraction and classification through a first pre-training student model,
Figure BDA0003925314330000176
representing the classification result of the picture after feature extraction and classification through a second pre-training student model,
Figure BDA0003925314330000177
to represent
Figure BDA0003925314330000178
Through different random data enhancement modes.
The triple loss function when using hard pseudo-tags for supervision is:
Figure BDA0003925314330000179
wherein, | | · | | represents the euclidean distance,
Figure BDA00039253143300001710
and
Figure BDA00039253143300001711
respectively represent
Figure BDA00039253143300001712
A positive sample and a negative sample of (a),m represents a residual-amount-over-parameter,
Figure BDA00039253143300001713
a feature representing an input picture anchor sample is represented,
Figure BDA00039253143300001714
representing features of a positive sample picture of the input picture,
Figure BDA00039253143300001715
representing features of a negative sample picture of the input picture.
Illustratively, if an input small batch of pictures comprises 32 pictures with 4 identities, each identity comprises 8 pictures, one of the pictures is an anchor sample, and the other pictures and the anchor sample are positive samples of the same identity and are not negative samples of the same identity. And then, taking each image in an input small-batch image as an anchor sample in sequence, calculating corresponding loss, and finally averaging 32 loss values to obtain the triple loss when the hard pseudo label is used for supervision.
The triple loss is the maximum triple label loss, and the soft maximum triple label loss refining pseudo label is to apply a reasonable soft pseudo label and a corresponding soft triple loss function to improve the quality of the pseudo label on the basis of the image characteristics of the triple. Softmax-triplets were used to represent the relationship between features within a triplet, as:
Figure BDA0003925314330000181
in the above formula, the numeric range of the result is [0,1 ].
At the same time, the supervision of triplets is softened (characteristic distance ratio Γ updated using EMA) i (E (T) [θ]) Instead of a hard pseudo label "1", the softened value is between [0, 1)). In particular, in a mutual-averaging learning framework, a softmax-triplet computed from image features can be used as a "soft" pseudo-tag to supervise training of triplets, and then a triplet loss function when supervised using a soft pseudo-tag can be usedExpressed as:
Figure BDA0003925314330000182
in the formula, L bce (p, q) = -qlogp- (1-q) log (1-p), representing a binary cross-entropy loss function used when hard pseudo-tags are supervised.
In the embodiment, the hard label is softened into the soft label, the traditional classification loss and the triple loss are softly labeled, the total loss during supervision is combined by using the hard and soft pseudo labels, the pseudo label noise based on clustering unsupervised cross-domain pedestrian re-identification can be reduced, the quality of the pseudo label is improved, the overall and local characteristics of the picture are better captured, more discriminative character characteristics are learned, and the accuracy of characteristic classification is improved, so that the identification accuracy is improved.
As shown in fig. 5, in an alternative embodiment, the total loss of contrast learning includes a global contrast loss and a local contrast loss. That is to say that the temperature of the molten steel,
L L2G =L LB +L GM (14);
in the formula, L L2G Represents the total loss of contrast learning, L LB Denotes local contrast loss, L GM Indicating a global contrast loss.
In this embodiment, the global contrast loss is computed using a difficult sample mining strategy, training samples x of the training set for each target domain i Definition of f θ (x i ) For its features, vectors, abbreviated f i =f θ (x i ),x i ∈X o ∪X c ,X c Is a set of already clustered samples, X o Is an outlier sample set that is not clustered, then the global contrast loss L based on the dynamic memory library GM Is defined as follows:
Figure BDA0003925314330000191
in the formula, N c Representing the number of clustered sample features in the current memory bank, N 0 Representing the number of unclustered sample features in the current memory bank, q i I-th feature, f, representing the currently input small series of pictures + Representation characteristic q i In the picture characteristic memory base, the characteristics of the corresponding samples, tau is a hyperparametric temperature coefficient,<·,·>the inner product between two feature vectors representing the metric similarity,
Figure BDA0003925314330000192
representing already clustered samples, f k And representing all unclustered sample characteristics in the picture characteristic memory library.
If the current feature q i Belonging to a clustered feature, then f + Is q i The hardest positive sample in the same type sample in the picture characteristic memory base, wherein the hardest positive sample represents that the selected sample characteristic deviates from q in the same cluster i The most distant. In all selected clustered samples
Figure BDA0003925314330000193
In addition to one being at q i Except the most difficult positive sample selected from the same kind, the others are all in the rest N c -the most difficult negative sample selected from 1 class, where the most difficult negative sample is q i The closest sample feature in each cluster. Then all the non-clustered sample characteristics f in the picture characteristic memory base are selected k ,k∈{1,...,N o As negative samples. If q is i If it is not a clustered sample feature, f is set + =f k As q in picture feature memory library i Corresponding non-clustered sample features when
Figure BDA0003925314330000201
Then all N are represented c Q already grouped into classes i The most difficult negative sample. Therefore, all the example-level self-supervision signals can be fully utilized for self-contrast learning. It should be noted that q represents the feature in the currently input small-batch pictures, and f represents the picture feature memoryFeatures in the library.
In this embodiment, the local contrast loss is calculated using a difficult sample mining strategy, and similarly to the global contrast loss, the local contrast loss L LB The formula is as follows:
Figure BDA0003925314330000202
in the formula, y i And y j Respectively representing features q in the current input small-batch pictures i And characteristic q j B represents the number of currently input small-lot pictures.
It should be noted that, based on the local contrast loss L inside the current input small batch of pictures LB The method is characterized in that the method is used for selecting positive and negative sample characteristics in the traditional contrast loss, and the positive and negative sample characteristics are selected from the current input small-batch pictures by using the most difficult sample mining strategy.
In this embodiment, the difficult sample mining strategy is applied to global and local contrast learning loss, that is, the features with the farthest distance from the same class of features are selected as positive examples for each sample, and the features with the closest distance from different classes are selected as negative examples. In the process of supervised learning, the most difficult sample features mined are used for dynamically updating the example features stored in the image feature memory base, better features are learned in the image feature memory base, and clustering is improved by using a difficult sample mining strategy.
In the target domain interactive supervised learning process, the mined most difficult sample features dynamically update the example features stored in the image feature memory base. In the local-to-global comparison learning process, the training examples with the most value and the largest information amount can be mined out on the clustering level by utilizing the whole target domain training set, and the training errors are prevented from being amplified due to wrong clustering in the whole optimization process of the model, so that the stability and effectiveness of the training process are maintained, the robustness is higher, and the recognition accuracy is also well improved.
And 4, step 4: and identifying the target domain query sample by using any model after cross-domain learning, finding out the pictures with the same label in the target domain picture set, and completing pedestrian re-identification.
In this embodiment, the target domain picture set is a gallery, and is used to match the identity in the query gallery quad (target domain query sample).
Please refer to the visual schematic diagram of the image classification test shown in fig. 6, wherein the first column is a target domain query sample, and the remaining 5 columns are images with the same label in the target domain library image set.
The unsupervised cross-domain pedestrian re-identification method is verified based on experiments, and the specific experimental parameters are as follows:
a database: the evaluation was performed on two widely used sets of personal identity data, i.e. Market-1501 and DukeMTMC-reID. The Market-1501 data set consists of 32668 annotated images of 1501 identities captured by 6 cameras, 12936 images of 751 identities used for training, 19732 images of 750 identities in the test set. The DukeMTMC-reID contains 16522 images of the person for the 702 identities used for training, and the remaining images of the other 702 identities used for testing, all of which were collected from 8 cameras.
Evaluation criteria: the evaluation criteria used in this example were Cumulative Matching Characteristics (CMC) and average accuracy (mAP). CMC is generally replaced by Rank-1, rank-5 and the like, and the reaction retrieval accuracy is high; the Rank-n recognition rate is the ratio of the number of the test samples which can be judged to be correct at the nth time after matching according to a certain similarity matching rule. mAP average accuracy = number of correctly classified pictures/total number of pictures.
The results of the experiments are shown in tables 1-6.
TABLE 1 pedestrian re-identification accuracy (target field Market-1501, batch size = 16)
Figure BDA0003925314330000221
TABLE 2 pedestrian re-identification accuracy (Market-1501, batch size =32 for target field)
Figure BDA0003925314330000222
TABLE 3 pedestrian re-identification accuracy (target domain is Market-1501, batch size =64 time)
Figure BDA0003925314330000223
As can be seen from comparison among tables 1, 2, and 3, the accuracy of each model is improved to some extent as the batch size is increased, and the effect is the best when the batch size = 64. However, due to the need for a large batch size for the blend-contrast learning, the blend-contrast learning effect is not significant on a small batch size =16, but the effect is significant when the batch size =64 is changed to be large. As can be seen in each table, the performance is greatly improved by using interactive average learning on the pre-training model, and the effect is also improved by introducing mixed contrast learning. Taking the example of the batch size =64 for achieving the best effect, compared to the pre-training model, the interactive average learning frame brings about accuracy improvement of 47.3%/29.7%/20.3% respectively at the mAP/rank-1/rank-5, and on this basis, the hybrid contrast learning still brings about accuracy improvement of 1.8%/1.3%/0.3% respectively at the mAP/rank-1/rank-5. Therefore, the excellent effects of the interactive average learning framework and the hybrid contrast learning on unsupervised cross-domain pedestrian re-identification can be seen.
TABLE 4 pedestrian re-identification accuracy (target domain DukeMTMC-reiD, when batch size = 16)
Figure BDA0003925314330000231
TABLE 5 pedestrian re-identification accuracy (DukeMTMC-reiD target field, when batch size = 32)
Figure BDA0003925314330000232
TABLE 6 pedestrian re-identification accuracy (DukeMTMC-reiD target domain, when batch size = 64)
Figure BDA0003925314330000233
Tables 4, 5 and 6 show the results of experiments in which the source domain and the target domain in tables 1, 2 and 3 were replaced. Similarly, it can be seen that as the batch size increases, the accuracy of each model increases to some extent, and the effect is best at batch size = 64. Likewise, due to the need for large batch size for blend-contrast learning, the blend-contrast learning effect is not significant on small batches of batch size =16, but the effect is significant when changing the batch size to become larger at batch size = 64. Taking the batch size =64 for realizing the best effect as an example, compared with the pre-trained model, the interactive average learning framework brings about accuracy improvement of 41.7%/36.0%/28.9% respectively at the mAP/rank-1/rank-5, and on the basis, the hybrid contrast learning still brings about accuracy improvement of 2.6%/2.8%/0.6% respectively at the mAP/rank-1/rank-5. Therefore, the excellent effects of the interactive average learning framework and the hybrid contrast learning on unsupervised cross-domain pedestrian re-identification can be seen.
The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is based on an interactive average learning frame, and overcomes the large collapse risk of a model under the condition that the initial pseudo label noise is large; the method can gradually reduce the noise of the pseudo label, improve the quality of the pseudo label and improve the accuracy of clustering, thereby improving the identification precision of unsupervised cross-domain pedestrian re-identification. In addition, based on a global and local comparison learning method, a memory base is introduced, a difficult sample mining strategy is adopted, training error amplification caused by noise pseudo labels is prevented, and more reliable target domain clusters are gradually generated and used for learning better features in a hybrid memory, so that clustering is improved, clustering accuracy is improved, and identification precision of unsupervised cross-domain pedestrian re-identification is improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrases "comprising one of \8230;" does not exclude the presence of additional like elements in an article or device comprising the element.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (8)

1. An unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning is characterized by comprising the following steps:
step 1: constructing two same original convolutional neural networks, pre-training the two original convolutional neural networks by using a source domain training set respectively by adopting different initialization parameters to obtain a first pre-training student model and a second pre-training student model which are pre-trained, and copying the first pre-training student model and the second pre-training student model respectively to obtain a corresponding first pre-training teacher model and a corresponding second pre-training teacher model;
and 2, step: constructing a picture characteristic memory library, wherein the picture characteristic memory library is used for storing picture characteristics and corresponding labels of a target domain training set;
and step 3: performing multi-round target domain interactive supervised learning by using a target domain training set according to the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model until a preset learning termination condition is obtained, and obtaining a first student model, a second student model, a first teacher model and a second teacher model which are subjected to cross-domain learning;
the method comprises the steps that a target domain training set is input into a first pre-training student model and a second pre-training student model aiming at each round of target domain interactive supervised learning, picture features extracted by any one pre-training student model are subjected to DBSCAN clustering, and the picture features and labels in a picture feature memory library are updated according to clustering results;
updating parameters of the first pre-training student model, the second pre-training student model, the first pre-training teacher model and the second pre-training teacher model by using network total loss during each round of target domain interactive supervised learning, wherein the network total loss comprises hard and soft pseudo labels combined with supervised total loss and comparative learning total loss;
and 4, step 4: and identifying the target domain query sample by using any model after cross-domain learning, finding out the pictures with the same label in the target domain picture set, and completing pedestrian re-identification.
2. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 1, wherein the source domain training set is a labeled picture set, and the target domain training set is an unlabeled picture set.
3. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to claim 2, wherein the step 1 comprises:
step 1.1: constructing two same original convolutional neural networks, and respectively adopting different initialization parameters for the two original convolutional neural networks;
step 1.2: respectively performing multiple rounds of pre-training on the two original convolutional neural networks by using a source domain training set until a preset pre-training termination condition is obtained, and obtaining a first pre-training student model Net1 and a second pre-training student model Net2 which are pre-trained;
according to each round of pre-training, different random enhancement modes are respectively adopted for images input into two original convolutional neural networks, random dropouts are respectively adopted for output characteristics of the two original convolutional neural networks, and during each round of pre-training, the two original convolutional neural networks both adopt Log softmax loss and triple loss back propagation to update network parameters;
and step 3: and respectively copying the structures and parameters of the first pre-training student model Net1 and the second pre-training student model Net2 to obtain a corresponding first pre-training teacher model Mean-Net1 and a corresponding second pre-training teacher model Mean-Net2.
4. The unsupervised cross-domain pedestrian re-recognition method based on clustering and multi-scale learning according to claim 3, wherein after the picture feature memory library is constructed, the picture feature memory library is initialized, and the method comprises the following steps:
and respectively extracting the target domain training set through a first pre-training student model and a second pre-training student model to obtain picture features and labels corresponding to each picture, and storing the picture features and the labels into the picture feature memory library to complete initialization, wherein the picture features F = (F1 + F2)/2, F1 represents first picture features obtained by extracting the picture through the first pre-training student model, and F2 represents second picture features obtained by extracting the picture through the second pre-training student model.
5. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 4, wherein the step of inputting the target domain training set into the first pre-trained student model and the second pre-trained student model, performing DBSCAN clustering on picture features extracted by any one pre-trained student model, and updating the picture features and labels in the picture feature memory base according to the clustering result comprises:
inputting the target domain training set into the first and second pre-trained student models;
calculating the jaccard distance according to the picture features extracted by any one of the pre-training student models;
performing DBSCAN clustering on the extracted picture features according to the jaccard distance;
calculating a clustering center of each clustering category, and distributing corresponding pseudo labels to the clustering categories;
and updating the picture features and the labels in the picture feature memory library according to the extracted picture features and the corresponding pseudo labels.
6. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning as claimed in claim 5, wherein updating the parameters of the first pre-trained student model, the second pre-trained student model, the first pre-trained teacher model and the second pre-trained teacher model with network total loss comprises:
according to the total loss of the network, performing parameter updating on a first pre-training student model and a second pre-training student model through gradient back propagation, and then performing parameter updating on the first pre-training teacher model and the second pre-training teacher model through an EMA (electronic emission model);
wherein the first pre-trained teacher model and the second pre-trained teacher model are updated with parameters according to the following formula,
Figure FDA0003925314320000041
in the formula, E [ theta ]]Expressing the accumulated average value of the network parameter theta, T expressing the T round of target domain interactive supervised learning, theta 1 Parameter, θ, representing the current round of the first pre-trained student model 2 Representing the parameters of the current round of the second pre-trained student model, and alpha representing the smoothing coefficient hyperparameter.
7. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning according to claim 6, wherein the hard and soft pseudo labels in combination with supervised total loss comprise: the method comprises the steps of (1) classification loss when a hard pseudo label is used for supervision, classification loss when a soft pseudo label is used for supervision, triple loss when a hard pseudo label is used for supervision and triple loss when a soft pseudo label is used for supervision;
the hard and soft false label is obtained by combining supervision total loss according to the following formula:
Figure FDA0003925314320000042
wherein t is the current training round,
Figure FDA0003925314320000043
representing the soft pseudo label classification loss coefficient,
Figure FDA0003925314320000044
representing the soft pseudo tag triplet loss coefficients,
Figure FDA0003925314320000045
indicating a loss of classification when using hard pseudo-tags for supervision,
Figure FDA0003925314320000046
indicating a loss of classification when supervised with soft pseudo-labels,
Figure FDA0003925314320000047
indicating a loss of triples when using hard pseudo tags for supervision,
Figure FDA0003925314320000048
indicating the loss of triples when using soft pseudo labels for supervision;
wherein, the classification loss function when using the hard pseudo label for supervision is as follows:
Figure FDA0003925314320000049
in the formula, N t Indicates the number of pictures, L ce Represents a multi-class cross-entropy loss function,
Figure FDA0003925314320000051
representing the classification result of the picture after the feature extraction and classification are carried out by pre-training the student model,
Figure FDA0003925314320000052
a hard pseudo label representing a picture in the target domain training set,
Figure FDA0003925314320000053
representing pictures in a target domain training set;
the classification loss function when using soft pseudo labels for supervision is:
Figure FDA0003925314320000054
Figure FDA0003925314320000055
in the formula (I), the compound is shown in the specification,
Figure FDA0003925314320000056
representing the class prediction values of the first pre-trained teacher model,
Figure FDA0003925314320000057
a classification prediction value representing a second pre-trained teacher model,
Figure FDA0003925314320000058
representing the classification result of the picture after feature extraction and classification through a first pre-training student model,
Figure FDA0003925314320000059
representing the classification result of the picture after feature extraction and classification through a second pre-training student model,
Figure FDA00039253143200000510
to represent
Figure FDA00039253143200000511
Different random data enhancement modes are adopted;
the triple loss function when using hard pseudo-tags for supervision is:
Figure FDA00039253143200000512
in the formula, | | · | |, represents the euclidean distance,
Figure FDA00039253143200000513
and
Figure FDA00039253143200000514
respectively represent
Figure FDA00039253143200000515
M represents a margin over-parameter,
Figure FDA00039253143200000516
a feature representing an anchor sample of the input picture,
Figure FDA00039253143200000517
representing features of a positive sample picture of the input picture,
Figure FDA00039253143200000518
features representing a negative sample picture of the input picture;
the triple loss function when using soft pseudo-tags for supervision is:
Figure FDA00039253143200000519
Figure FDA00039253143200000520
Figure FDA0003925314320000061
in the formula, L bce (p, q) = -qlogp- (1-q) log (1-p), representing a binary cross entropy loss function used when hard pseudo-tags are supervised.
8. The unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning of claim 7, wherein the contrast learning total loss comprises a global contrast loss and a local contrast loss, wherein,
the global contrast loss is calculated as follows:
Figure FDA0003925314320000062
in the formula, N c Representing the number of features of the clustered samples in the current memory pool, N 0 Representing the number of unclustered sample features in the current memory bank, q i I-th feature, f, representing the currently input small series of pictures + Representation characteristic q i In the picture characteristic memory base, tau is the super-parameter temperature coefficient,<·,·>inner product between two feature vectors representing metric similarity, f ck * Representing already clustered samples, f k Representing all non-clustered sample characteristics in the picture characteristic memory library;
the local contrast loss is calculated as follows:
Figure FDA0003925314320000063
in the formula, y i And y j Respectively representing features q in the current input small-batch pictures i And characteristic q j B represents the number of currently input small-lot pictures.
CN202211372036.3A 2022-11-03 2022-11-03 Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning Pending CN115641613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211372036.3A CN115641613A (en) 2022-11-03 2022-11-03 Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211372036.3A CN115641613A (en) 2022-11-03 2022-11-03 Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning

Publications (1)

Publication Number Publication Date
CN115641613A true CN115641613A (en) 2023-01-24

Family

ID=84946978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211372036.3A Pending CN115641613A (en) 2022-11-03 2022-11-03 Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning

Country Status (1)

Country Link
CN (1) CN115641613A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325223A (en) * 2018-12-13 2020-06-23 中国电信股份有限公司 Deep learning model training method and device and computer readable storage medium
CN116824695A (en) * 2023-06-07 2023-09-29 南通大学 Pedestrian re-identification non-local defense method based on feature denoising
CN116912535A (en) * 2023-09-08 2023-10-20 中国海洋大学 Unsupervised target re-identification method, device and medium based on similarity screening
CN117115641A (en) * 2023-07-20 2023-11-24 中国科学院空天信息创新研究院 Building information extraction method and device, electronic equipment and storage medium
CN117351522A (en) * 2023-12-06 2024-01-05 云南联合视觉科技有限公司 Pedestrian re-recognition method based on style injection and cross-view difficult sample mining
CN117556866A (en) * 2024-01-09 2024-02-13 南开大学 Data domain adaptation network construction method of passive domain diagram
CN117993468A (en) * 2024-04-03 2024-05-07 杭州海康威视数字技术股份有限公司 Model training method and device, storage medium and electronic equipment

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325223A (en) * 2018-12-13 2020-06-23 中国电信股份有限公司 Deep learning model training method and device and computer readable storage medium
CN111325223B (en) * 2018-12-13 2023-10-24 中国电信股份有限公司 Training method and device for deep learning model and computer readable storage medium
CN116824695A (en) * 2023-06-07 2023-09-29 南通大学 Pedestrian re-identification non-local defense method based on feature denoising
CN117115641A (en) * 2023-07-20 2023-11-24 中国科学院空天信息创新研究院 Building information extraction method and device, electronic equipment and storage medium
CN117115641B (en) * 2023-07-20 2024-03-22 中国科学院空天信息创新研究院 Building information extraction method and device, electronic equipment and storage medium
CN116912535A (en) * 2023-09-08 2023-10-20 中国海洋大学 Unsupervised target re-identification method, device and medium based on similarity screening
CN116912535B (en) * 2023-09-08 2023-11-28 中国海洋大学 Unsupervised target re-identification method, device and medium based on similarity screening
CN117351522A (en) * 2023-12-06 2024-01-05 云南联合视觉科技有限公司 Pedestrian re-recognition method based on style injection and cross-view difficult sample mining
CN117556866A (en) * 2024-01-09 2024-02-13 南开大学 Data domain adaptation network construction method of passive domain diagram
CN117556866B (en) * 2024-01-09 2024-03-29 南开大学 Data domain adaptation network construction method of passive domain diagram
CN117993468A (en) * 2024-04-03 2024-05-07 杭州海康威视数字技术股份有限公司 Model training method and device, storage medium and electronic equipment
CN117993468B (en) * 2024-04-03 2024-06-28 杭州海康威视数字技术股份有限公司 Model training method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN115641613A (en) Unsupervised cross-domain pedestrian re-identification method based on clustering and multi-scale learning
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
Ren et al. Meta-learning for semi-supervised few-shot classification
Jing et al. Videossl: Semi-supervised learning for video classification
US10671853B2 (en) Machine learning for identification of candidate video insertion object types
CN112069929B (en) Unsupervised pedestrian re-identification method and device, electronic equipment and storage medium
CN111401281B (en) Unsupervised pedestrian re-identification method and system based on deep clustering and sample learning
CN112396027B (en) Vehicle re-identification method based on graph convolution neural network
CN110728294A (en) Cross-domain image classification model construction method and device based on transfer learning
CN108491766B (en) End-to-end crowd counting method based on depth decision forest
CN113111814B (en) Regularization constraint-based semi-supervised pedestrian re-identification method and device
WO2022062419A1 (en) Target re-identification method and system based on non-supervised pyramid similarity learning
CN109753897B (en) Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning
CN116935447B (en) Self-adaptive teacher-student structure-based unsupervised domain pedestrian re-recognition method and system
CN111967325A (en) Unsupervised cross-domain pedestrian re-identification method based on incremental optimization
CN114283350A (en) Visual model training and video processing method, device, equipment and storage medium
Wu et al. An end-to-end exemplar association for unsupervised person re-identification
CN117152459B (en) Image detection method, device, computer readable medium and electronic equipment
CN105701516B (en) A kind of automatic image marking method differentiated based on attribute
US20230023164A1 (en) Systems and methods for rapid development of object detector models
CN115471739A (en) Cross-domain remote sensing scene classification and retrieval method based on self-supervision contrast learning
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
CN114299362A (en) Small sample image classification method based on k-means clustering
CN112183464A (en) Video pedestrian identification method based on deep neural network and graph convolution network
CN115761408A (en) Knowledge distillation-based federal domain adaptation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination