CN117830616A

CN117830616A - Remote sensing image unsupervised cross-domain target detection method based on progressive pseudo tag

Info

Publication number: CN117830616A
Application number: CN202311807723.8A
Authority: CN
Inventors: 耿杰; 齐浩; 陈文会; 蒋雯
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-04-05

Abstract

The invention provides a remote sensing image unsupervised cross-domain target detection method based on a progressive pseudo tag, which mainly solves the problems that the efficiency of a dual-stage detector is lower than that of a single-stage detector and the detection precision of the existing unsupervised cross-domain target detection method directly applied to the field of multi-platform remote sensing images is poor under the Fast R-CNN method based on the dual-stage of the unsupervised domain self-adaptive target detection method. When the model starts training, the method of the invention uses a set higher threshold value to filter out the false labels with low quality due to the domain difference between the source domain and the target domain. With the increase of training times, the student network adaptively learns the domain invariant features between the two domains through the multi-scale image level contrast domain adaptation and the instance level contrast domain adaptation, so that the quality of generating pseudo labels by the teacher network is improved. In the invention, a nonlinear function is used as the weight of the threshold value to help the teacher network generate a proper pseudo tag, thereby improving the performance of the cross-domain target detection model.

Description

Remote sensing image unsupervised cross-domain target detection method based on progressive pseudo tag

Technical Field

The invention relates to the field of computer vision, in particular to an image unsupervised cross-domain target detection method.

Background

The nature of performance degradation in cross-domain detection of multi-platform remote sensing images is due to domain distribution shifts between different platforms. Such a distribution shift is manifested by differences in view angle, illumination, resolution, etc. of the image, resulting in difficulty in generalizing the model to a new platform in the object detection task. The study of unsupervised domain adaptive target detection aims to solve the domain distribution offset problem. The three main methods are a method based on countermeasure learning, a self-training method based on pseudo tags, and a conversion method based on images. The domain self-adaptive target detection method based on the countermeasure learning aims to solve the problem of distribution difference between a source domain and a target domain. It assists in training the target detection model by introducing a domain arbiter. The task of the domain arbiter is to determine whether the input features are from the source domain or the target domain. The detector model is trained to generate features that can spoof domain discriminators such that the domain discriminators cannot accurately distinguish between features of source and target domains. When the detector and domain arbiter reach dynamic equilibrium, the detector can generate domain invariant features, thereby enhancing performance over the target domain. The self-training method based on the pseudo tag is an iterative training strategy. First, an object detection model is trained on source domain data. This model is then used to predict the target domain data and generate a high confidence prediction result (pseudo tag). These target domain data are then used with their corresponding pseudo tags for training of the target domain model. This iterative training progressively improves the performance of the target domain model due to the higher accuracy of the pseudo tag. The image-to-image based transformation method transforms the target domain image into a pattern resembling the source domain image or transforms the source domain image into a pattern resembling the target domain image by introducing an additional transformation model. This helps reduce the visual distribution difference between the source domain and the target domain. For example, image conversion from a target domain to a source domain may be achieved using a Generation Antagonism Network (GAN) such that the target domain image is more closely distributed to the source domain image. In this way the performance of the detector over the target domain can be improved. Most unsupervised domain adaptive target detection methods are based on dual-stage Fast R-CNN, whereas dual-stage detectors are inefficient compared to single-stage detectors. The existing unsupervised contrast domain self-adaptive method in the field of computer vision is directly applied to the field of multi-platform remote sensing images, and the detection accuracy is poor. The existing method uses an average teacher network in semi-supervised learning to carry out unsupervised domain self-adaption, and is directly applied to the field of multi-platform remote sensing images, and detection accuracy is poor.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a remote sensing image unsupervised cross-domain target detection method based on a progressive pseudo tag.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

step 1: generating a pseudo tag on the target domain data by utilizing a pre-trained average teacher network;

D _S and D _T Image training dataset, wherein D _S The source domain image is a labeled image training dataset; d (D) _T For the target domain image, this is the image dataset D without labels _T The method comprises the steps of carrying out a first treatment on the surface of the A single-stage detector yolov5 is selected for each of a teacher network and a student network in the average teacher network; the student network is different from the teacher network in that a 4 th convolution layer, a 5 th convolution layer and a 6 th convolution layer of the student network are respectively connected with a gradient inversion layer and a domain discriminator; the average teacher network uses CSPDarknet53 as a feature extraction method;

step 1-1: go through D _T Weak boost operation; the weak boost operation is: in D _T For input, sequentially processing by using a random horizontal overturning and cutting method;

step 1-2: inputting the result obtained after the weak enhancement operation in the step 1-1 into a teacher network to obtain a result in D _T Upper prediction result, D _T The predicted result is the predicted boundary frame coordinates and the predicted classification result; network teacher in D _T As a pseudo-result of the above predictionA label;

step 1-3: setting a pseudo tag dynamic optimization strategy to filter the pseudo tag;

step 2: training a student network using the pseudo tag;

step 2-1: will D _T Performing strong enhancement operation; the strong enhancement is carried out by sequentially using random color dithering, gray scale and Gaussian blur;

step 2-2: taking predicted boundary frame coordinates and predicted classification results obtained after pseudo tag filtering in the step 1-3 as D after the step 2-1 strong enhancement operation _T Is a pseudo tag of (2);

step 2-3: will D _S D after the treatment of the step 2-2 _T Inputting the student networks together;

step 2-4: d (D) _S And D _T Shallow layer characteristics f are output through 4 th convolution layers of student networks respectively _S1 And f _T1 Output middle layer feature f of 5 th convolution layer _S2 And f _T2 The 6 th convolution layer outputs a high-level feature f _S3 And f _T3 The 4 th convolution layer, the 5 th convolution layer and the 6 th convolution layer are respectively connected with a gradient inversion layer and a domain discriminator, f _S1 ，f _S2 ，f _S3 And f _T1 ，f _T2 ，f _T3 Respectively input into a corresponding gradient inversion layer and a domain discriminator, and output as f _S ' ₁ ，f _S ' ₂ ，f _S ' ₃ And f _T ' ₁ ，f _T ' ₂ ，f _T ' ₃ The domain arbiter is used to distinguish the domain labels, and the image level contrast domain adaptation loss is as follows:

to represent image-level contrast domain adaptation on shallow features,/for>Representing image-level contrast domain adaptation in mid-layer features,/->Representing image-level contrast domain adaptation at high-level features; d in _n Domain labels representing the nth training image, if the input is from D _S Then D is 0 from D _T D is 1; />Representing that the point of the nth training image at the (h, w) position comes from D _S Probability values of (2);

step 3: realizing domain self-adaption in a student network by utilizing contrast learning;

step 3-1: f output in step 2-4 _S ' ₁ ，f _S ' ₂ ，f _S ' ₃ And f _T ' ₁ ，f _T ' ₂ ，f _T ' ₃ Respectively obtaining D after passing through FPN network _S Is a fused image feature f _S And D _T Is a fused image feature f _T ；

Step 3-2: f (f) _S According to class D _S True tag classification of f _T According to the classification of the pseudo tag generated in step 1-3; f of step 3-1 _S And f _T Image features of the same class as positive samplesAnd->Classifying non-identical image features as negative sample +.>And->ijk represents the jth image feature vector of the ith image, respectivelyk classes; constructing the image level contrast loss L _InfoNCE ；

Step 4: training an average teacher network using the total loss function; the total loss L is the detector loss plus weighted image level contrast loss plus image level contrast domain adaptation at three scalesLambda super parameter takes value of 0.1, L _det As a loss function of yolov5 detector, L _InfoNCE Loss for image level contrast; the student network updates the parameters of the teacher network through the index moving average EMA;

θ _t ←αθ _t +(1-α)θ _s

θ _t and theta _s Representing updated parameters for the teacher network and the student network, respectively, where α is the EMA decay rate and is set to 0.999.

Further, in the step 1-3, the method for setting the dynamic optimization strategy delta of the pseudo tag is as follows:

wherein δ is related to the training period score e, decreasing nonlinearly from 1 to 0.5; e, e _t E is the current training period, epsilon is the super parameter;

delta is used in the teacher network as a weight for the confidence threshold for filtering low quality false labels.

Furthermore, in the setting of the pseudo tag dynamic optimization strategy delta, epsilon takes a value of 0.5, the confidence coefficient threshold value at the beginning of training is 0.8, the confidence coefficient threshold value of the current training period is 0.8 delta, and the following weight is reduced to 0.4 along with the increase of the training period; the teacher network outputs as a pseudo tag with a confidence threshold greater than 0.8 delta.

Further, in the step 3-2, the image level contrast loss L _InfoNCE The method comprises the following steps:

k is the number of categories, and the number of categories,a set of image features representing an ith image of the source domain image>The j-th feature vector representing the i-th image of the source domain image corresponds to a category k,/or->Representing a set of positive samples of the same class k and a predictive probability value greater than a threshold delta for positive samples _S Is composed of image features of delta _S Taking image features of 0.5, negative samplesIs>There are two parts, one is a negative sample with a class other than k, and the other is a negative sample with a class k but with a prediction probability less than the positive sample threshold delta _S Is a negative sample of (2);

a set of image features representing an ith image of the target domain image>The j-th feature vector representing the i-th image of the target domain corresponds to a category k,/or->Representing a set of positive samples, wherein the target domain image has no label information, using pseudo labels generated by the teacher network in the steps 1-3 as the label information, and if the categories are k and the predicted probability value is greater than the threshold delta of the positive samples _T Image feature =0.5 constitutes a positive sample, image feature of negative sample +.>Is>There are two parts, one is a negative sample of class not k, the other is a negative sample of class k but the predictive probability is less than the positive sample threshold delta _T Is a negative sample of (a).

The invention has the beneficial effects that: the same class of automobiles of the VisDrone data set and the DIOR data set are used as detection targets, the performance evaluation index of the detection targets is the commonly used precision mAP in the target detection task, and compared with mAPs detected by other four cross-domain targets, the method has the best detection effect in cross-domain target detection from the VisDrone data set to the DIOR data set;

TABLE 1

VisDrone->DIOR	mAP
		The method adopted by the invention	55.9
ConfMIX	49.6
		SSDA	46.3
AcroFOD	46.8
		MS-DA	52.7

The mAP of the existing MSDA algorithm with highest precision is 52.7%, and the method is improved to 55.9% and improved by 2.2%. The method has the advantage that the good cross-domain target detection performance of the model is shown.

Drawings

FIG. 1 is a schematic diagram of a detection model of the present invention;

fig. 2 is a cross-domain detection flow diagram.

Detailed Description

The invention will be further described with reference to the drawings and examples.

Step 1: generating pseudo tags on target domain data using a pretrained average teacher network

D _S And D _T Image training dataset, wherein D _S Is the sourceDomain images, which are labeled image training datasets; d (D) _T The image is a target domain image, and is an image dataset without labels; the teacher network and the backbone network of the student network in the average teacher network both select a single-stage detector yolov5; the student network is different from the teacher network in that a 4 th convolution layer, a 5 th convolution layer and a 6 th convolution layer of the student network are respectively connected with a gradient inversion layer and a domain discriminator; the average teacher network uses CSPDarknet53 as a feature extraction method of the backbone network;

step 1-1: go through D _T Weak boost operation; the weak boost operation is: in D _T For input, processing by using a random horizontal overturning and cutting method; the method comprises the steps of carrying out a first treatment on the surface of the

Step 1-2: inputting the result obtained after the weak enhancement operation in the step 1-1 into a teacher network to obtain a result in D _T Upper prediction result, D _T The predicted result is the predicted boundary frame coordinates and the predicted classification result; network teacher in D _T The predicted result is used as a pseudo tag;

step 1-3: setting a dynamic optimization strategy of the pseudo tag, and improving the quality of the pseudo tag;

the method for setting the dynamic optimization strategy of the pseudo tag comprises the following steps:

wherein δ is related to the training period denoted e and decreases nonlinearly from 1 to 0.5; e, e _t E is the current period, and epsilon is the super parameter;

using delta as a weight for a confidence threshold for filtering low quality pseudo tags in a teacher network;

epsilon takes a value of 0.5, the confidence coefficient threshold value at the beginning of training is 0.8, the confidence coefficient threshold value of the current training period is 0.8 delta, and the following weight is reduced to 0.4 along with the increase of the training period; the teacher network outputs as a pseudo tag with a confidence threshold greater than 0.8 delta.

Step 2: training a student network using the pseudo tag;

the student network adopts a yolov5 single-stage target detection algorithm commonly used in target detection, and uses CSPDarknet53 as the feature extraction of a backbone network;

CSPDarknet53 is a deep convolutional neural network with a series of convolutional layers, pooling layers, and residual connections; the layers form a feature extraction structure, and feature information of an input image is gradually extracted through the feature extraction structure; YOLOv5 extracts shallow layer features, middle layer features and high layer features of the image through CSPDarknet 53; information from different layers is fused through an FPN network, and finally a prediction result is obtained through an output end of yolov5;

gradient Reversal Layer gradient inversion layer allows the gradient direction to be automatically inverted during counter-propagation; the domain arbiter uses the input features to determine whether the sample is from a source domain or a target domain;

step 2-1: will D _T Performing strong enhancement operation; the strong enhancement is carried out by sequentially using random color dithering, gray scale and Gaussian blur; the method comprises the steps of carrying out a first treatment on the surface of the

step 2-4: output shallow layer characteristic f of 4 th convolution layer of student network ₁ Output middle layer feature f of 5 th convolution layer ₂ The 6 th convolution layer outputs a high-level feature f ₃ The three convolution layers are respectively connected with a gradient inversion layer and a domain discriminator, f ₁ ，f ₂ ，f ₃ Input into the corresponding gradient inversion layer and output from the domain discriminator as f ₁ '，f ₂ '，f ₃ ' the domain arbiter is used to distinguish the domain labels, the image level contrast domain adaptation loss is as follows:

in a classical average teacher network, since only a source domain image has tag information, the learned characteristics of the teacher network and a student network are easily biased to the characteristics of the source domain image; the multi-scale resistance learning is introduced into a student network in an average teacher network, and domain invariant features of a source domain image and a target domain image are learned; by the method, domain offset phenomenon can be effectively relieved, and the performance of cross-domain target detection is improved;

step 3-1: f output in step 2-4 _S ' ₁ ，f _S ' ₂ ，f _S ' ₃ And f _T ' ₁ ，f _T ' ₂ ，f _T ' ₃ Respectively obtaining D after passing through FPN network _S Is a fused image feature f _S And D _T Is a fused image feature f _T ；；

Step 3-2: f (f) _S According to class D _S True tag classification of f _T According to the classification of the pseudo tag generated in step 1-3; f of step 3-1 _S And f _T Image features of the same class as positive samplesAnd->Classifying non-identical image features as negative sample +.>And->ijk represents the kth class of the jth image feature vector of the ith image, respectively; constructing image-level contrast learning domain adaptive loss according to the self-adaptive loss;

image level contrast loss L _Inf The oNCE is:

k is the number of categories, and the number of categories,a set of image features representing an ith image of the source domain image>Representing the ith image of the source domain imageThe j-th feature vector corresponds to a category k, < >>Representing a set of positive samples of the same class k and a predictive probability value greater than a threshold delta for positive samples _S Is composed of image features of delta _S Taking image features of 0.5, negative samplesIs>There are two parts, one is a negative sample with a class other than k, and the other is a negative sample with a class k but with a prediction probability less than the positive sample threshold delta _S Is a negative sample of (2);

a set of image features representing an ith image of the target domain image>The j-th feature vector representing the i-th image of the target domain corresponds to a category k,/or->Representing a set of positive samples therein, the target domain image having no label information, using pseudo labels generated by the teacher network of steps 1-3For label information, if the categories are k and the predicted probability value is greater than the threshold delta of positive samples _T Image feature =0.5 constitutes a positive sample, image feature of negative sample +.>Is>There are two parts, one is a negative sample of class not k, the other is a negative sample of class k but the predictive probability is less than the positive sample threshold delta _T Is a negative sample of (a).

In order to further solve the domain offset problem of the image level in the cross-domain target detection, the feature distances of the same category of the source domain image and the target domain image can be shortened by using contrast learning at the image feature level, so that the intra-category difference between the image features of the source domain image and the target domain image is reduced, and the performance of the cross-domain target detection is improved;

step 4: training an average teacher network using the total loss function; the total loss L is the detector loss plus weighted image level contrast loss plus image level contrast domain adaptation at three scales,lambda super parameter takes value of 0.1, L _det A loss function for the yolov5 detector; the student network updates the parameters of the teacher network through the index moving average EMA;

θ _t ←αθ _t +(1-α)θ _s

Claims

1. A remote sensing image non-supervision cross-domain target detection method based on progressive pseudo tags is characterized by comprising the following steps of: the method comprises the following steps:

step 2: training a student network using the pseudo tag;

step 2-4: d (D) _S And D _T Shallow layer characteristics f are output through 4 th convolution layers of student networks respectively _S1 And f _T1 Output middle layer feature f of 5 th convolution layer _S2 And f _T2 The 6 th convolution layer outputs a high-level feature f _S3 And f _T3 The 4 th convolution layer, the 5 th convolution layer and the 6 th convolution layer are respectively connectedA gradient inversion layer and a domain discriminator f _S1 ，f _S2 ，f _S3 And f _T1 ，f _T2 ，f _T3 Respectively input into a corresponding gradient inversion layer and a domain discriminator, and output as f' _S1 ，f′ _S2 ，f′ _S3 And f' _T1 ，f′ _T2 ，f′ _T3 The domain arbiter is used to distinguish the domain labels, and the image level contrast domain adaptation loss is as follows:

step 3-1: f 'output in step 2-4' _S1 ，f′ _S2 ，f′ _S3 And f' _T1 ，f′ _T2 ，f′ _T3 Respectively obtaining D after passing through FPN network _S Is a fused image feature f _S And D _T Is a fused graph of (1)Image feature f _T ；

Step 3-2: f (f) _S According to class D _S True tag classification of f _T According to the classification of the pseudo tag generated in step 1-3; f of step 3-1 _S And f _T Image features of the same class as positive samplesAnd->Classifying non-identical image features as negative sample +.>And->ijk represents the kth class of the jth feature vector of the ith image, respectively; constructing the image level contrast loss L _InfoNCE ；

θ _t ←αθ _t +(1-α)θ _s

2. The method for detecting the remote sensing image unsupervised cross-domain target based on the progressive pseudo tag according to claim 1, wherein the method comprises the following steps: in the step 1-3, the method for setting the dynamic optimization strategy delta of the pseudo tag is as follows:

3. The method for detecting the remote sensing image unsupervised cross-domain target based on the progressive pseudo tag according to claim 2, which is characterized by comprising the following steps: in the setting of the pseudo tag dynamic optimization strategy delta, epsilon takes a value of 0.5, the confidence coefficient threshold value at the beginning of training is 0.8, the confidence coefficient threshold value of the current training period is 0.8 delta, and the following weight is reduced to 0.4 along with the increase of the training period; the teacher network outputs as a pseudo tag with a confidence threshold greater than 0.8 delta.

4. The method for detecting the remote sensing image unsupervised cross-domain target based on the progressive pseudo tag according to claim 1, wherein the method comprises the following steps:

in the step 3-2, the image level contrast loss L _InfoNCE The method comprises the following steps:

k is the number of categories, and the number of categories,a set of image features representing an ith image of the source domain image>The j-th feature vector representing the i-th image of the source domain image corresponds to a category k,/or->Representing a set of positive samples of the same class k and a predictive probability value greater than a threshold delta for positive samples _S Is composed of image features of delta _S Taking the image feature of the negative sample 0.5 +.>Is>There are two parts, one is a negative sample with a class other than k, and the other is a negative sample with a class k but with a prediction probability less than the positive sample threshold delta _S Is a negative sample of (2);