CN115861625A

CN115861625A - Self-label modifying method for processing noise label

Info

Publication number: CN115861625A
Application number: CN202211554141.9A
Authority: CN
Inventors: 张宇; 林凡; 米思娅
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-03-28

Abstract

The invention discloses a self-label modification method for processing noise labels, which comprises the steps of randomly selecting small-batch data samples, carrying out data enhancement processing on the data samples to obtain different views, using the different views as the input of a pseudo-twin neural network, and outputting the prediction probability of the class of the data samples; calculating JS divergence in distribution of data sample labels according to prediction of different views by different networks, and judging the possibility of the JS divergence as a clean data sample; dividing the batch of data samples into clean data samples and noisy data samples according to a given judgment threshold, only smoothing the labels of the clean data samples, and dynamically weighting the noisy data samples according to the prediction of a model and the labels of the samples to endow the noisy data samples with reliable labels; and finally, updating the model by using the classification loss function and the consistency loss function. The method is used for solving the image classification task under the label noise, and achieves a good performance effect.

Description

Self-label modifying method for processing noise label

Technical Field

The invention belongs to the technical field of computer vision, and mainly relates to a self-label modification method for processing a noise label.

Background

Deep neural networks have made tremendous progress in various computer vision tasks that do not leave large-scale datasets with reliable annotations, such as ImageNet. However, collecting well-annotated data sets is very labor, material, and time intensive, especially in the area of expertise (e.g., fine-grained classification). The high cost of acquiring large scale well-labeled data constitutes a bottleneck for the use of deep neural networks in real world scenarios. In order to alleviate the problem, a data annotation company selects a crowdsourcing mode to raise data and annotation, and data crawled from a network, or only one or a small number of annotation personnel perform annotation due to limited cost, or an alternative method such as online query is adopted to improve the labeling efficiency. Unfortunately, although the means to obtain these data is less costly and easier to implement, it often creates an unavoidable noisy signature due to unreliability of labeling from non-experts, error prone automated labeling systems or limited labeling personnel, and no way to make repeated checks.

Due to the complexity of the network structure, the overfitting capability of the deep network to the noise label is very strong, and the label sample with noise is inevitably fitted by the deep neural network, so that the performance of the model is influenced. Therefore, the research on the robust learning method of the anti-noise label is urgent.

The method for researching the noise label mainly comprises the following steps: (1) the problem is solved by estimating the underlying noise transition matrix, and the main difficulty of this kind of problem is that the noise transition matrix needs to be estimated accurately, and thus needs good a priori knowledge. (2) And designing a loss function for preventing noise, and correcting the loss according to the prediction of the deep neural network. However, such methods are prone to failure when the data set is large. (3) The deep neural network is trained using the selected or re-weighted training samples. The main challenge of such problems is to design a proper standard to identify clean data samples, and how to improve reliable clean data samples is a problem to be considered. (4) The labels of the data samples are modified, primarily in conjunction with the output of the prediction network, to modify the labels of the data samples that are considered noisy. However, how much confidence is to be given to the predicted network is a matter of consideration. (5) The noise label problem is studied in the field of semi-supervised learning. The accuracy of many semi-supervised learning classifiers can drop significantly in the presence of label noise.

Disclosure of Invention

The invention provides a self-labeling modification method for processing noise labels aiming at the problem of model performance reduction caused by the noise labels in the prior art, wherein small batches of data samples are randomly selected, the data samples are subjected to data enhancement processing to obtain different views and are used as the input of a pseudo-twin neural network, the prediction probability of the class of the data samples is output, and the JS divergence of the data sample labels and the prediction calculation of the different views by the different networks are used for judging the possibility of the data samples as clean data samples; dividing the batch of data samples into clean data samples and noisy data samples according to a given judgment threshold, performing smoothing treatment on labels of the clean data samples, performing dynamic weighting on the noisy data samples according to prediction of a model and labels of the samples, giving reliable labels to the noisy data samples, and updating the model by using the proposed classification loss function and consistency loss function. The method can achieve good performance effect under the condition of artificially synthesized noisy data sets and large-scale noisy data sets from real scenes, and has the characteristic of faster convergence in the training process.

In order to achieve the purpose, the invention adopts the technical scheme that: a self-tag modification method for processing a noisy tag, comprising the steps of:

s1, randomly selecting small-batch data samples in the process of training a model by using a data set

Processing each data sample X by using two data enhancement modes of zooming and cutting to obtain different views V and V';

s2, taking the different views V and V' obtained in the step S1 as the input of two pseudo-twin neural networks, and obtaining the final predicted output through soft-max layers by the output of the two pseudo-twin neural networksGive out a result P ₁ ，P’ ₁ ，P ₂ ，P’ ₂ In which P is ₁ And P' ₁ The output of the network I is generated through a soft-max layer, and the input is V and V'; p ₂ And P' ₂ The output of the network II is generated through a soft-max layer, and the input is V and V';

s3, calculating the difference between the pseudo twin neural network output in the step S2 and the label distribution given by the sample, specifically

Wherein, P _i ＝[P _i ¹ ,P _i ² ,...,P _i ^C ]Is a data sample x _i Measuring the predicted probability distribution by using Jensen-Shannon (JS) divergence;

as data samples x _i A given distribution of real tags; d _KL (. | |. Cndot.) represents the Kullback-Leibler (KL) divergence function;

the label distribution of the data samples is 0-1 distribution, only the class to which the data samples belong is marked as 1, and the rest are 0, and in order to prevent the situation that the true number of the logarithm is 0 in the calculation process, the distribution is converted into the following formula for calculation:

wherein the given label is l _i E {1,2,3,. Eta., C }, wherein epsilon is a hyper-parameter used for controlling the smoothness degree of label distribution;

s4, utilizing the distribution difference d obtained in the step S3 _i Calculating the data sample x _i The probability of being a clean sample is expressed as follows:

wherein,

represents p _i And y _i Consistency between them;

s5, calculating a threshold value selected by the clean data samples according to the training turns, and determining the threshold value tau _clean Then, for data sample x _i If the following equation is satisfied, x can be preliminarily determined _i Is a clean data sample:

s6, selecting clean data samples according to the output of the two pseudo-twin neural networks, wherein the data samples participate in subsequent model updating only when the two neural networks judge that the data samples are clean, and a sample set expression is selected as follows:

wherein,

and &>

Respectively judging the data samples by using the output predictions of the two neural networks;

s7, dividing the training data into two subsets through judging and selecting the data samples: clean data sample set

And noisy data sample set->

S8, adopting smoothingClean data sample set processed by label distribution

The expression of the sample label in (1) is as follows:

s9, processing the sample labels in the noisy data sample set by means of the pseudo-twin neural network in the step S2, wherein the expression is as follows:

wherein,

is the label after the smoothing treatment of step S3; p is a radical of formula _i The prediction result is output by the pseudo-twin neural network in the step S2, and one of the pseudo-twin neural networks is selected; e is the weight given to the model output;

s10, respectively performing cross entropy loss on the label distribution modified in the steps S8 and S9 and the probability distribution predicted by the model, and calculating a classification loss function, wherein the classification loss expression is as follows:

wherein, the data sample x _i Two different views v are obtained after different data enhancement processing _i And v _i ' the predicted probability distributions of the outputs as inputs to the two networks are denoted by p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 ；

Is the modified label distribution, and is used for the selected clean data sampleBased on>

Obtained in step S8; for data samples deemed noisy, a->

Obtained in step S9; n is the number of data samples processed;

s11, calculating a consistency loss function, specifically:

wherein D is _KL (. | |. Cndot.) denotes the Kullback-Leibler (KL) divergence function, p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 N represents the classmatic loss function;

s12, integrating the classification loss function obtained in the step S10 and the consistency loss function obtained in the step S11, and calculating an overall loss function, wherein the expression is as follows:

wherein alpha is a hyper-parameter used for adjusting the weight of two losses;

s13, calculating gradient descent by using the overall loss function, updating parameters of the model, and obtaining an optimal model for solving the noise label:

wherein θ = { θ ₁ ,θ ₂ }，θ ₁ ,θ ₂ And respectively representing the parameters of the two networks, repeating the training process after updating, executing the step S1 if the set iteration times are not reached, otherwise, exiting the training round, and executing the next training round until the training round is finished.

Compared with the prior art, the invention has the following beneficial effects: the self-label modifying method for processing the noise label is used for solving the image classification task under the label noise, and can achieve higher performance effect in a given data set. The training data are fully utilized in the training process, and the training is performed only by the model of the user without depending on an additional auxiliary model; the method can achieve good performance effect under the condition of artificially synthesized noisy data sets and large-scale noisy data sets from real scenes, and has the characteristic of faster convergence in the training process.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

FIG. 2 is a comparison graph of classification performance of the present invention on the cloning 1M data set and the existing main methods, including Decoupling, co-training +, joCOR, jo-SRC method, wherein Standard is a method for training the network directly on the noisy data set;

FIG. 3 is a comparison of classification performance of the present invention on Food101N datasets against existing dominant methods, including CleanNet, deepself methods, where Standard is a method to train the network directly on noisy datasets;

fig. 4 is a graph comparing classification performance of the present invention on a noisy dataset artificially synthesized on a CIFAR100 dataset with the existing main methods, including decoring, co-training +, joCoR, jo-SRC method, where Standard is a method for training a network directly on a noisy dataset.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

Example 1

A self-label modifying method for processing noise labels is provided, the frame of the method is shown in figure 1, and the invention adopts a pseudo-twin neural network to strictly judge whether a data sample is a clean sample. For data samples which are considered to be noisy, the method carries out self-label modification, does not depend on an additional auxiliary network, and only depends on a pseudo-twin neural network in the method for modification. Meanwhile, dynamic weight is given to the output prediction of the pseudo-twin neural network, so that the confidence degree of the pseudo-twin neural network is more reasonable as training progresses. Finally, the proposed consistency loss and classification loss are used to update the model. The method specifically comprises the following steps:

step S1: in the process of training the model by using the data set, randomly selecting a small batch of data samples

For each data sample X, processing the data sample by using a data enhancement technology to obtain different views V and V', and specifically for the same data sample, processing the data sample by using two data enhancement modes of scaling and cropping to obtain two views.

Step S2: the different views V and V' for each sample are taken as inputs to network one and network two. Wherein network one and network two are pseudo-twin neural networks, which can predict labels separately, with different parameters, but updated simultaneously by the same loss function. The output of the network obtains the final predicted output P through the soft-max layer ₁ ，P ₁ ’，P ₂ ，P ₂ ', wherein P ₁ And P ₁ 'is generated by the output of network one through soft-max layer, the inputs are V and V', P ₂ And P ₂ 'the output from network two is generated by the soft-max layer, with inputs V and V', respectively. The network architectures of the first network and the second network are the same, but do not share parameters, and the two networks are updated simultaneously by using a random gradient descent method by using the same loss function.

And step S3: and (3) calculating the difference of the pseudo twin neural network output and the label distribution given by the sample in the step (S2). For data sample x _i The method measures the predicted probability distribution P by using Jensen-Shannon (JS) divergence _i ＝[P _i ¹ ,P _i ² ,...,P _i ^C ]And given truthLabel distribution

The difference between them. It is represented as follows:

wherein D _KL (. | |. Cndot.) represents the Kullback-Leibler (KL) divergence function.

The label distribution of the data sample is 0-1 distribution, only the class of the data sample is marked as 1, the rest is 0, and in order to prevent the situation that the true number of the logarithm is 0 in the calculation process, the smoothed label distribution is adopted in the calculation formula (1), and the calculation is converted into the calculation formula (2).

Wherein the given label is l _i E {1,2,3.., C }, e is a hyper-parameter used to control the smoothness of the label distribution. And after debugging, selecting 0.7 as the hyper-parameter epsilon for smoothing the label.

And step S4: calculating data samples x _i The possibility of a clean sample. The JS divergence can be used to measure the dissimilarity between the two probability distributions, which is between 0 and 1. Thus, intuitively, d can be utilized _i To measure data sample x _i The probability of being a clean sample is expressed as follows:

in fact, it is possible to use,

represents p _i And y _i Consistency between them.

Step S5: and calculating the threshold value selected by the clean data sample according to the training turns. The invention adopts the following mode to carry out dynamic processing on the threshold value for judging whether the data sample is clean:

where t denotes the training run, Δ τ = τ _m -τ _c ，τ _c Is a hyperparameter, τ _m Is a self-defined constant, threshold τ _clean The first stage is divided into two stages, t is more than or equal to 1 and less than or equal to t _w In this stage, the model is selected only from the clean data samples, and the model is updated using the selected data samples without modifying the label. Second stage t _w ≤t≤t _max When the model has certain prediction capability, the label modification is carried out on the data sample judged to have noise in order to more effectively utilize the data. Threshold τ _clean Which varies with the training in a linear manner in both stages.

Given a suitable threshold τ _clean Then, for data sample x _i If the following equation is satisfied, x can be preliminarily determined _i Is a clean data sample:

the proportion of the two-stage training rounds is not fixed, and in the training process, the first stage can enable the training performance to tend to be saturated in the stage, and in the second stage, sufficient training is needed to achieve higher performance. The threshold for clean data sample selection during the initial process should be low enough to prevent selection of too few samples.

Step S6: clean data sample selection is performed in combination with output prediction of the network. Because the two networks in the framework have different learning capabilities, the two networks can filter errors caused by different types of noise labels, and in order to improve the reliability of sample selection, the invention adopts a dual-model structure to strengthen the screening of the clean labels. And only when the two networks judge that the data sample is clean, the data sample participates in subsequent model updating, and a sample set expression is selected as follows:

wherein

And &>

Respectively, the results of judging the data samples using the output predictions of the two networks.

Although the two networks are capable of filtering different types of noise, the thresholds by which the two networks determine whether a data sample is a clean data sample are the same at the same stage.

Step S7: the training data is partitioned. By decision and selection of data samples, we divide the training data into two subsets: clean data sample set

And noisy data sample set->

Step S8: processing clean data sample sets

The sample label of (1). The invention keeps their labels unchanged, but in order to improve generalization performance and prevent the situation that the true number of the logarithm is 0 in calculating cross entropy, a smoothed label distribution is adopted, and the expression is as follows:

the labels determined to be clean data samples still need to be consistent with the process of calculating the difference in the distribution of the network output and the given label of the sample in the previous step.

Step S9: processing a set of noisy data samples

The sample label of (1). The invention adopts a modification mode of self-labeling and only depends on the pseudo-twin neural network in the step S2 to process the sample labels in the noisy data sample set. When a data sample is predicted to be noisy, where the prediction of a given label and model is conflicting, the label and model prediction output of the sample itself should be given different weights, since it cannot be completely determined whether its corresponding label is erroneous or correct, as expressed below:

wherein in the Chinese formula

Is a label smoothed by the formula (2), p _i The prediction result output by the pseudo-twin neural network in step S2, where one of the pseudo-twin neural networks is selected, e is a weight given to the model output, and e determines how much the label distribution of the model prediction should be trusted. Considering that the model should become more reliable as training progresses, e should be dynamic and increase as training progresses. While considering that the output distribution for the prediction should be reasonable, for this reason e can be defined as:

∈＝g(t)×l(p) (9)

where g (t) determines how much the learner can be trusted, it is data independent, and the expression is as follows:

wherein

Represents the total number of iteration rounds of training and t represents the current training round.

l (p) determines how much to trust the predicted tag distribution, which is data dependent, and the expression is as follows:

l(p)＝1-H(p)/H(u) (11)

where H (p) represents the information entropy of the model output prediction, and the expression of H (u) is as follows:

wherein, the weight epsilon given to the model output is determined by weighting the model output by the prediction outputs of the two networks, and the weight is 0.5.

Step S10: a classification loss function is calculated. And respectively performing cross entropy loss on the modified label distribution and the probability distribution predicted by the model, wherein the classification loss expression is as follows:

wherein, the data sample x _i Two different views v are obtained after different data enhancement processing _i And v _i ' the predicted probability distributions of the inputs and outputs of the two networks are denoted by p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 。

Is a modified label distribution that, for selected clean data samples, combines>

Derived from formula (7) for the presence of noiseAcoustic data sample, based on the comparison of the signal level and the signal level>

Obtained by the formula (8). N is the number of processed data samples, and in the first stage of threshold dynamic, in order to make the model have a certain predictive capability as soon as possible, only the selection of clean data samples is performed without performing label processing on noisy data samples, and at this time, the classification loss is as follows:

wherein

The data sample set is obtained by calculation according to the formula (6) and the condition that both networks judge that the data sample is clean.

In the second stage of threshold dynamic, the model has a certain predictive ability, and the data judged to be noisy is subjected to label modification, wherein N is the number of data samples in a small batch.

The labels used for calculating the classification loss are processed differently in two stages, namely, the labels are smoothed in the first stage, and clean data samples and noisy data samples are processed differently in the second stage through the steps.

Step S11: a consistency loss function is calculated. The present invention designs a consistency loss to maximize consistency between two classifiers and consistency between output predictions from different views of the input by the same network. The expression is as follows:

wherein D _KL (. | | -) represents the Kullback-Leibler (KL) divergence function, p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 And N represents the classmatic loss function.

Step S12: a total loss function is calculated. Integrating the classification loss function and the consistency loss function, wherein the overall loss function expression is as follows:

where α is a hyperparameter used to adjust the weight magnitudes of the two losses.

Step S13: the gradient descent is calculated using the global loss function, thereby updating the parameters of the model:

wherein θ = { θ ₁ ,θ ₂ }，θ ₁ ,θ ₂ Representing the parameters of the two networks, respectively. And repeating the training process after updating, if the set iteration number is not reached, executing the step one, otherwise, exiting the training round, and executing the next training round until the training round is finished.

Test example

The classification performance of the method of the invention on the blocking 1M dataset, the Food101N dataset and the CIFAR100 dataset is compared with the classification performance of the existing advanced method related to the noise label processing field, the compared method is different for each dataset, the specific compared method is shown in the attached figure description, and the compared result is shown in figures 2-4.

FIG. 2 shows the comparison of classification performance of the present invention with the existing advanced methods in the field of noise label processing on the blocking 1m dataset, including Decoupling, co-training +, joCOR, jo-SRC, standard, which is a method for training the network directly on the noisy dataset. It can be seen that the best results are obtained with a performance of about 0.2% higher than that obtained previously for the best performance method, jo-SRC. However, the training process of the Jo-SRC method requires predictions using a teacher model, and therefore relies on accurate auxiliary models to generate the predictions. The invention lightens the network model in the training process and obtains higher performance effect on the closing 1m data set.

Fig. 3 shows a comparison of the present invention with existing advanced methods in the field of noise signature processing on a Food101N dataset, including clearnet, deepSelf, standard, which is a method of training a network directly on a noisy dataset. The performance effect of the method Jo-SRC, which is slightly higher than the best performance before, is achieved by the method, and the effectiveness of the method in the case of processing real-world noise is also verified. However, the Jo-SRC uses the teacher model in the training process, which also embodies the characteristic that the method can obtain better performance effect without depending on an additional auxiliary model.

Fig. 4 shows a comparison of the present method with the existing advanced methods related to the field of noise label processing, including decoring, co-training +, joCoR, jo-SRC, standard, for training the network directly on the noisy dataset, when the CIFAR100 dataset is composed of the noisy dataset. The noise types include "symmetrical type" and "asymmetrical type", the noise ratio under the symmetrical type "is set to 0.2, 0.4, 0.8, and the noise ratio under the asymmetrical type" is set to 0.4. As shown, the method is always superior to the existing advanced methods regarding the field of noise tag processing.

In summary, a simple and effective method is proposed to solve the performance degradation problem caused by noise labels in image classification. Aiming at the problem that the existing method is lack of reliability when the data sample is judged to be clean, the method adopts a dual-model structure to filter errors caused by different types of noise labels, and the consistency between prediction outputs of dual models is maximized. For a clean data sample, the method smoothes the label of the clean data sample to improve the generalization performance of the model and prevent the condition that the true number is 0 in the process of calculating the cross entropy. For a noisy data sample, the method determines the label thereof to the prediction of the model and the label marked by the method, and gives dynamic weight between the prediction and the label, and the modification mode does not depend on other models, but only depends on the model of the frame. In addition, the method also provides a classification loss function and a consistency loss function to update the model, experiments are carried out on the synthesized noisy data set and the large-scale noisy data set in a real scene, and a good performance effect is obtained to prove the effectiveness of the method.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it will be apparent to those skilled in the art that several modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments fall within the protection scope of the claims of the present invention.

Claims

1. A self-tag modification method of processing a noisy tag, characterized by: randomly selecting a small batch of data samples, performing data enhancement processing on the data samples to obtain different views, using the different views as the input of a pseudo-twin neural network, outputting the prediction probability of the data sample category, and calculating JS divergence distributed by data sample labels according to the prediction of different views by different networks for judging the possibility of the data samples as clean data samples; dividing the batch of data samples into clean data samples and noisy data samples according to a given judgment threshold, performing smoothing treatment on labels of the clean data samples, performing dynamic weighting on the noisy data samples according to prediction of a model and labels of the samples, giving reliable labels to the noisy data samples, and updating the model by using a classification loss function and a consistency loss function.

2. A self-tag modification method of processing a noise tag as claimed in claim 1, comprising the steps of:

s1, randomly selecting a small batch of data samples in the process of training a model by using a data set

s2, taking the different views V and V' obtained in the step S1 as the input of two pseudo-twin neural networks, and obtaining final prediction output results P through soft-max layers by the output of the two pseudo-twin neural networks ₁ ，P’ ₁ ，P ₂ ，P’ ₂ In which P is ₁ And P' ₁ The output of the network I is generated through a soft-max layer, and the input is V and V'; p ₂ And P' ₂ The output of the network II is generated through a soft-max layer, and the input is V and V';

s3, calculating the difference between the pseudo twin neural network output in the step S2 and the label distribution given by the sample, specifically:

wherein,

for data samples x _i Measuring the predicted probability distribution by using Jensen-Shannon (JS) divergence; />

Is a data sample x _i A given distribution of real tags; d _KL (. | | -) represents the Kullback-Leibler (KL) divergence function;

the label distribution of the data samples is 0-1 distribution, only the class to which the data samples belong is marked as 1, and the rest is 0, so that the situation that the true number of the logarithm is 0 in the calculation process is prevented, and the distribution is converted into the following formula for calculation:

s4, utilizing the stepsThe difference d in distribution obtained in step S3 _i Calculating the data sample x _i The probability of being a clean sample is expressed as follows:

wherein,

represents p _i And y _i Consistency between them;

wherein,

and &>

s7, dividing the training data into two parts by judging and selecting the data samplesSubset of individuals: clean data sample set

And a collection of noisy data samples>

S8, processing the clean data sample set by adopting the smoothed label distribution

The expression of the sample label in (1) is as follows:

s9, processing the noisy data sample set by means of the pseudo twin neural network in the step S2

The expression of the sample label in (1) is as follows:

wherein,

the label is smoothed in the step S3; p is a radical of formula _i Is the prediction result output by the pseudo-twin neural network in step S2; e is the weight given to the model output;

wherein, the data sample x _i Two different views v are obtained after different data enhancement processing _i And v _i ' the output probability distributions obtained as inputs to the two networks are denoted p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 ；

Is a modified label distribution that, for a selected clean data sample, is based on the value of the label distribution>

Obtained in step S8; for data samples deemed noisy, a->

Obtained in step S9; n is the number of data samples processed;

s11, calculating a consistency loss function, specifically:

wherein D is _KL (. | | -) represents the Kullback-Leibler (KL) divergence function, p _i1 ，p’ _i1 ，p _i2 ，p’ _i2 N represents the classmark penalty function;

wherein alpha is a hyperparameter used for adjusting the weight of two losses;

wherein θ = { θ ₁ ,θ ₂ }，θ ₁ ,θ ₂ And respectively representing parameters of the two networks, repeating the training process after updating, executing the step S1 if the set iteration times are not reached, otherwise, exiting the training round and executing the next training round until the training round is finished.

3. A self-tag modification method of processing a noisy tag according to claim 2, characterized by: in the step S2, the two pseudo-twin neural networks are the same in network structure, but do not share parameters, and the two neural networks are updated simultaneously by using the same loss function and using a random gradient descent method.

4. A self-tag modification method of processing a noisy tag according to claim 2, characterized by: in step S5, the threshold for determining whether the data sample is clean is dynamically processed in the following manner:

wherein t represents the training round; Δ τ = τ _m -τ _c ，τ _c Is a hyper-parameter; tau is _m Is a self-defined constant; threshold τ _clean The first stage is divided into two stages, t is more than or equal to 1 and less than or equal to t _w Only selecting a clean data sample, and updating the model by using the selected data sample without modifying the label; second stage t _w ≤t≤t _max Modifying the label of the data sample judged to have noise; the threshold value tau _clean In two stages with linearityThe way of (2) is constantly changing with training.

5. A self-tag modification method of processing a noisy tag according to claim 2, characterized by: in step S6, the thresholds according to which the two neural networks determine whether the data sample is a clean data sample are the same.

6. A self-tag modification method of processing a noisy tag according to claim 4 or 5, characterized in that: in step S9, the weight e of the model output is dynamic and continuously increases as the training progresses, i.e. e can be defined as:

∈＝g(t)×l(p)

H(u)＝-log(1/C)

wherein Γ represents the total number of iteration rounds of training; t represents the current training round; h (p) represents the information entropy of the model output prediction.

7. A self-tag modification method of processing a noisy tag according to claim 6, characterized by: in the step S10, in the first stage of threshold dynamic, only the selection of the clean data sample is performed, and the labeling processing on the noise data sample is not performed, where the classification loss is as follows:

wherein,

the data sample set is a data sample set which meets the condition that both networks judge that the data sample is clean;

in the second stage of threshold dynamism, the model performs label modification on data judged to be noisy, where N is the number of data samples in a small batch.