CN114972839B

CN114972839B - Generalized continuous classification method based on online comparison distillation network

Info

Publication number: CN114972839B
Application number: CN202210326319.8A
Authority: CN
Inventors: 冀中; 黎晋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2024-06-25
Anticipated expiration: 2042-03-30
Also published as: CN114972839A

Abstract

The invention discloses a generalized continuous classification method based on an online comparison distillation network, which establishes a classification model based on knowledge distillation; establishing a buffer area and updating the buffer area by using a reservoir sampling method; randomly sampling S samples from the buffer area and respectively inputting the S samples into a teacher model and a student model to obtain classification output and feature embedding of the corresponding models; calculating and classifying output quality scores according to a teacher model, adjusting knowledge distillation loss function weights of different samples, and calculating distillation lossFeature embedding between the two models is compared, and the distillation loss of the comparison relation between the two models is calculatedCalculating self-supervision loss of student modelAnd supervised contrast learning lossCalculating the cross entropy classification loss of student modelAnd determining the parameters of the maximum optimization objective optimization student model by adding the loss weights. The parameters of the teacher model are updated by the parameters of the student model. The invention has good classification accuracy for both new tasks and old tasks.

Description

Generalized continuous classification method based on online comparison distillation network

Technical Field

The invention relates to a generalized continuous classification method, in particular to a generalized continuous classification method based on an online comparison distillation network.

Background

Currently, in recent years, deep learning has achieved good results in computer vision tasks such as image classification, object detection, and semantic segmentation. However, when a neural network trained on an old task is trained directly on a new task, the new task can severely interfere with the performance of the old task, creating a catastrophic forgetting (Catastrophic Forgetting) problem. Having the neural network train from scratch obviously consumes more time and computing resources, and the data of the previous task is not necessarily re-acquired due to privacy issues, etc. The human beings have the ability to learn continuously, and can learn new knowledge quickly on the basis of old knowledge without compromising the stability of previously learned knowledge. It is desirable that the neural network has the ability to learn continuously in humans, and continuous learning (Continual Learning), also called incremental learning (INCREMENTAL LEARNING), is proposed to overcome the catastrophic forgetfulness problem. In recent years, a large amount of continuous learning work has employed the idea of empirical playback (Experience Replay), storing samples of a portion of an old task, and playing back stored samples while training a new task to alleviate the catastrophic forgetfulness problem.

In the existing continuous learning technology, it is often required to assume that the categories between the tasks are mutually disjoint, i.e., the categories in the new task are all not found in the old task, a clear task boundary exists between the tasks, and in the real world task, such a priori knowledge is most likely not present. Many existing techniques utilize a priori knowledge that is unlikely to occur in such a real-world task, simplifying the difficulty of continuing the learning problem. For example, when the model output of the old sample at the past moment is used for outputting the model output of the regular old sample at the current moment to relieve catastrophic forgetting, the dimension of the old model output and the dimension of the new model output become inconsistent due to the arrival of new categories, and under the condition that the categories between tasks are assumed to be complementary, the output of the new model can be only partially overlapped with the old model. The prior continuous method using the mutually disjoint categories between tasks can not be applied to the setting of generalized continuous learning. For this reason, generalized continuous learning (General Continual Learning) technology to solve the catastrophic forgetfulness problem in real-world scenarios is attracting attention. The goal of generalized continuous learning is to consolidate learned knowledge from non-stationary infinite data streams and learn new knowledge quickly. Under the setting of generalized continuous learning, the categories among each task may be intersected, new samples of old categories may appear in new tasks, and the previous method for solving continuous learning by means of priori knowledge which does not necessarily exist in the real world is difficult to apply to generalized continuous learning.

Generalized continuous learning is a general continuous learning scenario, and can also be applied to classical class incremental learning (CLASS INCREMENTAL LEARNING), task incremental learning (TASK INCREMENTAL LEARNING) and Domain incremental learning (Domain INCREMENTAL LEARNING) scenarios. However, the specific prior knowledge in these classical scenes cannot be utilized to alleviate catastrophic forgetfulness when image classification is performed in a scene of generalized continuous learning. This means that at experience playback, some inherent non-specific scene information must be mined to consolidate knowledge of the old task.

Disclosure of Invention

The invention provides a generalized continuous classification method based on an online comparison distillation network for solving the technical problems in the prior art.

The invention adopts the technical proposal for solving the technical problems in the prior art that: a generalized continuous classification method based on an online comparison distillation network comprises the following steps:

Step 1, establishing a classification model based on knowledge distillation, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are respectively provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of a student model; initializing parameters of a teacher model and a student model and giving a buffer zone with a fixed size;

Step 2, counting the number of samples currently encountered when a batch of data stream containing R samples arrives, and updating a buffer area by using a reservoir sampling method;

Step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classified output data of the teacher model and the student model corresponding to the S samples through processing of respective feature encoders and classifiers of the teacher model and the student model; respectively obtaining feature embedded data of a teacher model and a student model corresponding to the S samples through processing of respective feature encoders and feature mappers of the two samples;

Step 4, calculating the mass fraction of the classified output data of the teacher model, adjusting the weight of the online knowledge distillation loss function of different samples according to the mass fraction of the classified output data of the teacher model, and further calculating the online distillation loss of the teacher model and the student model

Step 5, comparing the characteristic embedded data between the teacher model and the student model, and calculating the comparative relation distillation loss of the teacher model and the student model

Step 6, utilizing self-supervision learning and supervision contrast learning to help the student model to extract discriminant features, and calculating self-supervision loss of the student modelLearning loss/>, as compared with supervision

Step 7, based on experience playback, calculating cross entropy classification loss of student model

Step 8, calculating the total optimization target of the student model Alpha ₁ to alpha ₃ are hyper-parameters of each corresponding loss function; optimizing parameters of the student model by using a random gradient descent algorithm;

and 9, directly updating the parameters of the teacher model by using the parameters of the student model.

Further, in step 2, assuming that the non-stationary data stream is composed of n sample-disjoint tasks { T ₁,T₂,...,T_n }, the training set of each task T _n is composed of labeled dataThe composition, wherein m is the number of samples of the task T _n training set, x _i is the ith image sample in the task T _n training set, and y _i is the category marked by the ith image sample x _i in the task T _n training set; buffer/>Is/>X _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; the reservoir sampling method comprises the following steps:

Step A1, judging the number num of samples and the buffer capacity of the current encountered sample The size of the two parts ifStoring samples (x _i,y_i) directly to the buffer/>In (a) and (b); x _i is the ith image sample in the training set of task T _n, y _i is the category marked by the ith image sample x _i in the training set of task T _n;

Step A2, if Generating a random integer rand_num, wherein the minimum value of the random integer is 0, and the maximum value of the random integer is num-1; if/>Replacing samples in the buffer with samples (x _i,y_i) (x _{rand_num},y_{rand_num});x_{rand_num} denotes buffer/>, indexed rand_numY _{rand_num} denotes the buffer/>, indexed rand_numIs included.

Further, in step 4, the method for calculating the quality score of the teacher model classified output data is as follows:

Setting: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)The sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the teacher model to obtain classified output data; omega (x _j) is the mass fraction of the teacher model class output data corresponding to sample x _j; the formula for ω (x _j) is shown below:

Wherein:

ρ represents a temperature coefficient;

c represents the number of all possible categories;

exp (·) represents an exponential function based on a natural constant e;

Output data for classification/> The classification output data of the corresponding category y _j;

Representing classified output data/> The classification of each category of the data is output.

Further, in step 4, set upThe sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the student model to obtain classified output data; calculating online distillation loss/>, of teacher model and student modelThe method of (2) is different from the following method:

wherein: | ₂ denotes the l ₂ norm; Representing a mathematical expectation function.

Further, in step 5, set: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)After a sample x _j is input into a teacher model, feature embedded data of the teacher model is obtained through a feature encoder and a feature mapper; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; z ^t is that after all samples x _j of the current batch are input into a teacher model, all teacher model feature embedded data/>, which is obtained through feature encoder and feature mapper processingIs a collection of (3); z ^s is that after all samples x _j of the current batch are input into the student model, all student model feature embedded data/>, which are obtained through a feature encoder and a feature mapperIs a collection of (3); /(I)Representing feature embedded data sampled from z ^s; z ^t+ represents and/>Teacher model features with the same class labels are embedded into the dataset; /(I)Representing feature embedded data sampled from z ^t+; /(I)Representing feature embedded data sampled from z ^t; calculating comparative relation distillation loss/>, of teacher model and student modelThe method of (2) is as follows:

Wherein:

representing a mathematical expectation function;

| ₂ denotes the l ₂ norm;

log (·) represents a natural log function based on a natural constant e;

representing a judging function for judging feature embedded data/> And/>Whether derived from their joint distribution/>

Representing a judging function for judging feature embedded data/>And/>Whether derived from their joint distribution/>

(. Cndot.) ^T represents a transpose;

exp (·) represents an exponential function based on a natural constant e;

τ represents the temperature coefficient.

Further, the step 6 includes the following sub-steps:

Step B1, setting Θ ^t,Φ^t,Ψ^t as a feature encoder, a classifier and a feature mapper of the teacher model, and setting Θ ^s,Φ^s,Ψ^s as a feature encoder, a classifier and a feature mapper of the teacher model; each training sample (x, y) of the student model is subjected to random geometric transformation once to obtain amplified training samples Where x represents the image sample and y is the category marked by the image sample x,/>For geometrically transformed image samples,/>A label that is geometrically transformed; amplified training samples/>Inputting the data into a student model, and processing the data by a feature encoder and a feature mapper of the student model to obtain corresponding student model feature data F ^s and feature embedded data/>Wherein:

Step B2, inputting the obtained student model feature data F ^s to a multi-layer sensor In the method for judging training samples/>The kind of geometric transformation is performed; the calculation formula for setting the output of the multilayer sensor as S ^s,S^s is as follows:

step B3, calculating self-supervision loss The calculation formula of (2) is as follows:

Wherein the method comprises the steps of Representing a mathematical expectation function;

softmax (·) represents the softmax function;

l (·) represents a cross entropy loss function;

step B4, setting: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; /(I)Feature embedded data/>, representing the resulting overall student modelAnd/>Is a collection of (3); /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; /(I)Representation and/>Student model features with the same class labels are embedded into the dataset; /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; based on the original feature embedded data and the amplified feature embedded data, performing supervision and contrast learning by utilizing the feature embedded data in the student model, and monitoring and contrast learning a loss function/>The calculation formula of (2) is as follows:

Wherein:

Wherein the method comprises the steps of Representing mathematical expectations;

| ₂ denotes the l ₂ norm;

log (·) represents a natural log function based on a natural constant e;

representing feature embedded data/> And/>Is a distance of (2);

exp (·) represents an exponential function based on a natural constant e;

Representing a transpose;

τ represents a temperature coefficient;

step B5, self-monitoring loss Learning loss/>, as compared with supervisionCombining to obtain cooperative contrast lossHelping student model to better extract discriminant features,/>The calculation formula of (2) is as follows:

Further, in step B1, the geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image.

Further, in step 7, assuming that the non-stationary data stream is composed of n sample-disjoint tasks { T ₁,T₂,...,T_n }, let x denote task T _n and bufferY is the category marked by the image sample x; cross entropy classification loss/>, of student modelThe calculation formula of (2) is as follows:

Wherein:

representing a mathematical expectation function;

softmax (·) represents the softmax function;

l (·) represents a cross entropy loss function;

r ^s (x) represents classified output data of the image sample x after sequentially passing through the feature encoder and the classifier of the student model.

Further, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model is as follows:

Setting Θ ^t,Φ^t,Ψ^t as a feature encoder, a classifier and a feature mapper of the teacher model, and setting Θ ^s,Φ^s,Ψ^s as a feature encoder, a classifier and a feature mapper of the teacher model; the updating method of the teacher model parameters comprises the following steps:

Θ^t←mΘ^t+(1-m)[(1-X)Θ^t+XΘ^s]；

Φ^t←mΦ^t+(1-m)[(1-X)Φ^t+XΦ^s]；

Ψ^t←mΨ^t+(1-m)[(1-X)Ψ^t+XΨ^s]；

where m represents a momentum factor and X obeys a Bernoulli distribution (also referred to as a 0-1 distribution), defined as:

P(X＝k)＝p^k(1-p)^1-k，k＝{0，1}；

The value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model is controlled through the Bernoulli probability p.

Further, the calculation formula of the momentum factor m is as follows:

m＝min(itera/(itera+1)，η)；

Wherein itera is the iteration number of the current student model, min (itera/(itera +1), η) represents the smaller one of itera/(itera +1) and η, η is a constant, and is generally set to 0.999.

The invention has the advantages and positive effects that: according to the generalized continuous classification method based on the online comparison distillation network, the old task knowledge is effectively consolidated by using a teacher-student framework in online knowledge distillation, so that the model has good classification accuracy for both new tasks and old tasks. In the training stage, the training strategy of contrast learning is introduced into online knowledge distillation, the teacher model realizes the accumulation of knowledge by integrating weights of student models at all moments, and the student models relieve catastrophic forgetting by distilling classification output data and contrast relations to the teacher model. The teacher model and the student model cooperate with each other, so that the student model keeps the performance of the old task, the teacher model accumulates more balanced weights on the old task and the new task when accumulating weights, and the teacher model can better guide the student model to consolidate the knowledge of the old task when the student model trains the new task. In the test stage, the invention adopts the teacher model for testing, because the teacher model integrates the advantages of distinguishing different categories of student models at different moments, the teacher model has good classification performance for all the categories. Therefore, the invention can effectively integrate the advantages of the student network and improve the classification accuracy of the teacher network during testing.

Drawings

FIG. 1 is a workflow diagram of a generalized continuous classification method based on an online comparative distillation network according to the present invention.

Detailed Description

For a further understanding of the invention, its features and advantages, reference is now made to the following examples, which are illustrated in the accompanying drawings in which:

Referring to fig. 1, a generalized continuous classification method based on an online comparative distillation network includes the following steps:

Step 1, establishing a classification model based on knowledge distillation, wherein the classification model comprises a teacher model and a student model; the teacher model and the student model are respectively provided with a feature encoder, a classifier and a feature mapper; setting an optimization target of a student model; parameters of the teacher model and the student model are initialized and a buffer of a fixed size is given.

And 2, counting the number of samples currently encountered when a batch data stream containing R samples arrives, and updating a buffer zone by using a reservoir sampling method.

Step 3, randomly sampling S samples from the buffer area, respectively inputting the S samples into a teacher model and a student model, and respectively obtaining classified output data of the teacher model and the student model corresponding to the S samples through processing of respective feature encoders and classifiers of the teacher model and the student model; respectively obtaining feature embedded data of a teacher model and a student model corresponding to the S samples through processing of respective feature encoders and feature mappers of the two samples; namely: processing the S samples sequentially through a feature encoder and a classifier of the teacher model to obtain a classified output data set of the teacher model; sequentially processing the characteristic encoder and the classifier of the student model to obtain a classified output data set of the student model; sequentially processing by a feature encoder and a feature mapper of the teacher model to obtain a feature embedded data set of the teacher model; and sequentially processing the characteristics of the student model by a characteristic encoder and a characteristic mapper to obtain a characteristic embedded data set of the student model.

Step 8, calculating the total optimization target of the student model Alpha ₁ to alpha ₃ are hyper-parameters of each corresponding loss function; and optimizing parameters of the student model by using a random gradient descent algorithm.

Preferably, in step 2, it may be assumed that the non-stationary data stream is composed of n sample-disjoint tasks { T ₁,T₂,...,T_n }, the training set of each task T _n is composed of tagged dataThe composition, wherein m is the number of samples of the task T _n training set, x _i is the ith image sample in the task T _n training set, and y _i is the category marked by the ith image sample x _i in the task T _n training set; buffer/>Is/>X _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; the method of reservoir sampling may comprise the steps of:

Step A1, judging the size between the number of samples num currently encountered and the buffer capacity B, if num is less than or equal to B, directly storing the samples (x _i,y_i) into the buffer In (a) and (b); x _i is the ith image sample in the training set of task T _n, and y _i is the category labeled by the ith image sample x _i in the training set of task T _n.

Step A2, ifGenerating a random integer rand_num, wherein the minimum value of the random integer is 0, and the maximum value of the random integer is num-1; if/>Replacing samples in the buffer with samples (x _i,y_i) (x _{rand_num},y_{rand_num});x_{rand_num} denotes buffer/>, indexed rand_numY _{rand_num} denotes the buffer/>, indexed rand_numIs included.

Preferably, in step 4, the method for calculating the quality score of the teacher model classified output data may be as follows:

The method can be provided with: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)The sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the teacher model to obtain classified output data; omega (x _j) is the mass fraction of the teacher model class output data corresponding to sample x _j; the formula for ω (x _j) can be as follows:

Wherein: ρ represents a temperature coefficient; c represents the number of all possible categories; exp (·) represents an exponential function based on a natural constant e; Output data for classification/> The classification output data of the corresponding category y _j; /(I)Representing classified output dataThe classification of each category of the data is output.

Preferably, in step 4, there is providedThe sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the student model to obtain classified output data; calculating online distillation loss/>, of teacher model and student modelThe method of (2) is as follows:

wherein: | ₂ denotes the l ₂ norm; representing a mathematical expectation function; exp (·) represents an exponential function based on a natural constant e.

Preferably, in step 5, it may be provided that: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)After a sample x _j is input into a teacher model, feature embedded data of the teacher model is obtained through a feature encoder and a feature mapper; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; z ^t is that after all samples x _j of the current batch are input into a teacher model, all teacher model feature embedded data/>, which is obtained through feature encoder and feature mapper processingIs a collection of (3); z ^s is that after all samples x _j of the current batch are input into the student model, all student model feature embedded data/>, which are obtained through a feature encoder and a feature mapperIs a collection of (3); /(I)Representing feature embedded data sampled from z ^s; z ^t+ represents and/>Teacher model features with the same class labels are embedded into the dataset; /(I)Representing feature embedded data sampled from z ^t+; /(I)Representing feature embedded data sampled from z ^t; calculating comparative relation distillation loss/>, of teacher model and student modelThe method of (2) can be as follows:

Wherein: Representing a mathematical expectation function; | ₂ denotes the l ₂ norm; log (·) represents a natural log function based on a natural constant e; /(I) Representing a judging function for judging feature embedded data/>And/>Whether derived from their joint distribution/> Representing a judging function for judging feature embedded data/>And/>Whether derived from their joint distribution/> Representing a transpose; exp (·) represents an exponential function based on a natural constant e; τ represents the temperature coefficient.

Preferably, the step 6 may include the following sub-steps:

Step B1, a feature encoder, a classifier and a feature mapper corresponding to Θ ^t,Φ^t,Ψ^t and to the teacher model can be provided, a feature encoder, a classifier and a feature mapper corresponding to Θ ^s,Φ^s,Ψ^s and to the teacher model can be provided, each training sample (x, y) of the student model is subjected to a random geometric transformation to obtain amplified training samples Where x represents the image sample and y is the category marked by the image sample x,/>For geometrically transformed image samples,/>A label that is geometrically transformed; amplified training samples/>Inputting the data into a student model, and processing the data by a feature encoder and a feature mapper of the student model to obtain corresponding student model feature data F ^s and feature embedded data/>Wherein:

Step B2, the obtained student model feature data F ^s can be input into a multi-layer sensor In the method for judging training samples/>The kind of geometric transformation is performed; the calculation formula for S ^s,S^s, which can set the output of the multi-layer sensor, can be as follows:

step B3, calculating self-supervision loss The calculation formula of (2) can be as follows:

Wherein the method comprises the steps of Representing a mathematical expectation function; softmax (·) represents the softmax function; l (·) represents a cross entropy loss function;

Step B4, the following steps are set: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; /(I)Feature embedded data/>, representing the resulting overall student modelAnd/>Is a collection of (3); /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; /(I)Representation and/>Student model features with the same class labels are embedded into the dataset; /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; based on the original feature embedded data and the amplified feature embedded data, the original feature embedded data and the amplified feature embedded data in the student model can be utilized to conduct supervision and contrast learning, and a loss function/>, of the supervision and contrast learning is utilizedThe calculation formula of (2) can be as follows:

Wherein:

Wherein the method comprises the steps of Representing mathematical expectations; | ₂ denotes the l ₂ norm; log (·) represents a natural log function based on a natural constant e; /(I)Representing feature embedded data/>And/>Is a distance of (2); /(I)Representing feature embedded data/>And/>Is a distance of (2); exp (·) represents an exponential function based on a natural constant e; /(I)Representing a transpose; τ represents a temperature coefficient;

Step B5, self-monitoring loss Learning loss/>, as compared with supervisionCombining to obtain the collaborative contrast loss/>Helping student model to better extract discriminant features,/>The calculation formula of (2) can be as follows:

preferably, in step B1, the geometric transformation may comprise rotating, scaling and adjusting the aspect ratio of the image.

Preferably, in step 7, it may be assumed that the non-stationary data stream consists of n sample-disjoint tasks { T ₁,T₂,…,T_n }, let x denote task T _n and bufferY is the category marked by the image sample x; cross entropy classification loss/>, of student modelThe calculation formula of (2) can be as follows:

Wherein: Representing a mathematical expectation function; softmax (·) represents the softmax function; l (·) represents a cross entropy loss function; r ^s (x) represents classified output data of the image sample x after sequentially passing through the feature encoder and the classifier of the student model.

Preferably, in step 9, a specific method for updating the parameters of the teacher model by using the parameters of the student model may be as follows:

Setting Θ ^t,Φ^t,Ψ^t as a feature encoder, a classifier and a feature mapper of the teacher model, and setting Θ ^s,Φ^s,Ψ^s as a feature encoder, a classifier and a feature mapper of the teacher model; the updating method of the teacher model parameters can be as follows:

Θ^t←mΘ^t+(1-m)[(1-X)Θ^t+XΘ^s]；

Φ^t←mΦ^t+(1-m)[(1-X)Φ^t+XΦ^s]；

Ψ^t←mΨ^t+(1-m)[(1-X)Ψ^t+XΨ^s]；

where m represents a momentum factor and X obeys a Bernoulli distribution (also referred to as a 0-1 distribution), which can be defined as:

P(X＝k)＝p^k(1-p)^1-k，k＝{0，1}；

wherein the value range of the Bernoulli probability p is (0, 1), and the updating frequency of the teacher model can be controlled through the Bernoulli probability p.

Preferably, the calculation formula of the momentum factor m can be as follows:

m＝min(itera/(itera+1)，η)；

Wherein itera is the iteration number of the current student model, min (itera/(itera +1), η) represents the smaller one of itera/(itera +1) and η, η is a constant, and can be set to 0.999.

The workflow and working principle of the invention are further described in the following with a preferred embodiment of the invention:

The generalized continuous learning is to consolidate old knowledge from non-stationary data streams and accumulate new knowledge at the same time, and finally complete classification prediction of all the seen class images. Assuming that the non-stationary data stream consists of N sample disjoint tasks { T ₁,T₂,...,T_N }, the training set of each task T _n is composed of tagged data Composition, where m is the number of samples of the task T _n training set, x _i is the ith image sample in the task T _n training set, and y _i is the category labeled by the ith image sample x _i in the task T _n training set. In the test stage, the generalized continuous learning method can finish classification tasks for all the categories which are seen at present. The test set for each task T _n is made up of tagged data/>Composition, where p is the number of samples of task T _n test set, x _q is the qth image sample in task T _n test set, and y _q is the category marked by the qth image sample x _q in task T _n test set. The generalized continuous learning task is to conduct category prediction on all task { T ₁,T₂,...,T_n } test sets trained currently.

FIG. 1 is a workflow diagram of a generalized continuous classification method based on an online comparative distillation network according to the present invention. Wherein,Representing a capacity of/>X _j is the j-th image sample in the buffer, and y _j is the class marked by the j-th image sample x _j in the buffer. /(I)Representing on-line distillation loss,/>Representing comparative distillation losses. Setting Θ ^t,Φ^t,Ψ^t as the feature encoder, classifier and feature mapper of teacher model, setting Θ ^s,Φ^s,Ψ^s as the feature encoder, classifier and feature mapper of student model,

The invention discloses a generalized continuous classification method based on an online comparison distillation network, which comprises the following steps:

Step1, before a task starts, firstly initializing parameters of a teacher model and a student model and giving a buffer zone with a fixed size: the component (a) is ^t＝Θ^s,Φ^t＝Φ^s,Ψ^t＝Ψ^s,

Step 2, counting the number num of samples currently encountered when a batch data stream containing bsz samples arrives, and updating a buffer zone by using a reservoir sampling methodThis ensures that the probability that all samples are stored in the buffer is equal. For a particular sample, the specific steps of sampling by using the reservoir include:

(1) Judging the number of samples num and the buffer capacity of the current encountered sample The size of the space, if/>Storing the samples (x _i,y_i) directly into a buffer;

(2) If it is A random integer rand_num is generated, the minimum value of the random integer is 0, and the maximum value is num-1. If/>Replacing samples in the buffer with samples (x _i,Y_i) (x _{rand_num},y_{rand_num});x_rand-num denotes index/>Y _{rand_num} denotes the index/>Image sample tags in a buffer of (a). /(I)

Step 3, from the buffer areaRandomly sampling S samples x _j to consolidate old knowledge, respectively inputting the S samples x _j into a teacher model and a student model, and respectively obtaining classified output data of the teacher model and the student model through a feature encoder and a classifier, wherein the classified output data are as follows:

the feature embedded data of the teacher model and the student model obtained through the feature encoder and the feature mapper are respectively as follows:

Step 4, setting: The sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the teacher model to obtain classified output data; omega (x _j) is the mass fraction of the teacher model class output data corresponding to sample x _j; /(I) Representing classified output data/>Classification output data of each category; let/>The sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the student model to obtain classified output data; /(I)Output data for classification/>The classification of the corresponding category y _j.

The quality fraction omega (x _j) of the classified output data of each sample is obtained through the calculation and sampling of the perceptron:

Where ρ is the temperature coefficient, C represents the number of all possible classes, exp (·) represents an exponential function based on the natural constant e.

Calculating an on-line distillation loss according to formulas (1), (2) and (5)

Wherein | ₂ denotes the l ₂ norm,Representing a mathematical expectation function. By giving the difference between outputs of teacher model and student model/>Weight ω (x _j) to let the student model focus more on samples with high mass fractions of the samples.

Step 5, comparing the characteristic embedded data between the teacher model and the student model, and calculating the comparison relation distillation loss according to formulas (3) and (4)

Wherein the method comprises the steps ofRepresenting a mathematical expectation function, log (·) representing a natural logarithmic function based on a natural constant e,/>After a sample x _j is input into a teacher model, feature embedded data of the teacher model is obtained through a feature encoder and a feature mapper; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; z ^t is that after all samples x _j of the current batch are input into a teacher model, all teacher model feature embedded data/>, which is obtained through feature encoder and feature mapper processingIs a collection of (3); z ^s is that after all samples x _j of the current batch are input into the student model, all student model feature embedded data/>, which are obtained through a feature encoder and a feature mapperIs a collection of (3); Representing feature embedded data sampled from z ^s; z ^t+ represents and/> Teacher model features with the same class labels are embedded into the dataset; /(I)Representing feature embedded data sampled from z ^t+; /(I)Representing feature embedded data sampled from z ^t.

Representing a judging function for judging feature embedded data/>And/>Whether derived from their joint distribution/>The calculation formula is as follows:

Where exp (·) represents an exponential function based on a natural constant e, | ₂ represents the l ₂ norm, (·) ^T represents the transpose, τ represents the temperature coefficient.

Step 6, utilizing self-supervision learning and supervision contrast learning to help the student model extract discriminant features, comprising the following specific steps:

(1) Each training sample (x, y) of the student model is subjected to random geometric transformation once to obtain amplified training samples Where x represents the image sample and y is the category marked by the image sample x,/>For geometrically transformed image samples,/>Is a geometrically transformed label. The geometric transformation includes rotating, scaling, and adjusting the aspect ratio of the image. Thus, the number of training images of the student model is doubled. For this set of images/>, with random geometric transformationsInputting them into the student network, corresponding student model features and feature embedded data are obtained:

(2) Inputting the obtained student model features into a multi-layer sensor In the method for judging training samples/>Types of geometric transformations:

(3) Calculating self-supervision loss />

Wherein the method comprises the steps ofRepresenting the mathematical expectation function, softmax (·) represents the softmax function, and l (·) represents the cross entropy loss function.

(4) Is provided withAfter the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; /(I)Feature embedded data/>, representing the resulting overall student modelAnd/>Is a collection of (3); /(I)Representing slaveEmbedding the characteristics obtained by sampling into data; /(I)Representation and/>Student model features with the same class labels are embedded into the dataset; the expression slave/> Embedding the characteristics obtained by sampling into data; /(I)The expression slave/>Embedding the characteristics obtained by sampling into data; based on the original feature embedded data and the amplified feature embedded data, performing supervision and contrast learning by utilizing the feature embedded data in the student model, and monitoring and contrast learning a loss function/>The calculation formula of (2) is as follows:

Wherein:

representing mathematical expectations; log (·) represents a natural log function based on a natural constant e;

representing feature embedded data/> And/>Is a distance of (2); /(I)Representing feature embedded data/>And/>Is a distance of (2); exp (·) represents an exponential function based on a natural constant e; | ₂ denotes the l ₂ norm; /(I)Representing a transpose; τ represents the temperature coefficient.

(5) For self-supervision lossLearning loss/>, as compared with supervisionCombining to obtain cooperative contrast lossHelping student models to better extract discriminant features:

step 7, based on experience playback, calculating the cross entropy classification loss of the student model:

Where x represents task T _n and buffer Y is the category marked by the image sample x; Representing a mathematical expectation function; softmax (·) represents the softmax function; l (·) represents a cross entropy loss function; r ^s (x) represents classified output data of the image sample x after sequentially passing through the feature encoder and the classifier of the student model.

R ^s (x) represents the output of the image sample x through the feature encoder Θ ^s and classifier Φ ^s of the student model:

Step 8, calculating the total optimization target of the student model Optimizing parameters of the student model by using a random gradient descent algorithm:

Wherein α ₁、α₂ and α ₃ represent hyper-parameters.

Step 9, the teacher model directly uses the parameters of the student model to update the parameters of the teacher model, and the parameters do not involve gradient feedback, and Θ ^t,Φ^t,Ψ^t corresponds to a feature encoder, a classifier and a feature mapper of the teacher model, and Θ ^s,Φ^s,Ψ^s corresponds to a feature encoder, a classifier and a feature mapper of the teacher model. The updating method comprises the following steps:

Θ^t←mΘ^t+(1-m)[(1-X)Θ^t+XΘ^s] (21)；

Φ^t←mΦ^t+(1-m)[(1-X)Φ^t+XΦ^s] (22)；

Ψ^t←mΨ^t+(1-m)[(1-X)Ψ^t+XΨ^s] (23)；

P(X＝k)＝p^k(1-p)^1-k，k＝{0，1} (24)；

In order for the teacher model to learn new knowledge quickly at an early stage of model training, the momentum factor m is designed as:

m＝min(itera/(itera+1)，η) (25)；

The generalized continuous classification method based on the online comparison distillation network can test at any time. In the test stage, a teacher model is adopted for testing. The reason is that student models at different times are good at classifying different categories, and teacher models learned from the student models can cumulatively learn their advantages. Thus, the teacher model has a greater ability to distinguish all of the categories seen than the student model.

The above-described embodiments are only for illustrating the technical spirit and features of the present invention, and it is intended to enable those skilled in the art to understand the content of the present invention and to implement it accordingly, and the scope of the present invention is not limited to the embodiments, i.e. equivalent changes or modifications to the spirit of the present invention are still within the scope of the present invention.

Claims

1. The generalized continuous classification method based on the online comparison distillation network is characterized by comprising the following steps of:

Step 2, assuming that the non-stationary data stream is composed of tasks with disjoint samples, the training set of each task is composed of labeled data The method comprises the steps of forming a task training set, wherein the number of samples is the number of samples in the task training set, the number of samples in the task training set is the number of samples in the first image sample in the task training set, and the number of the samples in the task training set is the number of the samples in the first image sample in the task training set; buffer/>The capacity of the image sensor is that the first image sample in the buffer area is the category marked by the first image sample in the buffer area; when a batch of data flow containing R samples arrives, counting the number of the samples currently encountered, and updating a buffer area by using a reservoir sampling method;

Step 8, calculating the total optimization target of the student model To the super parameter of each corresponding loss function; optimizing parameters of the student model by using a random gradient descent algorithm;

2. The generalized continuous classification method based on an on-line comparative distillation network according to claim 1, wherein in step 2, the method of reservoir sampling comprises the steps of:

Step A1, judging the number num of samples and the buffer capacity of the current encountered sample The size of the space, if/>Storing samples (x _i,y_i) directly to the buffer/>In (a) and (b); x _i is the ith image sample in the training set of task T _n, y _i is the category marked by the ith image sample x _i in the training set of task T _n;

3. The generalized continuous classification method based on online comparative distillation network according to claim 1, wherein in step 4, the method for calculating the mass fraction of the teacher model classification output data is as follows:

Wherein:

ρ represents a temperature coefficient;

c represents the number of all possible categories;

exp (·) represents an exponential function based on a natural constant e;

4. The generalized continuous classification method based on online comparative distillation network according to claim 3, wherein in step 4, there is providedThe sample x _j is represented to be sequentially processed by a feature encoder and a classifier of the student model to obtain classified output data; calculating online distillation loss/>, of teacher model and student modelThe method of (2) is as follows:

Wherein: i/₂ represents A norm; /(I)Representing a mathematical expectation function.

5. The generalized continuous classification method based on an online contrastive distillation network according to claim 1, wherein in step 5, it is set that: representing a capacity of/> Is a buffer of (1); x _j is the jth image sample in the buffer, y _j is the class marked by the jth image sample x _j in the buffer; /(I)After a sample x _j is input into a teacher model, feature embedded data of the teacher model is obtained through a feature encoder and a feature mapper; /(I)After the sample x _j is input into the student model, feature embedded data of the student model are obtained through a feature encoder and a feature mapper; z ^t is that after all samples x _j of the current batch are input into a teacher model, all teacher model feature embedded data/>, which is obtained through feature encoder and feature mapper processingIs a collection of (3); z ^s is that after all samples x _j of the current batch are input into the student model, all student model feature embedded data/>, which are obtained through a feature encoder and a feature mapperIs a collection of (3); /(I)Representing feature embedded data sampled from z ^s; z ^t+ represents and/>Teacher model features with the same class labels are embedded into the dataset; /(I)Representing feature embedded data sampled from z ^t+; /(I)Representing feature embedded data sampled from z ^t; calculating comparative relation distillation loss/>, of teacher model and student modelThe method of (2) is as follows:

Wherein:

representing a mathematical expectation function;

i/₂ represents A norm;

log (·) represents a natural log function based on a natural constant e;

Representing a transpose;

exp (·) represents an exponential function based on a natural constant e;

τ represents the temperature coefficient.

6. The generalized continuous classification method based on an online comparative distillation network according to claim 5, wherein step 6 comprises the following sub-steps:

softmax (·) represents the softmax function;

representing a cross entropy loss function;

Wherein:

i/₂ represents A norm;

log (·) represents a natural log function based on a natural constant e;

representing feature embedded data/> And/>Is a distance of (2);

exp (·) represents an exponential function based on a natural constant e;

Representing a transpose;

τ represents a temperature coefficient;

step B5, self-monitoring loss Learning loss/>, as compared with supervisionCombining to obtain the collaborative contrast loss/>Helping student model to better extract discriminant features,/>The calculation formula of (2) is as follows:

7. The generalized continuous classification method based on an online contrasted distillation network according to claim 1, wherein in step B1, the geometric transformation includes rotating, scaling and adjusting the aspect ratio of the image.

8. The generalized continuous classification method based on online contrastive distillation network according to claim 1, wherein in step 7, assuming that the non-stationary data stream is composed of n sample disjoint tasks { T ₁,T₂,...,T_n }, let x denote task T _n and bufferY is the category marked by the image sample x; cross entropy classification loss/>, of student modelThe calculation formula of (2) is as follows:

Wherein:

representing a mathematical expectation function;

softmax (·) represents the softmax function;

representing a cross entropy loss function;

9. The generalized continuous classification method based on online contrasted distillation network according to claim 1, wherein in step 9, a specific method for updating parameters of a teacher model by using parameters of a student model is as follows:

Θ^t←mΘ^t+(1-m)[(1-X)Θ^t+XΘ^s]；

Φ^t←mΦ^t+(1-m)[(1-X)Φ^t+XΦ^s]；

Ψ^t←mΨ^t+(1-m)[(1-X)Ψ^t+XΨ^s]；

Where m represents a momentum factor and X obeys the bernoulli distribution, defined as:

P(X＝k)＝p^k(1-p)^1-k，k＝{0，1}；

10. The generalized continuous classification method based on an online contrastive distillation network according to claim 9, wherein the calculation formula of the momentum factor m is as follows:

m＝min(itera/(itera+1)，η)；

wherein itera is the iteration number of the current student model, min (itera/(itera +1), η) represents the smaller one of itera/(itera +1) and η, η is a constant, and is set to 0.999.