CN113963165A

CN113963165A - Small sample image classification method and system based on self-supervision learning

Info

Publication number: CN113963165A
Application number: CN202111098484.4A
Authority: CN
Inventors: 王蕊; 施璠; 操晓春
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-21
Anticipated expiration: 2041-09-18

Abstract

The invention discloses a small sample image classification method and system based on self-supervised learning, belongs to the technical field of computer vision, and trains a feature extractor with generalization performance on all training data of a data set by methods of self-supervised learning, contrast learning, co-learning and the like. The self-supervision learning is applied to the training of small sample learning, and the representation capability of a feature extractor in the small sample learning is improved; the contrast learning is applied to the small sample learning, and meanwhile, the metric function is optimized, so that the learned features of the feature extractor have more obvious classification boundaries; the common learning is applied to the training of the small sample learning, so that regularization constraint is introduced, and the generalization performance of the network is improved.

Description

Small sample image classification method and system based on self-supervision learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method and a system capable of classifying images in a small sample scene.

Background

Image classification is a fundamental task that computer vision needs to address. The current development trend of deep learning is to continuously deepen the network structure so as to improve the classification accuracy of the model, however, a deeper model means more learnable parameters, and more data is needed for training the parameters, which is also one of the mainstream trends of the current deep learning, namely, data-driven model training. This data-driven deep learning approach typically requires a large amount of annotation data, which poses a significant problem. On one hand, the process of labeling data is very labor-consuming, and the semi-supervised learning and unsupervised learning methods are also based on the angle, so that the dependence of the network on the labeled data is reduced; on the other hand, in many application scenarios, we have no way to acquire large-scale data, such as rare species and emerging things. In both cases, the traditional deep learning method has difficulty in achieving the ideal classification effect.

The small sample learning aims to solve the problem that the model learning is difficult under the condition of less data. Small sample learning generally requires only a small number of images of a certain class to be obtained for classification prediction of such class. However, the quality of the small sample learning model depends largely on the data distribution difference of the test set and the training set, because many deep learning networks have the capability of processing similar tasks, but are difficult to adapt to the field not involved. This is also the difference between the small sample learning task and the traditional image classification task, i.e. the network needs the ability to process unseen data classes.

From the type, the small sample learning is mainly divided into two types, one is direct-push learning and the other is inductive learning. The main difference is whether unlabeled samples to be predicted can be obtained in the training. The direct-push learning can obtain test data in training, the final target only needs to predict the label of the test data, and when a new sample to be predicted appears, the model needs to be retrained; inductive learning does not require the acquisition of test data during training, i.e., the trained model can be used directly to predict unknown test data. In the two types, the inductive small sample learning can process more data which are not seen by the network, retraining is not needed, the network is required to have generalization, and the use scene is wider.

The main methods for small sample learning include the following three methods:

1) based on an optimization method, small sample learning is regarded as a new task, the concept of meta-learning is used, namely, a model learns, and the final aim is to converge as soon as possible when the model faces a new group of learning tasks. Training based on this method usually consists of two cycles. For example, the MAML is composed of a base learner and a meta learner, in the training process, the base learner is trained for each independent task by the inner loop, the meta learner is optimized by the outer loop according to the obtained verification effect of the base learner, and finally, the optimal initialization parameter of the base learner which can be quickly adapted to the new task is obtained.

2) And (3) an augmentation method based on the generated data. At the data level, the problem of too little original data can be solved by generating new data samples; on the aspect of characteristics, by using a generation method, not only can the classification boundary between specific categories be learned, but also the complete boundary of category distribution can be obtained by introducing the concept of data distribution, so that the problem of category combination is solved.

3) A method of metric-based learning. Firstly, a feature extractor is obtained through training, the image obtains a feature vector in a feature space through the feature extractor, the distance between different images is obtained through a proper metric function (such as Euclidean distance, cosine distance and the like), and finally the images are classified through a distance relation. The importance of obtaining a better feature space in small sample learning is also indicated in the context of RFS.

The current small sample learning also faces great challenges, in the training process, the data volume is small, so that a deeper model is under-fitted, the model representation capability is poor, a shallower model is easy to over-fit, and the model generalization performance is poor; in the test process, too few support sets can cause great difference between the whole data and the real data distribution, and the support sets cannot represent the new class well, so that the classification accuracy of the query set is low, and the problems are to be solved.

Disclosure of Invention

The invention aims to provide a small sample image classification method and system based on self-supervised learning, which can realize image classification by using less data volume based on the self-supervised learning.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a small sample image classification method based on self-supervision learning comprises the following steps:

constructing two image classification networks with the same structure but without sharing weights, wherein each image classification network comprises a feature extractor, a rotary class classifier, a supervision comparison learning classifier and an image class classifier;

in the training stage, image training data with class labels are respectively input into two image classifiers, and the two image classifiers are trained simultaneously, wherein the training steps are as follows:

rotating each input image every 90 degrees to obtain four images in four directions, and extracting feature vectors of the images through a feature extractor respectively;

inputting the feature vector of the image obtained by rotation into a rotation category classifier to classify the rotation direction of the image, and calculating the cross entropy loss of the rotation category classifier;

taking the images obtained by rotation and the images obtained by rotation of the same type as positive examples of the images, taking the images obtained by rotation of other types as negative examples, inputting the feature vectors of the positive examples and the negative examples of the images into a supervised contrast learning classifier for classification to obtain the probability of belonging to the same type, and calculating the cross entropy loss of the supervised contrast learning classifier;

directly inputting the feature vector of each image into an image category classifier for classification, and calculating the cross entropy loss of the image category classifier;

performing joint learning between the outputs of the image category classifiers of the two image classification networks through KL divergence constraint, and calculating joint learning cross entropy loss;

carrying out weighted summation on the cross entropy loss of the rotation category classifier, the cross entropy loss of the supervised contrast learning classifier, the cross entropy loss of the image category classifier and the cross entropy loss of the common learning to obtain the total loss; through iterative training, the overall loss is minimized, and a trained feature extractor is obtained;

in the using stage, classifying the image to be classified, and the steps are as follows:

inputting training images with class labels consistent with the classes of the images to be classified into a trained feature extractor to extract feature vectors, and training a rotary class classifier, a supervised contrast learning classifier and an image class classifier by using the feature vectors;

and inputting the images to be classified into a trained feature extractor to extract feature vectors, inputting the extracted feature vectors into a trained rotary class classifier, a supervised contrast learning classifier and an image class classifier, and outputting image classification results.

A small sample image classification system based on self-supervision learning comprises two image classification networks with the same structure and without sharing weight, wherein each image classification network comprises:

a feature extractor for extracting a feature vector of the image obtained by the rotation;

a rotation category classifier for classifying the feature vectors of the image obtained by rotation according to a rotation direction;

the supervised contrast learning classifier is used for classifying the feature vectors of positive examples and negative examples to obtain the probability of belonging to the same class, wherein the positive examples refer to the images obtained by rotation and the images obtained by rotation of the same class, and the negative examples refer to the images obtained by rotation of other classes;

the image category classifier is used for classifying the images according to the input feature vector of each image;

the feature extractor is trained through image training data with class labels in advance, the outputs of the image class classifiers of the two image classification networks are jointly learned through KL divergence constraint during training, overall loss is minimized through iterative training, and training is completed.

The main innovation points of the invention comprise the following three points:

1) the self-supervision learning is applied to the training of small sample learning, and the representation capability of the feature extractor in the small sample learning is improved.

2) The contrast learning is applied to the small sample learning, and meanwhile, the metric function is optimized, so that the features learned by the feature extractor have more obvious classification boundaries.

3) The common learning is applied to the training of the small sample learning, so that regularization constraint is introduced, and the generalization performance of the network is improved.

Drawings

FIG. 1 is a schematic diagram of a network architecture during a training phase of the method of the present invention;

fig. 2 is a schematic diagram of a network structure at the test stage of the method of the present invention.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

In this embodiment, an image classification network structure as shown in fig. 1 is established, and is composed of two networks with the same structure but not sharing weights, and each image classification network includes a feature extractor, a rotation class classifier, a supervised contrast learning classifier, and an image class classifier.

In training, images in four directions, in which training images with category labels (for example, birds, fruits, billboards, etc.) are augmented by rotation (0 °, 90 °, 180 °, 270 °), are input as a network input to a feature extractor, and feature vectors of the images are extracted.

1. In each network, the features extracted by the feature extractor are input to the following three classifiers:

1) rotating class classifier Θ

An auto-supervised learning task is constructed by rotating the images. Intuitively, the shooting angle of the imageThe direction of the object is greatly related, for example, a house which is normally shot must be the roof on top, a billboard on the ground, and a rotated picture can be deduced according to the clues to find the correct direction. In an experimental level, for each input image, the rotation angle needs to be predicted, and the rotation angles of 0 degree, 90 degrees, 180 degrees and 270 degrees clockwise are taken as 4 rotation angle categories, namely, the problem of four categories is solved. Specifically, D is a small sample training data set, C_rSet of 4 rotation classes, x^rRepresenting the r-th rotation transformation on the input x, L being the cross entropy loss function, F_iA feature extractor representing the ith network,

a rotating class classifier for the ith network.

2) Supervised contrast learning classifier phi

By inputting the features obtained by the feature extractor into the multi-layer neural network, the features are mapped into a smaller space. In this space, it is desirable that the features of the same class are as close in distance as possible and that the different classes are as far apart as possible, resulting in more robust classification boundaries. Specifically, in a training batch, four input images obtained by rotating an image, images in the same category and images in four directions after rotation of the images belong to a positive example of the image, images in other categories of the training batch and images after rotation of the images belong to a negative example of the image, so that a binary classification task is constructed, image feature vectors of the positive example and the negative example are input into a supervised contrast learning classifier phi, and finally cross entropy is used as a loss function of the positive example and the negative example.

And calculating the similarity of the features in a training batch, wherein the label of a positive case is 1, and the label of a negative case is 0, and calculating a cross entropy loss function. Specifically, the formula is shown below, wherein D^*Representing the data set after rotation enhancement, B (x, y) representing the sample with x in the same training batch and labeled y,

samples representing x in the same dataset but labeled other than y, F_iA feature extractor representing the ith network,

represents the supervised contrast learning classifier for the ith network. τ denotes the temperature coefficient, the lower the temperature coefficient the better the training, but a particularly low temperature coefficient makes the network more difficult to train.

Where E is the mathematical expectation, (x, y) E D^*Subscript of E, indicating data range; x, x,

Is an input image, respectively belonging to a set D^*、B(x,y)、

y、

Represent different tags; the log base is not limited and τ is the temperature coefficient.

3) Image class classifier Ψ

And inputting the input feature vector of the rotated image into an image category classifier Ψ for category prediction, and judging the real category of the image. Cross entropy is used as a loss function. Specifically, the formula is shown below, wherein D^*Representing the data set after rotation enhancement, and L is a cross entropy loss function. F_iA feature extractor representing the ith network,

to representAn image class classifier for the ith network.

2. Co-learning between two feature extractors

By the co-learning method, co-learning is performed between the outputs of the two image class classifiers by a KL divergence constraint.

Wherein L is_klIs the loss function corresponding to the KL divergence; d^*Representing the rotation enhanced data set, F_iA feature extractor representing the ith network,

and expressing an image class classifier of the ith network, and KL expresses a KL divergence calculation formula. Considering the asymmetry of the KL divergence, the purpose of enabling two networks to learn each other is achieved by exchanging the positions of two items in the common learning.

3. Overall loss function

The above loss functions are multiplied by respective coefficients and added to obtain an overall loss function, which is specifically expressed as follows, wherein α, β, γ, and η represent weight coefficients of the loss functions.

L_total＝α·L_cls+β·L_rot+γ·L_scl+η·L_kl

4. Algorithm flow

The algorithm flow of the training process is shown in table 1, the data set D is changed into D through rotation augmentation in 2-3 steps, and finally two feature extractors F for extracting features in the testing stage are output₁And F₂。

The image classification in the small sample scene based on the self-supervised learning mainly comprises a training stage and a testing stage.

1) Training phase

In the training process, unlike the training in units of segments used in the optimization-based small sample learning method, each image and its class label are used as a data unit in the embodiment. A supervised contrast learning classifier based on an automatic supervised learning method used in the training process does not need to additionally increase a training label of a rotation direction, and the label is constructed through data. The common learning strategy is used as a variant of knowledge distillation, and is different from an iterative process of firstly training a teacher network and then training a student network in the knowledge distillation, and two networks are trained together to finish the training in one step.

In the training process, one image is input, four images are obtained through rotation in sequence, and the four images are input into a feature extractor to obtain respective feature vectors. And inputting each feature vector into a rotation class classifier to obtain the rotation angle of the image. All pictures in a training batch are input into a supervised contrast learning classifier, and the probability that other images and an image in the training batch belong to the same class is obtained. Each feature vector is input into an image class classifier to obtain the class of the image.

The steps are repeated in another same network, and considering that the classification task is the main task of the experiment compared with other auxiliary tasks, the KL divergence of the classification results of the two networks is only constructed, so that the common learning is realized, and the generalization performance of the model is improved.

In the co-learning, the KL divergence adjusts the degree of the co-learning by the temperature coefficient T, which is 4 in this embodiment. In contrast learning, the loss is adjusted by the temperature coefficient τ, and the value of τ in this embodiment is 0.5. In the overall loss function, the weighting coefficients α, β, γ, η of the respective loss functions are all 0.5 in this embodiment.

2) Testing phase

During testing, the test set is divided into a support set and a query set. And extracting the features of the support set by using the feature extractor trained in the training stage, training a classifier on the features of the support set, and using the classifier for classifying the query set to obtain the prediction label of the query set. The present embodiment performs the experiment using two general settings for small sample learning, namely, a support set of "five categories, each having one image" and "five categories, each having five images". And (3) using the classifier obtained on the support set for prediction of the query set, and calculating classification accuracy, wherein 15 images of each category are taken in each group of experiments, and 75 images are taken as the query set in total. The 600 sets of experiments were repeated and the average accuracy was calculated. The above steps are repeated three times, and the median of the average accuracy is taken as the final result.

The network structure of the testing stage is shown in fig. 2, where F is a feature extractor obtained in the training stage, and F is a feature classifier trained by a single task in its support set in the testing stage, and is finally used for prediction in the querier. The algorithm flow is shown as algorithm 2, at this time, the data set D does not need to be subjected to rotating data amplification, and the feature extractor F selects F₁And F₂Any one of them may be used. S and Q in the algorithm respectively represent a support set and a query set of a test stage, and LR represents logistic regression. And finally, outputting the average classification accuracy in small sample learning.

The method mainly solves the image classification task under the small sample scene, and trains a feature extractor with generalization performance on all training data of a data set through methods such as self-supervision learning, comparison learning and joint learning. In the testing process, each classification task of small sample learning is composed of a support set and a query set, wherein the support set contains classes to be learned, and each class generally has only a small number of data samples. The query set comprises the categories which need to be predicted and appear in the support set, and the prediction preparation rate of the query set is the accuracy rate of the small sample learning. The support set obtains features through a feature extractor, classification functions are obtained on the obtained features through methods such as logistic regression, Euclidean distance and cosine distance, the category of the query set is predicted on the basis, and classification accuracy is calculated.

When the method is applied to a specific example, the method is substantially the same as the testing stage, except that a query set does not need to be separated, namely a trained feature extractor is used for extracting feature vectors of an image to be classified, then the classifier is subjected to temporary training related to classes through the image of the specified class, and after the classifier is trained, the knowledge for classifying the specified class can be mastered, so that the classifier can be used for classifying the image to be classified.

The invention provides a method for classifying images in a small sample scene, which comprises the following steps of:

(1) and (3) testing environment:

the system environment is as follows: centos 7;

hardware environment: memory: 64GB, GPU: TITAN XP, hard disk: 2 TB;

(2) experimental data:

experiments were performed on four datasets, MiniImageNet, Tiered ImageNet, CIFAR-FS, FC 100.

MiniImageNet is a subset of ImageNet, with data sizes and dimensions much smaller than ImageNet, on which training requires fewer resources, often used in the task of small sample classification. There are 100 classes, of which 64 classes are training sets, 16 classes are validation sets, and 20 classes are test sets, each class having 600 images, each image having a size of 84 × 84.

Tiered ImageNet is another subset of ImageNet, slightly larger than MiniImageNet, shares 608 categories, and can be combined into 34 major categories, of which 20 categories are training sets, 6 categories are validation sets, 8 categories are test sets, and total 779165 images.

CIFAR-FS is a small sample learning data set constructed based on CIFAR100, 100 classes are randomly divided into 64 classes, 16 classes and 20 classes which are respectively used as a training set, a verification set and a test set, each class comprises 600 images, and the size of each image is 32 x 32.

The FC100 is another small sample learning data set constructed based on the CIFAR100, which is more complex than the CIFAR-FS, and has 100 classes belonging to 20 major classes, wherein 60 classes belonging to 12 major classes are used as a training set, two 20 classes belonging to 4 major classes are respectively used as a verification set and a test set, each class has 600 images, and the size of each image is 32 x 32.

In the CIFAR-FS and FC100, in the training stage, after a boundary with the size of 4 is filled around, the boundary is randomly cut into 32 × 32, and color disturbance and horizontal turning data are added for enhancing and regularizing; the test phase only performs the regularization operation.

For MiniImageNet and Tiered ImageNet, in the training phase, after filling a boundary with the size of 4 around, randomly cutting the boundary into 84 × 84, adding data enhancement of color disturbance and horizontal inversion, and performing regularization operation; the test phase only performs the regularization operation.

The training optimization method comprises the following steps: adam, initial learning rate 0.05. MiniImageNet was attenuated with a weight of 0.1 at 60 th and 80 th training periods for a total of 90 periods; FC100 and CIFAR-FS are attenuated by 0.1 weight in 50 th, 65 th and 80 th training periods, and are trained for 90 periods; tiered ImageNet was trained for 60 cycles with a weight decay of 0.1 at 30, 40, 50 epochs.

(3) The experimental results are as follows:

the results of comparing the results of the current mainstream protocol and the inventive experiment are shown in tables 1 and 2 below, and experiments were performed on CIFAR-FS and FC100, and MiniImageNet and Tiered ImageNet, respectively. Experimental results show that the method is superior to the current mainstream algorithm, and the classification accuracy of small sample learning is improved in different experimental settings of a plurality of data sets.

TABLE 1 comparison of the Experimental results on CIFAR-FS and FC-100

TABLE 2 comparison of the Experimental Effect on MiniImageNet and Tiered ImageNet

Although the present invention has been described with reference to the above embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A small sample image classification method based on self-supervision learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the function that computes the cross-entropy loss for a rotating class classifier is as follows:

wherein L is_rotRepresents the cross-entropy loss of the rotating class classifier, D represents the small sample training data set, C_rSet representing four rotation direction classes, x^rShowing the r-th rotation transformation on the input image x, L showing the cross entropy loss function, F_iA feature extractor representing the ith image classification network,

a rotating class classifier of the network classifies the ith image.

3. The method of claim 1, wherein the function of cross entropy loss for the supervised contrast learning classifier is calculated as follows:

wherein L is_sclRepresents the cross-entropy loss of the contrast learning classifier, E represents the mathematical expectation, D^*Representing the rotation enhanced data set, x,

Representing the input image, y,

Labels representing different classes, B (x, y) represents samples with x in the same training batch and label y,

samples representing x in the same dataset but labeled other than y, F_iA feature extractor representing the ith image classification network,

and (4) a supervised contrast learning classifier of an ith image classification network is represented, tau represents a temperature coefficient, and the log base number is not limited.

4. The method of claim 1, wherein the function that calculates the cross-entropy loss for an image class classifier is as follows:

wherein L is_clsRepresenting cross entropy loss of image class classifier, E represents mathematical expectation, D^*Representing the rotation enhanced data set, x representing the input image, y representing the class label, L representing the cross entropy loss function, F_iA feature extractor representing the ith image classification network,

an image class classifier representing an ith image classification network.

5. The method of claim 1, wherein the function that computes cross-entropy loss for co-learning is as follows:

wherein L is_klRepresenting the cross-entropy loss of co-learning corresponding to KL divergence, E representing the mathematical expectation, D^*Representing the rotation enhanced data set, x representing the input image, y representing the class label, F₁、F₂A feature extractor representing the 1 st and 2 nd image classification networks,

an image class classifier representing the 1 st and 2 nd image classification networks.

6. The method according to claim 1 or 5, wherein the KL divergence adjusts the degree of co-learning by means of a temperature coefficient T.

7. The method of any one of claims 1-5, wherein the function of the overall loss is calculated as follows:

L_total＝α·L_cls+β·L_rot+γ·L_scl+η·L_kl；

wherein，L_totalDenotes the total loss, L_clsRepresenting the cross-entropy loss, L, of the image class classifier_rotRepresents the cross-entropy loss, L, of the rotating class classifier_sclRepresents the cross-entropy loss, L, of the contrast learning classifier_klThe cross entropy loss of the co-learning corresponding to the KL divergence is expressed, and α, β, γ, and η represent weight coefficients.

8. A small sample image classification system based on self-supervised learning, for implementing the method of any one of claims 1-7, wherein the system includes two image classification networks with the same structure but not sharing weight, and each image classification network includes: