CN107491782B

CN107491782B - Image classification method for small amount of training data by utilizing semantic space information

Info

Publication number: CN107491782B
Application number: CN201710603221.1A
Authority: CN
Inventors: 付彦伟; 林航宇; 马建奇; 姜育刚; 张寅达; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-07-22
Filing date: 2017-07-22
Publication date: 2020-11-20
Anticipated expiration: 2037-07-22
Also published as: CN107491782A

Abstract

The invention belongs to the technical field of computer image processing, and particularly relates to an image classification method for a small amount of training data by utilizing semantic spatial information. The invention utilizes semantic space information combined with an automatic encoder to amplify data, thereby obtaining more effective samples under the condition of a small amount of samples; training a deep neural network-based classifier using the augmented data; and then the classifier network and the feature extraction network are connected together for training to obtain an end-to-end neural network, so that the functions of giving a picture and directly outputting classification information are realized. The method uses a data amplification method to increase the owned data, so that the deep neural network training becomes more effective; the algorithm is an end-to-end neural network, so that only one picture needs to be given to give a corresponding classification result.

Description

Image classification method for small amount of training data by utilizing semantic space information

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to an image classification method for a small amount of training data by utilizing semantic spatial information.

Background

A significant portion of the current progress in the field of machine learning and deep learning is the reliance on large amounts of labeled data. However, in practical situations, acquiring a large amount of labeled data requires a large amount of manpower and material resources, which is not practical in many cases. On the other hand, it is known that humans can learn how to correctly identify objects with only a small amount of data (for example, we can identify other apples after seeing several apples). It is therefore meaningful and practical to study how to train classifiers with a small amount of data, and in fact this problem is known in the field of artificial intelligence as the One-shot Learning problem. Although One-shot Learning is a classical problem, there is still no very effective method and model for fine-grained image recognition.

The One-shot Learning problem is derived from the ability of humans to learn how to recognize objects from a small number of samples. However, training with a small amount of data is in a sense contrary to the currently existing machine learning methods [1 ]. In this case, the general gradient optimization-based models are not well suited, so the first developed models are Bayesian non-exponential methods [2], deep generative models [3], Bayesian auto-encoders [4 ]. In a broader case, the existing knowledge of another domain can be utilized to assist learning in the current domain, which is called transfer learning [5 ]. The key of the transfer learning is to utilize the existing knowledge in other fields for learning. The method used in our topic can also be considered as a kind of transfer learning. From another perspective, another approach to solve the One-shot Learning problem is to amplify data, and we know that machine Learning algorithms all require a large amount of data, and this is also a solution to the problem if more effective data can be generated by reusing other knowledge through the existing data. There are currently several related approaches: (1) adaptive leveling from large-amplitude of off-shape derived models [6 ]; (2) borrowing amides from free derived sources or vocalbulares [7 ];

(3) composition synthesized representations [8 ]. The above methods basically use only a single method to solve the problem, and most of the problems solved by the methods are classification of coarse-grained categories. In the invention, the data autography deep learning classifier is trained by utilizing knowledge of semantic space to solve the problem of fine-grained image recognition under the One-shot condition.

Disclosure of Invention

The invention aims to provide an image classification method aiming at a small amount of training data by utilizing semantic space information so as to solve the problem of fine-grained image identification under the One-shot condition.

The image classification method for a small amount of training data by utilizing the semantic space information provided by the invention is characterized in that the semantic space information is combined with an automatic encoder to amplify the data, so that more effective samples are obtained under the condition of a small amount of samples; training a deep neural network-based classifier using the augmented data; and then the classifier network and the feature extraction network are connected together for training to obtain an end-to-end neural network, so that the functions of giving a picture and directly outputting classification information are realized.

The method comprises the following specific steps:

(1) the data set is segmented into a training data set and a testing data set, and the image characteristics of the two data sets are extracted by using the same neural network. The same neural network is used for extracting the neural network by utilizing the existing effective characteristics, such as vgg16 network;

(2) and acquiring vocabulary vectors of the two data sets as semantic features. Specifically, corresponding text corpora are trained through a word2vec method, and corresponding mapping from vocabularies to vocabulary vectors is obtained. For two data sets, the vocabulary vectors corresponding to the marked vocabularies are the semantic features of the two data sets;

(3) and constructing an automatic encoder neural network, which specifically comprises two parts, wherein the first part is the neural network composed of the fully-connected layers and inputting the image features serving as training data and outputting the corresponding semantic features, and the second part is the neural network composed of the fully-connected layers and inputting the corresponding semantic features and outputting the reconstructed image features. The two parts are connected together, namely the output of the first part is the input of the second part, and the training aims to ensure that the difference between the semantic features output by the first part and the real semantic features is as small as possible, and the difference between the finally output image features and the real image features is as small as possible;

(4) utilizing the last half part of the automatic encoder network obtained in the step (3) to output nearest neighbor reconstructed image features of semantic features corresponding to the training data, and adding a training data set to complete data amplification;

(5) training an end-to-end neural network by using the obtained training data, and outputting a classification result, wherein the classification result is directly obtained for a given picture after the deep neural network model is trained.

In the invention, in the neural network of the automatic encoder, the first part is a feature extraction network, and the second part is a classification network; these networks are made up of fully connected layers.

In the invention, the fully-connected layer can be repeated for a plurality of times in the deep neural network structure.

The innovation of the invention is that:

1. the information of the semantic space is used for data amplification so as to solve the problem of image classification of a small number of samples, and the problem of insufficient information caused by a small number of samples can be well solved through the information of the semantic space, so that the deep neural network can be effectively trained;

2. the algorithm is an end-to-end neural network, so that only one picture needs to be given to give a corresponding classification result.

Drawings

Fig. 1 is a schematic structural diagram of a designed deep neural network.

Fig. 2 is a schematic structural diagram of an automatic encoder. .

FIG. 3 is a schematic diagram of data amplification.

Detailed Description

Step 1, acquiring corresponding data sets, segmenting the data sets into a training data set and a testing data set, wherein the segmented data sets have the characteristic of less training data, and extracting the image characteristics of the two data sets by using the same neural network. The neural network is a pre-trained deep convolutional neural network. We use a VGG-16 neural network comprising 13 convolutional layers and 3 fully-connected layers, with the output being a 4096-dimensional feature vector for each input picture.

And 2, training corresponding text corpora through a word2vec method to obtain a corresponding mapping or dictionary from the vocabulary to the vocabulary vector. Since we have a labeled vocabulary for each picture, we can use the mapping obtained before to obtain a corresponding vocabulary vector to obtain the semantic features we need.

And 3, constructing an automatic encoder neural network, which specifically comprises two parts, wherein the first part is the neural network composed of the full connection layers and inputting the image characteristics as training data and outputting the corresponding semantic characteristics, and the second part is the neural network composed of the full connection layers and inputting the corresponding semantic characteristics and outputting the reconstructed image characteristics. As shown in fig. 2, our automatic encoder structure is composed of two parts, the first part f (x) is composed of a fully connected neural network of 4096 × 2048 × 1024 × 512 × 256 × 100, and we hope that the output of this part is similar to the real semantic features. The second part g (x) is formed by a fully connected network of 100 × 256 × 512 × 1024 × 2048 × 4096, symmetrical to the previous network, the input of this network being the output of the previous part, the output being the reconstructed image features we have obtained. The training aims to make the difference between the semantic features output by the first part and the real semantic features as small as possible, and the difference between the finally output image features and the real image features as small as possible. A specific loss function is given here.

Where Θ denotes the parameter set of the auto-encoder, D_sRepresenting a training data set, u_iIs a lexical vector, x_iIs an image feature vector, x'_i，u′_iRespectively representing the results generated by the network, P (theta) represents the regular term of the parameter, and lambda is the parameter of the regular term and can be adjusted by self to obtain the best result. The loss function can be divided into three parts, the first part being

This term requires that the difference between the image features generated by the auto-encoder reconstruction and the real image features be minimized, the second termIs that

This term minimizes the gap between the output of the first partial neural network and the true semantic features, and the last term λ P (Θ) is a regularization term to prevent overfitting.

And 4, acquiring the nearest neighbor (multiple nearest neighbors) of the labeled vocabulary of each training data by using the dictionary obtained in the step 2. After obtaining the nearest neighbor, the second half of the auto-encoder constructed in step (3) is utilized to obtain the corresponding reconstructed image feature. Thus, image features that are common to, but not exactly the same as, the original image are obtained. Thus, the data amplification process is completed; these augmented data features are added to the data set for subsequent training. As shown in fig. 3, we search the nearest neighbor of the semantic space corresponding to the beaver, here, muskrat, beaver, and then we input the vectors of these words in the semantic space into the second half g (x) of the auto-encoder of step (3) to obtain the corresponding image features, so as to obtain the augmented data. The process utilizes the image characteristics of the data and the semantic characteristics of the image, so that the problem of small information amount of a small amount of samples is solved, and the training of an effective deep neural network classifier is facilitated.

And 5, training an end-to-end deep neural network, wherein the training process is divided into two parts, namely a first part and a classifier from image features to classification results. And a second part, expanding the network into an end-to-end network for fine-tuning. The two-part training process is described in detail below.

In the first part, a classifier is trained from image features to classification results, and this task is defined as follows.

In the image classification task, the method takes the prediction of the class as a process of solving a classification vector, and outputs the probability of belonging to the class for each class and then takes the highest probability as the class to which the class belongs. Assuming that the classification result y ∈ R of 200 dimensionalities is finally obtained²⁰⁰Then the corresponding sequence number is obtained as the final classification result after taking the maximum value for y.

Here, the network consists of 3 fully-connected layers plus one softmax layer, with the parameter dimensions W₁∈R^4096×1024,W₂∈R^1024×256,W₃∈R^256×dAnd d represents the number of classifications. The loss function is:

here, y_iRepresents the true classification result vector, y'_iRepresenting the predicted classification result (probabilistic form). By training this neural network, a more successful classifier can be obtained because the data from the data amplification is used, otherwise it is difficult to train an effective classifier for a small number of samples.

And secondly, training an end-to-end network, as shown in the figure 1, extracting the characteristics of a given picture by using the image characteristic extraction network, and inputting the characteristics into a trained classifier to obtain a classification result. To get better results, we put the existing data into the whole network and train again, i.e. fine-tuning. This gives the final result.

Reference to the literature

[1]S.Thrun.Learning To Learn:Introduction.Kluwer Academic Publishers,1996.1,1.1

[2]L.Fei-Fei,R.Fergus,and P.Perona.A bayesian approach to unsupervised one-shot learning of object3categories.In IEEE International Conference on Computer Vision,2003.1.1

[3]D.J.Rezende,S.Mohamed,I.Danihelka,K.Gregor,and D.Wierstra.One-shot generalization in deep generative models.In ICML,2016.1.1

[4]D.Kingma and M.Welling.Auto-encoding variational bayes.In ICLR,2014.1.1

[5]E.Bart and S.Ullman.Cross-generalization:learning novel classes from a single example by featurereplacement.In CVPR,2005.1.1

[6]Y.Wang and M.Hebert.Learning from small sample sets by combining unsupervised meta-training withcnns.In NIPS,2016.1,1.1

[7]J.Lim,R.Salakhutdinov,and A.Torralba.Transfer learning by borrowing examples for multiclass objectdetection.In NIPS,2011.1,1.1

[8]Y.Movshovitz-Attias.Dataset curation through renders and ontology matching.In Ph.D.thesis,CMU,2015.1,1.1。

Claims

1. An image classification method aiming at a small amount of training data by utilizing semantic space information is characterized in that the semantic space information is combined with an automatic encoder to amplify the data, so that more effective samples are obtained under the condition of a small amount of samples; training a deep neural network-based classifier using the augmented data; then, the classifier network and the feature extraction network are connected together for training to obtain an end-to-end neural network, so that the functions of giving a picture and directly outputting classification information are realized;

the method comprises the following specific steps:

(1) segmenting a data set into a training data set and a testing data set, and extracting image characteristics of the two data sets by using the same neural network; the same neural network is used for extracting the neural network by utilizing the existing effective characteristics;

(2) acquiring vocabulary vectors of two data sets as semantic features; specifically, training a corresponding text corpus by a word2vec method to obtain a corresponding mapping from a vocabulary to a vocabulary vector; for two data sets, the vocabulary vectors corresponding to the marked vocabularies are the semantic features of the two data sets;

(3) constructing an automatic encoder neural network, which specifically comprises two parts, wherein the first part is a neural network composed of a full connection layer and inputting image characteristics of training data, outputting the image characteristics as corresponding semantic characteristics, and the second part is a neural network composed of a full connection layer and inputting corresponding semantic characteristics, outputting the image characteristics as reconstruction; the two parts are connected together, namely the output of the first part is the input of the second part, the training aims to ensure that the difference between the semantic features output by the first part and the real semantic features is small, and the difference between the finally output image features and the real image features is small;

2. The image classification method according to claim 1, characterized in that in step (3), the loss function used is:

where Θ denotes the parameter set of the auto-encoder, D_sRepresenting a training data set, u_iIs a lexical vector, x_iIs an image feature vector, x'_i，u′_iRespectively representing the results generated by the network, P (theta) represents a regular term of the parameter, and lambda is a regular term parameter and can be adjusted.

3. The image classification method according to claim 1, wherein in step (5), the training of an end-to-end neural network is divided into two parts: a first part, training a classifier from image features to classification results; a second part, expanding the network into an end-to-end network for fine-tuning; wherein:

in the first part, the process of training a classifier from image features to classification results is as follows:

here, the network consists of 3 fully-connected layers plus one softmax layer, with the parameter dimensions W₁∈R^4096×1024,W₂∈R^1024×256,W₃∈R^256×dD represents the number of classifications; the loss function is:

here, y_iRepresents the true classification result vector, y'_iRepresenting the predicted classification result; by training this neural network, a successful classifier can be obtained;

training an end-to-end network, extracting the characteristics of a given picture by using an image characteristic extraction network, and inputting the characteristics into a trained classifier to obtain a classification result; in order to obtain a better result, putting the existing data into the whole network, and training again, namely fine-tuning; this gives the final result.