CN111401122B

CN111401122B - Knowledge classification-based complex target asymptotic identification method and device

Info

Publication number: CN111401122B
Application number: CN201911377824.XA
Authority: CN
Inventors: 胡君; 贺东华; 方标新; 韦章兵; 贾小月; 殷贺琦; 刘丹
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2023-09-26
Anticipated expiration: 2039-12-27
Also published as: CN111401122A

Abstract

The invention provides a knowledge classification-based complex target asymptotic identification method and device. The method comprises the following steps: image preprocessing, namely dividing an original image data set I into data sets with multiple levels of resolution ratios, and taking the data sets as a reference data set for asymptotically identifying a complex target; inputting images in batches into a VGG-16 network pre-trained on an ImageNet data set to perform feature extraction; carrying out bilinear feature fusion calculation and trilinear feature fusion calculation on the extracted features with various resolutions; and predicting the category by using the fused characteristics. The method combines the characteristics of tri-linear pooling and bi-linear pooling, and plans coarse-granularity tasks and fine-granularity tasks of complex targets in a unified frame. The feature reference provided by the coarse granularity task neglected in fine granularity identification in actual life is solved.

Description

Knowledge classification-based complex target asymptotic identification method and device

Technical Field

The invention belongs to the field of image recognition, relates to fine-grained image recognition and retrieval, and particularly relates to a knowledge classification-based complex target asymptotic recognition method and device.

Background

In recent years, fine-grained image recognition and retrieval has become a research hotspot in the fields of visual computing and information retrieval. Although image recognition technology has been greatly developed in recent years, there are still many technical difficulties in fine-grained image recognition and retrieval.

The fine-grained image classification problem is to identify sub-classes under large classes. The distinction and difficulty of fine-grained image analysis tasks over general-purpose image tasks is that the granularity of the class to which the images belong is finer. Not only is the difficulty and challenges of fine-grained image tasks certainly greater for computers and for the average population.

Although the prior art is easy to distinguish objects with obvious appearance differences, such as: cats and dogs, however, these prior art techniques still have difficulty distinguishing objects with less distinct differences in appearance such as: the recognition results of objects in these subclasses are easily affected by their motion gestures, viewing directions, and relative positions, both for boeing 737 guests and for boeing 747 guests.

However, with the development of artificial intelligence, more and more application scenes need to make finer feature distinction on objects in the same category, for example: the identification of brands by merchants, the identification of plants by botanicals, and the like. Fine-grained image classification has wide research requirements and application scenes in both industry and academia. The research subject related to this mainly includes the identification of different kinds of birds, dogs, flowers, cars, airplanes, etc. In real life, there is a great need to identify different sub-categories. For example, in ecological protection, the efficient identification of different species of organisms is an important prerequisite for ecological research.

Unlike the general image classification task, which distinguishes basic categories, fine-grained identification is very challenging. However, in real life scenarios, fine-grained tasks often appear with coarse-grained tasks when the observer is closer to the observer than the observed person is when the distance between the observer and the observed person is shortened. Whereas in previous work, the combination of fine-grained tasks and coarse-grained tasks was often ignored. The learner is more focused on the fine granularity level research, and ignoring the feature references provided by the accompanying coarse granularity tasks has instructive significance.

Therefore, it is necessary to propose a method for planning coarse-granularity tasks and fine-granularity tasks of complex targets in a unified framework, and further for fine-granularity image recognition.

Disclosure of Invention

The invention solves the problem of characteristic references provided by coarse-granularity tasks neglected in fine-granularity identification in actual life.

According to one aspect of the present invention, there is provided a knowledge classification-based complex object asymptotic recognition method, the method comprising:

image preprocessing, namely dividing an original image data set I into data sets with multiple levels of resolution ratios, and taking the data sets as a reference data set for asymptotically identifying a complex target;

inputting images in batches into a VGG-16 network pre-trained on an ImageNet data set to perform feature extraction;

carrying out bilinear feature fusion calculation and trilinear feature fusion calculation on the extracted features with various resolutions;

and predicting the category by using the fused characteristics.

Further, the original image dataset I definition is divided into three image datasets I with resolution from high to low _high ,I _medium ,I _low .。

Further, the resolution r of the original image dataset is defined as a high resolution r _high The image dataset is determined as I _high ；

Gradually reducing the resolution of the original image dataset to obtain image datasets of two other resolutions:

when the accuracy is lower than the threshold t _med When according to r _med Resolution, determining an image dataset as I _medium ；

When the accuracy is lower than the threshold t _low When according to r _low Resolution, determining image dataset as l _low 。

Further, three resolution images are mapped one-to-one to the biological taxonomies:

I _high corresponding species, I _medium Corresponding genus, I _low Corresponding families.

Further, the image is classified from high resolution r using SVM classification algorithm _high Classification into categories of science, passing the accuracy threshold t _med And t _low To classify.

Further, inputting the images in batches into the VGG-16 network pre-trained on the ImageNet dataset for feature extraction includes: the relu5_1, relu5_2, relu5_3 features of the three resolution atlases are extracted.

Further, the combination of bilinear features f _A (I)∈R ^hw×c And f _B (I)∈R ^hw×c Equal to f _A (I) ^T f _B (I)∈R ^c×c Where c is the number of feature maps and h and w represent the height and width of the feature maps;

bilinear pooling of cross-layer decomposition is expressed as:

wherein X represents one layer and Y represents another layer, whereinAnd->Is a projection matrix +.>Is a classifier matrix, < >>Is the hadamard product, d represents the dimension of the joint embedding, F is the output of the bilinear model, projection matrix f=i.

Further, the tri-linear pooling method is expressed as:

wherein W represents a projection matrixf combines three separate layers, where X represents one layer and Y, Z represents two more layers.

Further, fusing the three-linear characteristic and the three-dimensional bilinear characteristic, and calculating a softMax vector to obtain a predicted result;

the three loss functions add to the total loss function:

l _full ＝l _high +l _medium +l _low .

wherein the loss function loss is defined at each resolution as:

l _high ＝loss(I _high ),l _medium ＝loss(I _medium ) And l _low ＝loss(I _low )。

According to another aspect of the present invention, there is provided a complex object asymptotic recognition apparatus based on knowledge classification, the apparatus including: a memory storing computer executable instructions;

a processor executing computer executable instructions in the memory, the processor performing the steps of:

and predicting the category by using the fused characteristics.

The invention provides a three-linear pooling method, integrates the characteristics of three-linear pooling and double-linear pooling, considers the characteristic interaction between layers, simultaneously avoids introducing additional training parameters, better captures the characteristic relationship between layers, and has high efficiency and powerful functions of the cross-layer double-linear method.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

FIG. 1 is a flow chart of a knowledge classification based complex target asymptotic identification method of the present invention.

Fig. 2 is a schematic diagram illustrating an application of a complex object asymptotic recognition method according to an embodiment of the present invention.

FIG. 3 is a partial result of the present invention predicted to be correct on CUB 200-2011.

FIG. 4 is a comparison of recognition accuracy on CUB200-2011,Stanford Cars and FGVC-air datasets of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The present invention aims to solve the problem of asymptotically identifying complex objects in real life, whose object is to identify classes of objects at multiple resolutions (from low to high). In order to solve the problem, the invention provides a complex target asymptotic identification method based on knowledge classification. The method combines the characteristics of tri-linear pooling and bi-linear pooling, and plans coarse-granularity tasks and fine-granularity tasks of complex targets in a unified frame. The feature reference provided by the coarse granularity task neglected in fine granularity identification in actual life is solved.

FIG. 1 is a flow chart of a knowledge classification based complex target asymptotic identification method of the present invention. As shown in fig. 1, the present invention proposes a knowledge classification-based complex target asymptotic recognition method, which includes:

and predicting the category by using the fused characteristics.

First, image preprocessing is performed.

The original image dataset I definition is divided into three levels of resolution (high to low) datasets. Three image data sets I thus newly generated _high ,I _medium ,I _low A reference dataset asymptotically identified for the complex target. Specifically, these three resolutions are defined as follows:

first we define the resolution r of the original image as high resolution r _high These images are then classified from a high resolution r using an SVM classification algorithm _high The species are classified into the class of science, and then we gradually decrease the resolution of the dataset of the original image to get the other two resolutions.

As resolution decreases, classification accuracy of species necessarily decreases. When the accuracy is lower than the threshold t _med When, i.e. the classifier is not as accurate as a high resolution classifier, we will determine the resolution at that instant as r _med According to r _med Resolution, determining an image dataset as I _medium . The targets are then changed to be classified on the genus. And so on, the same process is repeated. Finally we can also get r _low And l _low . Thus, these three resolutions and their corresponding data sets can be determined by two parameters: accuracy threshold t _med And t _low 。

In the invention realizeIn the embodiment, the actual setting is t _med ＝0.8，t _low =0.8. Further, we map the three resolution images one-to-one with the biological taxonomies. For example, 200 total categories. Can be combined into 113 genera and 36 families. The original classification task is re-planned as: i _high Corresponding to 200 species. And I _medium And I _low For classifying 113 genera and 36 families. It is noted that the three classifiers may be defined using the CNN model, and all of the loss functions may be added.

Next, the images are batch input to a VGG-16 network pre-trained on the ImageNet dataset to extract features.

The size of the model input image is 488×488, the projection layer parameters and the normalized index layer parameters are initialized randomly, and firstly, parameters of other layers are kept unchanged, and only the normalized index layer is trained. The entire network is then trimmed with a random gradient descent with a step size of 8. Momentum 0.9, weight decay 5×10 ^-4 Learning rate of 1×10 ^-3 The periodic anneal was 0.5. Empirically, the dimensions of the projection layer were set to 8,192.

Notably, these three levels of training are cyclic, such as: the first tuning parameter is I at the normalized index layer of 200 dimensions _high Later on, the normalized index layer at 113-dimension will be used with I _medium Is finally used in I _low Training the 36-dimensional classifier in (c), and returning to the highest dimension.

The present invention uses standard data enhancement methods. For example: the original image is first adjusted to 512×s, S being the largest edge, then randomly sampled and flipped horizontally during training (only center clipping is included in the test). The whole model training adopts an end-to-end mode.

And carrying out bilinear feature fusion calculation and trilinear feature fusion calculation on the three extracted features with resolution relu5_1, relu5_2 and relu5_3.

Taking image I as input and using two feature functions f _A And f _B (typically the last layer of convolutional neural network), fromThese two features are extracted from the image. One bilinear vector output is obtained at each position output using the matrix outer product: binding of bilinear features f _A (I)∈R ^hw×c And f _B (I)∈R ^hw×c Equal to f _A (I) ^T f _B (I)∈R ^c×c Where c is the number of feature maps and h and w represent the height and width of the feature maps. Note that h x w needs to be fixed and c can be chosen from different feature functions.

Bilinear pooling of cross-layer decomposition is denoted in this invention as:

wherein X, Y, Z are three different layers,and->Is a projection matrix which is a projection matrix,is a classifier matrix, < >>Is the hadamard product, d represents the dimension of the joint embedding, and f is the output of the bilinear model.

Then, a tri-linear pooling method provided by the invention is utilized to extract a tri-linear characteristic. The specific tri-linear pooling method uses three different layers of X, Y and Z for feature extraction. The tri-linear pooling method replaces Hadamard (Hadamard) products with only two layers, and therefore is expressed as:

where f incorporates three separate layers.

And finally, predicting the category by using the fused characteristics.

And fusing the three-linear characteristic and the three-dimensional bilinear characteristic, and calculating a softMax vector to obtain a predicted result. Wherein the formula of the loss function of the present invention is expressed as:

l _full ＝l _high +l _medium +l _low wherein the loss function (loss) is defined at each resolution as: l (L) _high ＝loss(I _high ),l _medium ＝loss(I _medium ) And l _low ＝loss(I _low ). So far, the introduction of the complex target asymptotic recognition method based on knowledge classification is finished.

According to another embodiment of the present invention, there is provided a complex object asymptotic recognition apparatus based on knowledge classification, the apparatus including: a memory storing computer executable instructions;

and predicting the category by using the fused characteristics.

Fig. 2 is a schematic diagram illustrating an application of a complex object asymptotic recognition method according to an embodiment of the present invention. As shown in FIG. 2, the identification method of the present invention is illustrated by the identification of a golden samara.

Firstly, dividing pictures into three types according to resolution, namely I _high ,I _medium ,I _low .. Then training on VGG-16 network to extract the features relu5_1, relu5_2, relu5_3 of the three resolution images.

The combination of bilinear features is performed on the basis of three features relu5_1, relu5_2, relu5_3. And then performing bilinear feature fusion by using a cross-layer decomposition bilinear pooling method to obtain three bilinear features.

And extracting a tri-linear characteristic by using a tri-linear pooling method.

And finally, fusing the three-linear characteristic and the three-dimensional bilinear characteristic, and calculating a softMax vector to obtain a predicted result. The species is determined to be of the family of the genus Caragana by the family classifier, the genus Caragana by the genus classifier, and the species of Caragana by the species classifier.

FIG. 3 is a partial result of the present invention predicted to be correct on CUB 200-2011. The CUB200-2011 dataset is a fine-grain dataset proposed by the california academy of technology in 2010, and is also the benchmark image dataset for current fine-grain classification recognition studies. The data set contains 11788 bird pictures, including 200, 113, 36 families. By adopting the identification method, partial pictures are taken in the CUB200-2011 data set for testing, wherein the third row is each mispredicted category shown by a visualization tool and is predicted by an HBP algorithm. And our MLPH model predictions are accurate across these categories.

FIG. 4 is a comparison of recognition accuracy of the method of the present invention on CUB200-2011,Stanford Cars and FGVC-air datasets. The Stanford Cars image data contains 16185 car pictures in total of 196 categories. Of these, 8144 sheets are training data and 8041 sheets are test data. Each category is divided into 196 categories according to year, manufacturer and model, and the 196 categories belong to 13 families. FGVC-Aircrafts data sets are classical benchmark image data sets in fine-grained image classification and identification studies developed by the university of toyota chicago in 2013. The aircraft data set comprises 10,000 aircraft pictures, and is divided into 100 70 categories of 30 families according to three-layer hierarchical structures of manufacturers, families and variants. The comparison test shows that the identification accuracy is obviously higher than that of the HBP method by using the identification method.

The invention plans coarse-grained tasks and fine-grained tasks of complex targets in a unified framework. The feature reference provided by the coarse granularity task neglected in fine granularity identification in actual life is solved. Experiments prove that the knowledge classification-based complex target asymptotic recognition method provided by the invention has obviously improved recognition accuracy on the CUB200-2011,Stanford Cars and FGVC-air datasets compared with the existing method, and respectively achieves optimal accuracy.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A knowledge classification-based complex target asymptotic identification method is characterized by comprising the following steps:

binding of bilinear features f _A (I)∈R ^hw×c And f _B (I)∈R ^hw×c Equal to f _A (I) ^T f _B (I)∈R ^c×c Where c is the number of feature maps and h and w represent the height and width of the feature maps;

bilinear pooling of cross-layer decomposition is expressed as:

wherein X represents one layer and Y represents another layerWherein U is E R ^hw×d And V.epsilon.R ^hw×d Is a projection matrix, P.epsilon.R ^d×cc Is a matrix of the classifier(s),is the Hadamard product, d represents the dimension of the joint embedding, and f is the output of the bilinear model;

the tri-linear pooling method is expressed as:

wherein W is E R ^hw×d Representing a projection matrix, f incorporates three separate layers, where X represents one layer and Y, Z represents the other two layers;

and predicting the category by using the fused characteristics.

2. The knowledge-based complex object asymptotic recognition method according to claim 1, characterized in that the definition of the original image dataset I is divided into three image datasets I with resolution from high to low _high ,I _medium ,I _low 。

3. The knowledge-based complex object asymptotic recognition method according to claim 2, characterized in that the resolution r of the original image dataset is defined as high resolution r _high The image dataset is determined as I _high ；

4. A complex object asymptotic recognition method based on knowledge classification according to claim 3, characterized by the fact that three resolution images are mapped one-to-one with biological taxonomies:

5. A knowledge-based complex object asymptotic recognition method according to claim 3, characterized in that the image is classified from high resolution r using an SVM classification algorithm _high Classification into categories of science, passing the accuracy threshold t _med And t _low To classify.

6. The knowledge classification based complex target asymptotic identification method of claim 1, wherein inputting the images in batches into a VGG-16 network pre-trained on an ImageNet dataset for feature extraction includes: the relu5_1, relu5_2, relu5_3 features of the three resolution atlases are extracted.

7. The knowledge classification-based complex target asymptotic identification method of claim 1, wherein the three linear features and the three-dimensional bilinear features are fused, and SoftMax vectors are calculated to obtain a predicted result;

the three loss functions add to the total loss function:

l _full ＝l _high +l _medium +l _low .

wherein the loss function loss is defined at each resolution as:

l _high ＝loss(I _high )，l _medium ＝loss(I _medium ) And l _low ＝loss(I _low )。

8. A knowledge classification-based complex target asymptotic recognition device, the device comprising: a memory storing computer executable instructions;

bilinear pooling of cross-layer decomposition is expressed as:

wherein X represents one layer and Y represents another layer, wherein U.epsilon.R ^hw×d And V.epsilon.R ^hw×d Is a projection matrix, P.epsilon.R ^d×cc Is a matrix of the classifier(s),is the Hadamard product, d represents the dimension of the joint embedding, and f is the output of the bilinear model;

the tri-linear pooling method is expressed as:

and predicting the category by using the fused characteristics.