CN110390350B

CN110390350B - Hierarchical classification method based on bilinear structure

Info

Publication number: CN110390350B
Application number: CN201910548377.3A
Authority: CN
Inventors: 范建平; 张翔; 赵万青; 罗迒哉; 彭进业; 张二磊; 赵超
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2021-06-15
Anticipated expiration: 2039-06-24
Also published as: CN110390350A

Abstract

The invention discloses a hierarchical classification method based on a bilinear structure, which provides a hierarchical classification network of the bilinear structure, wherein a seed network structure can optimize a final classification structure by utilizing a hierarchical class relationship, and the characteristic of a deep convolutional network is utilized to be combined with the prior knowledge of a class hierarchical structure, so that the human comprehensible concept can be learned in different layers of the network; meanwhile, the invention also combines the hierarchical network structure with a bilinear model to further improve the classification effect; the invention combines the hierarchical network structure and the bilinear model to effectively distinguish the targets which belong to the same genus but are different, thereby further improving the target identification effect.

Description

Hierarchical classification method based on bilinear structure

Technical Field

The invention relates to the field of computer vision, relates to the technologies of pattern recognition, image processing and deep learning, and particularly relates to a hierarchical classification method based on a bilinear structure.

Background

In recent years, a deep convolutional neural network has been remarkably achieved because of its excellent performance in image classification. The conventional CNN-based classification model is designed as a sequential network structure and outputs a unique prediction result at the end of the model without any network branches at the output since the conventional neural network structure treats all target classes equally. However, in real life, there is a connection between categories, such as "orange" and "table tennis" have similar appearance structures, and it is much more difficult to distinguish the two categories than "orange" and "chair", but traditional deep learning considers the distinguishing capability of the two categories to be the same.

Disclosure of Invention

The invention aims to provide a hierarchical classification method based on a bilinear structure, which combines the hierarchical structure with a bilinear model to further improve the classification effect.

In order to realize the task, the invention adopts the following technical scheme:

a hierarchical classification method based on a bilinear structure comprises the following steps:

step 1, constructing a hierarchical classification network with a bilinear structure

The classification network comprises a hierarchical network and a bilinear network, the hierarchical network comprises a first convolution module to a fifth convolution module which are sequentially arranged, the bilinear network comprises two convolution neural networks which are arranged in parallel, and the hierarchical network comprises:

the first convolution module is connected with a first branch module, the first convolution module is used for carrying out convolution processing twice and maximum pooling processing once on an input image with a category structure label, an output feature map is input into the second convolution module on one hand, and is input into the first branch module on the other hand, and the first branch module is internally provided with three full connection layers so as to classify the category label of the input image;

the second convolution module is connected with a second branch module, after performing convolution processing twice and maximum pooling processing once on the feature map from the first convolution module, the output feature map is input into a third convolution module on one hand and is input into the second branch module on the other hand, and the classification of the department label of the input image is continuously performed in the second branch module through three full connection layers;

the third convolution module is connected with a third branch module, after performing convolution processing for three times and maximum pooling processing for one time on the feature map from the second convolution module, the output feature map is input into the fourth convolution module on one hand and is input into the third branch module on the other hand, and the third branch module is subjected to three full-connection layers so as to classify the attribute labels of the input image;

the fourth convolution module is connected with a fourth branch module, after performing convolution processing for three times and maximum pooling processing for one time on the feature map from the third convolution module, the output feature map is input into the fifth convolution module on one hand, and is input into the fourth branch module on the other hand, and the fourth branch module is subjected to three full-connection layers so as to classify the seed-level labels of the input image;

the fifth convolution module performs convolution processing on the feature map from the fourth convolution module for two times, outputs the feature map into two paths, and is respectively connected with the two convolution neural networks for feature extraction; performing outer product operation on the features extracted by the two convolutional neural networks, summing all the extracted features in a summing pooling mode, squaring and normalizing the obtained features, and finally passing through a full connection layer;

calculating five loss functions through cross entropy by utilizing the output of the first branch module to the fourth branch module and the output of the bilinear network, linearly adding the five loss functions, and giving different weights to the loss functions;

step 2, training the hierarchical classification network of the bilinear structure

And training the hierarchical classification network with the bilinear structure, optimizing a final classification result by using the weight distribution of the loss function during training, and storing the trained network model for image classification.

The invention has the following technical characteristics:

1. the network structure provided by the invention can optimize the final classification structure by utilizing the hierarchical category relationship, and combines the characteristics of a deep convolutional network with the prior knowledge of the category hierarchical structure by utilizing the characteristics of the deep convolutional network, so that the human comprehensible concept can be learned in different layers of the network.

2. According to the invention, the hierarchical structure information of the image tag is embedded into the deep learning network, and different types of targets with similar characteristics can be effectively identified.

3. The invention combines the hierarchical network structure and the bilinear model to effectively distinguish the targets which belong to the same genus but are different, thereby further improving the target identification effect.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a hierarchical network;

FIG. 2 is a schematic diagram of a label tree corresponding to a hierarchical network;

FIG. 3 is a schematic diagram of a network structure of a bilinear network;

fig. 4 is a schematic diagram of a hierarchical classification network with a bilinear structure proposed in the present invention.

Detailed Description

In recent years, popular neural network structures such as AlexNet, vggtet, ResNet and the like adopt a single network structure and do not design the structural characteristics of the image label into the network structure. For the large-scale fine-grained image classification task, the ideal result is difficult to achieve only by means of the network structure of the mode, because many different types of targets have similar characteristics for the large-scale fine-grained image classification, and the error condition can be well restrained by the characteristics of the label.

Therefore, the scheme adopts a layered network structure to realize image classification. The hierarchical network structure needs to classify target categories, such as that "apple" belongs to the coarse category of "fruit" and that it also belongs to the fine category of "apple", and then optimize the final classification result by means of hierarchical classification. This hierarchical network structure can limit errors to subclasses because it constrains the fine class ("orange") by the coarse class ("fruit") to reduce the errors of subclasses ("oranges") to other classes ("table tennis"). In addition, the bilinear model can capture more characteristics with distinctiveness and has the function of extracting region information, so that the bilinear model is combined with the hierarchical structure to further improve the image recognition result.

It should be noted that the network structure proposed by the present solution is directed to data having a hierarchical structure (genus structure) or data can be classified into categories of different granularities (coarse category and fine category) by a clustering method.

1. Hierarchical classification network

The structure of the hierarchical classification network is shown in fig. 1, and the corresponding label tree is shown in fig. 2. In the tag tree, fine tags are object classes that are presented in the form of leaves and aggregated into coarse categories that can be constructed manually or generated by unsupervised methods.

The hierarchical network model uses existing CNN components as building blocks to build a network with internal output branches. The network shown at the bottom of fig. 1 is a conventional convolutional network, and the middle part of fig. 1 shows the output branch networks of the hierarchical network, each branch network generating a prediction at a corresponding level of the label tree.

The structure of the hierarchical classification network is as follows:

the network input is a 224 × 224 × 3 image and a label (having a category structure label), the first two layers of the network are 64 filters with a size of 3 × 3, the output result is 224 × 224 × 64, then the network is subjected to maxpool operation, a feature map with a size of 112 × 112 × 64 is output, then the network is divided into two paths, one path is named as LA1, the other path is named as LA2, the other path of LA1 is mainly subjected to three full connection layers to respectively output vectors with a size of 1 × 256,1 × 256 and 1 × C1, wherein C1 represents the number of categories of coarse categories (families), the other path of LA1 is mainly used for classifying the labels with the department, the other path of LA2 is subjected to convolution with two layers, the convolution size is 3 × 3, the number of convolution kernels is 128, then the operation is performed by one maxpool, the feature map with a size of 56 × 56 128 is output, then the two paths are divided into two paths, the path is named as LB1, the other path of LB2, the path of LB1 is mainly subjected to three full connection layers respectively 256, vectors of 1 × 256 and 1 × C1, where C1 represents the number of categories of coarse categories (families), LB1 is mainly for classifying labels of families, LB2 is mainly convolved by three layers, the convolution size is 3 × 3, the number of convolution kernels is 256, then a maxpool operation is performed to output a characteristic map of 28 × 28 × 256, then two paths are separated, one path is named LC1, one path is named LC2, LC1 is mainly passed through three fully connected layers to output vectors of 1 × 1024,1 × 1024 and 1 × C2, where C2 represents the number of categories of coarse categories (genera), LC1 is mainly for classifying labels of families, LC2 is passed through convolution by three layers, the convolution size is 3 × 3, the number of kernels is 256, then a maxpool operation is performed, a feature map of 14 × 14 × 256 is output, then two paths are separated into LD1 and 2, the LD1 path mainly passes through three fully connected layers and respectively outputs vectors of 1 x 1024,1 x 1024 and 1 x C, wherein C represents the category number of the fine category (species), the LD1 path mainly classifies the labels of the species level, the DC2 path passes through convolution of three layers, the convolution size is 3 x 3, the number of convolution kernels is 512, and then the three fully connected layers respectively output vectors of 1 x 4096,1 x 4096 and 1 x C, wherein C represents the category number of the fine category (species). Finally, 5 loss functions can be calculated through cross entropy, the 5 loss functions are linearly added, and different weights are given to the loss functions to optimize the network.

2. Bilinear network

The structure of the bilinear network is shown in fig. 3. As can be seen from the structural diagram of the bilinear classification model, two convolutional neural networks are used to extract the features of the image, a bilinear posing function is used to combine the two groups of features extracted by the CNN, and finally the two groups of features are substituted into the softmax layer for classification. The network has two neural networks A and B, the input image is adjusted to 448 × 448 size, then the two networks are used to respectively extract the feature of the image, at each position of the image, the two networks respectively generate the feature of 1 × 512 size, at each position, the outer product operation is performed to the extracted features A (I) and B (I) of the two networks, the formula is as follows:

X(I)＝A(I)^TB(I)

and then summing the bilinear features of all positions by adopting a summing pooling mode:

where l denotes the position, the resulting features are then calculated as follows:

finally, carrying out normalization operation on the obtained characteristics:

and taking the normalized result as the feature of the picture and using the feature for classification. Such bilinear features can achieve better results in classification than features extracted by a single convolutional network, because the two convolutional neural networks function as region detection and feature extraction, respectively.

3. The invention provides a method for further improving the classification accuracy of images by combining a bilinear network model with a hierarchical network model, wherein the proposed model network structure is shown in FIG. 4, and the specific method comprises the following steps:

the first convolution module is connected with a first branch module LA1, and is used for performing two-time convolution on an input image with a category structure label by adopting 64 filters of 3 x 3, wherein the output result is 224 x 64, and then the maxpool operation is performed; the output feature map is input to the second convolution module LA2 on the one hand and the first branching module LA1 on the other hand, and three full-connection operations are performed in the first branching module LA1 to output vectors of 1 × 256,1 × 256 and 1 × C1, wherein C1 represents the number of categories of rough categories (departments), and LA1 is mainly used for classifying the labels of the departments of the input image.

The second convolution module is connected with a second branch module LB1, the second convolution module LA2 performs two times of convolution processing on the feature map from the first convolution module, the convolution size is 3 x 3, the number of convolution kernels is 128, and then a maxpool operation is performed to output a feature map with the size of 56 x 128; the output feature map is input into the third convolution module LB2 on the one hand and the second branch module LB1 on the other hand, and three full-join operations are performed in the second branch module LB1 to output vectors of 1 × 256,1 × 256 and 1 × C1, respectively, where C1 represents the number of classes of the rough classes (departments), and LB1 is mainly used to classify the labels of the departments.

The third convolution module is connected with a third branch module LC1, the third convolution module LB2 performs convolution processing on the feature map from the second convolution module for three times, the convolution size is 3 x 3, the number of convolution kernels is 256, and then a maxpool operation is performed to output a feature map with the size of 28 x 256; the output feature map is input into the fourth convolution module LC2 on one hand and the third branch module LC1 on the other hand, and three full-connection operations are performed in the third branch module LC1, and vectors of 1 × 1024,1 × 1024 and 1 × C2 are respectively input, wherein C2 represents the number of categories of the rough category (genus), and LC1 is mainly used for classifying the labels of the genus.

The fourth convolution module is connected with a fourth branch module LD1, the fourth convolution module LC2 performs convolution processing on the feature map from the third convolution module for three times, the convolution size is 3 x 3, the number of convolution kernels is 256, and then the feature map with the size of 14 x 256 is output after a maxpool operation; the output feature map is input into the fifth convolution module LD2 on one hand and the fourth branching module LD1 on the other hand, and vectors of 1 × 1024,1 × 1024 and 1 × C are respectively input into the fourth branching module LD1 in three full-connection operations, wherein C represents the category number of the subclass (species), and the LD1 is mainly used for classifying the labels of the species classes.

The fifth convolution module performs convolution processing on the feature map from the fourth convolution module twice, the convolution size is 3 multiplied by 3, the number of convolution kernels is 512, the output is divided into two paths, and the two paths are respectively connected with the two convolution neural networks LE1 and LE2 for feature extraction; LE1 and LE pass through the convolution of one deck respectively, the convolution size is 3 x 3, the quantity of convolution kernel is 512, get a characteristic map of 14 x 512 size, in every position of the picture, two networks generate the characteristic of 1 x 512 size respectively, in every position, do the outer product operation to the characteristic that two convolution neural networks extract, reuse the way of summing pooling, sum the bilinear characteristic of all positions, and then carry on the evolution and normalization of the characteristic obtained, as the characteristic of the input image, pass through the full connection layer finally.

Five loss functions loss 1-loss 5 are calculated through cross entropy by using the outputs of the first branch module to the fourth branch module and the output of the bilinear network, the five loss functions are linearly added, and different weights w 1-w 5 are given to the loss functions.

The training strategy of the network utilizes the weight distribution of the loss function to optimize the final classification result. During training, the distribution of the weight of the loss function is modified, for example, for a three-layer hierarchical network structure, the initialization weight should be given as [1,0,0], the weight is modified into [0,1,0] after 20 iterations, and the weight is modified into [0,0,1] after 50 iterations. The training strategy of the network is as follows:

(1) the initialization weight should be given as [1,0,0,0,0], and is mainly used to train the network of the module (first convolution module + first branch module) of loss1, and optimize the classification result of the rough class (department);

(2) after the Num1 iterations, the weight is modified to [0,1,0,0,0], and the method is mainly used for training the network of the module (second convolution module + second branch module) of loss2 and optimizing the classification result of the rough class (family).

(3) After the Num2 iterations, the weight is modified to [0,0,1,0,0], and the method is mainly used for training the network of the module of loss3 (the third convolution module + the third branch module) and optimizing the classification result of the coarser classes (genera).

(4) After the Num3 iterations, the weight is modified to [0,0,0,1,0], and the method is mainly used for training the network of the module of loss4 (the fourth convolution module + the fourth branch module) and optimizing the classification result of the finer class (species).

(5) After the Num4 times of iteration, the weight is modified to [0,0,0,0,1] for training the whole network and optimizing the classification result of the finer class (species).

Experimental part:

experimental data were used in four databases: CIFAR-10, CIFAR-100, "original" plant database. Wherein the "Orchid" plant database is "Orchid family" collected by the inventor team.

CIFAR-10: the CIFAR-10 data set contains 10 object classes, of which 50000 and 10000 images are respectively in the training set and the test set. These images are all 32 x 32 size color images. The 10 fine classes (airplane, ship, truck, dog, cat, der horse) of CIFAR-10 are divided into 7 coarse classes (sky, water, road, bird, reptile, pet, medium), and the 7 coarse classes can be further divided into 2 coarser classes (animal).

CIFAR-100: the CIFAR-100 database contains 100 classes of objects, each class containing 600 pictures and 32 x 32 color images. The 100 fine classes of CIFAR-100 are divided into 20 coarse classes, and these 20 coarse classes can be further divided into 8 coarser classes.

"original" plant database: 51 plants of the orchid family were collected, and a total of 32064 training pictures and 7894 test pictures were obtained. These 52 plants of the orchid family are further grouped into 8 major groups. It is worth noting that some different species of these orchids have very similar shape structures, which is a challenging task for classification.

1. Classification result of hierarchical network structure

(1) Training strategy

The training strategy of the hierarchical network structure utilizes the weight distribution of the loss function to optimize the final classification result. During training, the distribution of the loss function is modified, for example, for a three-layer hierarchical network structure, the initialization weight should be given as [1,0,0], the weight is modified to [0,1,0] after 20 iterations, and the weight is modified to [0,0,1] after 50 iterations.

The weight assignments for the loss functions on CIFAR-10 and CIFAR-100 are shown in Table 1, and the weight assignments for the loss functions on the "original" dataset are shown in Table 2.

TABLE 1 weight of loss function on CIFAR-10 and CIFAR-100 datasets

	CIFAR-10	CIFAR-100
			Number of iterations	Loss weight distribution	Loss weight distribution
1	[1,0,0,0,0]	[1,0,0,0,0]
			13	[0,1,0,0,0]	[0,1,0,0,0]
23	[0,0,1,0,0]	[0,0,1,0,0]
			33	[0,0,0,1,0]	[0,0,0,1,0]
43	[0,0,0,0,1]	[0,0,0,0,1]

TABLE 2 weight of loss function on "Orchid" dataset

	Orchid
		Number of iterations	Loss weight distribution
1	[1,0,0,0]
		15	[0,1,0,0]
50	[0,0,1,0]
		100	[0,0,0,1]

(2) Results of the experiment

The experimental results are shown in tables 3, 4 and 5, and it can be seen from the experimental results that the proposed hierarchical network structure achieves a good classification effect under different data sets.

TABLE 3 CIFAR-10 database recognition accuracy

Network architecture	Percent identification (%)
		vgg16	88.11
Hierarchical network structure	88.75

TABLE 4 CIFAR-100 database recognition accuracy

Network architecture	Percent identification (%)
		vgg16	62.97
Hierarchical network structure	64.57

TABLE 5 "Orchid" database recognition accuracy

Network architecture	Percent identification (%)
		vgg16	84.02
Hierarchical network structure	84.78

2. Classification result of hierarchical classification network with bilinear structure

(1) Training strategy

The training strategy utilizes the weight distribution of the loss function to optimize the final classification result. During training, the distribution of the loss function is modified, for example, for a three-layer hierarchical network structure, the initialization weight should be given as [1,0,0], the weight is modified to [0,1,0] after 20 iterations, and the weight is modified to [0,0,1] after 50 iterations. In the fifth module of the network we use a bilinear structure to optimize the classification effect.

(2) Results of the experiment

VGG16, the bilinear network and a hierarchical network model based on the bilinear structure are realized on an 'original' database, an experimental structure is shown in a table 6, and compared with a traditional VGG16 classification network, the hierarchical network model based on the bilinear structure provided by the scheme can effectively improve the classification effect.

TABLE 6 "Orchid" database identification accuracy

Network architecture	Percent identification (%)
		vgg16	84.02
Bilinear	89.4
		Hierarchy + bilinear	91.1

Claims

1. A hierarchical classification method based on a bilinear structure is characterized by comprising the following steps: