CN113139587A

CN113139587A - Double-quadratic pooling model for self-adaptive interactive structure learning

Info

Publication number: CN113139587A
Application number: CN202110350164.7A
Authority: CN
Inventors: 谭敏; 袁富; 俞俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-20
Anticipated expiration: 2041-03-31
Also published as: CN113139587B

Abstract

The invention provides a biquadratic pooling model for self-adaptive interactive structure learning. The invention comprises the following steps: firstly, extracting multi-level depth features of an image by using a hierarchical depth model, and constructing a weight vector with dimension being the number of pooling groups after obtaining a plurality of groups of biquadratic pooling features among cross-level features; adding a multiplication module of weight and pooling characteristics in the deep network, and classifying on the weighted pooling characteristics; secondly, applying sparse constraint of L1 norm to the whole weight vector; a supervision module is then designed to build classification penalties on all the weighted pooled features. 4. And establishing a multi-task end-to-end deep learning model according to the steps, training and fine-tuning the whole network on a specific data set, and testing the performance of the final model on a test set. The invention can self-adaptively mine the most suitable interactive structure aiming at a specific data set, and has strong practicability and universality.

Description

Double-quadratic pooling model for self-adaptive interactive structure learning

Technical Field

The invention relates to the field of fine-grained image classification, in particular to learning of a bi-quadratic pooling-based adaptive interaction structure, and a competitive classification result is realized on a common reference data set by means of pure visual information.

Background

Image classification is a popular research topic in the field of computer vision. With the progress of deep learning, the classification of fine-grained images is paid considerable attention, and the technology plays an important role in many fields such as endangered organism species protection, commodity identification, traffic violation automobile management and the like. Many fine-grained classification methods based on deep learning have been proposed in recent years. Fine-grained image classification aims at distinguishing objects from different sub-categories in a general category, e.g. different kinds of birds, dogs or different types of cars. However, fine-grained classification is a very challenging task because objects from similar sub-categories may have slight category differences, while objects of the same sub-category may exhibit large appearance changes due to different shooting scales or perspectives, or different object poses, complex backgrounds, and occlusion changes, thereby making fine-grained classification more difficult. As such, fine-grained image classification still faces significant challenges.

The classification of fine-grained images can be divided into two categories, namely strong supervision and weak supervision, according to the existence of manual labeling information. The strong supervision fine-grained classification needs to provide marking information in the training process, mainly comprises marking frames, local area positioning and the like, and accurately completes local positioning and obtains foreground objects by means of the information, but the manual marking information is expensive, so that the practicability of the algorithm is limited.

The classification of the weakly supervised fine granularity only requires to provide class information for the image, so that the application scene is wider, most of the algorithms in recent years are used for researching the classification of the weakly supervised fine granularity image and obtain greater breakthrough, and a common solution idea is high-order pooling. A series of specific algorithms are derived based on the idea of high-order pooling, such as Bilinear Convolutional Neural Network (BCNN), Hierarchical Bilinear Pooling (HBP), and biquadratic pooling (HQP). The purpose of these models is to fully exploit important information with discrimination in images. However, the existing pooling method usually only considers fixed feature interaction, does not fully explore complementarity of feature interaction of different levels and different scales in a deep neural network, and does not consider how to select the most suitable feature combination or interaction structure from a plurality of groups of pooled features.

Disclosure of Invention

The invention provides a biquadratic pooling model for adaptive interactive structure learning based on the ideas of biquadratic pooling (HQP) and adaptive learning. The method integrates self-adaptive interactive structure selection and image classification into a unified multi-task model framework, can complete training end to end, and realizes competitive classification accuracy rate on a common fine-grained classification benchmark data set, and comprises the following steps:

step (1): image data pre-processing

Because the size of the images in the data set (the existing data set) is different, the images need to be subjected to size transformation and conventional data enhancement operation before model training, so that the sizes of the images are consistent.

Step (2): and constructing a hierarchical depth model based on the bi-quadratic pooling multi-scale feature interaction.

In the convolutional neural network, the model outputs of different levels contain target features of different granularities, and the target features from coarse to fine correspond to the outputs of the model from high to shallow. The features of different levels are fused by a biquadratic pooling (HQP) method, so that the detail features having a critical effect on classification can be effectively extracted.

And (3): constructing weight vectors

Extracting a plurality of bi-quadratic pooling characteristics of the preprocessed image by using a hierarchical depth model, and constructing a weight vector with dimension being the number of the bi-quadratic pooling characteristics; and adding weighted pooling features in the hierarchical depth model; the weighted pooling feature is obtained by multiplying the weight vector by the corresponding biquadratic pooling feature.

And (4): sparsely constraining weight vectors

And applying sparse constraint of L1 norm to the weight vector in the training process of the hierarchical model, so that the hierarchical depth model can obtain excellent classification performance more easily while optimizing the weighting pooling characteristics.

And (5): design supervision module

In order to guarantee the convergence of the hierarchical depth model training and the stability of the gradient flow in the training process, a supervision module is designed, and a global classification loss is constructed by utilizing all weighted pooling characteristics.

And (6): model training and testing

And establishing a multi-task end-to-end hierarchical depth model according to the steps, training and fine-tuning the whole model on a specified data set, and testing the performance of the final hierarchical depth model on a test set.

The image data preprocessing in the step (1) comprises the following specific steps:

because the sizes of the images in the data sets are different, all the images are uniformly adjusted to a certain specified size by a bilinear interpolation method, and the optimal specified sizes of different data sets are different. Next, the resized image is randomly cropped to obtain image data of 448 × 448. The cropped image is then flipped horizontally with a 50% probability. And finally, carrying out normalization processing on the image.

Establishing a hierarchical depth model based on the biquadratic pooling multi-scale feature interaction in the step (2), wherein the specific process is as follows:

in the convolutional neural network, as the network goes from shallow to deep, the characteristic size output by convolutional layers at different depths is gradually reduced. We therefore divide the convolutional neural network into stages, with the division criterion being that convolutional layers that enable the output of the same size feature are in the same stage.

2-1, in the same convolutional neural network, selecting the last three stages, and respectively calling the three stages as a low stage, a medium stage and a high stage according to the characteristic sizes in each stage from big to small. The features of one or more convolutional layers are selected from each stage, and the features selected in the three stages are respectively called a low-layer feature group, a middle-layer feature group and a high-layer feature group. The low-level feature group comprises at least one convolutional layer feature and at most all convolutional layer features in a low-level stage; the middle layer feature set and the high layer feature set respectively comprise at least two convolutional layer features, and at most all convolutional layer features of the whole corresponding stage. And then, respectively adjusting the features contained in the low-layer feature group and the middle-layer feature group by using a residual error down-sampling module to ensure that the feature size is consistent with the feature size in the high-layer feature group.

The residual downsampling module is as follows:

the residual downsampling structure has two branches: the main branch contains a maximal pooling of size k x k and step size k, followed by a convolution layer with convolution kernel size and step size of 1. The other residual branch contains a convolutional layer with the convolutional kernel size and the step size of k, and is used for compensating the information lost due to the maximum pooling in the main branch. Finally, the characteristics of the two branches are added and then pass through a normalization layer.

And 2-2, performing double secondary pooling operation on the features among the low-level feature group, the middle-level feature group and the high-level feature group. After new low-layer and middle-layer feature groups are obtained through a residual error down-sampling module in the step 2-1, inner products are firstly made between every two features contained in different layer feature groups in a cross-layer mode, and the features contained in the different layer feature groups are interacted with each other; and then performing matrix outer product on each interacted feature and the transpose of the feature to obtain biquadratic pooling features, namely pooling features for short, so as to obtain a hierarchical depth model based on biquadratic pooling.

The weight vector construction process in the step (3) is specifically as follows:

and 3-1, after the hierarchical depth model established in the step 2-2 generates a plurality of pooling features, constructing weight vectors with dimensions equal to the number of the pooling features.

3-2, because the importance of the pooling features is positively correlated with the 'significance' of the output visual features, when the hierarchical depth model is trained for the first round, the mean value of each pooling feature obtained by the hierarchical depth model is obtained, the weight vector is initialized by the mean values of all pooling features, and the weight vector is normalized in the training iteration process, so that the range of each value in the weight vector w is in [0,1], and the specific formula is as follows:

wherein max (), min () take the maximum value, minimum value respectively for all values in the weight vector. Relu (w) represents the linear commutation activation function.

And 3-3, correspondingly multiplying the normalized weight vector and all the pooling features to obtain the weighted pooling feature.

The sparse constraint in the step (4) is that the regularization constraint of an L1 norm is implemented on the weight vector during model training, so that the sparsity of the weight vector is ensured, and the final classification performance of the model is improved.

The design supervision module in the step (5) utilizes all weighted pooling characteristics to construct a global classification loss.

Since the dimension of fully stitching all weighted pooling features is very high, classifying all weighted pooling features after averaging through one layer of full connection is called a supervision module, because: 1) it can provide smooth gradients for all network sub-branches involved in the weighted pooling feature to facilitate stable training; 2) it helps to learn more reasonable weight vectors by minimizing the overall loss of all weighted pooled features; 3) this global classification penalty is only used to monitor the training process and will be ignored during the testing process. The supervision module ensures the training safety and the reliability of the weight vector.

The step (6) of constructing the multitask deep learning model specifically refers to that after an end-to-end frame is established according to the steps (2), (3), (4) and (5), actual classification loss, overall classification loss of a supervision module and sparse constraint of a weight vector are optimized on a specified data set at the same time.

The actual classification loss construction process is as follows:

and selecting the weighting pooling features corresponding to the maximum K numerical values according to the magnitude of each numerical value in the weight vector, splicing the selected weighting pooling features, and then using the spliced weighting pooling features for final classification through one layer of full connection, wherein the generated classification loss is called actual classification loss. Wherein the K value is minimum 1 and maximum number of weighted pooling features.

Firstly, a hierarchical depth model is constructed according to a specific convolutional neural network, and a final bi-quadratic pooling model for self-adaptive interactive structure learning is obtained after a weight vector, a sparse constraint module and a supervision module are added on the hierarchical depth model. In the training process, firstly, parameters of a specific convolutional neural network part obtained by pretraining an Imagenet data set are fixed, and only parameters of other newly added modules are trained; and then fine-tuning the whole network to obtain a final model and testing the training effect on the test set. The specific optimization objective function of the whole model is as follows:

wherein θ, w represent the parameters and weight vectors of the model, respectively; y is^sA label representing a sample s;

respectively representing the actual classification output of the model of the sample s and the overall classification output of the monitoring module; α, λ, λ', δ respectively represent the ratio between the losses; n represents the number of pictures in the training set.

The invention has the beneficial effects that:

based on the concepts of biquadratic pooling (HQP) and adaptive learning, a biquadratic pooling Model (MSHQP) for adaptive interactive structure learning for fine-grained image classification is proposed. Through the weight vector, the model can self-adaptively select the optimal pooling feature combination suitable for a specific data set from a plurality of pooling features, and the current leading or competitive accuracy rate is obtained on a common reference data set. In addition, the model for self-adaptive interactive structure learning provided by the invention can be applied to fine-grained image classification, can be used as a more universal module to be conveniently applied to various other tasks, and can improve the performance of the model without influencing the reasoning efficiency of the model.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a schematic diagram of a model framework constructed in the method of the present invention.

Detailed Description

The present invention will be further described with reference to FIGS. 1 and 2.

Firstly, extracting multi-level depth features of an image by using a hierarchical depth model, and constructing a weight vector with dimension being the number of pooling groups after obtaining a plurality of groups of biquadratic pooling features among cross-level features; adding a multiplication module of weight and pooling characteristics in the deep network, and classifying on the weighted pooling characteristics; secondly, applying sparse constraint of L1 norm to the whole weight vector; then, in order to guarantee the convergence of model training and the stability of gradient flow, a supervision module is designed in the training process, and classification losses are constructed on all the weighted pooling characteristics. 4. And establishing a multi-task end-to-end deep learning model according to the steps, training and fine-tuning the whole network on a specific data set, and testing the performance of the final model on a test set. The invention can self-adaptively mine the most suitable interactive structure aiming at a specific data set, and has strong practicability and universality.

The invention specifically realizes the following steps:

the first step is as follows:

we used three data sets CUB-200-2011, Stanford Cars, FGVC-Aircraft to validate our bi-quadratic pooling model for adaptive interaction structure learning. When the model is trained, firstly, the sizes of the images in the three data sets are respectively adjusted to 600 × 600, 500 × 500 and 500 × 480 through a bilinear interpolation method, then, each image is randomly cut to 448 × 448, then, the images are randomly and horizontally turned over with the probability of 50%, and finally, the pixel values of the images are normalized. When testing a model, the data processing is similar to that of training, but does not require random horizontal flipping.

The second step is that:

the following procedure is illustrated by way of example with Resnet 34. In Resnet34, the low-level feature group was selected as Conv3_4, with original feature sizes of 128 × 56, representing the number of channels × feature height × feature width, respectively. The middle layer feature group is selected from Conv4_2, Conv4_4 and Conv4_6, and the original feature size is 256 × 28. The high level feature group selects Conv5_1, Conv5_2, Conv5_3, the original feature size is 512 × 14. All the features in the low-level feature group and the middle-level feature group are respectively subjected to residual error down-sampling modules with k being 4 and 2, and the sizes of the new down-sampled features are 512 x 14.

When double-secondary pooling (HQP) is performed, two features in different hierarchical feature groups are subjected to inner product, the size of the feature after inner product is still 512 × 14, deformation is 512 × 196, the deformed feature is subjected to outer product with the self transpose to obtain the feature after outer product with the size of 512 × 512, and the feature is deformed again to be the pooled feature of 1 × 262144. And performing double-secondary pooling on the features in the low-layer feature group and the middle-layer feature group to obtain three pooled features, performing double-secondary pooling on the features in the low-layer feature group and the high-layer feature group to obtain three groups of pooled features, and performing double-secondary pooling on the features in the middle-layer feature group and the high-layer feature group to obtain nine groups of pooled features. All biquadratic pooling features of dimension 15 × 262144 were finally obtained.

The third step:

and constructing a trainable one-dimensional weight vector with the length of 15, averaging the pooled features with the dimension of 15 × 262144 output by the first forward process in the model training on the second dimension to obtain a vector with the dimension of 15 × 1, and taking the value of the vector as an initialization value of the weight vector. The weight vector is normalized according to formula 1 in each forward process. And then multiplying the normalized weight vector by the corresponding pooling feature with the dimension of 15 x 262144 to obtain the weighted pooling feature. And finally, selecting the first K corresponding weighted pooling features according to the numerical values in the weight vector, splicing the weighting pooling features to be used as final actual classification features, wherein the cross entropy loss after full connection and softmax is called actual classification loss. In our experiment, K is selected from 1 to 5, and the classification effect on three data sets is shown in table 1 below, and it can be seen that when K is equal to 3, the optimal classification performance can be obtained by selecting 3 groups of weighted pooling features with the largest weight for splicing.

TABLE 1 Classification accuracy in selecting different weighted pooled feature quantities

K	1	2	3	4	5
						CUB-200-2011	87.2	87.9	88.5	88.2	88.3
StanfordCars	94.0	93.9	94.4	94.1	94.1
						FGVC-Aircraft	92.0	92.3	92.8	92.4	92.5

The fourth step:

in the model training process, sparse constraints of L1 or L2 paradigm are applied to the weight vectors alone, or combined sparse constraints of L1 and L2. As shown in Table 2 below, the model has the best classification effect when sparse constraint is performed using the L1 paradigm on the CUB-200-2011 data set.

TABLE 2 Classification accuracy at different sparseness regimes

Sparse mode	L2	L1	L2+L1
					88.1	88.5	88.2

The fifth step:

and averaging the weighted pooling features in the third step in the first dimension to obtain an average weighted pooling feature with the dimension of 1 × 262144, wherein the cross entropy loss of the feature after the full connection and the softmax is called the global classification loss of the supervision module. When the proposed adaptive interactive structure learning model is verified, an ablation experiment is carried out on the sparse constraint and supervision module, the experimental result is shown in table 3 below, wherein the reference represents the adaptive interactive structure learning model without the sparse constraint and supervision module, and the joint control represents the adaptive interactive structure learning method with the sparse constraint and supervision module.

TABLE 3 ablation experiment of adaptive interaction structure learning module

	Datum	Sparse constraints	Supervision signal	Joint control
					CUB-200-2011	87.6	87.2	88.0	88.5
StanfordCars	94.1	93.9	94.3	94.4
					FGVC-Aircraft	92.1	92.1	92.6	92.8

And a sixth step:

and loading pre-training parameters of the Resnet34 model on the Imagenet data set, and removing the final full-link layer to serve as a hierarchical deep visual feature extraction model. And establishing a biquadratic pooling and self-adaptive interactive structure learning module according to the second, third, fourth and fifth steps after the visual feature extraction model. And taking actual classification loss, global classification loss of a supervision module and weight vector sparse constraint loss as final model loss. Firstly, parameters of a visual feature extraction module are fixed, parameters of a subsequent pooling and self-adaptive interactive structure learning part are trained independently, and when the model approaches convergence, parameters of the whole model are fine-tuned until the model is completely converged. In the model reasoning stage, the supervision module branches in the model are discarded to reduce the model parameters and accelerate the model reasoning speed.

Finally, we added both Stanford-Dog and VegFru datasets and validated the bi-quadratic pooling model of our proposed adaptive interaction structure learning on VGG16, Resnet34, Resnet50, Resnet152 convolutional neural networks, with classification performance on five reference datasets as in table 4 below.

TABLE 4 Classification accuracy under different convolutional neural networks

。

Claims

1. A bi-quadratic pooling model for self-adaptive interaction structure learning is characterized in that self-adaptive interaction structure selection and image classification are fused in a unified multi-task model framework, training can be completed end to end, and competitive classification accuracy is realized on a fine-grained classification benchmark data set, and the specific realization steps are as follows:

step (1): image data pre-processing

Before model training, carrying out size transformation and data enhancement operation on the images to ensure that the sizes of the images are consistent;

step (2): constructing a hierarchical depth model based on the multi-scale feature interaction of biquadratic pooling;

and (3): constructing weight vectors

Extracting a plurality of bi-quadratic pooling characteristics of the preprocessed image by using a hierarchical depth model, and constructing a weight vector with dimension being the number of the bi-quadratic pooling characteristics; and adding weighted pooling features in the hierarchical depth model; the weighted pooling features are obtained by corresponding multiplication of the weight vectors and the corresponding biquadratic pooling features;

and (4): sparsely constraining weight vectors

And (5): designing a supervision module, and then constructing a global classification loss by using all the weighted pooling characteristics;

and (6): and (5) training and testing the model.

2. The bi-quadratic pooling model for adaptive interaction structure learning of claim 1, wherein the step (2) is implemented as follows:

2-1, selecting the last three stages in the same convolutional neural network, and respectively calling the three stages as a low-stage, a middle-stage and a high-stage according to the characteristic size of each stage from big to small; selecting the characteristics of one or more convolution layers from each stage, wherein the characteristics selected in the three stages are respectively called a low-layer characteristic group, a middle-layer characteristic group and a high-layer characteristic group; the low-level feature group comprises at least one convolutional layer feature and at most all convolutional layer features in a low-level stage; the middle layer feature group and the high layer feature group respectively at least comprise two convolution layer features and at most comprise all convolution layer features of the whole corresponding stage; then utilizing a residual error down-sampling module to respectively adjust the features contained in the low-layer feature group and the middle-layer feature group so as to enable the feature size to be consistent with the feature size in the high-layer feature group;

2-2, performing double secondary pooling operation on the characteristics among the low-layer characteristic group, the middle-layer characteristic group and the high-layer characteristic group; after new low-layer and middle-layer feature groups are obtained through a residual error down-sampling module in the step 2-1, inner products are firstly made between every two features contained in different layer feature groups in a cross-layer mode, and the features contained in the different layer feature groups are interacted with each other; and then performing matrix outer product on each interacted feature and the transpose of the feature to obtain biquadratic pooling features, namely pooling features for short, so as to obtain a hierarchical depth model based on biquadratic pooling.

3. The bi-quadratic pooling model for adaptive interaction structure learning of claim 2, wherein the residual down-sampling module is as follows:

the residual downsampling structure has two branches: the main branch comprises a maximum pooling layer with the size of k × k and the step size of k, and then a convolution layer with the convolution kernel size and the step size of 1; the other residual branch comprises a convolution layer with the convolution kernel size and the step length both being k and is used for compensating the information lost due to the maximum pooling in the main branch; finally, the characteristics of the two branches are added and then pass through a normalization layer.

4. The bi-quadratic pooling model for adaptive interaction structure learning according to claim 2 or 3, wherein the weight vector construction process in step (3) is specifically as follows:

3-1, after the hierarchical depth model generates a plurality of pooling features, constructing weight vectors with dimensions equal to the number of the pooling features;

3-2, when the hierarchical depth model is subjected to first-round training, solving the mean value of each pooled feature obtained by the hierarchical depth model, initializing a weight vector by using the mean value of all pooled features, and normalizing the weight vector in the training iteration process to ensure that the range of each value in the weight vector w is in [0,1], wherein the specific formula is as follows:

wherein max { } and min { } take the maximum value and the minimum value respectively for all values in the weight vector; relu (w) represents a linear commutation activation function;

5. The bi-quadratic pooling model of adaptive interaction structure learning of claim 4, wherein said design supervision module of step (5) is configured to use all weighted pooling features to construct global classification loss.

6. The bi-quadratic pooling model for adaptive interaction structure learning according to claim 5, wherein the step (6) is implemented by constructing a multi-task deep learning model, specifically, after an end-to-end frame is established according to the steps (2), (3), (4) and (5), on a designated data set, the actual classification loss, the global classification loss of a supervision module and the sparse constraint of a weight vector are optimized simultaneously;

the actual classification loss construction process is as follows:

selecting the weighting pooling features corresponding to the maximum K numerical values according to the magnitude of each numerical value in the weight vector, splicing the selected weighting pooling features, and then using the splicing result for final classification through a layer of full connection, wherein the generated classification loss is called actual classification loss; wherein the K value is minimum 1 and maximum number of weighted pooling features;

firstly, a hierarchical depth model is constructed according to a specific convolutional neural network, and a final bi-quadratic pooling model for self-adaptive interactive structure learning is obtained after a weight vector, a sparse constraint module and a supervision module are added on the hierarchical depth model; in the training process, firstly, parameters of a specific convolutional neural network part obtained by pretraining an Imagenet data set are fixed, and only parameters of other newly added modules are trained; then fine-tuning the whole network to obtain a final model and testing the training effect on the test set; the specific optimization objective function of the whole model is as follows: