CN113139587B

CN113139587B - Double secondary pooling model for self-adaptive interactive structure learning

Info

Publication number: CN113139587B
Application number: CN202110350164.7A
Authority: CN
Inventors: 谭敏; 袁富; 俞俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-02-06
Anticipated expiration: 2041-03-31
Also published as: CN113139587A

Abstract

The invention provides a self-adaptive interactive structure learning double secondary pooling model. The method comprises the following steps: firstly, extracting multi-level depth features of an image by using a hierarchical depth model, and constructing a weight vector with dimension of the number of pooling groups after obtaining multiple groups of double secondary pooling features among cross-level features; adding a multiplication module of weight and pooling characteristics in a depth network, and classifying the weighted pooling characteristics; secondly, applying sparse constraint of L1 norm to the whole weight vector; a supervision module is then designed to build classification losses over all weighted pooling features. 4. And establishing a multi-task end-to-end deep learning model according to the steps, training and fine-tuning the whole network on a specific data set, and testing the performance of the final model on a test set. The invention can self-adaptively mine the most suitable interactive structure aiming at the specific data set, and has strong reality and universality.

Description

Double secondary pooling model for self-adaptive interactive structure learning

Technical Field

The invention relates to the field of fine-grained image classification, in particular to learning of a self-adaptive interaction structure based on double secondary pooling, and a competitive classification result is realized on a common reference data set by means of pure visual information.

Background

Image classification is a popular research topic in the field of computer vision. With the progress of deep learning, fine-grained image classification is paid considerable attention to, and the technology plays an important role in many fields of endangered biological species protection, commodity identification, traffic violation automobile management and the like. Many fine-grained classification methods based on deep learning have been proposed in recent years. Fine-grained image classification aims at distinguishing objects from different sub-categories in a general category, e.g. different kinds of birds, dogs or different types of cars. However, fine-grained classification is a very challenging task because objects from similar subcategories may have minor class differences, while objects of the same subcategory may exhibit large appearance variations due to differences in shooting scale or viewing angle, or changes in object pose, complex background and occlusion, thus making fine-grained classification more difficult. As such, fine-grained image classification still faces a significant challenge.

According to the existence of the manual annotation information, the fine-grained image classification can be divided into two types of strong supervision methods and weak supervision methods. The strong supervision fine-granularity classification needs to provide labeling information in the training process, mainly comprises a labeling frame, local area positioning and the like, and accurately completes local positioning and acquisition of foreground objects by means of the information, but the practicality of the algorithm is limited due to the fact that manual labeling information is expensive.

The fine granularity classification of weak supervision only requires to provide category information for the image, so that the application scene is wider, most of the algorithms in recent years are used for researching the fine granularity image classification of weak supervision and make a larger breakthrough, and one common solution idea is high-order pooling. A series of specific algorithms, such as Bilinear Convolutional Neural Networks (BCNN), hierarchical Bilinear Pooling (HBP), and bi-quadratic pooling (HQP), etc., were derived based on the concept of Gao Jiechi. The purpose of these models is to fully exploit the important information in the image with discrimination. However, the existing pooling method only considers fixed feature interaction, does not fully explore complementarity of feature interactions of different scales at different levels in the deep neural network, and does not consider how to select the most suitable feature combination or interaction structure from multiple groups of pooled features.

Disclosure of Invention

The invention provides a double secondary pooling model for self-adaptive interactive structure learning based on the double secondary pooling (HQP) and self-adaptive learning ideas. The method fuses the self-adaptive interaction structure selection and image classification into a unified multi-task model framework, can complete training end to end, realizes competitive classification accuracy on a common fine-granularity classification reference data set, and comprises the following steps:

step (1): image data preprocessing

Because the sizes of the images in the data sets (the existing data sets) are different, the images need to be subjected to size transformation and conventional data enhancement operation before model training, so that the sizes of the images are consistent.

Step (2): and constructing a hierarchical depth model based on the bi-quadratic pooling multi-scale feature interaction.

In convolutional neural networks, model outputs of different layers contain target features of different granularities, and the target features from thick to thin correspond to model outputs from high to shallow layers. The detail features with critical effects on classification can be effectively extracted by fusing the features of different layers through a double secondary pooling (HQP) method.

Step (3): constructing weight vectors

Extracting a plurality of double secondary pooling features of the preprocessed image by using a layering depth model, and constructing a weight vector with dimensions of the number of the double secondary pooling features; adding a weighted pooling feature into the layering depth model; the weighted pooling feature is obtained by multiplying the weight vector and the corresponding bi-quadratic pooling feature.

Step (4): sparsity constraint on weight vectors

In the training process of the layering model, L1 norm sparse constraint is applied to the weight vector, so that the layering depth model can obtain excellent classification performance more easily while optimizing the weighting pooling characteristic.

Step (5): design supervision module

In order to ensure convergence of hierarchical depth model training and stability of gradient flow in the training process, a supervision module is designed, and a global classification loss is constructed by utilizing all weighted pooling features.

Step (6): model training and testing

And establishing a multi-task end-to-end layering depth model according to the steps, training and fine-tuning the whole model on a designated data set, and testing the performance of the final layering depth model on a test set.

The preprocessing of the image data in the step (1) specifically comprises the following steps:

because the sizes of the images in the data sets are different, firstly, all the images are uniformly adjusted to a certain designated size by a bilinear interpolation method, and the optimal designated sizes of different data sets are also different. And then, carrying out random clipping on the image with the adjusted size to obtain image data with the size of 448 x 448. The cropped image is then flipped horizontally with a 50% probability. And finally, carrying out normalization processing on the image.

The multi-scale feature interaction based on double secondary pooling in the step (2) builds a layering depth model, and the specific process is as follows:

in convolutional neural networks, as the network goes from shallow to deep, the feature sizes of convolutional layer outputs at different depths also become smaller gradually. We therefore divide the convolutional neural network into multiple stages, with the division criteria being that the convolutional layers that enable the same size feature to be output are located in the same stage.

2-1, selecting the last three stages in the same convolutional neural network, and respectively calling the three stages as a low stage, a medium stage and a high stage according to the characteristic size of each stage. One or more characteristics of the convolution layer are selected from each stage, and the characteristics selected by the three stages are respectively called a low-layer characteristic group, a middle-layer characteristic group and a high-layer characteristic group. The low-level feature group comprises at least one convolution layer feature, and at most comprises all convolution layer features in a low-level stage; the middle layer feature set and the high layer feature set respectively comprise at least two convolution layer features, and at most comprise all convolution layer features in the whole corresponding stage. And then, respectively adjusting the features contained in the low-layer feature group and the middle-layer feature group by utilizing a residual error downsampling module to ensure that the feature sizes of the features are consistent with the feature sizes of the features in the high-layer feature group.

The residual downsampling module is as follows:

the residual downsampling structure has two branches: the main branch comprises a maximum pooling of size k x k and step size k, followed by a convolution layer of convolution kernel size and step size 1. The other residual branch contains a convolution layer with a convolution kernel size and a step size of k, which compensates for the information lost in the main branch due to the maximum pooling. Finally, the features of the two branches are added and then pass through a normalization layer.

2-2, carrying out double secondary pooling operation on the features between the low-layer feature group, the middle-layer feature group and the high-layer feature group. After a new low-layer and middle-layer feature set is obtained through a residual error downsampling module in the step 2-1, all features contained in different-layer feature sets firstly cross-layer and do inner products between every two, so that every two features contained in different-layer feature sets interact; and then, carrying out matrix outer product on each interacted feature and the transposition of the interacted feature to obtain a double secondary pooling feature, namely pooling feature, so as to obtain a layering depth model based on double secondary pooling.

The weight vector construction process in the step (3) specifically comprises the following steps:

and 3-1, after the hierarchical depth model established in the step 2-2 generates a plurality of pooling features, constructing weight vectors with the dimension equal to the number of the pooling features.

3-2, because the importance of the pooling feature is positively correlated with the 'significance' of the visual feature output by the pooling feature, when the layering depth model performs first-round training, the average value of each pooling feature obtained by the layering depth model is calculated, the weight vector is initialized by using the average value of all pooling features, and the weight vector is normalized in the training iteration process, so that the range of each value in the weight vector w is in [0,1], and the specific formula is as follows:

wherein max (), min () takes the maximum value and the minimum value for all values in the weight vector respectively. Relu (w) represents a linear rectification activation function.

And 3-3, correspondingly multiplying the normalized weight vector with all the pooling features to obtain weighted pooling features.

The sparse constraint in the step (4) refers to regularization constraint of L1 norm on the weight vector during model training, so that sparsity of the weight vector is ensured, and final classification performance of the model is improved.

The design supervision module in step (5) constructs global classification loss by using all the weighted pooling features.

Since the dimension of completely splicing all the weighted pooling features is very high, the weighted pooling features are classified through one layer of full connection after being averaged, and the classification implementation based on the average weighted pooling features is called a supervision module because: 1) It can provide a smooth gradient for all the network sub-branches involved in the weighted pooling feature to promote stable training; 2) It helps learn more reasonable weight vectors by minimizing the overall loss of all weighted pooling features; 3) This global classification penalty is only used to monitor the training process and will be ignored during the test. The training safety and the weight vector reliability are ensured by the supervision module.

The step (6) of constructing the multi-task deep learning model specifically means that after the end-to-end framework is established according to the steps (2), (3), (4) and (5), the actual classification loss, the global classification loss of the supervision module and the sparse constraint of the weight vector are optimized on the appointed data set.

The construction process of the actual classification loss is as follows:

according to the magnitude of each numerical value in the weight vector, selecting the weighting pooling characteristics corresponding to the maximum K numerical values, splicing the selected weighting pooling characteristics, and using the spliced weighting pooling characteristics for final classification through one layer of full connection, wherein the generated classification loss is called actual classification loss. Wherein the K value is minimum 1 and maximum is the number of weighted pooling features.

Firstly, a layering depth model is built according to a specific convolutional neural network, and a weight vector, a sparse constraint and a supervision module are added on the layering depth model to obtain a final double secondary pooling model for self-adaptive interactive structure learning. In the training process, firstly, fixing parameters of a specific convolutional neural network part obtained by pretraining an image data set, and training parameters of other newly added modules; and then fine-tuning the whole network to obtain a final model and testing the training effect on the test set. The specific optimization objective function of the whole model is as follows:

wherein θ, w represent the parameters and weight vectors of the model, respectively; y is ^s A label representing sample s;the model actual classification output and the monitoring module global classification output of the sample s are respectively represented; α, λ, λ', δ represent the ratio between the individual losses, respectively; n represents the number of pictures in the training set.

The invention has the beneficial effects that:

based on the concept of bi-quadratic pooling (HQP) and adaptive learning, a bi-quadratic pooling Model (MSHQP) for adaptive interaction structure learning for fine-grained image classification is proposed. The model can adaptively select the optimal pooling feature combination suitable for a specific data set from a plurality of pooling features through the weight vector, and the current leading or competitive accuracy is obtained on a common reference data set. In addition, the model for self-adaptive interactive structure learning provided by the invention not only can be applied to fine-granularity image classification, but also can be used as a more universal module to be conveniently applied to various other tasks, and the performance of the model is improved under the condition that the model reasoning efficiency is not influenced.

Drawings

FIG. 1 is a schematic illustration of a specific flow of the method of the present invention.

FIG. 2 is a schematic view of a model framework constructed in the method of the present invention.

Detailed Description

The invention is further described in detail below with reference to fig. 1 and 2.

Firstly, extracting multi-level depth features of an image by using a hierarchical depth model, and constructing a weight vector with dimension of the number of pooling groups after obtaining multiple groups of double secondary pooling features among cross-level features; adding a multiplication module of weight and pooling characteristics in a depth network, and classifying the weighted pooling characteristics; secondly, applying sparse constraint of L1 norm to the whole weight vector; then, in order to ensure the convergence of model training and the stability of gradient flow, a supervision module is designed in the training process, and classification loss is constructed on all weighted pooling features. 4. And establishing a multi-task end-to-end deep learning model according to the steps, training and fine-tuning the whole network on a specific data set, and testing the performance of the final model on a test set. The invention can self-adaptively mine the most suitable interactive structure aiming at the specific data set, and has strong reality and universality.

The method comprises the following specific implementation steps:

the first step:

we used three datasets, CUB-200-2011, stanford cards, FGVC-aircyaft, to validate our dual quadratic pooling model of adaptive interaction structure learning. When training a model, firstly, the sizes of images in three data sets are respectively adjusted to 600 x 600, 500 x 500 and 500 x 480 by a bilinear interpolation method, then each image is randomly cut to 448 x 448, then the images are randomly turned over at a probability of 50%, and finally, the pixel values of the images are normalized. When testing models, the data processing process is similar to that of training, but without random horizontal flipping.

And a second step of:

the following procedure is illustrated with Resnet34 as an example. In Resnet34, conv3_4 is selected as the lower layer feature set, and the original feature size is 128×56×56, which respectively represents the feature width of the channel number×feature height. The middle layer feature set selects Conv4_2, conv4_4 and Conv4_6, and the original feature size is 256×28×28. The higher layer feature set selects conv5_1, conv5_2, conv5_3, and the original feature size is 512×14×14. All the features in the low-layer feature set and the middle-layer feature set pass through residual downsampling modules with k being 4 and 2 respectively, and the feature sizes after new downsampling are 512 x 14.

When performing double pooling (HQP), two features in different hierarchical feature groups are subjected to inner product, the feature size after inner product is still 512×14×14, the deformation is 512×196, the deformed feature and the transpose of the deformed feature are subjected to outer product to obtain an outer product feature with the size of 512×512, and the pooling feature is deformed again to 1×26144. And carrying out double secondary pooling on the features in the low-layer feature group and the middle-layer feature group to obtain three pooled features, carrying out double secondary pooling on the features in the low-layer feature group and the high-layer feature group to obtain three groups of pooled features, and carrying out double secondary pooling on the features in the middle-layer feature group and the high-layer feature group to obtain nine groups of pooled features. All biquadratic pooling features with dimensions 15 x 26144 were finally obtained.

And a third step of:

a trainable one-dimensional weight vector with the length of 15 is constructed, the pooling characteristics with the dimension of 15 x 2626144 output by the first forward process during model training are averaged in the second dimension to obtain a vector with the dimension of 15 x 1, and the value of the vector is used as the initialization value of the weight vector. The weight vector is normalized according to equation 1 in each forward direction. And then multiplying the normalized weight vector by the pooling feature with the dimension of 15 x 2626144 to obtain a weighted pooling feature. Finally, the first K corresponding weighted pooling features are selected according to the value size in the weight vector to be spliced to be used as final actual classification features, and the cross entropy loss after full connection and softmax is called actual classification loss. In our experiments, K is selected from 1 to 5, and classification effects on three data sets are shown in table 1 below, and it can be seen that when K is equal to 3, the best classification performance can be obtained by selecting and stitching the 3 sets of weighted pooling features with the largest weights.

Table 1 classification accuracy in selecting different weighted pooled feature quantities

K	1	2	3	4	5
						CUB-200-2011	87.2	87.9	88.5	88.2	88.3
StanfordCars	94.0	93.9	94.4	94.1	94.1
						FGVC-Aircraft	92.0	92.3	92.8	92.4	92.5

Fourth step:

in the model training process, the weight vector is singly subjected to the sparse constraint of L1 or L2 paradigm or the combined sparse constraint of L1 and L2. As shown in Table 2 below, the model has the best classification effect on the CUB-200-2011 dataset when sparsely constrained using the L1 paradigm.

TABLE 2 accuracy of the time-lapse classification in different sparsity modes

Sparse mode	L2	L1	L2+L1
					88.1	88.5	88.2

Fifth step:

averaging the weighted pooling features in the third step to obtain an average weighted pooling feature with a dimension of 1 x 2626144 after averaging the first dimension, and the cross entropy loss of the feature after full connection and softmax is called the global classification loss of the supervision module. When verifying the proposed self-adaptive interactive structure learning model, an ablation experiment is carried out on the sparse constraint and supervision module, the experimental results are shown in the following table 3, wherein the benchmark represents the self-adaptive interactive structure learning model without the sparse constraint and supervision module, and the joint control represents the self-adaptive interactive structure learning method with the sparse constraint and supervision module.

Table 3 ablation experiments for adaptive interactive structure learning modules

	Datum	Sparsity constraint	Supervisory signals	Joint control
					CUB-200-2011	87.6	87.2	88.0	88.5
StanfordCars	94.1	93.9	94.3	94.4
					FGVC-Aircraft	92.1	92.1	92.6	92.8

Sixth step:

and loading pre-training parameters of the Resnet34 model on the image data set, and removing the last full-connection layer to serve as a hierarchical depth visual feature extraction model. And after the visual characteristic extraction model, a bi-quadratic pooling and self-adaptive interactive structure learning module is built according to the second, third, fourth and fifth steps. The actual classification loss, the global classification loss of the supervision module and the weight vector sparse constraint loss are taken as the final model loss. Firstly, fixing parameters of a visual characteristic extraction module, independently training parameters of a subsequent pooling and self-adaptive interactive structure learning part, and fine-tuning parameters of the whole model until the model is completely converged when the model is close to convergence. In the model reasoning stage, the supervision module branch in the model is thrown away to reduce model parameters and accelerate model reasoning speed.

Finally, we added both Stanford-Dog and VegFru data sets and validated the double quadratic pooling model of our proposed adaptive interaction structure learning on VGG16, resnet34, resnet50, resnet152 convolutional neural networks, and the classification performance on five reference data sets is as follows table 4.

TABLE 4 Classification accuracy under different convolutional neural networks

。

Claims

1. A self-adaptive interaction structure learning double secondary pooling model is characterized in that self-adaptive interaction structure selection and image classification are fused in a unified multi-task model frame, meanwhile, training can be finished end to end, competitive classification accuracy is realized on a fine-granularity classification reference data set, and the method comprises the following specific realization steps:

step (1): image data preprocessing

Performing size transformation and data enhancement operation on the images before model training to ensure that the sizes of the images are consistent;

step (2): constructing a layering depth model based on multi-scale feature interaction of double secondary pooling;

step (3): constructing weight vectors

Extracting a plurality of double secondary pooling features of the preprocessed image by using a layering depth model, and constructing a weight vector with dimensions of the number of the double secondary pooling features; adding a weighted pooling feature into the layering depth model; the weighted pooling feature is obtained by multiplying the weight vector and the corresponding bi-quadratic pooling feature correspondingly;

step (4): sparsity constraint on weight vectors

Step (5): designing a supervision module, and then constructing a global classification loss by utilizing all the weighted pooling features;

step (6): model training and testing;

the step (2) is specifically realized as follows:

2-1, selecting the last three stages in the same convolutional neural network, and respectively calling the three stages as a low stage, a medium stage and a high stage according to the characteristic size in each stage from large to small; selecting one or more characteristics of the convolution layers from each stage, wherein the characteristics selected in the three stages are respectively called a low-layer characteristic group, a middle-layer characteristic group and a high-layer characteristic group; the low-level feature group comprises at least one convolution layer feature, and at most comprises all convolution layer features in a low-level stage; the middle layer feature set and the high layer feature set respectively comprise at least two convolution layer features, and at most comprise all convolution layer features in the whole corresponding stage; then, the residual error downsampling module is utilized to respectively adjust the characteristics contained in the low-layer characteristic group and the middle-layer characteristic group so that the characteristic size is consistent with the characteristic size in the high-layer characteristic group;

2-2, carrying out double secondary pooling operation on the features between the low-layer feature group, the middle-layer feature group and the high-layer feature group; after a new low-layer and middle-layer feature set is obtained through a residual error downsampling module in the step 2-1, all features contained in different-layer feature sets firstly cross-layer and do inner products between every two, so that every two features contained in different-layer feature sets interact; then, carrying out matrix outer product on each interacted characteristic and the transposition of the interacted characteristic to obtain a double secondary pooling characteristic, namely pooling characteristic, so as to obtain a layering depth model based on double secondary pooling;

the residual downsampling module is as follows:

the residual downsampling structure has two branches: the main branch comprises a maximum pooling with k x k and k step length, and then a convolution layer with 1 convolution kernel and 1 step length; the other residual branch comprises a convolution layer with a convolution kernel size and a step length of k, and the convolution layer is used for compensating the information lost due to the maximum pooling in the main branch; finally, adding the features of the two branches and then passing through a normalization layer;

3-1, after the hierarchical depth model generates a plurality of pooling features, constructing weight vectors with dimensions equal to the number of the pooling features;

3-2, when the hierarchical depth model is trained for the first time, calculating the mean value of each pooled feature obtained by the hierarchical depth model, initializing a weight vector by using the mean value of all pooled features, and normalizing the weight vector in the training iteration process to ensure that the range of each value in the weight vector w is in [0,1], wherein the specific formula is as follows:

wherein max and min are maximum and minimum for all values in the weight vector respectively; relu (w) represents a linear rectification activation function;

2. The model of claim 1, wherein the design supervision module in step (5) constructs global classification loss using all weighted pooling features.

3. The method is characterized in that the step (6) is performed with a multi-task deep learning model, specifically, after an end-to-end framework is established according to the steps (2), (3), (4) and (5), the actual classification loss, the global classification loss of the supervision module and the sparse constraint of the weight vector are optimized on a designated data set;

the construction process of the actual classification loss is as follows:

selecting weighting pooling features corresponding to the maximum K values according to the magnitude of each value in the weight vector, splicing the selected weighting pooling features, and using the spliced weighting pooling features for final classification through one layer of full connection, wherein the generated classification loss is called actual classification loss; wherein the minimum K value is 1, and the maximum K value is the number of the weighted pooling features;

firstly, constructing a layering depth model according to a specific convolutional neural network, and adding a weight vector and a sparse constraint and supervision module on the layering depth model to obtain a final double secondary pooling model for self-adaptive interactive structure learning; in the training process, firstly, fixing parameters of a specific convolutional neural network part obtained by pretraining an image data set, and training parameters of other newly added modules; then fine-tuning the whole network to obtain a final model and testing training effects on a test set; the specific optimization objective function of the whole model is as follows: