CN112861970B

CN112861970B - Fine-grained image classification method based on feature fusion

Info

Publication number: CN112861970B
Application number: CN202110179265.2A
Authority: CN
Inventors: 初妍; 王丽娜; 莫世奇; 李思纯; 李松; 时洁; 胡博; 苗晓晨; 赵佳昕
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2023-01-03
Anticipated expiration: 2041-02-09
Also published as: CN112861970A

Abstract

The invention belongs to the technical field of image recognition in computer vision, and particularly relates to a fine-grained image classification method based on feature fusion. The invention realizes the extraction of local detail characteristics of the fine-grained images on the classification task, accurately positions the fine-grained images in the concerned target area, solves the difficulty of small intra-class difference of the fine-grained images on the classification task, utilizes the improved non-maximum value to inhibit the soft-NMS optimization area to suggest the RPN to acquire the target object, and avoids the interference of background information. According to the invention, the bilinear convolutional neural network B-CNNs are improved through the attention module SCA and used for a fine-grained classification task so as to obtain attention characteristics with different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Description

Fine-grained image classification method based on feature fusion

Technical Field

The invention belongs to the technical field of image recognition in computer vision, and particularly relates to a fine-grained image classification method based on feature fusion.

Background

The traditional classification task, multi-finger gross classification, is for example cat and dog. Due to their many distinctive features, it is relatively easier than fine-grained image classification. Fine-grained image classification is a subtask of image classification, mainly identifying hundreds of sub-categories under the same basic category, such as hundreds of sub-categories of birds, cars, pets, flowers, and airplanes. Different from a general classification task, fine-grained image classification has the characteristic of small intra-class difference, and the fine and local difference is the key of fine-grained image classification.

Due to the slight intra-class differences, different sub-classes can often be distinguished only by slight local differences. The fine-grained classification method mainly comprises two methods: one is a classification model based on strong supervision, which needs to use additional information such as manually labeled object labeling boxes and part labeling points in addition to the class labels of the images in order to obtain better classification accuracy. For example, the Part R-CNN algorithm adopts a recursive convolutional neural network to detect objects and local regions in an image. The practicability of the algorithm is limited to a great extent because the acquisition cost of the label information is very expensive. The other is a classification model based on weak supervision, which only relies on class labels to complete good classification without using additional part labeling information. Like a Two-level attention (Two-level attention) algorithm, does not depend on additional labeling information, and only uses class labels to complete fine-grained image classification. Although the extracted features have certain expression capability, how to effectively extract the features of the discriminant parts of the key attention area categories on the premise of only having category labels is challenging.

Disclosure of Invention

The invention aims to realize the extraction of local detail features of fine-grained images on a classification task and the accurate positioning in a concerned target area, and provides a fine-grained image classification method based on feature fusion.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: acquiring an image data set to be classified, taking partial image data to construct a training set, and forming a test set by the rest data; labeling the images in the training set to obtain a category label corresponding to each image;

and 2, step: extracting a feature map of each image in the training set by using a VGG-19 convolutional neural network, and obtaining a feature vector of each image in the training set through sliding window operation on the final conv5-3 feature map;

and step 3: inputting the feature vector of each image in the training set into a regression layer and a classification layer to obtain a regional candidate detection frame set of each image in the training set; calculating a confidence score f for each detection frame in the set of region candidate detection frames _i Selecting a detection frame with the highest confidence coefficient to cut the image to obtain a cut image training set;

and 4, step 4: inputting the cut image training set into an SC-B-CNNs model for training;

the SC-B-CNNs model comprises a first ResNet-50 network, a second ResNet-50 network and a softmax classifier; the first ResNet-50 network is a ResNet-50 network which is pre-trained on ImageNet and removes a final full connection layer, and an attention module SCA is added between conv2 and conv3 volume blocks of the ResNet-50 network; the second ResNet-50 network does not perform pre-training and has an attention module SCA added between its conv4 and conv5 volume blocks;

step 4.1: respectively inputting the cut image training set into a first ResNet-50 network and a second ResNet-50 network, wherein the first ResNet-50 network outputs a first weighted feature map f of each image _A The second ResNet-50 network outputs a second weighted feature map f for each image _B ；

Step 4.2: the first weight characteristic graph f of each image in the cut image training set _A And a second weighted feature map f _B Obtaining each image in the cut image training set through bilinear poolingBilinear feature vectors of the sheet image;

step 4.3: inputting the bilinear feature vector of each image in the cut image training set into a softmax classifier to obtain the category of the image;

and 5: and inputting the test set into the trained SC-B-CNNs model to obtain a classification result of the image data set to be classified.

The present invention may further comprise:

the attention module SCA is used for extracting a feature map F with weight distribution of the input feature map G _sc The method comprises the following specific steps:

step 4.1.1: generating a feature map F by 1 multiplied by 1 convolution for the feature map G input to the attention module SCA;

step 4.1.2: feature graph F is dimensionality reduced using global mean pooling by having a parameter W _fc The full-connection layer assigns weight to the full-connection layer, then compresses the w multiplied by h multiplied by 1 characteristic diagram into a channel according to the channel direction through convolution operation, and generates a space attention diagram A by adopting a sigmoid activation function _s ；

Wherein G ∈ R ^w×h×c W is the length of the feature map G, h is the width of the feature map G, and w × h represents the two-dimensional space size of the feature map G; c represents the number of channels; f. of ^7×7 Representing the size of the convolution kernel; σ () represents a sigmoid activation function;

step 4.1.3: element-by-element dot multiplication method for spatial attention diagram A _s Performing feature fusion with the feature map F to obtain a spatial attention feature F _s ：

Step 4.1.4: feature spatial attention F _s Compressing according to the spatial dimension w multiplied by h to generate a global compressed feature vector z of the current feature map _c ；

Wherein, f _sq () Representing a compression operation; u. of _c Representing the c channel characteristic diagram;

step 4.1.5: obtaining the weight value of each channel in the feature map through two full-connection layers, and obtaining a feature map F with weight distribution by using sigmoid activation _sc ；

A＝σ(W _s2 ×tanh(W _s1 ×z _c ))

Wherein σ () represents a sigmoid activation function, and tanh () represents a tanh activation function; a is a feature vector of weight distribution; w _s1 Is the weight of the first fully connected layer; w is a group of _s2 Is the weight of the second fully connected layer; u. of _c Representing the c channel feature map;

representing element-by-element dot multiplication.

The invention has the beneficial effects that:

the invention realizes the extraction of local detail characteristics of the fine-grained images on the classification task, accurately positions the fine-grained images in the concerned target area, solves the difficulty of small intra-class difference of the fine-grained images on the classification task, utilizes the improved non-maximum value to inhibit the soft-NMS optimization area to suggest the RPN to acquire the target object, and avoids the interference of background information. According to the invention, the bilinear convolutional neural network B-CNNs are improved through the attention module SCA and used for a fine-grained classification task so as to obtain attention characteristics with different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Drawings

Fig. 1 is a frame diagram of the fine-grained image classification method based on feature fusion according to the present invention.

Fig. 2 is a specific flowchart of the RPN network according to the present invention.

FIG. 3 is a schematic diagram of the framework of the B-CNNs based on SCA in the invention.

FIG. 4 is a schematic diagram of the attention module SCA of the present invention.

Fig. 5 is a specific algorithm code diagram of the SCA-based bilinear CNNs in the present invention.

FIG. 6 is a table of the results of comparative experiments performed on three datasets CUB-200, stanford cars and Oxford flowers.

FIG. 7 is a table of the results of comparative experiments performed on the CUB-200 dataset.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention aims to realize extraction of local detail features of fine-grained images on a classification task and accurate positioning in a concerned target area, provides a feature fusion-based weak supervision fine-grained image classification method, and aims to suppress a soft-Non-Maximum-supervision (NMS) optimization area recommendation Network (RPN) to obtain a target object by using an improved Non-Maximum value so as to avoid interference of background information. An Attention module SCA (Spatial-Channel Attention) is designed to improve bilinear convolutional neural networks (B-CNNs) for a fine-grained classification task so as to acquire Attention features of different dimensions. Compared with the existing classification method, the method is positioned in the key part of the distinction, and has higher accuracy.

Step 1, inputting images in a data set and corresponding class labels, and extracting a feature map of each image by using a VGG-19 convolutional neural network;

step 2, obtaining a 256-dimensional feature vector through 3 multiplied by 3 sliding window operation on the final conv5-3 feature map;

step 3, inputting 256-dimensional feature vectors into two full-connection layers, namely a boundary regression layer and a classification layer to obtain a regional candidate frame set;

step 4, selecting a detection frame with the highest confidence level in the frames to be detected by using an improved soft-NMS algorithm;

step 5, cutting and dividing the detected target area with the highest confidence coefficient;

step 6, inputting the cut image;

step 7, two ResNet-50 networks with the last full connection layer removed are respectively used for extracting convolution characteristics of the input image;

step 8, the first branch network uses ResNet-50 pre-trained on ImageNet, and adds a designed attention module SCA between conv2 and conv3 volume blocks to obtain a weighted feature map;

step 9, the second network uses ResNet-50 without pre-training and adds a designed attention module SCA between the conv4 and conv5 volume blocks to obtain a weighted feature map;

step 10, obtaining bilinear feature vectors by bilinear pooling operation on the weighted feature maps in the

steps

8 and 9;

step 11, inputting the bilinear feature vectors into a softmax classifier to obtain the category of the image;

step 12 inputs the test data set and calculates the accuracy of the model classification.

The invention extracts the image characteristics through the RPN network and completes the selection of the candidate frame. And taking the picture as an input, extracting the rough features of the detected image by using VGG-19, and outputting an RPN (robust pitch contour) which is a region of interest obtained by convolving the feature map. To prevent overfitting, the RPN network is optimized using a modified soft-NMS, selecting the region where the higher confidence target is located. And optimizing the preset region, selecting anchors with 3 scales and 3 aspect ratios, namely generating 9 types of anchors, outputting 18 confidence values at each sliding window position classification layer, and outputting 36 target interested region position information by the regression layer to obtain more accurate candidate regions. Carrying out parameterized calculation on the target according to the boundary coordinates, wherein the formula is as follows:

t _x ＝(x-x _a )/w _a ，t _y ＝(y-y _a )/h _a

t _w =log(w/w _a )，t _h =log(h/h _a )

wherein, x, y, w, h respectively represent the horizontal and vertical coordinates and the length and width of the frame center of the prediction matrix. t is t _i Representing parameterization of object boundary coordinates.

Indicating annotation information associated with the positive anchor point. x is a radical of a fluorine atom _a ,y _a ,w _a ,h _a Respectively representing the horizontal and vertical coordinates and the length and width, x, of the anchor point frame ^* ,y ^* ,w ^* ,h ^* Respectively representing the abscissa and ordinate of the true position of the label and the length and width.

Sorting all the detected detection boxes according to their scores (when the score is scored by using a classifier, a probability value is obtained, and the probability value represents the probability that the current detection box is the object to be detected), selecting the detection box A with the largest score, setting a threshold b, calculating loU (interaction over Unit) between the current detection box and the largest detection box A in the rest detection boxes, and if the loU is larger than the threshold b, obtaining the detection box with high overlapping rate. Deleting the detection boxes; if the detection frames are not overlapped with the current detection frame completely or the overlapping area of the detection frames is very small (the loU is smaller than the threshold b), then the detection frames which are not processed are reordered, the detection frame with the largest score is also selected after the ordering is finished, then the loU values of other detection frames and the largest detection frame are calculated, then the detection frames with the loU larger than a certain threshold are deleted again, and the process is iterated continuously until all the detection frames are processed, and the final detection result is output.

The RPN extracted candidate frames will be highly overlapping. To reduce redundancy, the improved soft-NMS was optimized according to the classification score of the detection box. And when the score of the detection frame is larger than the threshold value t, putting the detection frame into a final detection result set. When the areas are overlapped, the score of the detected frame is multiplied by an attenuation function, so that the error probability is effectively reduced, and the detection accuracy is improved. The specific calculation formula is as follows:

wherein: f. of _i The score corresponding to the ith detection box is shown, and t is a threshold value.

The SC-B-CNNs network architecture provided by the invention can be formed by a quaternary function B = (f) _A ,f _B P, C), bilinear features are obtained by performing bilinear combination through outer product operation, and the calculation formula is as follows:

b＝f _A ^T ·f _B

wherein f is _A And f _B The feature function containing the added attention block SCA, P is the pooling function and C is the classification function.

The feature outputs for each location are combined using bilinear pooling. The bilinear pooling operation of the input image l at position I is defined as:

bilinear(l,I,f _A ,f _B )＝f _A (l,I) ^T f _B (l,I)

wherein f is _A And f _B Are the output of two feature extraction functions of the B-CNNs.

Firstly, a feature graph extracted by a feature function is used as an original input G, G belongs to R ^w×h×c Where w × h represents the two-dimensional space size of G, and c represents the number of channels. Feature map F is generated by a 1 × 1 convolution, and F is dimensionality reduced using Global Average Pooling (Global Average Pooling), by having a parameter W _fc The full-connection layer assigns weight to the full-connection layer, then compresses the w multiplied by h multiplied by 1 characteristic diagram into a channel according to the channel direction through convolution operation, and generates a space attention diagram A by adopting a sigmoid activation function _s ，A _s ∈R ^w×h×1 . The process of spatial attention extraction is expressed as the formula:

wherein: f. of ^7×7 Representing the size of the convolution kernel, σ () representing the sigmoid activation function, W _fc Is represented by having a parameter W _fc The full interconnect layer of (1).

Then, the space attention diagram A is multiplied by the element points _s Performing feature fusion with the original input F to obtain a spatial attention feature F _s ：

And compressing the global space information into the channel description characteristic information. Generating a global compressed feature vector z of the current feature map by compressing the feature map Fs in a spatial dimension w × h _c The specific calculation formula is as follows:

wherein, f _sq () Denotes a compression operation, u _c Showing the c-th channel profile.

Then, an activation operation is carried out, and by learning the weight parameters, the nonlinear correlation between the channels is found. And obtaining the weight value of each channel in the feature map through the two fully-connected layers, and taking the weighted feature map as the input of the next layer of network. The weight assignment calculation formula of the channel is as follows:

A _c ＝f _eq (z,W)＝σ(W _s2 ×tanh(W _s1 ×z _c ))

wherein f is _eq () Represents a compression operation, z represents a global compressed feature vector, σ () represents a sigmoid activation function, and tanh () represents a tanh activation function.

After the weight distribution vector of the feature map is obtained through the operation, simple gate control is selected and used, sigmoid activation is used, and the feature map F with the weight distribution is obtained _sc The calculation process is as follows:

wherein A is _c Is a feature vector of weight distribution, u _c Showing the characteristic diagram of the c-th channel,

representing element-by-element dot multiplication.

The function of using two fully-connected layers is to ensure the consistency of input and output. The first full-connection layer firstly reduces the dimension of the channel to 1/16 of the original dimension, and after the channel passes through the tanh activation function, the channel is restored to the original input dimension through one full-connection layer.

The specific algorithm of the SCA-based bilinear CNNs is shown in FIG. 5. To demonstrate the effectiveness of the proposed method, comparative experiments were performed on three datasets, CUB-200, stanford cars and Oxford flowers, respectively, and the results are shown in FIG. 6. To further verify the validity and accuracy of the improved RPN network and SCA, comparative experiments were performed on the CUB-200 dataset, with the results shown in fig. 7.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A fine-grained image classification method based on feature fusion is characterized by comprising the following steps:

step 1: acquiring an image data set to be classified, taking partial image data to construct a training set, and forming a test set by the rest data; labeling the images in the training set to obtain class labels corresponding to the images;

step 4.1.1: for the feature map G input to the attention module SCA, generating a feature map F by 1 × 1 convolution;

step 4.1.2: feature graph F is dimensionality reduced using global average pooling by having parameter W _fc The full connection layer is assigned with weight, then the characteristic diagram of w multiplied by h multiplied by 1 is compressed into a channel according to the channel direction through convolution operation, and the sigmoid activation function is adopted to generate a space attention diagram A _s ；

step 4.1.3: element-by-element dot multiplication method for spatial attention diagram A _s Performing feature fusion with the feature map F to obtain a spatial attention feature map F _s ：

Step 4.1.4: spatial attention feature map F _s Compressing according to the spatial dimension w multiplied by h to generate a spatial attention feature map F _s Global compressed feature vector z of _c ；

Step 4.1.5: obtaining a spatial attention feature map F through two fully-connected layers _s Using sigmoid to activate the weight value of each channel to obtain a feature graph F with weight distribution _sc ；

A＝σ(W _s2 ×tanh(W _s1 ×z _c ))

Wherein σ () represents a sigmoid activation function, and tanh () represents a tanh activation function; a is a weight distribution feature vector; w _s1 Is the weight of the first fully connected layer; w _s2 Is the weight of the second fully connected layer; u. of _c Representing the c channel characteristic diagram;

representing element-by-element dot multiplication;

and 4.2: the first weight characteristic graph f of each image in the cut image training set _A And a second weighted feature map f _B Obtaining a bilinear feature vector of each image in the cut image training set through bilinear pooling operation;