CN112132004B

CN112132004B - Fine granularity image recognition method based on multi-view feature fusion

Info

Publication number: CN112132004B
Application number: CN202010992253.7A
Authority: CN
Inventors: 黄伟锋; 张甜; 常东良; 马占宇; 柳斐; 王丹; 刘念
Original assignee: South To North Water Transfer Middle Route Information Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: South To North Water Transfer Middle Route Information Technology Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2024-06-25
Anticipated expiration: 2040-09-21
Also published as: CN112132004A

Abstract

A fine-grained image recognition method based on multi-view feature fusion relates to the technical field of image processing, and solves the problems that the existing fine-grained image recognition method ignores detail information of images and has poor adaptability of visual differences among images, an introduced loss function is complex, the quantity of parameters of a model is increased and the like; the invention introduces a suppression branch, and forces the network to find subtle discriminant features among the confusing categories by suppressing the most significant region in the image. And introducing a similar comparison learning module, fusing the feature vectors of similar samples, and increasing the interaction information of different images under the same category. And a center loss function is also introduced, so that the distance between the features and the centers of the corresponding classes is minimized, and the learned features are more discriminative. The accuracy of fine-grained image recognition is improved.

Description

Fine granularity image recognition method based on multi-view feature fusion

Technical Field

The invention relates to the technical field of image processing, in particular to a fine-granularity image recognition method based on multi-view feature fusion.

Background

Fine-grained image classification is based on distinguishing basic categories, such as birds or dogs, for finer subclassification. This problem therefore requires capturing subtle inter-class differences and fully mining the discriminative features of the image.

Fine-grained objects are widely available in real life, and fine-grained image recognition corresponding to the fine-grained objects is an important research topic in computer vision recognition. Current fine-grained image recognition presents mainly the following three challenges: (1) The same category may appear to have a large variance due to differences in pose, background, and shooting angle. (2) The different categories, due to belonging to the same parent category, differ between them only in some subtle areas, such as the beak and tail of a bird, etc. And (3) collecting and labeling fine-grained images is time-consuming and labor-consuming. As shown in fig. 1.

The existing method mainly achieves the aim of identification through the following three aspects: (1) Fine-grained image recognition is performed based on a location-classification network. (2) More discriminant characterization is learned directly by developing a powerful depth model for fine-grained recognition. (3) And combining the global features and the local features of the image to realize fine-grained classification of the image.

In the prior art 1, the feature is extracted through a pretrained twin convolutional neural network (convolutional neural networks) by bilinear pooling fine-grained image classification (Bilinear pooling), and bilinear pooling is carried out on each channel layer of the feature to obtain high-order representation of the feature, so that the discrimination capability of the feature is enhanced. The method benefits from the realization of improvement of the accuracy of fine-grained image recognition by a new pooling mode.

The method provides a new bilinear pooling mode, but no effective design is carried out on the aspects of the relation among fine-granularity image categories, the number of model parameters, the number of detail areas and the like aiming at fine-granularity image identification. The influence of factors such as small inter-class difference, large intra-class difference and the like on the fact that the fine-grained images contain various detail information is not considered.

In the prior art 2, a Multi-Attention Multi-Class Constraint network (Multi-Attention Multi-Class Constraint) is adopted, multiple Attention (Attention) areas of an input image are extracted through a one-time compression-Multi-expansion (one-squeeze Multi-extraction) module, then metric learning (METRIC LEARNING) is introduced, a network is trained by adopting a triplet loss and a softmax loss, the same Attention of similar features is pulled, and different Attention or different types of features are pushed away, so that the relation among components is enhanced, and the improvement of the accuracy of fine-granularity image identification is realized.

The method mainly uses metric learning to improve sample distribution in feature space, and therefore it has poor adaptability to mining visual differences between a pair of images. And the introduced loss function is complex, a large number of sample pairs need to be constructed, and the parameter number of the model is greatly increased.

Disclosure of Invention

The invention provides a fine-grained image recognition method based on multi-view feature fusion, which aims to solve the problems that the existing fine-grained image recognition method ignores detail information of images and has poor adaptability of visual differences between images, an introduced loss function is complex, the number of parameters of a model is increased and the like.

A fine granularity image recognition method based on multi-view feature fusion is realized by the following steps:

Step one, bilinear feature extraction;

Inputting an original image into a bilinear feature extraction network, and fusing feature graphs output by different convolution layers to obtain bilinear feature vectors; the characteristic extraction network adopts a network structure pre-trained in a data set ImageNet;

Step two, inhibiting branch learning, wherein the specific process is as follows:

Step two, generating attention patterns according to the sizes and the threshold values of the feature graphs output by different convolution layers of the feature extraction network in the step one;

Step two, generating a suppression mask according to the attention graph in the step two, and covering the suppression mask on the original image to generate a suppression image with a masked local area;

step two, carrying out bilinear feature extraction on the suppression image in the step two by adopting the step one to obtain bilinear feature vectors, inputting the bilinear feature vectors into a full-connection layer to obtain predicted class probability values, and calculating multi-class cross entropy for the predicted class probability values;

Step three, learning a similar comparison module;

Step three, randomly selecting other N images under the same category as the original image as positive sample images;

Step three, the target image and the positive sample image in step three are sent to the feature extraction network in step one to carry out bilinear feature vector fusion, so that bilinear feature vectors with fusion features integrating a plurality of images under the same category are obtained;

Thirdly, averaging bilinear feature vectors of a plurality of images under the same category obtained in the third step to obtain a fused feature vector, inputting the fused feature vector into a full-connection layer to obtain a predicted probability, and calculating multi-category cross entropy for the obtained predicted probability of the same category;

Step four, calculating a center loss function L _C;

Let v _i be bilinear feature of the ith sample, c _i be average feature of all samples of the class corresponding to sample i, i.e. class center, and N be the number of samples in the current batch, then the formula of the center loss function L _C is as follows:

Step five, calculating a model optimization loss function;

And carrying out weighted summation on the cross entropy loss function of the bilinear feature vector of the original image, the cross entropy loss function of the bilinear feature vector of the inhibition image, the cross entropy loss function of the fusion feature and the center loss function, and obtaining the loss function of model optimization.

The invention has the beneficial effects that: the invention comprehensively considers factors such as large intra-class difference, small inter-class difference, large background noise influence and the like of fine-granularity images, introduces a suppression branch, and forces the network to find subtle distinguishing characteristics among confusable classes by suppressing the most obvious area in the images. A similar comparison learning module is also introduced, and feature vectors of similar samples are fused, so that interaction information of different images under the same category is increased. Meanwhile, a center loss function is introduced, the distance between the features and the corresponding class centers is minimized, and the learned features are more discriminant.

By combining the above points, the method comprehensively utilizes the global features and the local features in the judging process, achieves obvious performance improvement on a plurality of fine-granularity image classification tasks, has robustness compared with the existing method, and is easy to be practically deployed. The accuracy of fine-grained image recognition is improved.

Drawings

A, b, c and d in fig. 1 are all schematic diagrams of the existing 4-group fine-grained images;

FIG. 2 is a schematic diagram of bilinear feature extraction in a fine-grained image recognition method based on multi-view feature fusion according to the invention;

FIG. 3 is a schematic diagram of similar contrast learning in a fine-grained image recognition method based on multi-view feature fusion according to the invention;

FIG. 4 is a schematic diagram of model optimization loss function calculation in a fine-grained image recognition method based on multi-view feature fusion according to the invention;

Fig. 5 is a feature visualization effect diagram obtained by a fine-grained image recognition method based on multi-view feature fusion according to the invention.

Detailed Description

The present embodiment will be described with reference to fig. 2 to 5, which are a fine-grained image recognition method based on multi-view feature fusion, and the method is implemented by the following steps:

Step one, bilinear feature extraction: and (3) inputting original images with fixed sizes by adopting a pre-trained ResNet-50 network structure on the ImageNet, and fusing the feature images output by different convolution layers to obtain bilinear feature vectors.

In connection with fig. 2, in the feature extraction step, a network pre-trained in the dataset ImageNet is used as the base network for feature extraction, and a common image classification network such as VGGNet, googLeNet, resNet can be fine-tuned to adapt the model to a specific task. Specifically, the original image is input to a feature extraction network to obtain feature maps (feature maps) output by the last two convolution layers, which are respectively marked asWherein D1, D2 represent the number of channels of the two features, and H and W represent the height and width of the feature map, respectively. In order to solve the problem of too high feature dimension after fusion, the feature information contained in the generated feature vector is enough, and only feature graphs of n channels in the F ₂ are randomly extracted to be fused with the F ₁. Characteristic vectors of positions along channels in characteristic diagrams F ₁ and F ₂ are/> After multiplication of the two eigenvectors, a bilinear matrix/>Adding bilinear matrixes corresponding to all positions in the feature map, and expanding the matrixes into a vector, namely bilinear vector/>Where d=d1×d2. The bilinear vector provides a stronger representation of the feature than the linear model.

Step two, a branch learning inhibition step:

A. Note that the map generation step: generating the attention pattern according to the size of the feature map and the threshold value.

B. And generating a suppression mask according to the attention graph, and overlaying the suppression mask on the original image to generate a suppression image with a masked local area.

C. multi-classification cross entropy calculation: and (3) obtaining bilinear feature vectors through the first step of the suppression image, inputting the bilinear feature vectors into a full-connection layer to obtain a prediction probability value, and calculating multi-classification cross entropy for the obtained category prediction value.

In the suppression branch learning step, the following three aspects are included:

step A, outputting a feature map of a convolution layer in a feature extraction network P _d, ordered by the average, and the value of top-5 is selected to calculate entropy:

Attention is directed to FIG. A by comparing the entropy and the magnitude of threshold δ:

Step B of enlarging the attention map to the original image size, calculating an average value M thereof, taking m×θ as a threshold, setting elements larger than the threshold in the attention map to 0, and setting other elements to 1, thereby obtaining a suppression mask M:

Calculating an average value m of the attention patterns, setting a threshold value theta ranging from 0 to 1,

Step C, overlaying the suppression mask onto the original image, thereby obtaining a suppression image with the local area masked:

I_s(x,y)＝I(x,y)*M(x,y)

Where I (x, y) is the value of the (x, y) position in I in the original image.

Because the most significant areas of the image are suppressed, the attention is dispersed, forcing the neural network to learn discriminant information from other areas. The dependence of the network on training samples can be reduced, overfitting is prevented, and the robustness of the model is further improved.

Step three, a similar comparison module study step:

A. And (3) image sampling: the other N images under the same category are randomly selected as positive samples.

B. And a feature fusion step, in which the target image and the randomly sampled positive sample image are fused by the bilinear feature vector obtained in the first step, and the obtained fusion features integrate the feature information of a plurality of images under the same category.

C. Fusion feature loss function calculation: and directly inputting the fused feature vectors into a full-connection layer to obtain a prediction probability, and calculating multi-classification cross entropy for the obtained class prediction value.

Referring to fig. 3, step a randomly selects N images belonging to the same category as the input image, and all the N images are sent to the bilinear feature extraction network of step one.

Step B, averaging bilinear feature vectors of the multiple similar images output in the step A to obtain a fused feature vector:

Wherein j is the position of the feature vector, V (j) is the value of the feature vector at the j-th position, and T is the number of selected positive samples; v _r (j) is the value of the nth positive sample at the jth position;

step four, calculating center loss;

A. Class center generation: in the training process, the feature vectors of the learned centers of all the categories are continuously updated.

B. Center loss calculation step: the distance between the bilinear feature vector and the class center vector obtained by each input image is used as the center loss, and the distance is continuously optimized in the training process.

In this embodiment, a feature vector is calculated for each class as the class center of the corresponding class, and this feature vector is updated continuously as training progresses. By punishing the bilinear feature vector of each sample and the offset of the sample center of the corresponding category, samples of the same category are gathered together as much as possible, and the complex sample pair construction process is omitted. Let v _i be the bilinear feature of the ith sample, c _i be the average feature of all samples in the class corresponding to sample i, i.e. class center, N be the number of samples in the current batch, and the formula is as follows:

step five, calculating a model optimization loss function:

and carrying out weighted summation on the cross entropy loss function of the bilinear features of the original image, the cross entropy loss function of the bilinear features of the inhibition image, the cross entropy loss function of the fusion features and the center loss function, and obtaining the model optimized loss function.

Referring to fig. 4, the cross entropy loss of bilinear feature vectors of the original image is recorded as L _CE1, the cross entropy loss of bilinear feature vectors of the suppressed image is recorded as L _CE2, the cross entropy loss of fusion features is recorded as L _CE3, the center loss is recorded as L _C, and the loss functions are weighted and summed to obtain a model-optimized loss function L:

Where λ is the weight of the center loss function.

In connection with fig. 5, the description will be given of the present embodiment, in fig. 5, of an original image selected randomly in the first behavior data set, a class activation map obtained by a global branch of the input of the original image in the second behavior, and a class activation map obtained by a third behavior suppression branch. It can be seen that in the global branch, the network learns the most prominent areas of the image, such as the beak of a bird, the head lamp of a car, etc., and in the inhibitory branch, the network learns subtle features that facilitate fine-grained classification, such as the torso of a bird, wheels, etc. The combination of multiple views enables the judgment basis of the network model to be more comprehensive, not only can the salient region be obtained, but also the fine granularity characteristic can be captured subtly.

The fine-grained image recognition method according to the embodiment introduces a new data enhancement mode, and suppresses the component area in the image through attention map guidance, so that attention is dispersed, and the network learns more complementary area characteristic information. A similar comparison module is introduced, and feature information from a plurality of images in the same category is fused, so that the representation of the images in the same category is as close as possible in an embedding space, and the classification performance is improved.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A fine granularity image recognition method based on multi-view feature fusion is characterized by comprising the following steps: the method is realized by the following steps:

Step one, bilinear feature extraction;

In the second step, the specific process of generating the attention pattern is as follows:

feature map output to last convolution layer in feature extraction network P _k, wherein D is the number of channels characterized, and H and W are the height and width of the feature map, respectively; sequencing according to the average value, and obtaining the formula of the entropy E as follows:

note that fig. a is constructed by comparing the entropy and the magnitude of the threshold δ:

wherein F _k is a two-dimensional feature map corresponding to each channel after channel sequencing;

In the second step, the specific process of generating the inhibition mask is as follows:

amplifying the attention map in the second step to the original image size, calculating an average value M of the attention map, setting a threshold value theta in a range between 0 and 1, taking M theta as a threshold value, setting elements larger than the threshold value M theta in the attention map as 0, setting other elements as 1, and obtaining a suppression mask M:

wherein a (x, y) is the value of the (x, y) position in attention map a;

Overlaying a suppression mask onto the original image to obtain a suppression image I _s (x, y) with a local area masked;

I_s(x,y)＝I(x,y)*M(x,y)

wherein I (x, y) is the value of the (x, y) position in I in the original image;

Step three, learning a similar comparison module;

Step four, calculating a center loss function L _C;

Step five, calculating a model optimization loss function;

The cross entropy loss function of the bilinear feature vector of the original image is restrained, the cross entropy loss function of the bilinear feature vector of the image is restrained, the cross entropy loss function of the fusion feature and the center loss function are weighted and summed, and a model optimized loss function is obtained;

In the fifth step, the cross entropy loss function of the bilinear feature vector of the original image is L _CE1, the cross entropy loss function of the bilinear feature vector of the inhibition image is L _CE2, the cross entropy loss function of the fusion feature is L _CE3, the center loss function is L _C, the weighted summation is carried out to obtain a model optimized loss function L, and finally the fine granularity image identification is realized; expressed by the following formula:

Where λ is the weight of the center loss function.

2. The fine-grained image recognition method based on multi-view feature fusion according to claim 1, wherein: in the third step, bilinear feature vectors of a plurality of images in the same category are averaged to obtain a fused feature vector V' (j), which is expressed as follows:

Wherein j is the position of the feature vector, V (j) is the value of the feature vector at the j-th position, and T is the number of selected positive samples; v _r (j) is the value of the jth positive sample at the jth position.