CN116935100A

CN116935100A - Multi-label image classification method based on feature fusion and self-attention mechanism

Info

Publication number: CN116935100A
Application number: CN202310728668.7A
Authority: CN
Inventors: 高世杰; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-24

Abstract

The application discloses a multi-label image classification method based on feature fusion and self-attention mechanism, which mainly comprises the following steps: extracting global features of the image, and extracting the global features of the image by using a deep convolutional neural network; extracting local features of the image, namely performing convolution operation with a convolution kernel of 1*1 on a feature map generated by the middle layer of the deep convolution neural network to extract the local features of the image; feature fusion, namely fusing global features and local features of the extracted images based on a self-attention mechanism, and generating feature expression under each category; and classifying the image multi-label, and generating the image label through a full-connection layer and a sigmoid activation function based on the fused feature expression. The image multi-label classification method provided by the application provides a mode of fusion of the local characteristics and the global characteristics of the image, can effectively model the visual characteristics of small targets in the image, takes semantic correlation among labels into consideration, and can improve the classification precision of the image multi-label.

Description

Multi-label image classification method based on feature fusion and self-attention mechanism

Technical Field

The application belongs to the field of image recognition, and particularly relates to a multi-label image classification method based on fusion of global features and local features of images and introduction of a self-attention mechanism.

Background

In the information age, images have become a medium and carrier for conveying information and are widely used in various fields. The rapid and accurate classification of mass digital images in the information age is realized, and the method is the main research content in the current image application field. While Convolutional Neural Networks (CNNs) exhibit good performance in single-label image classification tasks, most images in the real world contain more than one scene or object, and an image may be labeled with multiple labels that may correspond to different objects, scenes, actions, and properties in an image.

If the abundant semantic information in the image is to be extracted, a multi-label generation technology of the image is needed to be used, all the categories in the image are identified as accurately as possible, while the traditional classification is often hard classification, namely, one data is only classified into one category, so that the method has exclusivity, and the image labeling is reflected in that one image is labeled with only one label, so that the method has certain limitation. In addition, for a typical multi-label image, objects of different types are located at different positions and have different proportions and postures, and the problems of shielding, overlapping, illumination and the like among the objects can cause higher difficulty in identifying and classifying the multi-label image. The multi-label image classification is a more general and practical problem, models rich semantic information in images and the dependency relationship of the semantic information, efficiently and accurately completes the classification and identification of the multi-label images, becomes an important research direction (see 'Ji Zhong, li Huihui, he Yuqing. Zero-sample multi-label image classification based on depth example differentiation [ J ]. Computer science and exploration, 2019, 13 (1): 9'), and has wide application in a plurality of fields such as image retrieval, portrait grouping, medical image identification, scene understanding and the like.

The success of CNN in single-label image classification provides some insight into solving the problem of multi-label image classification. Thanks to the translational invariance of the convolution operation in CNNs, i.e. it detects these same features wherever the object appears in the image, outputting the same response, also when multiple objects appear in the image. Therefore, the vector output by the full connection layer in the CNN model can be simply converted into a probability value between 0 and 1 by a Sigmoid function, so that the probability that the sample belongs to each category is calculated. The probability distribution of each class output by the model is independent, namely the multi-label problem is decomposed into a plurality of independent two-class problems. However, this approach ignores the semantic correlation between labels, i.e., when an image is labeled with a label, the probability that the image has the contents of another label at the same time is high. Such as sky and clouds, are commonly present together, while water and cars are almost never present together. In addition, in the deep convolutional neural network, although the number of model parameters is reduced by the multiple convolution and pooling operations through a weight sharing and downsampling mode, meanwhile, the receptive field of neurons is continuously enlarged, and the feature map of the deep layer of the model is more reflective to global features of the image, which is beneficial in a single-label image classification task with only single targets in the image, however, in a multi-label image classification task, small targets with different sizes, positions and shapes exist in the image, and local features contained in the small targets are often ignored or diluted under the receptive field with larger deep layer of the model. Therefore, if global features are directly extracted from the whole picture, it is difficult to avoid that visual features of small targets are lost in the process of extracting features, so that multi-label classification accuracy is affected.

The method of multi-tag text classification based on tag semantic attention is proposed by sholin et al (see "sholin, chen Boli, huang Xin, etc.. Multi-tag text classification based on tag semantic attention [ J ]. Software academy, 2020, 31 (4): 11."), relies on the text of a document and the corresponding tags, uses bi-directional long and short term memories to obtain hidden representations of each word, obtains weights of each word in the document by using a tag semantic attention mechanism, and additionally tags tend to be interrelated in semantic space. Zhang Yong et al (see "Zhang Yong, liu Haoke, zhang Jie. Multi-tag classification algorithm based on generic features and instance correlations [ J ]. Pattern recognition and artificial intelligence, 2020, 33 (5): 10.") propose a multi-tag classification algorithm based on generic features and instance correlations that considers not only tag correlations but also instance features' correlations, and learns instance feature space similarity by constructing a similarity graph. Mou Jiapeng et al (see "Mou Jiapeng, cai Jian, yu Mengchi, xu Jian. Generic attribute multi-label classification algorithm based on label correlation [ J ]. Computer applied research 2020, 37 (9): 4.") propose a generic attribute multi-label classification algorithm based on label correlation, which uses the correlation between labels to measure the correlation between labels by using the distance between labels to complete the introduction of label correlation by attaching the correlation labels to the generic attribute space, so as to achieve the purpose of improving classification performance. Chen et al (see "Chen Z M, wei X S, wang P, et al Multi-label image recognition with graph convolutional networks [ C ]// Proceedings of the IEEE/CVF conference on computer vision and pattern reception.2019:5177-5186.") propose to use the graph-rolling network (GCN) to explicitly model the correlation between class labels, learn an interdependent target classifier based on the GCN' S mapping function, and can apply the generated classifier to the image features learned by any CNN model with high expansibility and flexibility. Lanchantin et al (see "Lanchantin J, wang T, ordonez V, et al general Multi-label image classification with transformers [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recgnitionion.2021:16478-16488.") propose to use a Transformer model and use Label Mask Training training strategy to randomly mask portions of real tags when training, allowing the model to predict the masked tags, thereby exploring complex dependencies between image features and tags and within the tag set.

Aiming at the problem that the common method can lose the visual characteristics of partial small targets in the image in the process of extracting the global characteristics of the image and considering the condition that the dependency relationship exists among the labels in the multi-label classification problem, it is necessary to design an efficient multi-label image classification model so as to effectively model the local characteristics owned by the small targets in the image and the dependency relationship among a plurality of labels.

Disclosure of Invention

In order to overcome the defects of the prior art, the application provides a multi-label image classification method based on feature fusion and a self-attention mechanism.

In order to achieve the above purpose, the application adopts the following technical scheme:

step 1: the ResNet50 model structure and parameters are initialized and 1*1 convolution operations are performed on the feature map output by the third convolution block of ResNet50 to extract image local features. The number of channels of the 1*1 convolution kernel should be consistent with the total number of categories of the current multi-tag classification task.

Step 2: the feature map output by the original ResNet50 model continues to pass through a subsequent convolution block and is subjected to Average Pooling (Average Pooling) to obtain a global feature matrix of the image.

Step 3: in order to discover the dependency relationship between the labels, the feature vectors are fused through a self-attention mechanism, and specifically comprises the following steps:

(1) Flattening the image local feature matrix obtained in the step 1 into a one-dimensional vector in each dimension of the channel dimension respectively; flattening the global feature matrix of the image obtained in the step 2 into a one-dimensional vector; splicing the vectors into a matrix E according to rows, wherein the local eigenvectors are subjected to linear transformation to ensure that the dimensions of the local eigenvectors are consistent with those of the global eigenvectors;

(2) Initializing a weight matrix W ^Q 、W ^K 、W ^V ；

(3) Respectively combining (1) the feature matrix E with (2) the weight vector W ^Q 、W ^K 、W ^V Multiplying to obtain a Query matrix, a Key matrix and a Value matrix, wherein each row of the matrixThe vectors are all associated with the aforementioned one-dimensional global feature vector or local feature vector.

(4) An attention score is calculated. Multiplying the Query matrix by the transpose of the Key matrix to obtain the attention Score matrix Score, and dividing the attention Score matrix Score by the value(d ^k The number of columns of the Key matrix) and then normalize each row in the matrix using the Softmax function. At this time, the numerical value of each element in the Score matrix represents the attention Score between every two feature vectors in the feature matrix E;

(5) The attention Score matrix Score is multiplied by the Value matrix, so that each row of vectors in the Value matrix is obtained by weighted summation of the attention Score and other rows of vectors.

Step 4: and (3) inputting the Value matrix into a fully-connected neural network for calculation, and finally generating a vector with the dimension equal to the category number for each image through a Sigmoid activation function, wherein the numerical Value of each dimension of the vector represents the probability that the image belongs to the corresponding category.

The beneficial effects of the application are specifically expressed as follows:

(1) According to the application, the global feature and the local feature of the image are taken into consideration at the same time, so that the problem that the traditional image feature extraction network loses small target feature information is solved to a certain extent;

(2) According to the method, convolution operation is carried out through the convolution kernel with the channel number equal to the 1*1 size of the total category number, and the feature map is independently calculated for each category, so that the classification precision is improved compared with a common method for sharing the feature map;

(3) According to the method, the self-attention mechanism is used for modeling the dependency relationship among the labels, and the semantic relevance of the labels in the multi-label classification problem is utilized for remarkably improving the classification performance of the model;

(4) The method has better anti-interference capability and strong robustness, and can meet the practical multi-label image classification application requirements.

Drawings

FIG. 1 is a flow chart of a multi-label image classification method based on feature fusion and self-attention mechanism.

Fig. 2 is a flow chart of a feature fusion process based on a self-attention mechanism.

FIG. 3 is a schematic diagram of a neural network model of a multi-label image classification method based on feature fusion and self-attention mechanisms.

Detailed Description

The application is further illustrated by the following figures and specific examples, which are intended to be illustrative of the application and not to be limiting of the scope of the application, as various equivalent modifications to the application will fall within the scope of the application as defined by the appended claims, after reading the application.

The following is an example of classifying the total number of classes 5.

S1: the ResNet50 model structure and parameters are initialized, where the parameters refer to weight data from the ResNet50 pre-training on the ImageNet large-scale visual recognition dataset. Thereafter, a 1*1 convolution operation is performed on the feature map output by the third convolution block of ResNet50 to extract image local features. Specifically, the feature map with the shape of 512 x 28 is subjected to convolution operation by using the convolution check with the shape of 5 x 1, so as to obtain the feature map with the shape of 5 x 28, wherein each channel corresponds to a corresponding category.

S2: the feature map output by the original ResNet50 model continues to pass through a subsequent convolution block and is subjected to Average Pooling (Average Pooling) to obtain a global feature map of the image, wherein the shape of the global feature map is 1 x 2048.

S3: in order to discover the dependency relationship between the labels, the feature vectors are fused through a self-attention mechanism, and specifically comprises the following steps:

s31, flattening the image local feature matrix obtained in the step S1 into one-dimensional vectors in each dimension of the channel dimension respectively, namely flattening a feature map formed as 5 x 28 into 5 x 784, and marking as F ^regional The method comprises the steps of carrying out a first treatment on the surface of the Flattening the image global feature matrix obtained in the step S2 into a 2048-dimensional one-dimensional vector, and marking the vector as F ^global The method comprises the steps of carrying out a first treatment on the surface of the The vectors are spliced in rows into a matrix E, wherein the local feature vectors should be linearTransforming to make the dimension consistent with the dimension of the global feature vector, wherein the dimension of the E matrix is 6 x 2048;

defining 784 x 2048 dimension parameter matrix θ for feature matrix F ^regional Performing linear transformation:

F′＝F ^regional ·θ (9)

E＝(F′；F ^global ) _concat (10)

s32 initializing 2048 x 512-dimensional weight matrix W ^Q 、W ^K 、W ^V ；

S33, the feature matrix E of S31 and the weight vector W of S32 are respectively ^Q 、W ^K 、W ^V Multiplying to obtain a Query matrix, a Key matrix and a Value matrix, wherein each row of vectors of the matrix are associated with the one-dimensional global feature vector or the local feature vector. The matrix dimensions are 6 x 512.

The specific calculation method comprises the following steps:

Query＝E·W ^Q (11)

Key＝E·W ^K (12)

Value＝E·W ^V (13)

s34 calculates an attention score. The attention Score matrix Score (6*6) is obtained by multiplying the Query matrix by the transpose of the Key matrix, and dividing by the value(d ^k The number of columns of the Key matrix) and then normalize each row in the matrix using the Softmax function. At this time, the numerical value of each element in the Score matrix represents the attention Score between every two feature vectors in the feature matrix E;

the specific calculation method comprises the following steps:

s35 multiplying the attention Score matrix Score by the Value matrix, thus, the resulting matrix F ^mixed Each row of vectors is based on the attention score and Value matrixAnd the corresponding other row vectors are weighted and summed.

F ^mixed ＝Score·Value (15)

S4: will F ^mixed The matrix is input into a fully-connected neural network for calculation, finally, a vector with the dimension equal to the category number is generated for each image through a Sigmoid activation function, and the numerical value of each dimension of the vector represents the probability that the image belongs to the corresponding category. The fully-connected neural network is specifically defined as follows, wherein the parameter matrix ω ^c Is 512 x 256, ω ^b Is 256 x 64, omega ^a Is 64 x 5:

out＝Sigmoid(((F ^mixed ·ω ^c )x _elu ·ω ^b ) _relu ·ω ^a ) (16)。

Claims

1. a multi-label image classification method based on feature fusion and self-attention mechanism is characterized by comprising the following steps:

step 1: initializing a model and extracting local features of an image;

step 2: extracting global features of the image;

step 3: fusing global features and local features of the image through a self-attention mechanism;

step 4: and based on the fused characteristics, performing image multi-label classification by using a fully connected neural network.

2. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 1 initializes a model, extracts image local features, and the method comprises the following steps:

the ResNet50 model structure and parameters are initialized, where the parameters refer to weight data from the ResNet50 pre-training on the ImageNet large-scale visual recognition dataset. Thereafter, a 1*1 convolution operation is performed on the feature map output by the third convolution block of ResNet50 to extract image local features.

3. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 2 extracts global features of the image, and the method is as follows:

and (3) continuously passing the feature map output by the original ResNet50 model in the step (1) through a subsequent convolution block, and carrying out Average Pooling (Average Pooling) to obtain a global feature matrix of the image.

4. The multi-label image classification method based on feature fusion and self-attention mechanism according to claim 1, wherein the step 3 fuses the global features and the local features of the image by the self-attention mechanism, and the method is as follows:

(1) Flattening the image local feature matrix obtained in the step 2 into one-dimensional vectors in each dimension of the channel dimension respectively, and marking as F ^regional The method comprises the steps of carrying out a first treatment on the surface of the Flattening the global feature matrix of the image obtained in the step 3 into a one-dimensional vector, and marking the one-dimensional vector as F ^global The method comprises the steps of carrying out a first treatment on the surface of the Splicing the vectors into a matrix E according to rows, wherein the local eigenvectors are subjected to linear transformation to ensure that the dimensions of the local eigenvectors are consistent with those of the global eigenvectors; defining a parameter matrix theta and a characteristic matrix F ^regional Performing linear transformation:

F′＝F ^regional ·Θ (1)

E＝(F′；F ^global ) _concat (2)

(2) Initializing a weight matrix W ^Q 、W ^K 、W ^V ；

(3) Respectively combining (1) the feature matrix E with (2) the weight vector W ^Q 、W ^K 、W ^V Multiplying to obtain a Query matrix, a Key matrix and a Value matrix, wherein each row of vectors of the matrix are associated with the one-dimensional global feature vector or the local feature vector.

The specific calculation method comprises the following steps:

Query＝E·W ^Q (3)

Key＝E·W ^K (4)

Value＝E·W ^V (5)

(4) An attention score is calculated. The Query matrix and KeyTransposed multiplication of the matrices to obtain the attention Score matrix Score, and division by the value(d ^k The number of columns of the Key matrix) and then normalize each row in the matrix using the Softmax function. At this time, the numerical value of each element in the Score matrix represents the attention Score between every two feature vectors in the feature matrix E;

the specific calculation method comprises the following steps:

(5) Multiplying the attention Score matrix Score by the Value matrix, so that the resulting matrix F ^mixed Each row vector in the Value matrix is obtained by weighted summation of the attention score and other corresponding row vectors in the Value matrix.

F ^mixed ＝Score·Value (7)

5. The method for classifying images according to claim 1, wherein the step 4 is based on the fused features and uses a fully connected neural network to classify the images, and the method is as follows:

will F ^mixed The matrix is input into a fully-connected neural network for calculation, finally, a vector with the dimension equal to the category number is generated for each image through a Sigmoid activation function, and the numerical value of each dimension of the vector represents the probability that the image belongs to the corresponding category. The fully-connected neural network is specifically defined as follows:

out＝Sigmoid(((F ^mixed ·ω ^c ) _relu ·ω ^b ) _relu ·ω ^a ) (8)。