CN114648635A

CN114648635A - Multi-label image classification method fusing strong correlation among labels

Info

Publication number: CN114648635A
Application number: CN202210250180.3A
Authority: CN
Inventors: 张辉宜; 夏媛龙; 黄�俊
Original assignee: Anhui University of Technology AHUT
Current assignee: Anhui University of Technology AHUT
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-21
Anticipated expiration: 2042-03-15
Also published as: CN114648635B

Abstract

The invention discloses a multi-label image classification method fusing strong correlation among labels, which comprises the following steps: clustering the labels in the data set into M communities, and dividing a traditional label co-occurrence matrix into M-times label co-occurrence tensors; sending the picture to be trained into a general convolutional neural network, and performing M-weight categorical pooling after the last hierarchy to obtain a multiple categorical feature map; sending the label co-occurrence tensor and the label embedded matrix into a multi-graph convolutional neural network, and fusing the M-fold label expression tensors into a label expression matrix by using an attention fusion mechanism after the last multi-graph convolutional layer; merging the label sub-semantic relation into a middle hierarchy of the convolutional neural network; integrating community coding information into a label expression matrix, and performing label level multiplication with the multiple generic characteristic diagrams; and constructing a global objective function. The invention learns the strong correlation between the labels and the correlation between the labels, and integrates the strong correlation into the feature map, thereby improving the performance of the multi-label classification task.

Description

Multi-label image classification method fusing strong correlation among labels

Technical Field

The invention relates to the technical field of multi-label image classification, in particular to a multi-label image classification method fusing strong correlation among labels.

Background

Image classification is an important topic in the field of machine learning and is also the core of computer vision. The traditional method is to extract image features by using a convolutional neural network, for example, the neural network such as AlexNet, VGG, Resnet and the like is used to extract the features, and a high accuracy rate is obtained in the field of single-label image classification. However, in multi-tag image recognition, tag correlation is difficult to be represented by such a neural network, and therefore, in recent years, researchers have tried to incorporate the correlation between tags into a convolutional neural network by different methods.

Previous researchers have dealt with the problem of multi-label image classification, and often adopt a recurrent neural network and a graph convolution neural network to construct the correlation between labels. However, since the recurrent neural network generally processes serialized data, it is difficult to express its internal complex co-occurrence relationship in constructing correlation problems between tags. The graph convolution neural network has strong modeling capacity in processing non-Euclidean structure data (such as graph data), so that the co-occurrence relation between the labels in the graph convolution neural network can be well learned by utilizing the graph convolution neural network and a label co-occurrence graph constructed in advance. Researchers often adopt a word embedding model to construct a label node representation matrix, construct a label co-occurrence adjacency matrix by utilizing the co-occurrence relation among labels in a data set, send the label node representation matrix and the label co-occurrence adjacency matrix into a graph convolution neural network to learn the correlation among labels, and finally fuse the learned label correlation into a final-level feature graph of the convolutional neural network.

The conventional method for constructing the tag correlation by using the graph convolution neural network, such as the ML-GCN method, and integrating the tag correlation into the last layer of the convolution neural network has two disadvantages: first, the convolutional neural network only has correlation between the last convolutional layer and the merged tag, and the convolutional neural network, especially the number of convolutional layers of the Resnet correlation model, is usually stacked very much, which results in insufficient correlation of the merged tag of the convolutional neural network; second, conventional methods of constructing tag dependency adjacency matrices tend to simply establish co-occurrence relationships between tags throughout the dataset, ignoring strong relationships between the interior of tags. For example, { "person", "cup", "bowl", "table", "umbrella", chair "," car "} may have some relevance, but {" cup "," bowl "," table "," chair "}, {" person "," umbrella "," car "} may have stronger relevance.

The prior MGTN model fuses the strong connection in the label into the convolutional neural network by the Graph transformations method, divides the label nodes into different communities by the community detection algorithm, and fuses the label nodes of the different communities into different characteristic graphs by using Multiple CNNs (consisting of a plurality of convolutional neural networks), but the use of the Multiple CNNs needs to consume very large calculation amount to learn a small number of communities, so that the number of the communities is difficult to expand.

Disclosure of Invention

1. Technical problem to be solved by the invention

In view of the defects of the prior art, the invention provides a multi-label image classification method fusing strong correlation among labels; according to the invention, the traditional label co-occurrence adjacency matrix is converted into a plurality of sub-label co-occurrence adjacency matrixes, and the strong correlation inside the sub-labels is learned through the multi-graph convolutional neural network, so that the co-occurrence information of the sub-labels is more fully fused into the convolutional neural network, and the image classification performance of the convolutional neural network is improved.

2. Technical scheme

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention discloses a multi-label image classification method fusing strong correlation among labels, which comprises the following steps:

s1, constructing a label co-occurrence matrix by using the label co-occurrence relation in the data set; dividing the label nodes into M communities by using a community detection algorithm; setting a threshold value to divide the label co-occurrence matrix into M co-occurrence subgraphs;

s2, obtaining an image file of training data, sending the image file into a convolutional neural network to extract image features to obtain a feature map, converting the feature map into M two-dimensional generic feature maps through multiple generic pooling operations, and splicing the two-dimensional generic feature maps in parallel;

s3, acquiring a label embedding matrix, acquiring a label co-occurrence tensor, sending the M label co-occurrence tensors and the label embedding matrix into a multi-graph convolutional neural network to obtain an M-fold label node representation tensor, and fusing the M-fold label node representation tensors into a label node representation matrix by utilizing a label level attention fusion mechanism;

s4, fusing an M-ply label node expression tensor generated by the multi-ply graph convolutional layer into a feature graph output by a middle level of the general convolutional neural network by using an attention mechanism;

s5, building a community coding M matrix according to M communities divided by the label nodes by a community detection algorithm, multiplying the M matrix by a label node expression matrix fused by a fusion mechanism to obtain a label node expression matrix with community coding, multiplying the matrix by a multiple generic feature graph spliced together in the column direction, and using the obtained prediction label for classification;

and S6, constructing a loss function, wherein the loss function comprises a multi-label classification loss function and an objective function for learning a parameter matrix in the multiple generic pooling layer.

3. Advantageous effects

Compared with the prior art, the technical scheme provided by the invention has the following remarkable effects:

(1) the traditional multi-label deep learning algorithm usually utilizes the correlation between the convolutional neural network learning labels, but ignores the strong connection in the sub-labels, and the multi-label image classification method fusing the strong correlation between the labels converts the traditional label co-occurrence adjacent matrix into a plurality of sub-label co-occurrence adjacent matrices, learns the strong correlation in the sub-labels through the multi-image convolutional neural network, and is more favorable for improving the classification performance of the multi-label image.

(2) According to the multi-label image classification method fusing the strong correlation among labels, a label attention fusion mechanism can learn the correlation among label nodes, and the correlation among sub-images is not only learned in a Graph transform algorithm. The sub-graph label node representation is fused into the graph convolution neural network by using an attention mechanism, so that the co-occurrence information of the sub-labels is more fully fused into the convolution neural network, and the image classification performance of the convolution neural network is improved.

Drawings

FIG. 1 is a flow chart of a multi-label image classification method fusing strong correlation between labels according to the present invention;

FIG. 2 is a schematic diagram of Resnet-101 extracting image features according to the present invention;

FIG. 3 is a schematic diagram of generic pooling of the present invention;

FIG. 4 is a schematic diagram of the multiple generic pooling of the present invention;

FIG. 5 is a schematic diagram of the fusion of multiple convolutional layers and a sub-graph tag representation matrix according to the present invention;

FIG. 6 is a schematic diagram of an intermediate layer of the present invention for fusing sub-graph label correlations to a convolutional neural network;

FIG. 7 is an internal structure diagram of the M matrix according to the present invention;

FIG. 8 is a schematic diagram of the tag level multiplication of the present invention.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples.

Example 1

According to the multi-label image classification method fusing the strong correlation between labels, the label relation and the strong relation inside the labels are learned, and a multi-label classification task is better performed, and the method comprises the following steps:

s1, constructing a label co-occurrence matrix by using the label co-occurrence relation in the data set; dividing the label nodes into M communities by utilizing a community detection algorithm; setting a threshold value to divide the label co-occurrence matrix into M co-occurrence subgraphs, namely a label co-occurrence tensor; the method specifically comprises the following steps:

(1-1) obtaining an image file of training data and labels in the training data, and establishing a co-occurrence relation adjacency matrix A belonging to R for the labels^C×CWherein A is_ij1 denotes the label L_iWhen present, label L_jThere is also a certain probability that A will occur otherwise_ij0. By using the basic conditional probability formula and the Z matrix, the conditional probability matrix P epsilon R can be constructed^C×CDenotes when the label L_iWhen present, label L_jSetting a threshold value tau to binarize the P matrix and constructing a label co-occurrence matrix A epsilon for R^C×CThe following:

(1-2) obtaining a label co-occurrence adjacency matrix, dividing the label co-occurrence adjacency matrix into M communities and M subgraphs (label sub-relationship adjacency tensors),

dividing the label nodes into M communities by using a community detection algorithm, firstly, calculating the modularity of the adjacency matrix A,

wherein the content of the first and second substances,

d_i＝∑_kA_i，kdegree of the ith node, and δ (c)_i,c_j) Used for detecting whether the i node and the j node are in the same community, if so, delta (c)_i,c_j) 1, otherwise, δ (c)_i,c_j)＝0，

When the community detection algorithm starts, each node is independently used as a community, and in order to maximize modularity, other label nodes are brought into the community. Until the modularity does not change, M communities are detected.

(1-3) constructing a label sub-relationship adjacency tensor by setting a threshold as follows: setting a threshold T ═ T on the conditional probability matrix P₁,...t_m]Wherein, t_i∈[0,1]And is

Constructing a label sub-relationship adjacency tensor

The following were used:

wherein the content of the first and second substances,

co-occurrence of adjacency matrices by labels

k is formed by { 1.,. M },

representing the kth sub-graph, the label L_iWhen present, label L_jThere is also a certain probability that this will occur,

in this embodiment, the number of communities is equal to the number of subgraphs.

S2, obtaining an image file of training data, sending the image file into a convolutional neural network to extract image features to obtain a feature map, converting the feature map into M two-dimensional generic feature maps through multiple generic pooling operations, and splicing the two-dimensional generic feature maps in parallel; the method specifically comprises the following steps:

(2-1) acquiring an image file of training data, and cutting the image into a uniform specification (3 × 448 × 448) and inputting the uniform specification into the Resnet-101 neural network. In a neural network, the feature map goes through a 7 × 7 convolutional layer, a 1 × 1 pooling layer, and four intermediate layers, layer1, layer2, layer3, layer 4. Four ofThe feature map output by the middle hierarchy is

Wherein s is the hierarchy number, and the value range is {1,2,3,4 }. W_S、H_SAnd C_SThe width, height and channel number of the output feature map of the s-th level are respectively. Specifically, the feature map dimensions of the four hierarchical outputs are respectively, F₁∈R^{256×112×112},F₂∈R^512×56×56,F₃∈R^1024×28×28，F₄∈R^2048×14×14Then, a multiple generic pooling layer is connected to obtain a multiple generic feature map, as shown in FIG. 2.

(2-2) obtaining the last hierarchical feature map F₄∈R^2048×14×14Will F₄Is shown as

Wherein, H, W and D_cThe method for extracting the picture features in the hierarchical image includes the following steps:

X^lsp＝W^lsp×R(X^conv)

wherein the content of the first and second substances,

the feature map output for the last level of Resnet-101. The dimension transformation operation R (-) may let X^convFrom the three-dimensional tensor (dimension D)_cxHxW) to a two-dimensional matrix (dimension HW x D)_c) That is to say that,

and let the parameter matrix W^lsp∈R^C×HWLeft ride

Can obtain a matrix after special pooling

Wherein C is the label category number, and the generic pooling method is shown in FIG. 3.

(2-3) finally, performing M times of generic pooling operations by using the same method, wherein the method comprises the following specific steps:

wherein i is a positive integer from 1 to M, M is the number of subgraphs or communities, and M generic feature graphs are used

Splicing according to the column direction to obtain

Multiple generic pooling layers are shown in FIG. 4.

(2-4) finally, learning parameter matrix by using sub-graph label adjacency matrix

The following:

in particular, the amount of the solvent to be used,

wherein i is a positive integer from 1 to M, M is the number of subgraphs, and l is used²Norm is to maintain a parameter matrix

The sparsity of the composite material is determined,

learning a parameter matrix using tag co-occurrence tensors

Is to accelerate the parameter matrix

And (4) learning, and differentially fusing different sub-graphs to learn the label semantic relationship by utilizing the generic feature graph. Generic feature maps containing different features can be obtained by multiple generic pooling

S3, acquiring a label embedding matrix, acquiring a label co-occurrence tensor, sending the M label co-occurrence tensors and the label embedding matrix into a multi-graph convolutional neural network to obtain an M-fold label node representation tensor, and fusing the M-fold label node representation tensors into a label node representation matrix by utilizing a label level attention fusion mechanism; the method specifically comprises the following steps:

(3-1) adjoining subgraph label relationship to matrix

And the label embedding matrix E ∈ R^C×DAnd sending the signal to a multiple convolutional neural network as follows:

the first layer of input of the multi-graph convolution layer is a label embedding matrix E belonging to R^C×DAnd multiple adjacent subgraphs after normalization

The subgraphs are sent into different graph convolution layers, and after the multiple graph convolution layers, a label expression tensor is obtained

Wherein L is_lThe labels input for the ith layer of graph convolution represent the node dimensions of the tensor,l is more than or equal to 2. h (-) is a nonlinear activation function. The above formula is an expression of the multi-graph convolution layer when the multi-graph convolution layer is more than two layers, wherein

For a parameter matrix in a graph convolution layer, each layer of nodes represents a dimension represented by a parameter

And (6) determining. Similar to the above, the first layer map is represented by

(3-2) at this time, the M-repeated label nodes need to represent tensors by using a label-level attention fusion mechanism

A label node expression matrix G epsilon R is fused into a whole^C×LThe following are:

can be expressed as

Wherein the content of the first and second substances,

denotes a k-th node representing a matrix, k being an integer between 1 and M,

the importance of the matrix is represented by learning the nodes of each subgraph first, as follows:

wherein W is a parameter vector W ∈ R^1×CB is an offset vector b ∈ R^1×CTan h (-) is an activation function, q is a parameter vector q epsilon R^1×CThe purpose is to convert the vector generated by the function tanh (-) into a scalar, so that the whole isThe scalar values obtained on both sides of the equation,

for the kth label representation the label node representation of the ith node in the matrix, w_kRepresenting importance weights of the label embedding matrix and the kth sub-graph when learning through a graph convolution neural network, i is an integer between 1 and C,

calculating the weight w of the good node in each subgraph_kAfterwards, normalization is done by the softmax () function as follows:

thus, the importance score (β) of the node in each sub-graph is calculated₁,...,β_M) Respectively multiplying the importance scores by the corresponding label nodes, then summing, and then multiplying by the parameter matrix

And performing dimension transformation to obtain a label node representation matrix G after fusion, which is as follows:

wherein G ∈ R^C×LWherein C is the number of label nodes, and the label nodes represent the dimension L by a parameter matrix

It is determined that the multi-map convolutional layer and label level attention fusion network structure is shown in FIG. 5.

S4, fusing the M-ply label node expression tensor generated by the multi-ply convolutional layer into a feature graph output by the middle layer of the general convolutional neural network by using an attention mechanism; comprises the following steps:

(4-1) mixing

Fusing the attention mechanism into a feature map output at a certain level in the middle of Resnet-101 as follows:

the Resnet-101 top three-level output characteristic diagram is

Wherein s is more than or equal to 1 and less than or equal to 3, C_s、W_sAnd H_sThe number, width and height of the channels of the s-th middle layer

Is marked as

Wherein, C_l＝C_s、W＝W_s、H＝H_sL represents F_sAnd fusing the label node correlation output by the l-th layer graph convolution neural network.

Due to the fact that

Middle L_lIs defined by a parameter matrix

Determine that L can be_l＝C_lP, i.e. X_l∈R_P×W×H，

Using dimension transformation operation R (-) to transform X_lAnd H_lDimensional transformation to X_R∈R^WH×PAnd

mixing X^R∈R^WH×PWhen the key and the value are taken as the key and the value,

as query, it is fed into the transform model decoder, and the Multi-Head Attention mechanism in the transform decoder is used to fuse tag correlations as follows:

wherein, Multihead (Q, K, V) Concat (head)₁，head₂，...，head_h)W^O,

Wherein the content of the first and second substances,

wherein, the first and the second end of the pipe are connected with each other,

the present embodiment uses a standard Transformer structure, and the difference from the conventional Transformer structure is that the input is from a different source. In this embodiment, key and value in the MultiHead structure are derived from feature maps output by a convolutional neural network Resnet-101 middle hierarchy, and query is derived from label nodes output by a multi-graph convolutional network to represent tensors. The correlation between the image features and the sub-labels can be learned through query and key, wherein QK is^TThe correlation degree of the Q matrix and the K matrix can be calculated, the correlation score is obtained through the function normalization of softmax (), and then the correlation score is multiplied to value, so that the label correlation can be merged into the feature map. Wherein the multi-head attention mechanism is consistent with the conventional Transformer, i.e.

While

Wherein h is 8, P is hd_k，

The obtained characteristics X fused with the relevance of the label^RDoing the inverse operation of dimension transformation R (-) to convert X^R∈R^WH×PConversion to X^M∈R^P×W×HThe newly obtained X^M∈R^P×W×HWith the original feature map X^l∈R^P×W×HAdd to Transformer decoder&The Norm layer is used for carrying out normalization and residual error linkage, and then the characteristic diagram X of the tag correlation is blended^M∈R^P×W×HFNN layer fed into the transform decoder as follows:

X^F＝FFN(X^M)

wherein ffn (x) max (0, xW)₁+b₁)W₂+b₂The inner layer comprises two linear layer linear transformations, and the middle is activated by a Relu activation layer.

The FFN layer in the conventional Transformer consists of two linear layers, the middle of which is activated by a Relu function, and the difference between the embodiment and the conventional method is that the parameter W is₁And W₂Is a three-dimensional tensor, i.e.

Wherein d is_ffIs the dimension of the previous linear layer output.

The newly obtained X^F∈R^P×W×HSignature graph X correlated with original merged label^M∈R^P×W×HAdd to Transformer decoder&And the Norm layer is used for carrying out normalization and residual error linkage. Obtaining the final characteristic diagram X of the correlation of the merged label^l∈R^P×W×HAnd (5) sending the information to the next layer of training of the convolutional neural network, wherein the convolutional neural network attention fusion subgraph label correlation network structure is shown in FIG. 6.

S5, building a community coding M matrix according to M communities divided by label nodes by a community detection algorithm, multiplying the M matrix by a label node expression matrix fused by a fusion mechanism (Hadamard product) to obtain a label node expression matrix with community coding, multiplying the matrix by multiple generic feature graphs spliced together in the column direction (label level multiplication), and using the obtained prediction label for classification; the method comprises the following steps:

(5-1) dividing the label nodes in the data set into M communities by utilizing a community detection method, wherein M and a subgraphThe number is kept consistent, the label nodes are divided into M communities according to a community detection algorithm, so that each label node has a community to which the label node belongs, the communities are numbered as p according to a { 1.,. M } sequence, wherein M is the number of the communities, and the community number to which each label node belongs is stored in an S e.Z^CWherein C is the number of label categories, S_i＝p,i∈{1,...,C}，

Obtaining multiple label generic feature map through multiple generic pooling layers

Wherein, X^mullspCan be expressed as a combination of generic feature maps

i is a positive integer between 1 and M,

by using

Different communities are distinguished and matched, the pth community is determined to be matched with which generic label graph according to the community code p,

constructing a community coding M matrix as follows:

wherein, tau is equal to [0, 1]]，S_iP means that the i-th tag can find the community to which it belongs as p,

representing node needs and generic features graph belonging to p-community

Matching, so that the point corresponding to the feature map can be found in the M matrix, the point value is set as tau, and other position points are set as

It is clear that,

the purpose of differentially fusing different label nodes by using different generic characteristics is achieved, and the internal structure of the community coding M matrix is shown in FIG. 7.

(5-2) expressing the community coding M matrix and the label node expression matrix G epsilon R^C×LMaking Hadamard product, and mixing with X^mullspPerforming label level multiplication to enable the model to fuse different generic features with label nodes of different communities as follows:

since G is belonged to R^C×LL in (1) is a parameter matrix

Determine that L and D can be made_cM equal, coding subgraphs

And label node representation

Multiplying (Hadamard product) to enable the label nodes to be merged into community coding information to obtain a matrix

The following were used:

W＝M·G

(5-3) As shown in FIG. 8, the

And

performing label level multiplication for final classification, as follows:

w may be represented as W ═ W₁,...,w_C]^T，X^mullspCan be represented as X^mullsp＝[x₁,...,x_C]And C is the number of label categories, the prediction function is as follows:

and S6, constructing a loss function, wherein the loss function comprises a multi-label classification loss function and an objective function for learning a parameter matrix in the multiple generic pooling layer. The method comprises the following steps:

(6-1) defining the objective function as:

where L (-) is a loss function of the multi-label classification, L_lspFor accelerating learning parameters

And is used to distinguish different subgraphs, theta is a global parameter that needs to be learned.

(6-2) using a two-class loss function as a multi-label class loss function as follows:

objective function L for guiding generic features_lsp(θ) is:

(6-3) therefore, the overall loss function is:

wherein the parameters

Is part of the parameter theta.

Through the steps, semantic correlations of different strengths of the labels are merged into the convolutional neural network to obtain the predicted label

And calculating the minimum value between the real label and the predicted label by using a loss function, and optimizing by adopting SGD (generalized minimum) to fulfill the aim of improving the classification performance of the multi-label image.

In the embodiment, the Resnet related network is used as a base network to extract image features, and replaces the traditional average pooling layer or maximum pooling layer after the last layer, and a plurality of generic pooling layers are used for converting feature map dimensions. And learning parameters in the label sub-relation adjacent tensor to obtain a generic feature map with different features. The method replaces different communities of a Multiple CNNs matching method in MGTN, so that the calculated amount is greatly reduced, more communities are learned, and parameters in the adjacent matrixes are learned by using different sub-labels to distinguish label features in different subgraphs.

In addition, the embodiment adopts the multi-Graph convolutional neural network to train the co-occurrence adjacent tensor of the sub-labels and the label node representation matrix, which is different from the method in the MGTN, the embodiment can learn the semantic relationship of the label nodes under the co-occurrence adjacent matrix of different sub-labels, and only learn the correlation of the co-occurrence adjacent matrix of different sub-labels, or the correlation between element paths, by using the Graph transformations method,

the embodiment utilizes an attention mechanism, a transform decoder structure to integrate the tag correlation into the middle layer of the convolutional neural network. The embodiment utilizes a MultiHead orientation mechanism to integrate the label correlation into a convolutional neural network, takes a feature graph output by the convolutional neural network as key and value, and takes a label node expression tensor output by the multi-graph convolutional neural network as query, so that the sub-label correlation is integrated into an intermediate layer feature graph of the convolutional neural network, and the integration degree of the label correlation and the convolutional neural network is improved.

In summary, the embodiment can learn the correlation between the labels and the correlation inside the labels, and more fully fuse the strong correlation between the labels and the correlation between the labels into the convolutional neural network, and the method of the embodiment is more easily expanded on the number of communities and the number of subgraphs, thereby being more beneficial to improving the multi-label image classification performance.

It is worth to be noted that the number of layers of the multi-map convolutional layer is not necessarily limited to three, if the number of layers of the multi-map convolutional neural network is three, the multi-map convolutional neural network needs to correspond to the first three middle layers of the convolutional neural network one by one, and the sub-label correlation is merged into the middle layer of the convolutional neural network; if the number of layers of the multi-map convolutional layer is less than three, selectively fusing the sub-label correlation into some layers of the first three middle layers of the convolutional neural network; but the number of layers of the multi-map convolutional layer cannot be more than three.

If the number of layers of the multi-map convolutional layers is three, the label expression tensors output by each multi-map convolutional layer need to correspond to the first three middle layers of the feature map output by the convolutional neural network one by one, and the transformer decoder is used for integrating the sub-label correlation into the middle layers of the convolutional neural network. Therefore, the dimension of the feature map output by the first three middle levels is F₁∈R^{256×112×112},F₂∈R^512×56×56,F₃∈R^1024×28×28And the label expression tensor dimension of the multi-graph convolution layer output is

If the number of layers of the multi-map convolutional layer is less than three, the eigen map and the label expression tensor dimension are kept consistent by using a similar method.

Compared with the method of matching different communities by using a multiple CNNs method, the method can expand the number of sub-graphs or communities by using smaller calculated amount, and is easier to learn the strong correlation between the label correlation and the sub-labels, thereby improving the classification performance of multi-label images.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A multi-label image classification method fusing strong correlation among labels is characterized by comprising the following steps:

s4, fusing the M-ply label node expression tensor generated by the multi-ply convolutional layer into a feature graph output by the middle layer of the general convolutional neural network by using an attention mechanism;

s5, constructing a community coding M matrix according to M communities divided by the label nodes by a community detection algorithm, multiplying the M matrix by a label node expression matrix fused by a fusion mechanism to obtain a label node expression matrix with community coding, multiplying the matrix by multiple generic feature graphs spliced together in the column direction, and using the obtained prediction label for classification;

2. The multi-label image classification method fusing strong correlation between labels as claimed in claim 1, wherein step S1 specifically includes the following steps:

(1-1) obtaining the image file of the training data and the label in the training data, and establishing a co-occurrence relation adjacency matrix A belonging to the R for the label^C×CWherein A is_ij1 denotes the label L_iWhen present, label L_jIt is also possible that otherwise A_ij0, as follows:

firstly, counting the co-occurrence relation of labels in a data set, and constructing a Z matrix, wherein Z belongs to R^C×CWherein Z is_ijIndicating label L_iAnd a label L_jThe times of common occurrence in the data set are constructed by a Z matrix and a basic conditional probability formula to construct a conditional probability matrix P belonging to R^C×C，P_ij＝P(L_j|L_i) Denotes when the label L_iWhen present, label L_jSetting a threshold value tau to binarize the P matrix and constructing a label co-occurrence matrix A epsilon for R^C×CThe following are:

(1-2) dividing the label nodes into M communities by using a community detection algorithm, wherein the modularity of the adjacency matrix A is defined as follows:

d_i＝∑_kA_i，kis the degree of the ith node, and, δ (c)_i,c_j) Used for detecting whether the i node and the j node are in the same community, if so, delta (c)_i,c_j) 1, otherwise, δ (c)_i,c_j)＝0；

Each node is independently used as a community, and in order to maximize modularity, other label nodes are brought into the community;

until the modularity does not change, the M communities are detected, namely, all the label nodes in the data set are divided into the M communities;

(1-3) setting a threshold to construct a tag co-occurrence tensor as follows: setting a threshold T ═ T on the conditional probability matrix P₁，...，t_m]Wherein, t_i∈[0,1]And is

Constructing a tag co-occurrence tensor

The following were used:

wherein the content of the first and second substances,

co-occurrence matrix by tag

The components of the composition are as follows,

representing the kth sub-graph, the label L_iWhen present, label L_jMay also occur.

3. The multi-label image classification method fusing the strong correlation between labels as claimed in claim 2, wherein: step S2 specifically includes the following steps:

(2-1) acquiring an image file of training data, and inputting the image file into a Resnet-101 convolutional neural network; obtaining a feature map of the final level output by Resnet-101 feature extraction, and representing the feature map as

Wherein, H, W and D_cRespectively, height, width and channel of the characteristic diagram;

(2-2) changing the original feature map X^convDimension (c):

X^lsp＝W^lsp×R(X^conv)

transforming X by a dimension transformation operation R (-)^convMapping from three-dimensional tensor to two-dimensional matrix and letting parameter matrix W^lsp∈R^C×HWLeft ride

Wherein C is the label category number, and a generic feature map of dimension transformation is obtained

(2-3) repeating the same method M times, specifically as follows:

wherein i is a positive integer between 1 and M, M is the number of label co-occurrence graphs or community number, and M generic feature graphs are combined

Splicing according to the column direction to obtain a multiple generic feature map

(2-4) learning parameter matrix by using label co-occurrence relation after segmenting subgraph

The following were used:

wherein the content of the first and second substances,

is the ith sub-graph in the tag co-occurrence tensor.

4. The multi-label image classification method fusing strong correlation between labels as claimed in claim 3, wherein step S3 specifically comprises the following steps:

(3-1) Co-occurrence tensor of tags

And the label embedding matrix E ∈ R^C×DThe method is sent to a multi-graph convolutional neural network, and comprises the following steps:

the first layer of input of the multi-graph convolution layer is a label embedding matrix E belonging to R^C×DAnd normalized multiple adjacency sub-graph

Wherein L is_lThe label input for the convolution of the ith layer of graph represents the node dimension of the tensor, wherein l is more than or equal to 2; h (-) is a nonlinear activation function; the above formula is an expression of the multi-graph convolution layer when the multi-graph convolution layer is more than two layers, wherein

Determining; similar to the above, the first layer map is represented by

(3-2) Label node expression tensor learned by multigraph convolutional layer by using label level attention fusion mechanism

A label node expression matrix G epsilon R is fused into a whole^C×LThe following:

is shown as

Wherein the content of the first and second substances,

representing the kth node represents a matrix, k is an integer between 1 and M, and the importance of the label node representing the matrix learned by each subgraph is as follows:

wherein W is a parameter vector W ∈ R^1×CB is an offset vector b ∈ R^1×CTan h (-) is an activation function, q is a parameter vector q epsilon R^1×C，

For the k-th label representing matrix the label node representing matrix of the i-th node, w_kRepresenting importance weights of the label embedded matrix and the kth sub-graph when the k sub-graph is subjected to graph convolution neural network learning, wherein i is an integer from 1 to C;

calculating the weight w of the good node on each subgraph_kThen, normalization is carried out through a softmax () function, and the importance score of the node on the subgraph is calculated as follows:

thus, the importance scores (beta) of the nodes on each subgraph are obtained₁,...,β_M) Respectively multiplying the importance scores by corresponding label node representation matrixes, then summing, and then multiplying by a parameter matrix

And (6) determining.

5. The multi-label image classification method fusing strong correlation between labels as claimed in claim 4, wherein step S4 specifically comprises the following steps:

(4-1) outputting a multi-label node representation matrix from the multi-graph convolutional neural network as

Wherein M is the number of subgraphs, C is the number of label categories, L_lFor label node representation dimensions output for the ith graph convolution layer, will

Fusing the feature map output by Resnet-101 intermediate layer with attention mechanism as follows:

the output characteristic diagram of the first three middle layers of Resnet-101 is

Is marked as

Wherein, C_l＝C_s、W＝W_s、H＝H_sL represents F_sFused with the label node correlation output by the l-th layer graph convolution neural network,

due to the fact that

Middle L_lIs determined by a parameter matrix

Determine that L can be_l＝C_lP, i.e. X^l∈R^P×W×H，

Using dimension transformation operation R (-) to transform X^lAnd H^lDimensional transformation to X^R∈R^WH×PAnd

into the decoder of the transform model, the tag correlation is fused using the Multi-Head Attention mechanism in the transform decoder, as follows:

wherein, Multihead (Q, K, V) Concat (head)₁，head₂，...，head_h)W^O,

Wherein the content of the first and second substances,

wherein the content of the first and second substances,

the obtained characteristics X fused with the label correlation^RDoing the inverse operation of dimension transformation R (-) to convert X^R∈R^WH×PConversion to X^M∈R^P×W×HThe newly obtained X^M∈R^P×W×HWith the original feature map X^l∈R^P×W×HAdd to Transformer decoder&The Norm layer is used for carrying out normalization and residual error linkage, and then the characteristic diagram X of the tag correlation is blended^M∈R^P×W×HFNN layer fed into the transform decoder as follows:

X^F＝FFN(X^M)

wherein ffn (x) max (0, xW)₁+b₁)W₂+b₂The inner layer comprises two linear layer linear transformations, and the middle is activated by a Relu activation layer;

the newly obtained X^F∈R^P×W×HSignature graph X correlated with original merged label^M∈R^P×W×HAdd to Transformer decoder&A Norm layer for normalization and residual linking; obtaining the final characteristic diagram X of the correlation of the merged label^l∈R^P×W×HAnd sending the training to the next layer of the convolutional neural network.

6. The method for classifying multi-label images fused with strong correlation between labels as claimed in claim 5, wherein step S5 comprises the following steps:

(5-1) dividing the labels in the data set into M communities by using a community detection algorithm, so that each label node has a respective community, numbering the communities as p according to a { 1., M } sequence order, wherein M is the number of communities, and storing the community number of each label node to the S E Z^CWherein C is the number of label categories, S_i＝p,i∈{1,...,C}，

Obtaining a multi-label generic feature map

X^mullspCan be expressed as a combination of generic feature maps

By using

Different communities are distinguished and matched, the pth community is determined to be matched with which generic label graph according to the community code p, and a community code M matrix is constructed as follows:

wherein, tau belongs to [0, 1 ];

(5-2) representing the matrix G e R according to the label nodes^C×LExpressing the community coding M matrix and the label node to form a matrix G epsilon R^C×LMaking Hadamard product and combining with multiple generic feature diagram X^mullspPerforming label level multiplication to enable the model to fuse different generic features with label nodes of different communities as follows:

since G is belonged to R^C×LL in (1) is a parameter matrix

Determine that L and D can be made_CM is equal, then the subgraph is coded

And label node representation

Multiplying to enable the label nodes to be integrated into community coding information to obtain a matrix

The following were used:

W＝M·G

(5-3) mixing

And

performing label-level multiplication, and using the obtained prediction label for final classification, wherein W can be expressed as W ═ W₁,...,w_C]^TMultiple generic feature map X^mullspCan be represented as X^mullsp＝[x₁,...,x_C]Wherein C is the number of label categories, the prediction function is,

7. the method for classifying multi-label images fused with strong correlation between labels as claimed in claim 6, wherein step S6 comprises the following steps:

the objective function is defined as:

where L (-) is a loss function of the multi-label classification, L_lsp(theta) is a parameter matrix objective function for learning the multiple generic pooling layers, and theta is a global parameter to be learned;

using a two-class loss function as the multi-label class loss function as follows:

objective function L for guiding generic features_lsp(θ) is:

thus, the overall loss function is:

wherein the parameters

Is part of the parameter θ;

by the above steps, we will differentiate the labelsThe strength semantic correlation is merged into the convolutional neural network to obtain a prediction label