CN112784909B

CN112784909B - Image classification and identification method based on self-attention mechanism and self-adaptive sub-network

Info

Publication number: CN112784909B
Application number: CN202110119391.9A
Authority: CN
Inventors: 李惠; 徐阳; 胡芳侨
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-09-28
Anticipated expiration: 2041-01-28
Also published as: CN112784909A

Abstract

The invention discloses an image classification and identification method based on an attention mechanism and an adaptive sub-network, which comprises the following steps: constructing a neuron calculation model for initial image classification identification; enabling each layer of neurons of the model to focus attention on a region of interest of a preset image by using a self-attention mechanism, and extracting an attention proportion coefficient of the region of interest; enabling a single neuron of the model to learn the nonlinear expression ability by using an adaptive sub-network so as to extract high-level features of a preset image; the calculated amount in the image classification and identification process is controlled by setting the attention proportion coefficient, the number of sub-network layers and the number of sub-network nodes in the nonlinear expression capability, so that the calculation cost is controlled while the high-accuracy image classification and identification result is obtained. The method has the difficult problem of complex nonlinear expression capability, and can realize the effects of concentration, local network deepening and widening under a complex application scene, thereby improving the expression capability and the identification accuracy of the network.

Description

Image classification and identification method based on self-attention mechanism and self-adaptive sub-network

Technical Field

The invention relates to the technical field of artificial intelligence, neural networks, computer vision and deep learning, in particular to an image classification and identification method based on an attention mechanism and an adaptive sub-network.

Background

At present, a neuron calculation model commonly used in the field of artificial intelligence comprises two parts, namely a linear operation (weight multiplication and bias addition) and a nonlinear activation function, and the mathematical expression of the neuron calculation model is as follows:

in the formula,

the ith neuron representing the l-th layer,

represents the jth neuron of layer l +1,

represents a weight coefficient connecting an ith neuron of the l-th layer and a jth neuron of the l + 1-th layer,

represents the bias coefficient corresponding to the jth neuron of the l +1 th layer, n^lRepresents the number of neurons in layer I, and sigma represents a nonlinear activation function, and common forms include sigmoid, tanh, ReLU, and the like.

The formula (1) is a neuron calculation model commonly used at present, takes a node value of a neuron at an upper layer as input, multiplies a weight coefficient connected with the neuron at the current layer, accumulates the weight coefficient, and sums the weight coefficient and a bias coefficient of the neuron at the current layer to obtain an operation result of a first step of linear transformation; then, nonlinear transformation is carried out on the linear operation result through a nonlinear activation function, so that the neuron obtains nonlinear expression capability.

As shown in fig. 1, on the basis of a single neuron operation, a plurality of neuron nodes are arranged in each layer of neural network, and the transmission process from the front layer input to the rear layer output is repeated for a plurality of times, so that a multilayer neural network can be formed, and approximate fitting of a more complex nonlinear mapping relation is realized.

As shown in fig. 2, based on the operation mode of the multi-layer neural network, deep learning is based on a deeper network architecture. In addition, the operation principle of the convolutional neural network which is very popular in the deep learning field is similar to that of the neural network, and only the convolutional kernel is adopted for realization, namely, firstly, the linear operation is completed by multiplying the weight coefficient in the convolutional kernel by the pixel point in the receiving domain and then overlapping the bias coefficient, and then the nonlinear activation operation is performed, which is the same as the formula (1).

In summary, the current common network architecture in the field of artificial intelligence is a neuron calculation model based on formula (1), that is, the interior of a single neuron is multiplied and summed by a weight coefficient, then added with a bias coefficient, and finally added with nonlinearity through a nonlinear activation function.

However, when facing a practical complex application scenario, the conventional neuron computational model has the following disadvantages:

(1) due to the limitation of the design function of the formula (1), high-order operation cannot be realized in a single neuron, and only simple linear (multiplication and addition) and nonlinear (sigmoid, tanh, ReLU and the like) operation can be performed, so that the expression capacity of the single neuron is limited;

(2) the human brain nerve cells can release very complex chemical transmitters and electrical signals in the working process, and the process cannot be effectively expressed only by carrying out one-time combination of simple linear and nonlinear operations according to the formula (1), namely the traditional neuron calculation model cannot truly reflect the calculation process of processing data by the human brain nerve cells;

(3) at present, deep learning needs to obtain a sufficiently strong nonlinear expression capability through a very deep network architecture, and one important reason is that one layer of neurons can only represent one nonlinear activation operation, so that a large number of layers is needed to construct a complex nonlinear approximation function. However, the nerve cells in the human cerebral cortex are not infinitely multi-layered, with only a few distinct functional areas. Therefore, the real human brain does not obtain the nonlinear expression capability through the connection of a plurality of layers of neurons, that is, the single neuron calculation model is required to have the sufficiently strong nonlinear expression capability, which is not possessed by the formula (1);

(4) aiming at some special problems, such as image classification and identification in fuzzy, shielding and other scenes, if the neuron can adaptively focus attention on some special areas and extract local high-level features of the image for classification and identification, the identification performance can be improved; however, the existing neuron model does not have the significance of distinguishing the nodes.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

To this end, an object of the present invention is to provide an image classification and identification method based on an attention-free mechanism and an adaptive sub-network.

In order to achieve the above object, an embodiment of the present invention provides an image classification and identification method based on an attention-free mechanism and an adaptive sub-network, including the following steps: step S1, constructing a neuron calculation model for initial image classification and identification; step S2, using a self-attention mechanism to make each layer of neurons of the neuron calculation model identified by the initial image classification focus attention on a region of interest of a preset image, and extracting an attention proportion coefficient of the region of interest; step S3, learning the non-linear expression ability of a single neuron of the neuron calculation model identified by the initial image classification by using an adaptive sub-network so as to extract the high-level features of the preset image, wherein the non-linear expression ability is controlled by the number of sub-network layers and the number of sub-network nodes; and step S4, controlling the calculation amount in the image classification and identification process by setting the attention ratio proportion coefficient, the number of the sub-network layers and the number of the sub-network nodes, and controlling the calculation cost while obtaining the high-accuracy image classification and identification result.

The image classification and identification method based on the self-attention mechanism and the adaptive sub-network is characterized in that a self-attention module and an adaptive sub-network module are added on the basis of the traditional neuron calculation process, key control parameters of the image classification and identification method comprise an attention proportion coefficient, a sub-network layer number, a sub-network node number proportion coefficient and the like, only the kernel of the traditional neuron calculation model needs to be modified, the increase of the calculated amount in the image classification and identification process is controlled by setting the attention proportion coefficient, the sub-network layer number and the sub-network node number proportion coefficient, the accuracy of an image classification and identification result is guaranteed, and the image classification and identification method can be applied to any neural network architecture, such as data mining, image classification, structural damage identification and the like.

In addition, the image classification and identification method based on the self-attention mechanism and the adaptive sub-network according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the step S2 specifically includes:

α^l＝G[softmax(X^l),β]

wherein exp is exponential operation, softmaxPerforming exponential normalization operation on all input neurons in the l-th layer to obtain a weight vector softmax (X) with a value range of (0,1)^l) (ii) a G is gate operation, beta is attention force proportion coefficient, the value range (0,1) is taken, namely the first beta elements in the weight vector are selected, and all the other elements are set as 0; alpha is alpha^lTo obtain the attention coefficient vector of all neurons in the l-th layer.

Further, in an embodiment of the present invention, the step S3 specifically includes:

h＝γ·n^l+1

wherein,

representing an adaptive sub-network, wherein k is the number of hidden layers in the adaptive sub-network and is at least 2; h is the number of neuron nodes in each layer of the adaptive sub-network, and is composed of the node coefficient gamma of the adaptive sub-network and the node number n of the new network in the current layer^l+1The value range of gamma is 0-1.

Further, in an embodiment of the present invention, the width of the adaptive subnetwork is proportional to the size of the network node coefficient γ of the adaptive subnetwork, and the number k of concealment layers in the adaptive subnetwork is proportional to the depth of the adaptive subnetwork.

Further, in an embodiment of the present invention, the operation structure in the step S4 is:

wherein,

representing an implicit subnetwork, α, in the jth neuron of layer l +1^lIs the attention weight coefficient vector, X, of all neurons in the l-th layer^lA vector formed by all neuron values of the l-th layer,

for the jth neuron internal subnetwork of the l +1 th layer

The corresponding network parameters.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a data flow diagram of a conventional neuron computational model;

FIG. 2 is a schematic diagram of a conventional multi-layer neural network;

FIG. 3 is a flow chart of an image classification identification method based on an adaptive attention mechanism and an adaptive sub-network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the operation of the self-attention module according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the operation of an adaptive subnetwork in accordance with an embodiment of the present invention;

FIG. 6 is a novel neuron computational model constructed by the image classification and identification method based on the self-attention mechanism and the adaptive sub-network according to an embodiment of the present invention;

fig. 7 is a schematic diagram illustrating recognition of a handwritten digit classification task in a fuzzy scene according to a first embodiment of the present invention, in which (a) is a number 1 in different fuzzy degrees, (b) is a number 5 in different fuzzy degrees, and (c) is a number 6 in different fuzzy degrees;

FIG. 8 is a schematic diagram illustrating a fusion manner on a U-Net semantic segmentation architecture according to a second embodiment of the present invention;

fig. 9 is a schematic diagram of an identification effect on a semantic segmentation task of a steel box girder micro fatigue crack according to a second embodiment of the present invention, where (a) is an original image, (b) is a real label, and (c) is a new neuron calculation model prediction result;

fig. 10 is a comparison diagram of the xor operation performed on the conventional neuron and the novel neuron model according to the third embodiment of the present invention, where (a) is an xor classification problem, (b) is the novel neuron model, and (c) is the conventional multilayer neural network model.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

An image classification and identification method based on an attention-free mechanism and an adaptive sub-network according to an embodiment of the present invention is described below with reference to the accompanying drawings.

FIG. 3 is a flowchart of an image classification identification method based on an attention-free mechanism and an adaptive sub-network according to an embodiment of the present invention.

As shown in fig. 3, the image classification and identification method based on the self-attention mechanism and the adaptive sub-network comprises the following steps:

in step S1, a neuron computational model for initial image classification recognition is constructed.

That is, an initial image classification recognition neuron calculation model is established, which is an existing general neuron calculation model, that is, an initial image classification recognition neuron calculation model

Wherein,

the ith neuron representing the l-th layer,

represents the jth neuron of layer l +1,

represents the bias coefficient corresponding to the jth neuron of the l +1 th layer, n^lRepresents the number of neurons in the l-th layer, and σ represents the nonlinear activation function. Based on the existing general neuron calculation model.

In step S2, each layer of neurons of the neuron computational model identified by the initial image classification is made to concentrate attention on a region of interest of the preset image by using the self-attention mechanism, and an attention proportion coefficient of the region of interest is extracted.

That is, as shown in FIG. 4, a self-attention module is used on the basis of the initial neuron computational model, i.e.

α^l＝G[softmax(X^l),β] (2)

Wherein exp is exponential operation, softmax is exponential normalization operation of all input neurons of the l-th layer, and weight vector softmax (X) with value range of (0,1) is obtained^l) I.e. the region of interest of the preset image; g is gate operation, beta is attention force proportion coefficient, the value range (0,1) is taken, namely the first beta elements in the weight vector are selected, and all the other elements are set as 0; alpha is alpha^lTo obtain the attention coefficient vector of all neurons in the l-th layer.

Further, according to the formulas (2) and (3), the invention can cut off in the neurons of the next layer according to the importance degree of all the neurons of the previous layer, and reserve the nodes with relative importance and ignore the nodes with less importance, that is, reserve the interested region of the preset image and ignore the non-interested region, thereby reducing a part of the calculation amount. Meanwhile, the degree to which the calculation amount is reduced may be controlled by the attention force proportion coefficient β.

In step S3, the adaptive sub-network is used to make individual neurons of the neuron computational model identified by the initial image classification learn the non-linear expression capability controlled by the number of sub-network layers and the number of sub-network nodes to extract the high-level features of the preset image.

In particular, as shown in fig. 5, an adaptive sub-network is employed on the basis of the use of the self-attention module, i.e.

h＝γ·n^l+1 (5)

Wherein,

Obviously, the larger the value of the node coefficient γ of the adaptive sub-network is, the wider the width of the adaptive sub-network is; the larger the number of hidden layers k in the adaptive subnetwork, the deeper the subnetwork is represented. Therefore, the setting of the adaptive sub-network can widen and deepen the original network in a local range, thereby improving the capability of feature extraction, namely extracting the high-level features of the preset image. Meanwhile, a game balance exists between the expression capability and the calculation amount consumption of the network, namely: the more powerful the expression of the sub-network, the more computationally intensive the overall network will be.

In step S4, the calculation amount in the image classification and identification process is controlled by setting the attention scaling factor, the number of sub-network layers, and the number of sub-network nodes, thereby controlling the calculation cost while obtaining a high-accuracy image classification and identification result.

Specifically, as shown in fig. 6, a new neuron computational model is obtained after step S3, i.e.

Wherein,

for the jth neuron internal subnetwork of the l +1 th layer

The corresponding network parameters.

The invention only needs to modify the kernel of the initial neuron calculation model (adding the attention module and the self-adaptive sub-network module), namely, the invention controls the increase of the calculated amount in the image classification and identification process by setting the attention proportionality coefficient, the number of the sub-network layers and the proportionality coefficient of the number of the sub-network nodes, obtains the image classification and identification result with higher accuracy, and can be applied to any neural network architecture.

The image classification and identification method based on the self-attention mechanism and the adaptive sub-network is further described below with reference to three specific examples.

First embodiment, the method is applied to recognition of the handwritten digit classification task in the complex scene

Firstly, constructing a neuron calculation model for initial image classification and identification; then, a self-attention module is utilized to enable each layer of neurons of the neuron calculation model for initial image classification and identification to focus the analyzed emphasis on a local important region of the image, and meanwhile, an attention proportion coefficient of the local important region is extracted; and then, a single neuron of a neuron calculation model for initial image classification recognition learns the nonlinear expression capability by using the self-adaptive sub-network, so that an internal sub-network can extract high-level features of the image, and further, the calculation amount is controlled by controlling the attention proportion coefficient and the nonlinear expression capability, so that the classification accuracy of an image classification result is improved, and the image recognition under a complex scene also has better performance.

Specifically, considering the fuzzy scene first, as shown in fig. 7, classifying MNIST handwritten digital images 0-9 under the fuzzy scene improves the classification accuracy by 4.76% compared to the conventional model (from 82.02% to 86.78%).

Then, considering the occlusion scene, as shown in table 1 below, the MNIST handwritten digital images 0-9 are classified in the occlusion scene, 1, 3, and 5 random occlusion regions are used at 50% occlusion rate, compared with the conventional model, the classification accuracy is respectively improved by 0.63%, 1.58%, and 1.45%, and the average accuracy is improved by 1.22% (from 68.41% to 69.63%).

TABLE 1 recognition results of novel neuron model on handwritten digit classification task in sheltered scene

Second embodiment, the method is applied to semantic segmentation task of the micro fatigue cracks of the steel box girder

As shown in FIG. 8, the method can be used for fusing on a U-Net semantic segmentation network, replacing the original convolution and nonlinear activation operation with a calculation process shown in a formula (1), and verifying on a steel box girder micro fatigue crack data set by using the fused network.

As shown in FIG. 9, the invention can well identify the fine crack pixels from the fatigue crack image of the steel box girder with the complex background interference. The quantitative evaluation index is the occupation ratio of a fatigue crack pixel prediction region and a real region, compared with original U-Net, after a novel neuron calculation model is fused, the occupation ratio is increased from 0.315 to 0.342, the promotion of 0.027 is obtained, and the effect promotion reaches 8.5%.

Further, the image classification and recognition direction can be applied to other problems as follows:

third embodiment, the method is applied to the task of classifying the XOR problem

As shown in fig. 10, a complex xor operation can be achieved using only the present invention; however, for a conventional neuron (initial neuron computational model), at least two layers of neural networks are required to achieve this function.

To sum up, the image classification and identification method based on the attention mechanism and the adaptive sub-network provided by the embodiment of the invention fundamentally solves the problem that the current single neuron calculation model does not have complex nonlinear expression capability, and also has the following advantages:

(1) the method breaks through the limitation that linear and nonlinear operations are simply combined once in the traditional single neuron calculation model, realizes high-order nonlinear operation in the neuron through a sub-network module, and improves the expression capacity of the single neuron, wherein the order can be controlled by sub-network parameters;

(2) the calculation process of human brain nerve cell processing data is better simulated, input high-level features can be extracted from the interior of a single neuron, the importance of nodes can be analyzed based on a self-attention mechanism, the calculation power is concentrated on the nodes needing important processing, and unimportant nodes are abandoned to reduce the calculation amount;

(3) the increase of the calculated amount is controlled by setting parameters such as an attention proportional coefficient, the number of sub-network layers, the number of sub-network nodes proportional coefficient and the like, and meanwhile, the identification precision is ensured, so that the calculation cost is reduced;

(4) the method can be expanded and applied to any neural network architecture, only the kernel of the traditional neuron calculation model needs to be modified, the application is very convenient, and the fusion on any neural network can be realized by modifying the bottom code of the original neuron calculation model into the code of the proposed model, including but not limited to a multilayer perceptron, a convolutional neural network and the like;

(5) complex XOR operation can be realized only by a novel neuron, and for the traditional neuron, the function can be realized by at least two layers of neural networks;

(6) aiming at the image classification problem of a fuzzy scene, compared with a traditional neuron calculation model, the MNIST data set improves the classification accuracy by 4.76%;

(7) aiming at the problem of image classification of an occlusion scene, compared with a traditional neuron calculation model on an MNIST data set, the method has the advantage that 1.22% of classification accuracy improvement is obtained on average under three working conditions of the same occlusion rate and different occlusion areas;

(8) the method can also be applied to the field of structural damage recognition, a novel neuron calculation model is added into a U-Net semantic segmentation framework, and 2.7% of occupation ratio promotion is obtained on a steel box girder micro fatigue crack data set.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An image classification and identification method based on an attention mechanism and an adaptive sub-network is characterized by comprising the following steps:

step S1, constructing a neuron calculation model for initial image classification and identification;

step S2, using a self-attention mechanism to make each layer of neurons of the neuron computing model identified by the initial image classification focus attention on a region of interest of a preset image, and extract an attention proportion coefficient of the region of interest, wherein the step S2 specifically is:

α^l＝G[softmax(X^l),β]

wherein exp is exponential operation, softmax is exponential normalization operation of all input neurons of the l-th layer, and weight vector softmax (X) with value range of (0,1) is obtained^l) (ii) a G is gate operation, beta is attention force proportion coefficient, the value range (0,1) is taken, namely the first beta elements in the weight vector are selected, and all the other elements are set as 0; alpha is alpha^lTo obtain attention coefficient vectors for all neurons in layer I, X^lVectors formed for all neuron values of the l-th layer, n^lThe number of all neurons in the l layer;

step S3, learning a non-linear expression capability by a single neuron of the neuron computational model identified by the initial image classification using an adaptive subnetwork to extract a high-level feature of the preset image, wherein the non-linear expression capability is controlled by the number of subnetwork layers and the number of subnetwork nodes, and the step S3 specifically includes:

h＝γ·n^l+1

wherein,

representing the self-adaptive sub-network corresponding to the jth neuron of the l +1 th layer, wherein k is the number of hidden layers in the self-adaptive sub-network and is at least 2; h is the number of neuron nodes in each layer of the adaptive sub-network, and is composed of the node coefficient gamma of the adaptive sub-network and the node number n of the new network in the current layer^l+1The value range of gamma is 0 to 1,

for the jth neuron internal subnetwork of the l +1 th layer

And step S4, controlling the calculation amount in the image classification and identification process by setting the attention ratio proportion coefficient, the number of the sub-network layers and the number of the sub-network nodes, and controlling the calculation cost while obtaining the high-accuracy image classification and identification result.

2. The method according to claim 1, wherein the width of the adaptive sub-network is proportional to a node coefficient γ of the adaptive sub-network, and the number k of hidden layers in the adaptive sub-network is proportional to the depth of the adaptive sub-network.

3. The image classification and identification method based on the self-attention mechanism and the adaptive sub-network according to claim 1, wherein the operation architecture in the step S4 is as follows:

wherein,

is the jth neuron of layer l +1, σ is the nonlinear activation function, n^lIs the number of all neurons in layer l,

to connect the ith neuron of the l-th layer and the jth neuron of the l + 1-th layer,

the ith neuron of the l-th layer,

the bias coefficient corresponding to the jth neuron of the l +1 th layer,

represents the adaptive sub-network corresponding to the jth neuron of the l +1 th layer, alpha^lIs the attention coefficient vector of all neurons in the l-th layer, X^lA vector formed by all neuron values of the l-th layer,

for the jth neuron internal subnetwork of the l +1 th layer

The corresponding network parameters.