CN112580777A

CN112580777A - Attention mechanism-based deep neural network plug-in and image identification method

Info

Publication number: CN112580777A
Application number: CN202011256575.1A
Authority: CN
Inventors: 李海良; 刘敏; 郭焕; 庄师强; 张明
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-03-30

Abstract

The invention relates to a depth neural network plug-in based on an attention mechanism and an image identification method, wherein the plug-in consists of two layers of LSTMs with the same size and a CNN with a multilayer structure; wherein one layer of LSTM is used for memorizing context information and generating a mask image with a remarkable feature, and the other layer of LSTM is used for realizing a 'glimpse' function and generating a classification confidence; the CNN of the multi-layer structure is used for down-sampling, extracting image features and transmitting context information to the LSTM unit. The CNN with the multilayer structure is used, an attention mechanism is applied to guide the CNN with the multilayer structure to focus key features of a target object in an image, secondary features and background pixels are removed, the recognition capability is high, the target object is recognized step by step through multiple propagation, the main features of the object are gradually remembered, the secondary features of the object are forgotten, and the CNN with the multilayer structure is fed back for multiple times, so that the CNN has an error correction function.

Description

Attention mechanism-based deep neural network plug-in and image identification method

Technical Field

The invention belongs to the technical field of artificial intelligence and image recognition, and particularly relates to a depth neural network plug-in based on an attention mechanism and an image recognition method.

Background

Currently, deep neural networks used for image recognition are mainly Convolutional Neural Networks (CNN); however, the applicant finds that the existing Convolutional Neural Network (CNN) has the following defect problems in the image identification process:

1. it is known that objects in an image usually occupy only a part, even a small part, of the entire space, and in many cases, there are a large number of background pixels in the image, and many of these pixels in the image are irrelevant to, and even interfere with, an identification target; however, for CNN, all pixels in the image are equally weighted;

2. during the recognition process, the CNN can only be propagated forward once. Therefore, if the perturbations in the challenge samples are valid in this process, it is likely to result in CNN identification errors.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a deep neural network plug-in with high recognition capability and an error correction function by using a multi-layer CNN and applying an attention mechanism to guide the multi-layer CNN to focus key features of a target object in an image and remove secondary features and background pixels, and an image recognition method using the deep neural network plug-in based on the attention mechanism.

In order to solve the technical problems, the invention adopts the following technical scheme:

a deep neural network plug-in based on attention mechanism is composed of two layers of LSTMs with the same size and a CNN with a multilayer structure; wherein the content of the first and second substances,

one layer of LSTM is used for memorizing context information and generating a mask image with a remarkable characteristic, and the other layer of LSTM is used for realizing a 'glimpsing' function and generating classification confidence;

the CNN of the multi-layer structure is used for down-sampling, extracting image features and transmitting context information to the LSTM unit.

Further, the CNN of the multilayer structure includes four convolutional layers; the first convolutional layer is used for generating a feature vector and transmitting the feature vector to the LSTM unit, the second convolutional layer is used for performing secondary feature extraction on data after point multiplication to obtain a feature matrix containing attention information, and the fourth convolutional layer is used for outputting a one-dimensional vector containing a plurality of elements spliced by a multi-dimensional feature map.

An image recognition method adopting a depth neural network plug-in based on an attention mechanism is characterized in that CNN of a multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to the LSTM of one layer, then the LSTM of the layer screens, forgets and memorizes the received image characteristics to generate a mask image with significant characteristics, and then the CNN of the multilayer structure is used for secondary characteristic extraction to obtain a characteristic matrix containing attention information and forms the attention mechanism with context association capability through circulation transfer. The method specifically comprises the following steps:

a1. initializing LSTM, calculating average value from input feature vector to obtain implicit layer state h of two said one layer LSTM₀And storage state c₀Carrying out initialization;

a2. generating a mask image, transmitting a characteristic vector x generated by a CNN (convolutional layer number) of a multilayer structure to an LSTM (least significant TM) unit of one layer of LSTM, calculating the characteristic vector x through the LSTM of the layer to obtain a characteristic matrix after analysis and filtration, performing dot multiplication operation with an original image to generate a characteristic picture with a mask, displaying key characteristics reserved by the characteristic matrix, hiding the characteristics of other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the characteristic vector x to the next convolutional layer;

a3. the method comprises the steps of generating a mask image with remarkable features, firstly re-extracting a feature map of the mask image by using a CNN of a multilayer structure to obtain a feature matrix containing attention information, then transferring the feature matrix to an output gate of a layer of LSTM, addressing output feature vectors, on one hand, using the feature matrix as feedback input, on the other hand, adjusting the weights of the two layers of LSTM and the CNN of the multilayer structure through back propagation, and finally, after multiple iterations, the information of a hidden state is continuously changed, the information of a storage state is reserved, key features of a target object in the mask image are reserved, non-key features or irrelevant features are hidden, and the mask image with the remarkable features is generated.

Further, calculating according to a formula I and the feature vector to obtain two initial hidden layer states h of the layer LSTM₀And storage state c₀，

Further, step a2 specifically includes:

first, a layer of LSTM is based on the hidden layer state h at the previous time_t-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operation of formulas two to eight,

i_t＝σ(W_i·[h_t-1,x_t]+b_i) In the formula II, the first step is carried out,

f_t＝σ(W_f·[h_t-1,x_t]+b_f) The formula four is shown in the specification,

o_t＝σ(W₀·[h_t-1,x_t]+b₀) The formula five is shown in the specification,

h_t＝o_t*tanh(c_t) In the formula eight, the first step is,

wherein i_tRefers to the state moment of the updated information,

new information to be stored, f_tState matrix, o, which refers to forgotten information_tRefers to the state matrix of the output information, sigma refers to the sigmoid function, W_iWeight matrix, W, referring to input gates_cIs a weight matrix, W, of new cells_fIs a weight matrix of forgetting gates, W_oWeight matrix, h, referring to the output gates_t-1Refers to the cell output, x, at the previous time_tIs referred to as the current input, [ h ]_t-1，x_t]Means that two input vectors are merged, b_iMeaning forgetting the door bias, b_cRefers to a new cell bias, b_fIs the input gate offset, b_oIs the output gate offset, h_tRefers to the current cell unit output, c_tRefers to the current memory state, c_t-1Refers to the stored (or memorized) state at the previous moment;

then, the LSTM in the layer screens, forgets and memorizes the features to obtain a feature matrix mask after feature deconstruction and filtering, and the feature matrix mask is subjected to a linear transformation formula of y-xW^T+ b and a matrix which is output by full connection, splicing the matrix which is output by full connection, and converting the matrix into a matrix A which has the same size as the original image, wherein x is the input matrix, W is a weight matrix, and b is an offset matrix;

performing dot multiplication on the matrix A and the original image, namely performing dot multiplication according to a formula nine to generate a feature picture with a mask, displaying key features reserved by a feature matrix mask, hiding features at other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the task downwards to a next convolution layer of the CNN with a multilayer structure;

where matrix A is the output o of LSTM_t(the matrix is composed of only 0 and 1, 1 represents the feature of reserving the area, and 0 represents the feature of forgetting the area), and the matrix B is the original image.

Further, in step a3, through the back propagation of the cross entropy loss function formula ten, the loss function is calculated,

the invention mainly has the following beneficial effects:

the attention mechanism-based deep neural network plug-in provided by the invention has the advantages that the CNN of the multilayer structure is used, the attention mechanism is simultaneously applied to guide the CNN of the multilayer structure to focus the key features of the target object in the image, the secondary features and background pixels are eliminated, the recognition capability can be improved, the focus of the image is automatically focused, the plug-in has a more humanoid function and is more intelligent, the target object is gradually recognized through multiple transmissions, the main features of the object are gradually remembered in the transmission process, and the secondary features of the object are forgotten. Due to the multi-propagation mechanism, even if an error occurs occasionally during a certain propagation, the error is ignored by considering the context information, and a plurality of backward feedbacks, like human "glimpses", have an error correction function.

Drawings

Fig. 1 is a schematic structural diagram of a CNN of a multilayer structure in a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of an image recognition method using a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the effect of a mask image generated by the image recognition method according to the present invention;

FIG. 4 is a schematic diagram illustrating the effect of a mask image with salient features generated in the image recognition method according to the present invention;

FIG. 5 is a schematic diagram of an image recognition method using a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The deep neural network plug-in based on the attention mechanism is composed of two layers of LSTMs with the same size and a CNN with a multi-layer structure. Wherein the content of the first and second substances,

one of the layers of LSTM is used to memorize context information and generate a mask image with salient features, and the other layer of LSTM is used to implement a "glance" function and generate classification confidence, where "glance" refers to a process of distinguishing an object equivalent to a human eye, such as seeing a person, who may not see clearly at first sight, often "glance" several times more, and finally recognize the person;

the CNN of the multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to the LSTM unit; as shown in fig. 1, the CNN of the multilayer structure includes four convolutional layers; the first convolutional layer is used for generating a feature vector and transmitting the feature vector to the LSTM unit, the second convolutional layer is used for performing secondary feature extraction on data after point multiplication to obtain a feature matrix containing attention information, and the fourth convolutional layer is used for outputting a one-dimensional vector containing a plurality of elements spliced by a multi-dimensional feature map.

The invention relates to an image identification method of a deep neural network plug-in based on an attention mechanism, which is characterized in that CNN with a multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to one LSTM layer, then the LSTM layer screens, forgets and memorizes the received image characteristics to generate a mask image with remarkable characteristics, then the CNN with the multilayer structure is used for secondary characteristic extraction to obtain a characteristic matrix containing attention information, and the attention mechanism with context association capability is formed through cyclic transfer.

As shown in fig. 2 to 5, the image recognition method of the present invention specifically includes the following steps:

s100, initializing the LSTM, and calculating an average value from the input feature vectors to perform comparison on the hidden layer states h of the two LSTMs of the layer₀And storage state c₀Carrying out initialization; specifically, the initial hidden layer states h of the two LSTMs in the layer can be obtained by calculation according to a formula I and a feature vector₀And storage state c₀，

In LSTM, hidden state h₀For short-term memory, whose information is background information or secondary features, will disappear in subsequent calculations; storage state c₀For long term memory, the key characteristics of its information objects will remain unchanged. At the beginning of model operation, hidden state h₀And storage state c₀Initialized with the average value of the feature vectors being generated.

S200, generating a mask image, transmitting a feature vector x generated by a CNN (convolutional layer network) of a multilayer structure to an LSTM (least significant bit) unit of an LSTM (least significant bit), calculating the feature vector x through the LSTM to obtain a feature matrix after analysis and filtration, and performing dot multiplication operation with an original image to generate a feature image with a mask, so that key features reserved by the feature matrix are displayed, and features of other positions are hidden, thereby completing a task of high-resolution information reconstruction and continuously transmitting the feature vector x to the next convolutional layer; the method specifically comprises the following steps:

first, a layer of LSTM is formed according to the last momentHidden layer state h of_t-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operations of formulas two to eight to obtain the feature vector (namely the key feature of the object) to be retained,

f_t＝σ(W_f·[h_t-1,x_t]+b_f) The formula four is shown in the specification,

h_t＝o_t*tanh(c_t) In the formula eight, the first step is,

wherein i_tIt means that,

means that b is_iMeans that;

then, the LSTM in the layer screens, forgets and memorizes the features to obtain a feature matrix mask after feature deconstruction and filtering, and the feature matrix mask is subjected to a linear transformation formula of y-xW^T+ b and the matrix of full connection output, and splicing the matrix of full connection to convert it into the matrix a with the same size as the original picture, where x is the input matrix, W is the weight matrix, b is the offset matrix, for example: the feature matrix mask is a mask feature matrix of 1 × 512, and is output by a linear transformation formula and a fully-connected batchsize × 1024 matrixAnd the fully connected matrix is spliced and converted into a 32 x 32 matrix A with the same size as the original image, and W is a weight matrix of 512 x 1024 because h_t-1When the attention mechanism selects the current feature point for memorizing, the context information, namely the past memorized features, is fully considered;

then, performing dot multiplication on the matrix A and the original image, namely performing dot multiplication according to a formula nine to generate a feature picture with a mask (as shown in fig. 3, brighter pixels are key features, namely key parts of a target object, and weak pixels are discardable features, namely non-key or irrelevant parts of the target object), displaying the key features reserved by the feature matrix mask, hiding the features at other positions, completing the task of high-resolution information reconstruction, and continuously transmitting the task downwards to the next convolutional layer of the CNN with a multilayer structure;

where matrix A is the output o of LSTM_t(the matrix is composed of only 0 and 1, wherein 1 represents the feature of reserving the area, and 0 represents the feature of forgetting the area), and the matrix B is the original image;

in the process, the LSTM needs to pass through the operations of a forgetting gate, an input gate and an output gate; wherein, the forgetting gate is used for deciding which characteristic information should be discarded or retained, the information from the previous hidden state h { t-1} and the current input x _ t is simultaneously transmitted to the sigmoid function, and the output value is between 0 and 1; the closer to 0, the more the user should forget, the closer to 1, the more the user should reserve, then the storage state at the previous moment is multiplied to obtain a state matrix of forgotten information, and the storage state information at the previous moment is determined to be forgotten; the input gate is used for determining which information in the current input is important and needs to be added, the hidden layer state h { t-1} of the last iteration and the eigenvector matrix x-u t passed by the iteration are input into the input gate, and the storage state c is updated, specifically: firstly, information of a previous hidden state h { t-1} and a current input x _ t is transmitted to a sigmoid function, a result value is set between 0 and 1 to determine which features in x _ t are important and need to be updated, then a feature vector of the previous hidden state and a feature vector of the current input are transmitted to a tanh activation function, information stored in a cell state is calculated, then a control signal of an input gate is multiplied by the information stored in the storage state to obtain an updated state matrix, and a previously calculated forgetting state matrix is added to obtain a storage state c at the current moment; the output gate is used to determine the value of the next hidden state, which contains previous information, specifically: firstly, the previous hidden state h { t-1} and the current input state x _ t are transferred to a forgetting gate, then the sigmoid function sets a value between 0 and 1 to determine the characteristic information to be output and generate a state matrix of the output information, then the current storage state c _ t is transferred to the tanh function and the output of the tanh function is multiplied by the state matrix of the output result to determine the information that the hidden state h _ t should carry, and finally the new storage state c _ t and the new hidden state h _ t are transferred to the next time step. Through the operation of a forgetting gate, an input gate and an output gate, a hidden layer outputs a characteristic vector matrix (namely a mask matrix of 1 multiplied by 512), then the mask matrix is subjected to linear transformation, the mask matrix subjected to the linear transformation is transformed into 1024 dimensions, and the mask matrix is reshaped into 32 multiplied by 32 and is consistent with an original image, because the characteristic vector at h { t-1} is involved in calculation, when the current characteristic vector is selected, the attention mechanism considers context information, namely the characteristic vector of a past memory.

S300, generating a mask image with significant features, firstly re-extracting a feature map of the mask image by using a CNN (compressed natural language) of a multilayer structure to obtain a feature matrix containing attention information, then transferring the feature matrix to an output gate of a layer of LSTM, and addressing an output feature vector (namely addressing according to the step S200), wherein the feature matrix is used as feedback input on one hand, and on the other hand, the weights of two layers of LSTMs and the CNN of the multilayer structure are adjusted through reverse propagation, and finally, after multiple iterations, the information of a hidden state is continuously changed, the information of a stored state is reserved, key features of a target object in the mask image are reserved, non-key features or irrelevant features are hidden, and the mask image with significant features is generated; wherein, the loss function is calculated through the back propagation of the cross entropy loss function formula ten,

namely: the second convolution layer of the CNN with a multilayer structure performs secondary feature extraction on the data after point multiplication to obtain a feature matrix containing attention information, an attention mechanism with context association capability is formed through cyclic transfer, the principle is utilized to perform long-term memory on the context of key features under the action of two LSTMs attached to one layer, the two LSTMs of one layer simulate the memory of the context in the text field, the features with more times appearing in the picture are subjected to context enhanced memory, the features with lower frequency appearing in the picture are forgotten, the adjacent features related to the previous memory are continuously added into the memory, so that the feature range of high frequency is expanded, the picture processed by the attention mechanism with context association capability can be output, the main features of the picture are clearer and clearer, as shown in figure 4, the light and dark pixels of the mask image are gradually changed along with the increase of iteration times, key features of the target object are gradually highlighted.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A deep neural network plug-in based on an attention mechanism is characterized by consisting of two layer LSTMs with the same size and a CNN with a multilayer structure; wherein the content of the first and second substances,

2. The attention-based mechanism deep neural network plug-in of claim 1, wherein the CNN of the multilayer structure is composed of four convolutional layers, and is used for generating feature vectors and transmitting the feature vectors to the LSTM unit, performing secondary feature extraction on the point-multiplied data to obtain a feature matrix containing attention information, and outputting a vector containing multiple elements spliced by a multi-dimensional feature map into a single dimension.

3. An image recognition method using the attention mechanism-based deep neural network plug-in unit as claimed in claim 1, 2 or 3, characterized in that CNN of a multi-layer structure is used for down-sampling, extracting image features and transmitting context information to one layer of LSTM, then the layer of LSTM filters, forgets and memorizes the received image features to generate a mask image with salient features, and then CNN of the multi-layer structure is used for secondary feature extraction to obtain a feature matrix containing attention information, and the attention mechanism with context association capability is formed through circulation transmission.

4. A method according to claim 3, characterized by the steps of:

a1. initializing LSTM, calculating average value from feature vector input by CNN of multilayer structure to obtain hidden layer state h of two LSTM layers₀And storage state c₀Carrying out initialization;

5. The method of claim 4, wherein the initial hidden layer states h of two of the layers of LSTMs are calculated according to formula one and the eigenvector₀And storage state c₀，

6. The method according to claim 4, wherein step a2 is specifically:

first, a layer of LSTM is based on the hidden layer state h at the previous time_t-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operation of formulas two to eight to obtain the feature vector to be retained,

f_t＝σ(W_f·[h_t-1,x_t]+b_f) The formula four is shown in the specification,

h_t＝o_t*tanh(c_t) In the formula eight, the first step is,

wherein i_tRefers to the state moment of the updated information,

this layer of LSTM then performs the featureScreening, forgetting and memorizing to obtain a feature matrix mask after feature deconstruction and filtering, wherein the feature matrix mask is obtained by a linear transformation formula of y-xW^T+ b and a matrix which is output by full connection, splicing the matrix which is output by full connection, and converting the matrix into a matrix A which has the same size as the original image, wherein x is the input matrix, W is a weight matrix, and b is an offset matrix;

7. The method according to claim 4, wherein the loss function is calculated in step a3 by back-propagating the cross entropy loss function formula ten,