CN112580777A - Attention mechanism-based deep neural network plug-in and image identification method - Google Patents

Attention mechanism-based deep neural network plug-in and image identification method Download PDF

Info

Publication number
CN112580777A
CN112580777A CN202011256575.1A CN202011256575A CN112580777A CN 112580777 A CN112580777 A CN 112580777A CN 202011256575 A CN202011256575 A CN 202011256575A CN 112580777 A CN112580777 A CN 112580777A
Authority
CN
China
Prior art keywords
matrix
feature
lstm
layer
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011256575.1A
Other languages
Chinese (zh)
Inventor
李海良
刘敏
郭焕
庄师强
张明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202011256575.1A priority Critical patent/CN112580777A/en
Publication of CN112580777A publication Critical patent/CN112580777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a depth neural network plug-in based on an attention mechanism and an image identification method, wherein the plug-in consists of two layers of LSTMs with the same size and a CNN with a multilayer structure; wherein one layer of LSTM is used for memorizing context information and generating a mask image with a remarkable feature, and the other layer of LSTM is used for realizing a 'glimpse' function and generating a classification confidence; the CNN of the multi-layer structure is used for down-sampling, extracting image features and transmitting context information to the LSTM unit. The CNN with the multilayer structure is used, an attention mechanism is applied to guide the CNN with the multilayer structure to focus key features of a target object in an image, secondary features and background pixels are removed, the recognition capability is high, the target object is recognized step by step through multiple propagation, the main features of the object are gradually remembered, the secondary features of the object are forgotten, and the CNN with the multilayer structure is fed back for multiple times, so that the CNN has an error correction function.

Description

Attention mechanism-based deep neural network plug-in and image identification method
Technical Field
The invention belongs to the technical field of artificial intelligence and image recognition, and particularly relates to a depth neural network plug-in based on an attention mechanism and an image recognition method.
Background
Currently, deep neural networks used for image recognition are mainly Convolutional Neural Networks (CNN); however, the applicant finds that the existing Convolutional Neural Network (CNN) has the following defect problems in the image identification process:
1. it is known that objects in an image usually occupy only a part, even a small part, of the entire space, and in many cases, there are a large number of background pixels in the image, and many of these pixels in the image are irrelevant to, and even interfere with, an identification target; however, for CNN, all pixels in the image are equally weighted;
2. during the recognition process, the CNN can only be propagated forward once. Therefore, if the perturbations in the challenge samples are valid in this process, it is likely to result in CNN identification errors.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a deep neural network plug-in with high recognition capability and an error correction function by using a multi-layer CNN and applying an attention mechanism to guide the multi-layer CNN to focus key features of a target object in an image and remove secondary features and background pixels, and an image recognition method using the deep neural network plug-in based on the attention mechanism.
In order to solve the technical problems, the invention adopts the following technical scheme:
a deep neural network plug-in based on attention mechanism is composed of two layers of LSTMs with the same size and a CNN with a multilayer structure; wherein the content of the first and second substances,
one layer of LSTM is used for memorizing context information and generating a mask image with a remarkable characteristic, and the other layer of LSTM is used for realizing a 'glimpsing' function and generating classification confidence;
the CNN of the multi-layer structure is used for down-sampling, extracting image features and transmitting context information to the LSTM unit.
Further, the CNN of the multilayer structure includes four convolutional layers; the first convolutional layer is used for generating a feature vector and transmitting the feature vector to the LSTM unit, the second convolutional layer is used for performing secondary feature extraction on data after point multiplication to obtain a feature matrix containing attention information, and the fourth convolutional layer is used for outputting a one-dimensional vector containing a plurality of elements spliced by a multi-dimensional feature map.
An image recognition method adopting a depth neural network plug-in based on an attention mechanism is characterized in that CNN of a multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to the LSTM of one layer, then the LSTM of the layer screens, forgets and memorizes the received image characteristics to generate a mask image with significant characteristics, and then the CNN of the multilayer structure is used for secondary characteristic extraction to obtain a characteristic matrix containing attention information and forms the attention mechanism with context association capability through circulation transfer. The method specifically comprises the following steps:
a1. initializing LSTM, calculating average value from input feature vector to obtain implicit layer state h of two said one layer LSTM0And storage state c0Carrying out initialization;
a2. generating a mask image, transmitting a characteristic vector x generated by a CNN (convolutional layer number) of a multilayer structure to an LSTM (least significant TM) unit of one layer of LSTM, calculating the characteristic vector x through the LSTM of the layer to obtain a characteristic matrix after analysis and filtration, performing dot multiplication operation with an original image to generate a characteristic picture with a mask, displaying key characteristics reserved by the characteristic matrix, hiding the characteristics of other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the characteristic vector x to the next convolutional layer;
a3. the method comprises the steps of generating a mask image with remarkable features, firstly re-extracting a feature map of the mask image by using a CNN of a multilayer structure to obtain a feature matrix containing attention information, then transferring the feature matrix to an output gate of a layer of LSTM, addressing output feature vectors, on one hand, using the feature matrix as feedback input, on the other hand, adjusting the weights of the two layers of LSTM and the CNN of the multilayer structure through back propagation, and finally, after multiple iterations, the information of a hidden state is continuously changed, the information of a storage state is reserved, key features of a target object in the mask image are reserved, non-key features or irrelevant features are hidden, and the mask image with the remarkable features is generated.
Further, calculating according to a formula I and the feature vector to obtain two initial hidden layer states h of the layer LSTM0And storage state c0
Figure BDA0002773305490000031
Further, step a2 specifically includes:
first, a layer of LSTM is based on the hidden layer state h at the previous timet-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operation of formulas two to eight,
it=σ(Wi·[ht-1,xt]+bi) In the formula II, the first step is carried out,
Figure BDA0002773305490000032
ft=σ(Wf·[ht-1,xt]+bf) The formula four is shown in the specification,
ot=σ(W0·[ht-1,xt]+b0) The formula five is shown in the specification,
Figure BDA0002773305490000033
Figure BDA0002773305490000034
ht=ot*tanh(ct) In the formula eight, the first step is,
wherein itRefers to the state moment of the updated information,
Figure BDA0002773305490000041
new information to be stored, ftState matrix, o, which refers to forgotten informationtRefers to the state matrix of the output information, sigma refers to the sigmoid function, WiWeight matrix, W, referring to input gatescIs a weight matrix, W, of new cellsfIs a weight matrix of forgetting gates, WoWeight matrix, h, referring to the output gatest-1Refers to the cell output, x, at the previous timetIs referred to as the current input, [ h ]t-1,xt]Means that two input vectors are merged, biMeaning forgetting the door bias, bcRefers to a new cell bias, bfIs the input gate offset, boIs the output gate offset, htRefers to the current cell unit output, ctRefers to the current memory state, ct-1Refers to the stored (or memorized) state at the previous moment;
then, the LSTM in the layer screens, forgets and memorizes the features to obtain a feature matrix mask after feature deconstruction and filtering, and the feature matrix mask is subjected to a linear transformation formula of y-xWT+ b and a matrix which is output by full connection, splicing the matrix which is output by full connection, and converting the matrix into a matrix A which has the same size as the original image, wherein x is the input matrix, W is a weight matrix, and b is an offset matrix;
performing dot multiplication on the matrix A and the original image, namely performing dot multiplication according to a formula nine to generate a feature picture with a mask, displaying key features reserved by a feature matrix mask, hiding features at other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the task downwards to a next convolution layer of the CNN with a multilayer structure;
Figure BDA0002773305490000042
Figure BDA0002773305490000043
where matrix A is the output o of LSTMt(the matrix is composed of only 0 and 1, 1 represents the feature of reserving the area, and 0 represents the feature of forgetting the area), and the matrix B is the original image.
Further, in step a3, through the back propagation of the cross entropy loss function formula ten, the loss function is calculated,
Figure BDA0002773305490000051
the invention mainly has the following beneficial effects:
the attention mechanism-based deep neural network plug-in provided by the invention has the advantages that the CNN of the multilayer structure is used, the attention mechanism is simultaneously applied to guide the CNN of the multilayer structure to focus the key features of the target object in the image, the secondary features and background pixels are eliminated, the recognition capability can be improved, the focus of the image is automatically focused, the plug-in has a more humanoid function and is more intelligent, the target object is gradually recognized through multiple transmissions, the main features of the object are gradually remembered in the transmission process, and the secondary features of the object are forgotten. Due to the multi-propagation mechanism, even if an error occurs occasionally during a certain propagation, the error is ignored by considering the context information, and a plurality of backward feedbacks, like human "glimpses", have an error correction function.
Drawings
Fig. 1 is a schematic structural diagram of a CNN of a multilayer structure in a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of an image recognition method using a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the effect of a mask image generated by the image recognition method according to the present invention;
FIG. 4 is a schematic diagram illustrating the effect of a mask image with salient features generated in the image recognition method according to the present invention;
FIG. 5 is a schematic diagram of an image recognition method using a deep neural network plug-in based on an attention mechanism according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The deep neural network plug-in based on the attention mechanism is composed of two layers of LSTMs with the same size and a CNN with a multi-layer structure. Wherein the content of the first and second substances,
one of the layers of LSTM is used to memorize context information and generate a mask image with salient features, and the other layer of LSTM is used to implement a "glance" function and generate classification confidence, where "glance" refers to a process of distinguishing an object equivalent to a human eye, such as seeing a person, who may not see clearly at first sight, often "glance" several times more, and finally recognize the person;
the CNN of the multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to the LSTM unit; as shown in fig. 1, the CNN of the multilayer structure includes four convolutional layers; the first convolutional layer is used for generating a feature vector and transmitting the feature vector to the LSTM unit, the second convolutional layer is used for performing secondary feature extraction on data after point multiplication to obtain a feature matrix containing attention information, and the fourth convolutional layer is used for outputting a one-dimensional vector containing a plurality of elements spliced by a multi-dimensional feature map.
The invention relates to an image identification method of a deep neural network plug-in based on an attention mechanism, which is characterized in that CNN with a multilayer structure is used for down-sampling, extracting image characteristics and transmitting context information to one LSTM layer, then the LSTM layer screens, forgets and memorizes the received image characteristics to generate a mask image with remarkable characteristics, then the CNN with the multilayer structure is used for secondary characteristic extraction to obtain a characteristic matrix containing attention information, and the attention mechanism with context association capability is formed through cyclic transfer.
As shown in fig. 2 to 5, the image recognition method of the present invention specifically includes the following steps:
s100, initializing the LSTM, and calculating an average value from the input feature vectors to perform comparison on the hidden layer states h of the two LSTMs of the layer0And storage state c0Carrying out initialization; specifically, the initial hidden layer states h of the two LSTMs in the layer can be obtained by calculation according to a formula I and a feature vector0And storage state c0
Figure BDA0002773305490000071
Figure BDA0002773305490000072
In LSTM, hidden state h0For short-term memory, whose information is background information or secondary features, will disappear in subsequent calculations; storage state c0For long term memory, the key characteristics of its information objects will remain unchanged. At the beginning of model operation, hidden state h0And storage state c0Initialized with the average value of the feature vectors being generated.
S200, generating a mask image, transmitting a feature vector x generated by a CNN (convolutional layer network) of a multilayer structure to an LSTM (least significant bit) unit of an LSTM (least significant bit), calculating the feature vector x through the LSTM to obtain a feature matrix after analysis and filtration, and performing dot multiplication operation with an original image to generate a feature image with a mask, so that key features reserved by the feature matrix are displayed, and features of other positions are hidden, thereby completing a task of high-resolution information reconstruction and continuously transmitting the feature vector x to the next convolutional layer; the method specifically comprises the following steps:
first, a layer of LSTM is formed according to the last momentHidden layer state h oft-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operations of formulas two to eight to obtain the feature vector (namely the key feature of the object) to be retained,
it=σ(Wi·[ht-1,xt]+bi) In the formula II, the first step is carried out,
Figure BDA0002773305490000073
ft=σ(Wf·[ht-1,xt]+bf) The formula four is shown in the specification,
ot=σ(W0·[ht-1,xt]+b0) The formula five is shown in the specification,
Figure BDA0002773305490000074
Figure BDA0002773305490000081
ht=ot*tanh(ct) In the formula eight, the first step is,
wherein itIt means that,
Figure BDA0002773305490000082
means that b isiMeans that;
then, the LSTM in the layer screens, forgets and memorizes the features to obtain a feature matrix mask after feature deconstruction and filtering, and the feature matrix mask is subjected to a linear transformation formula of y-xWT+ b and the matrix of full connection output, and splicing the matrix of full connection to convert it into the matrix a with the same size as the original picture, where x is the input matrix, W is the weight matrix, b is the offset matrix, for example: the feature matrix mask is a mask feature matrix of 1 × 512, and is output by a linear transformation formula and a fully-connected batchsize × 1024 matrixAnd the fully connected matrix is spliced and converted into a 32 x 32 matrix A with the same size as the original image, and W is a weight matrix of 512 x 1024 because ht-1When the attention mechanism selects the current feature point for memorizing, the context information, namely the past memorized features, is fully considered;
then, performing dot multiplication on the matrix A and the original image, namely performing dot multiplication according to a formula nine to generate a feature picture with a mask (as shown in fig. 3, brighter pixels are key features, namely key parts of a target object, and weak pixels are discardable features, namely non-key or irrelevant parts of the target object), displaying the key features reserved by the feature matrix mask, hiding the features at other positions, completing the task of high-resolution information reconstruction, and continuously transmitting the task downwards to the next convolutional layer of the CNN with a multilayer structure;
Figure BDA0002773305490000083
Figure BDA0002773305490000084
where matrix A is the output o of LSTMt(the matrix is composed of only 0 and 1, wherein 1 represents the feature of reserving the area, and 0 represents the feature of forgetting the area), and the matrix B is the original image;
in the process, the LSTM needs to pass through the operations of a forgetting gate, an input gate and an output gate; wherein, the forgetting gate is used for deciding which characteristic information should be discarded or retained, the information from the previous hidden state h { t-1} and the current input x _ t is simultaneously transmitted to the sigmoid function, and the output value is between 0 and 1; the closer to 0, the more the user should forget, the closer to 1, the more the user should reserve, then the storage state at the previous moment is multiplied to obtain a state matrix of forgotten information, and the storage state information at the previous moment is determined to be forgotten; the input gate is used for determining which information in the current input is important and needs to be added, the hidden layer state h { t-1} of the last iteration and the eigenvector matrix x-u t passed by the iteration are input into the input gate, and the storage state c is updated, specifically: firstly, information of a previous hidden state h { t-1} and a current input x _ t is transmitted to a sigmoid function, a result value is set between 0 and 1 to determine which features in x _ t are important and need to be updated, then a feature vector of the previous hidden state and a feature vector of the current input are transmitted to a tanh activation function, information stored in a cell state is calculated, then a control signal of an input gate is multiplied by the information stored in the storage state to obtain an updated state matrix, and a previously calculated forgetting state matrix is added to obtain a storage state c at the current moment; the output gate is used to determine the value of the next hidden state, which contains previous information, specifically: firstly, the previous hidden state h { t-1} and the current input state x _ t are transferred to a forgetting gate, then the sigmoid function sets a value between 0 and 1 to determine the characteristic information to be output and generate a state matrix of the output information, then the current storage state c _ t is transferred to the tanh function and the output of the tanh function is multiplied by the state matrix of the output result to determine the information that the hidden state h _ t should carry, and finally the new storage state c _ t and the new hidden state h _ t are transferred to the next time step. Through the operation of a forgetting gate, an input gate and an output gate, a hidden layer outputs a characteristic vector matrix (namely a mask matrix of 1 multiplied by 512), then the mask matrix is subjected to linear transformation, the mask matrix subjected to the linear transformation is transformed into 1024 dimensions, and the mask matrix is reshaped into 32 multiplied by 32 and is consistent with an original image, because the characteristic vector at h { t-1} is involved in calculation, when the current characteristic vector is selected, the attention mechanism considers context information, namely the characteristic vector of a past memory.
S300, generating a mask image with significant features, firstly re-extracting a feature map of the mask image by using a CNN (compressed natural language) of a multilayer structure to obtain a feature matrix containing attention information, then transferring the feature matrix to an output gate of a layer of LSTM, and addressing an output feature vector (namely addressing according to the step S200), wherein the feature matrix is used as feedback input on one hand, and on the other hand, the weights of two layers of LSTMs and the CNN of the multilayer structure are adjusted through reverse propagation, and finally, after multiple iterations, the information of a hidden state is continuously changed, the information of a stored state is reserved, key features of a target object in the mask image are reserved, non-key features or irrelevant features are hidden, and the mask image with significant features is generated; wherein, the loss function is calculated through the back propagation of the cross entropy loss function formula ten,
Figure BDA0002773305490000101
namely: the second convolution layer of the CNN with a multilayer structure performs secondary feature extraction on the data after point multiplication to obtain a feature matrix containing attention information, an attention mechanism with context association capability is formed through cyclic transfer, the principle is utilized to perform long-term memory on the context of key features under the action of two LSTMs attached to one layer, the two LSTMs of one layer simulate the memory of the context in the text field, the features with more times appearing in the picture are subjected to context enhanced memory, the features with lower frequency appearing in the picture are forgotten, the adjacent features related to the previous memory are continuously added into the memory, so that the feature range of high frequency is expanded, the picture processed by the attention mechanism with context association capability can be output, the main features of the picture are clearer and clearer, as shown in figure 4, the light and dark pixels of the mask image are gradually changed along with the increase of iteration times, key features of the target object are gradually highlighted.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (7)

1. A deep neural network plug-in based on an attention mechanism is characterized by consisting of two layer LSTMs with the same size and a CNN with a multilayer structure; wherein the content of the first and second substances,
one layer of LSTM is used for memorizing context information and generating a mask image with a remarkable characteristic, and the other layer of LSTM is used for realizing a 'glimpsing' function and generating classification confidence;
the CNN of the multi-layer structure is used for down-sampling, extracting image features and transmitting context information to the LSTM unit.
2. The attention-based mechanism deep neural network plug-in of claim 1, wherein the CNN of the multilayer structure is composed of four convolutional layers, and is used for generating feature vectors and transmitting the feature vectors to the LSTM unit, performing secondary feature extraction on the point-multiplied data to obtain a feature matrix containing attention information, and outputting a vector containing multiple elements spliced by a multi-dimensional feature map into a single dimension.
3. An image recognition method using the attention mechanism-based deep neural network plug-in unit as claimed in claim 1, 2 or 3, characterized in that CNN of a multi-layer structure is used for down-sampling, extracting image features and transmitting context information to one layer of LSTM, then the layer of LSTM filters, forgets and memorizes the received image features to generate a mask image with salient features, and then CNN of the multi-layer structure is used for secondary feature extraction to obtain a feature matrix containing attention information, and the attention mechanism with context association capability is formed through circulation transmission.
4. A method according to claim 3, characterized by the steps of:
a1. initializing LSTM, calculating average value from feature vector input by CNN of multilayer structure to obtain hidden layer state h of two LSTM layers0And storage state c0Carrying out initialization;
a2. generating a mask image, transmitting a characteristic vector x generated by a CNN (convolutional layer number) of a multilayer structure to an LSTM (least significant TM) unit of one layer of LSTM, calculating the characteristic vector x through the LSTM of the layer to obtain a characteristic matrix after analysis and filtration, performing dot multiplication operation with an original image to generate a characteristic picture with a mask, displaying key characteristics reserved by the characteristic matrix, hiding the characteristics of other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the characteristic vector x to the next convolutional layer;
a3. the method comprises the steps of generating a mask image with remarkable features, firstly re-extracting a feature map of the mask image by using a CNN of a multilayer structure to obtain a feature matrix containing attention information, then transferring the feature matrix to an output gate of a layer of LSTM, addressing output feature vectors, on one hand, using the feature matrix as feedback input, on the other hand, adjusting the weights of the two layers of LSTM and the CNN of the multilayer structure through back propagation, and finally, after multiple iterations, the information of a hidden state is continuously changed, the information of a storage state is reserved, key features of a target object in the mask image are reserved, non-key features or irrelevant features are hidden, and the mask image with the remarkable features is generated.
5. The method of claim 4, wherein the initial hidden layer states h of two of the layers of LSTMs are calculated according to formula one and the eigenvector0And storage state c0
Figure FDA0002773305480000021
Figure FDA0002773305480000022
6. The method according to claim 4, wherein step a2 is specifically:
first, a layer of LSTM is based on the hidden layer state h at the previous timet-1Inputting the feature vector matrix x transmitted this time into an input gate, a forgetting gate and an output gate to perform the operation of formulas two to eight to obtain the feature vector to be retained,
it=σ(Wi·[ht-1,xt]+bi) In the formula II, the first step is carried out,
Figure FDA0002773305480000023
ft=σ(Wf·[ht-1,xt]+bf) The formula four is shown in the specification,
ot=σ(W0·[ht-1,xt]+b0) The formula five is shown in the specification,
Figure FDA0002773305480000031
Figure FDA0002773305480000032
ht=ot*tanh(ct) In the formula eight, the first step is,
wherein itRefers to the state moment of the updated information,
Figure FDA0002773305480000033
new information to be stored, ftState matrix, O, which refers to forgotten informationtRefers to the state matrix of the output information, sigma refers to the sigmoid function, WiWeight matrix, W, referring to input gatescIs a weight matrix, W, of new cellsfIs a weight matrix of forgetting gates, WoWeight matrix, h, referring to the output gatest-1Refers to the cell output, x, at the previous timetIs referred to as the current input, [ h ]t-1,xt]Means that two input vectors are merged, biMeaning forgetting the door bias, bcRefers to a new cell bias, bfIs the input gate offset, boIs the output gate offset, htRefers to the current cell unit output, ctRefers to the current memory state, ct-1Refers to the stored (or memorized) state at the previous moment;
this layer of LSTM then performs the featureScreening, forgetting and memorizing to obtain a feature matrix mask after feature deconstruction and filtering, wherein the feature matrix mask is obtained by a linear transformation formula of y-xWT+ b and a matrix which is output by full connection, splicing the matrix which is output by full connection, and converting the matrix into a matrix A which has the same size as the original image, wherein x is the input matrix, W is a weight matrix, and b is an offset matrix;
performing dot multiplication on the matrix A and the original image, namely performing dot multiplication according to a formula nine to generate a feature picture with a mask, displaying key features reserved by a feature matrix mask, hiding features at other positions, completing a task of high-resolution information reconstruction, and continuously transmitting the task downwards to a next convolution layer of the CNN with a multilayer structure;
Figure FDA0002773305480000041
Figure FDA0002773305480000042
where matrix A is the output o of LSTMt(the matrix is composed of only 0 and 1, 1 represents the feature of reserving the area, and 0 represents the feature of forgetting the area), and the matrix B is the original image.
7. The method according to claim 4, wherein the loss function is calculated in step a3 by back-propagating the cross entropy loss function formula ten,
Figure FDA0002773305480000043
CN202011256575.1A 2020-11-11 2020-11-11 Attention mechanism-based deep neural network plug-in and image identification method Pending CN112580777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011256575.1A CN112580777A (en) 2020-11-11 2020-11-11 Attention mechanism-based deep neural network plug-in and image identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011256575.1A CN112580777A (en) 2020-11-11 2020-11-11 Attention mechanism-based deep neural network plug-in and image identification method

Publications (1)

Publication Number Publication Date
CN112580777A true CN112580777A (en) 2021-03-30

Family

ID=75122439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011256575.1A Pending CN112580777A (en) 2020-11-11 2020-11-11 Attention mechanism-based deep neural network plug-in and image identification method

Country Status (1)

Country Link
CN (1) CN112580777A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326739A (en) * 2021-05-07 2021-08-31 山东大学 Online learning participation degree evaluation method based on space-time attention network, evaluation system, equipment and storage medium
CN113536989A (en) * 2021-06-29 2021-10-22 广州博通信息技术有限公司 Refrigerator frosting monitoring method and system based on camera video frame-by-frame analysis

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326739A (en) * 2021-05-07 2021-08-31 山东大学 Online learning participation degree evaluation method based on space-time attention network, evaluation system, equipment and storage medium
CN113326739B (en) * 2021-05-07 2022-08-09 山东大学 Online learning participation degree evaluation method based on space-time attention network, evaluation system, equipment and storage medium
CN113536989A (en) * 2021-06-29 2021-10-22 广州博通信息技术有限公司 Refrigerator frosting monitoring method and system based on camera video frame-by-frame analysis

Similar Documents

Publication Publication Date Title
CN109711426B (en) Pathological image classification device and method based on GAN and transfer learning
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN110826338B (en) Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement
CN112580777A (en) Attention mechanism-based deep neural network plug-in and image identification method
CN112990444B (en) Hybrid neural network training method, system, equipment and storage medium
CN110349179B (en) Visible light infrared vision tracking method and device based on multiple adapters
CN114119975A (en) Language-guided cross-modal instance segmentation method
CN112347756A (en) Reasoning reading understanding method and system based on serialized evidence extraction
CN112580728B (en) Dynamic link prediction model robustness enhancement method based on reinforcement learning
CN112464816A (en) Local sign language identification method and device based on secondary transfer learning
CN111814843B (en) End-to-end training method and application of image feature module in visual question-answering system
CN117746078B (en) Object detection method and system based on user-defined category
CN115797952B (en) Deep learning-based handwriting English line recognition method and system
CN110990630A (en) Video question-answering method based on graph modeling visual information and guided by using questions
CN117037176A (en) Pre-training language model adaptation method for vision-language task
CN115495579A (en) Method and device for classifying text of 5G communication assistant, electronic equipment and storage medium
CN115456173A (en) Generalized artificial neural network unsupervised local learning method, system and application
CN113379593B (en) Image generation method, system and related equipment
CN112329803B (en) Natural scene character recognition method based on standard font generation
CN114564568A (en) Knowledge enhancement and context awareness based dialog state tracking method and system
CN113239678A (en) Multi-angle attention feature matching method and system for answer selection
CN113836910A (en) Text recognition method and system based on multilevel semantics
CN115311595B (en) Video feature extraction method and device and electronic equipment
CN117688974B (en) Knowledge graph-based generation type large model modeling method, system and equipment
Tian et al. Multi-channel co-attention network for visual question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination