CN110704665A - Image feature expression method and system based on visual attention mechanism - Google Patents

Image feature expression method and system based on visual attention mechanism Download PDF

Info

Publication number
CN110704665A
CN110704665A CN201910818508.5A CN201910818508A CN110704665A CN 110704665 A CN110704665 A CN 110704665A CN 201910818508 A CN201910818508 A CN 201910818508A CN 110704665 A CN110704665 A CN 110704665A
Authority
CN
China
Prior art keywords
module
network
attention
value
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910818508.5A
Other languages
Chinese (zh)
Inventor
段凌宇
白燕
楼燚航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201910818508.5A priority Critical patent/CN110704665A/en
Publication of CN110704665A publication Critical patent/CN110704665A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of computer vision, in particular to an image feature expression method and system based on a visual attention mechanism. Inputting the pictures into a trained deep network model to perform feature extraction on the pictures to obtain the attention feature values of the pictures, calculating the distance between the feature values of the pictures and the feature values of the target pictures, and selecting a plurality of target pictures with the closest distances to display. The invention realizes a target retrieval frame integrating feature extraction and distance measurement by utilizing a multi-scale attention network, and compared with the traditional algorithm, the processing speed and the accuracy are both improved better.

Description

Image feature expression method and system based on visual attention mechanism
Technical Field
The invention relates to the field of computer vision, in particular to an image feature expression method and system based on a visual attention mechanism.
Background
Image retrieval, which aims at retrieving from an image dataset the same image describing a particular object as a given query image, has gained much research attention. The success of convolutional neural networks has greatly facilitated the advancement of image retrieval in recent years, benefiting from discernability and compact representation capabilities. Although significant performance gains have been achieved with deep learning based image descriptors. But in practical application, the two challenges of background interference and scale change still exist. First, clutter as extraneous information greatly affects the representation of features on the information area for image retrieval; second, the interest/target objects in the query and reference images are typically different in scale. In this work, we focused on using multi-scale feature representations of information-rich regions in the image.
In practical scene application, background interference can seriously affect the process of feature matching, and for image retrieval, the information-rich area in the concerned image is beneficial to generating efficient features. Recently, CNN-based features are mostly trained into global descriptors using twin networks or triple networks. These global features are extracted directly from the output of the last convolutional layer using the maximum or average pooling layer operation, which is difficult to handle for complex scenes. This is because the target objects in the image are mostly misaligned and in some extreme cases even only a small part. Therefore, it is necessary to selectively focus on certain areas and ignore irrelevant areas. This alternative approach to attention, also known as the attention mechanism, has proven effective in a variety of research areas. Such as machine translation, speech recognition, and image description. One typical attention mechanism applied in CNN is predictive attention maps, where the values on each map indicate the amount of information at the corresponding location.
The scale is a main factor influencing the representation of features in image retrieval, and the noted region may be different on different scales. One representative work is Scale Invariant Feature Transform (SIFT), which finds extreme responses as feature points for image matching in a multi-scale gaussian pyramid. However, in current deep learning-based approaches, the multi-scale context of relevance between different-scale regions of interest has not been fully explored. Current networks for generating scale-robust features are typically equipped with data enhancement (i.e., randomly resizing or cropping training images, etc.) during the training phase or obtain fully-connected features of different scale input images as final features. In some extreme cases, when the object of interest occupies a small part of the input image, it is difficult to retain the response as the size of the feature map continues to decrease during the forward propagation, and in order to take reliable attention on different scales, we need to intuitively acquire multi-scale context information.
Disclosure of Invention
The embodiment of the invention provides an image feature expression method and system based on a visual attention mechanism, a multi-scale attention network is utilized to realize a target retrieval frame integrating feature extraction and distance measurement, and compared with the traditional algorithm, the processing speed and accuracy are improved better.
According to a first aspect of the embodiments of the present invention, an image feature expression method based on a visual attention mechanism includes
Inputting the picture into a trained deep network model to perform feature extraction on the picture to obtain an attention feature value of the picture,
calculating the distance between the attention characteristic value of the picture and the characteristic value of the target picture,
and selecting a plurality of target pictures with the closest distances for display.
The deep network model comprises
The system comprises a classification network, a data processing module and a data processing module, wherein a visual attention module is inserted in the middle of partial convolution layers of the classification network, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;
two layers of long and short term memory networks, wherein the LSTM modules in each layer of long and short term memory network correspond to the visual attention modules one by one;
the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first layer long-short term memory network is input into the LSTM module of the second layer long-short term memory network;
and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.
The training of the deep network model comprises
Using triplet elements<x,xp,xn>Training a deep network model to combine triplet elements<x,xp,xn>Inputting a depth network model, wherein x is an anchor sample, xpBelonging to the same class as x, xnCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;
and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.
The loss function is
Figure BDA0002186918780000031
α is a parameter, x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output.
The last volume of lamination data output value is input into the next volume of lamination after being processed by the visual attention module, and the method specifically comprises the following steps
Processing the output value of the previous volume of lamination layer data through a visual attention module to obtain the value of a basic branch, and inputting the value of the basic branch into the next volume of lamination layer, wherein the calculation method of the processing is
bi,j(x)=fi,j(x)⊙si,j(x)+fi,j(x)
Wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output value at i, j, f (x) is the output characteristic, ⊙ is the element dot product.
The attention feature value is obtained by calculating the output value of the last LSTM module of the second layer long-short term memory network and the output value of the classification network, and comprises
Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)
Wherein, s (x) is the output value of the last LSTM module of the second layer LSTM, and f (x) is the output value of the classification network.
The secondThe LSTM module state update rule of the layer length short-term memory network is ct=gt⊙it⊙ut+(1-gt)⊙ft⊙ct-1
gtIs the normalized attention gate for the t-step input; i.e. itIs an input gate for the t-step input; f. oftIs the input gate of the t-step output; u. oftIs an intermediate variable of the step t; c. CtAre t-step memory cells.
An image feature expression system based on a visual attention mechanism comprises a depth network module and a feature calculation module,
the deep network module is used for inputting the picture into the trained deep network module to perform feature extraction on the picture to obtain the attention feature value of the picture,
the feature calculation module is used for calculating the distance between the attention feature value of the picture and the feature value of the target picture, and selecting a plurality of target pictures with the closest distances for display.
The deep network module comprises
The classified network module is characterized in that a visual attention module is inserted between partial convolution layers of the classified network module, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;
the system also comprises two layers of long and short term memory networks, and the LSTM modules in the long and short term memory networks correspond to the visual attention modules one by one;
the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first-time long-short term memory network is input into the LSTM module of the second-layer long-short term memory network;
and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.
Also included is a pre-training module to employ triplet elements<x,xp,xn>Training deep network module to combine triplet unit<x,xp,xn>Input a deep network module, where x is an anchor sample, xpBelonging to the same class as x, xnCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;
and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.
A loss function of
Figure BDA0002186918780000041
α is a parameter, x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the multi-scale attention network is utilized to realize a target retrieval frame integrating feature extraction and distance measurement, and compared with the traditional algorithm, the processing speed and accuracy are improved better;
through a plurality of modules with different scales and attention and context modeling among the modules, the characteristics of network learning can better resist the influence caused by scale and background interference when image content is searched and matched, and the performance of image searching is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flowchart of an image feature expression method based on a visual attention mechanism according to a first embodiment;
FIG. 2 is a flow chart of a method for expressing image features based on a visual attention mechanism;
FIG. 3 is a block diagram of an image feature representation system based on a visual attention mechanism.
Detailed Description
Example one
As shown in fig. 1, the present invention provides an image feature expression method based on a visual attention mechanism, including:
inputting the picture into a trained deep network model to perform feature extraction on the picture to obtain an attention feature value of the picture,
calculating the distance between the attention characteristic value of the picture and the characteristic value of the target picture,
and selecting a plurality of target pictures with the closest distances for display.
Preferably, the training of the deep network model comprises the steps of extracting the category and the individual characteristic of the same target object appearing in different image data through a deep neural network with a certain specific structural property, and enabling the characteristic to meet the relationship that the same target object is close in distance and different objects are far in distance in a high-dimensional Euclidean space;
and after preparing data and corresponding labeled information, continuously executing forward propagation and backward propagation according to a training mode of the deep neural network until the final output loss function is converged.
Preferably, the design of the deep network model includes a classification network of CNN structure as an underlying network,
each intermediate convolutional layer is inserted with a visual attention module,
comprises two layers of Long Short Term Memory (LSTM) network;
the output value of the visual attention module is input to the input end of an LSTM module of a first layer long-short term memory (LSTM) network, and the hidden state value of the output of the LSTM module of the first layer long-short term memory (LSTM) network is input to the input end of an LSTM module of a second layer long-short term memory (LSTM) network and the input end of a next LSTM module of the first layer long-short term memory (LSTM) network; the hidden state value of the output of the LSTM module of the second layer Long Short Term Memory (LSTM) network is input to the input end of the LSTM module of the next second layer Long Short Term Memory (LSTM) network.
Preferably, the output value of the last LSTM module of the second layer Long Short Term Memory (LSTM) network is the final visual attention feature map, and the final visual attention feature map is combined with the output of the last convolutional layer of the classification network and output as the final feature representation map. That is, the output of the last LSTM module of the second layer is treated as the final attention weight.
F′(x)=F(x)+F(x)⊙S(x)
Wherein, s (x) is the output value of the last LSTM module of the second layer LSTM, and f (x) is the output value of the classification network.
Preferably, the first LSTM module of the second layer Long Short Term Memory (LSTM) network receives an input of the hidden state value of the first LSTM module of the first layer Long Short Term Memory (LSTM) network and a multi-scale context memory M, wherein the expression of the multi-scale context memory M is:
average hidden state h of all steps TtTo obtain the context memory M. It can be expressed as follows:
Figure BDA0002186918780000061
htis the hidden state value of step t.
Add attention gate gtThen, the LSTM unit state update rule in the second layer LSTM will also change as follows:
ct=gt⊙it⊙ut+(1-gt)⊙ft⊙ct-1
gtis the normalized attention gate for the t-step input; i.e. itIs an input gate for the t-step input; f. oftIs the input gate of the t-step output; u. oftIs an intermediate variable of the step t; c. CtAre t-step memory cells.
Preferably, a triple loss function is adopted to train the deep network model.
A triplet loss function is used to train our proposed model. The ternary network aims at projecting samples into the embedding space, where samples belonging to the same class will be closer together than those from a different class<x,xp,xn>Representing a triplet element, where x is an anchor sample, xpBelonging to the same as xClass (b), xnConstraints can be expressed as:
d(x,xp)+α≤d(x,xn)
where α is a scalar that controls the boundary between the positive and negative samples.
Figure BDA0002186918780000062
x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f is the net output
Due to the size of the training data and the sensitivity of triplet loss selection, the process of optimizing the triplet loss function is inefficient. As for computational losses, each iteration requires tens of triplet elements, but only a few may violate the constraints. Therefore, we perform online hard sample mining (online hard sample mining) to improve training efficiency. We define a hard sample as a sample that does not comply with the distance (margin) constraint. We then feed these "filtered" hard sample triplet elements again into the network to compute the losses and perform back propagation.
Preferably, the visual attention module can be viewed as another branch to calculate the importance score for each location in the feature map. Given an input x, we can obtain the output characteristics f (x) of the network, and the attention module calculates an attention (importance) score s (x),
wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) ⊙ is the element dot product;
the process is a dot product process of one element, so that the interference response can be suppressed and the response of the object of interest can be boosted. TheThe model may be trained end-to-end using a back-propagation algorithm, di,j(x) The partial derivative of (a) is given by:
Figure BDA0002186918780000071
where θ is a parameter in the attention module; wherein i, j represents a position on the entire feature map; di,j(x) The value of the intermediate variable at i, j for the base branch; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output feature at i, j; (x) is an output characteristic; notably, during the training period si,j(x) Is restricted to be non-negative.
To build a multi-scale context we add an attention module after each residual block in the ResNet 101. In particular, our attention module consists of two convolutional layers with kernel size (1 x 1). The attention score for each location is found using the softmax function after the output of the second layer. Directly feeding the weighted features into the next layer may affect the stability of the web learning. Since the attention weights range from 0 to 1, the original identity map in the parameter block is changed. Therefore, we also establish an identity map of attention, which can be expressed as follows:
bi,j(x)=fi,j(x)⊙si,j(x)+fi,j(x)
wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output characteristic in i, j, f (x) is the output characteristic, ⊙ is the element dot product;
its motivation is similar to residual learning, as the identity mapping in the attention module ensures that an increase in attention will work better than before no increase.
Example two
As shown in FIG. 2, the present invention relates to an image feature expression method based on visual attention mechanism, which comprises
The deep network training step comprises the steps of carrying out category and individual feature extraction on the data of different pictures of the same target object through a deep neural network with a certain specific structural property, and enabling the feature to meet the relationship that the same target object is close in distance and different objects are far in distance in a high-dimensional Euclidean space;
the method comprises the steps of target accurate retrieval, wherein the accurate retrieval step comprises the steps of utilizing a trained deep network model to extract features of pictures, then calculating Euclidean distances of a plurality of pictures in Euclidean space, and realizing the target of target accurate retrieval through sequencing;
in the deep neural network training step, the method further comprises the following steps:
a) and a network structure design step, namely designing a multi-scale visual attention module to obtain final visual attention capable of better resisting background interference and scale change, and finally assisting to generate an image global descriptor with distinguishing force.
b) And (c) a model training step, namely after preparing data and corresponding labeled information, continuously executing forward propagation and backward propagation on the network model designed in the step a according to the training mode of the deep neural network until the final output loss function is converged.
In the step of designing the network structure, the method further comprises the following steps:
a) selecting a general classification network as a basic network structure;
b) inserting a plurality of visual attention modules in the middle of the classification network;
c) and inputting the attention diagrams output by the visual attention module of each layer into an LSTM long-time memory network, and finally outputting a final visual attention feature diagram by the LSTM.
d) And weighting the final visual attention feature map to the final depth feature of the network to obtain a final global image descriptor.
e) Applying triplet penalties on the final global image descriptor
The LSTM memory network further comprises the following steps:
a) inputting the visual attention feature map output by each attention module into the LSTM
b) The output of the first layer of the LSTM obtains the mean value of each hidden state to obtain the context information of visual attention under different scales;
c) the visual attention information in the first layer LSTM, as well as the context information, is dynamically introduced in the LSTM according to a threshold mechanism to generate the final visual attention.
In the retrieving step, the method further includes the steps of:
a) extracting features of all the object pictures by using the trained network model;
b) calculating Euclidean distances of different picture characteristics;
c) and (4) obtaining the difference degree among the pictures according to the distance sequence, and realizing accurate target retrieval.
In particular, this context within the attention sequence is modeled by a two-layer long short-term memory network (LSTM), the first layer encoding the attention map at different scales and generating an initial multi-scale context memory. This contextual memory is then input to the second LSTM layer to help the network selectively focus on information attention and further create multi-scale perceptual attention. If the attention response at a particular scale is information about the multi-scale context, the LSTM network will import more attention information at that scale.
a) Selecting a general classification network as a basic network structure; preferably, the classification network may be of a basic CNN structure;
b) inserting a visual attention module in the middle convolution layer of the classification network;
preferably, we design a soft care module to actively select responses in the network. For example, using ResNet101 as the underlying network, the attention model can be viewed as another branch to compute the importance score for each location in the feature map. Given an input x, we can obtain the output characteristics f (x) of the network, and the attention module computes an attention (importance) score s (x), which is softly weighted onto the output characteristics of the network. Outputted attention score si,j(x) Can be regarded as a base branch bi,j(x) Door of di,j(x) To find the basic branch bi,j(x) May be expressed as:
di,j(x)=fi,j(x)⊙si,j(x)
wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) ⊙ is the element dot product;
the process is a dot product process of one element, so that the interference response can be suppressed and the response of the object of interest can be boosted. The model can be trained end-to-end using a back propagation algorithm, di,j(x) The partial derivative of (a) is given by:
Figure BDA0002186918780000101
where θ is a parameter in the attention module; wherein i, j represents a position on the entire feature map; di,j(x) The value of the intermediate variable at i, j for the base branch; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output feature at i, j; (x) is an output characteristic; notably, during the training period si,j(x) Is restricted to be non-negative.
To build a multi-scale context we add an attention module after each residual block in the ResNet 101. In particular, our attention module consists of two convolutional layers with kernel size (1 x 1). The attention score for each location is found using the softmax function after the output of the second layer. Directly feeding the weighted features into the next layer may affect the stability of the web learning. Since the attention weights range from 0 to 1, the original identity map in the parameter block is changed. Therefore, we also establish an identity map of attention, which can be expressed as follows:
bi,j(x)=fi,j(x)⊙si,j(x)+fi,j(x)
wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output characteristic in i, j, f (x) is the output characteristic, ⊙ is the element dot product;
its motivation is similar to residual learning, as the identity mapping in the attention module ensures that an increase in attention will work better than before no increase.
c) And inputting the attention diagram output by the visual attention module of each layer into a corresponding LSTM long-time memory network, and finally outputting a final visual attention feature diagram by the LSTM.
Relating to long-short term memory (LSTM) networks, which are constructed by stacking LSTM cells, a typical LSTM cell includes an input gate itForgetting the door ftOutput gate otAnd hidden state htAnd a memory cell ct. The calculation process for the LSTM unit can be expressed as follows:
Figure BDA0002186918780000111
ct=it⊙ut+ft⊙ct-1
ht=ot⊙tanh(ct)
wherein xtFor input in step t, utIs the intermediate variable of step t, the adjusted input, ht-1Sigma represents an activation function sigmoid in an LSTM unit; c. CtIs a memory unit in t steps;
operation ⊙ represents the dot product of the elements, its three gate input gates itForgetting the door ftOutput gate otAre different features of the LSTM unit and serve different purposes. Input gate determines the slave modulation input utDegree of import information to update ctThen forget to remember the door ftControlling unit c in slave step tt-1Previous state of (1)In the last step, the output gate determines the output level of the memory cell.
d) And weighting the final visual attention feature map to the final depth feature of the network to obtain a final global image descriptor.
Since we try to selectively introduce attention mechanisms from different scales, a multi-scale context memory should be obtained. We consider using the output of the first layer in the LSTM to generate a multi-scale context memory. In particular, we use the average hidden state h of all steps T in the first LSTM layertTo obtain the context memory M. It can be expressed as follows:
Figure BDA0002186918780000112
in each step t, the LSTM accepts attention maps from a particular scale. In addition, we also consider feeding all hidden states of the first layer to another forward network. To simplify and compact the model, we adopt an averaging method, since adding another sub-network would involve more parameters.
Attention in the second layer LSTM: we further evaluated the degree of information input at the second level, where an attention gate g was settTo selectively control the attention information imported from each scale it receives an input htAnd a multi-scale context memory M, which can be expressed as follows:
Figure BDA0002186918780000121
Figure BDA0002186918780000122
wherein g istIs the normalized attention gate for the t-step input, htIs the average hidden state; m is context memory
Add attention gate gtThereafter, the cell state update rule in the second layer LSTM will also change as follows:
ct=gt⊙it⊙ut+(1-gt)⊙ft⊙ct-1
gtis the normalized attention gate for the t-step input; i.e. itIs an input gate for the t-step input; f. oftIs the input gate of the t-step output; u. oftIs an intermediate variable of the step t; c. CtAre t-step memory cells.
It indicates that if the input attention ht is important for the multi-scale context, the units in the second layer will import more attention information from it; although the amount of information is less, we tend to block it and make better use of the history information of the LSTM unit.
The goal of our MSCAN is to generate more discriminative global descriptors that can learn the multi-scale context when modeling attention. The output of the last convolutional layer is then combined with the final attention as follows:
F′(x)=F(x)+F(x)⊙S(x)
then we perform a global maximization pooling operation on F' (x) to generate a depth global descriptor; s (x) is the output value of the last LSTM module of the second layer of LSTM.
We take a triplet loss function to train our proposed model. The ternary network aims at projecting samples into the embedding space, where samples belonging to the same class will be closer together than those from a different class<x,xp,xn>Representing a triplet element, where x is an anchor sample, xpBelonging to the same class as x, xnConstraints can be expressed as:
d(x,xp)+α≤d(x,xn)
where α is a scalar that controls the boundary between the positive and negative samples.
Figure BDA0002186918780000123
x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output;
due to the size of the training data and the sensitivity of triplet loss selection, the process of optimizing the triplet loss function is inefficient. As for computational losses, each iteration requires tens of triplet elements, but only a few may violate the constraints. Therefore, we perform online hard sample mining (onlinehard instance mining) to improve training efficiency. We define a hard sample as a sample that does not comply with the distance (margin) constraint. We then feed these "filtered" hard sample triplet elements again into the network to compute the losses and perform back propagation.
The main contributions of this patent are as follows:
first, we propose a multi-scale contextual attention network that stacks multiple attention modules in network layers of different depths and scales. Therefore, we can capture the region with the largest amount of visual information from multiple scales.
Second, we model contextual information of visual attention at different scales. Such contextual information is modeled by an LSTM network with contextual memory capability to adaptively select visual attention information from multiple scales.
Thirdly, the method provided by the invention achieves excellent performance on all evaluated image retrieval benchmarks, and the visualization result further proves the effectiveness of the method.
An image feature expression system based on a visual attention mechanism comprises a depth network module and a feature calculation module,
the deep network module is used for inputting the picture into the trained deep network module to perform feature extraction on the picture to obtain the attention feature value of the picture,
the characteristic calculation module is used for calculating the distance between the characteristic value of the picture and the characteristic value of the target picture and selecting a plurality of target pictures with the closest distance for display.
The deep network module comprises
The system comprises a classification network module, a visual attention module and a data processing module, wherein the visual attention module is inserted in the middle of partial layers of the classification network module, and the data output value of the previous layer is processed by the visual attention module and then is input into the next layer;
the system also comprises two layers of long and short term memory networks, and the LSTM modules in the long and short term memory networks correspond to the visual attention modules one by one;
the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first-time long-short term memory network is input into the LSTM module of the second-layer long-short term memory network;
and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.
Also included is a pre-training module to employ triplet elements<x,xp,xn>Training deep network module to combine triplet unit<x,xp,xn>Input a deep network module, where x is an anchor sample, xpBelonging to the same class as x, xnCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;
and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.
A loss function of
α is a parameter, x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output.
Preferably, the output value of the data of the previous layer is processed by the visual attention module to obtain the value of the basic branch, and the value of the basic branch is input into the next layer, wherein the processing calculation method is
bi,j(x)=fi,j(x)⊙si,j(x)+fi,j(x)
Wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output value at i, j, f (x) is the output characteristic, ⊙ is the element dot product.
Preferably, the attention feature value is calculated by the output value of the last LSTM module of the second layer long-short term memory network and the output value of the classification network, and comprises
Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)
Wherein, s (x) is the output value of the last LSTM module of the second layer LSTM, and f (x) is the output value of the classification network.
Preferably, the LSTM module state update rule of the second layer long-short term memory network is ct=gt⊙it⊙ut+(1-gt)⊙ft⊙ct-1
gtIs the normalized attention gate for the t-step input; i.e. itIs an input gate for the t-step input; f. oftIs the input gate of the t-step output; u. oftIs an intermediate variable of the step t; c. CtAre t-step memory cells.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. An image feature expression method based on a visual attention mechanism is characterized by comprising
Inputting the picture into a trained deep network model to perform feature extraction on the picture to obtain an attention feature value of the picture,
calculating the distance between the attention characteristic value of the picture and the attention characteristic value of the target picture,
and selecting a plurality of target pictures with the closest distances for display.
2. The image feature expression method based on the visual attention mechanism as claimed in claim 1, wherein the depth network model comprises
The system comprises a classification network, a data processing module and a data processing module, wherein a visual attention module is inserted in the middle of partial convolution layers of the classification network, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;
two layers of long and short term memory networks, wherein the LSTM modules in each layer of long and short term memory network correspond to the visual attention modules one by one;
the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first layer long-short term memory network is input into the LSTM module of the second layer long-short term memory network;
and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.
3. The method of claim 2, wherein the training of the deep network model comprises training the deep network model to generate a visual attention model
Using triplet elements<x,xp,xn>Training a deep network model to combine triplet elements<x,xp,xn>Inputting a depth network model, wherein x is an anchor sample, xpBelonging to the same class as x, xnCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;
and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.
4. The method of claim 3, wherein the loss function is a function of visual attention
Figure FDA0002186918770000011
α is a parameter, x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output.
5. The method as claimed in claim 4, wherein the step of inputting the previous volume of layer data output values into the next volume of layer after the processing by the visual attention module includes
Processing the output value of the previous volume of lamination layer data through a visual attention module to obtain the value of a basic branch, and inputting the value of the basic branch into the next volume of lamination layer, wherein the calculation method of the processing is
bi,j(x)=fi,j(x)⊙si,j(x)+fi,j(x)
Wherein i, j represents a position on the entire feature map; bi,j(x) Is the value of the base branch at i, j; si,j(x) The attention score at i, j is output; f. ofi,j(x) Is the value of the output value at i, j, f (x) is the output characteristic, ⊙ is the element dot product.
6. The method as claimed in claim 5, wherein the step of calculating the attention feature value from the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network comprises
Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)
Wherein, s (x) is the output value of the last LSTM module of the second layer LSTM, and f (x) is the output value of the classification network.
7. The image feature expression method based on visual attention mechanism as claimed in claim 6, wherein the LSTM module state update rule of the second layer long-short term memory network is
ct=gt⊙it⊙ut+(1-gt)⊙ft⊙ct-1
gtIs the normalized attention gate for the t-step input; i.e. itIs an input gate for the t-step input; f. oftIs the input gate of the t-step output; u. oftIs an intermediate variable of the step t; c. CtAre t-step memory cells.
8. An image feature expression system based on a visual attention mechanism is characterized by comprising a depth network module and a feature calculation module,
the deep network module is used for inputting the picture into the trained deep network module to perform feature extraction on the picture to obtain the attention feature value of the picture,
the feature calculation module is used for calculating the distance between the attention feature value of the picture and the feature value of the target picture, and selecting a plurality of target pictures with the closest distances for display.
9. The visual attention mechanism-based image feature representation system of claim 8, wherein the depth network module comprises
The classified network module is characterized in that a visual attention module is inserted between partial convolution layers of the classified network module, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;
the system also comprises two layers of long and short term memory networks, and the LSTM modules in the long and short term memory networks correspond to the visual attention modules one by one;
the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first-time long-short term memory network is input into the LSTM module of the second-layer long-short term memory network;
and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.
10. The visual attention mechanism-based image feature representation system of claim 9, further comprising a pre-training module to employ triplet elements<x,xp,xn>Training deep network module to combine triplet unit<x,xp,xn>Input a deep network module, where x is an anchor sample, xpBelonging to the same class as x, xnCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;
and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.
A loss function of
Figure FDA0002186918770000031
α is a parameter, x is an anchor sample, xpBelonging to the same class as x, xnBelong to different classes; f () is the net output.
CN201910818508.5A 2019-08-30 2019-08-30 Image feature expression method and system based on visual attention mechanism Pending CN110704665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910818508.5A CN110704665A (en) 2019-08-30 2019-08-30 Image feature expression method and system based on visual attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910818508.5A CN110704665A (en) 2019-08-30 2019-08-30 Image feature expression method and system based on visual attention mechanism

Publications (1)

Publication Number Publication Date
CN110704665A true CN110704665A (en) 2020-01-17

Family

ID=69194227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910818508.5A Pending CN110704665A (en) 2019-08-30 2019-08-30 Image feature expression method and system based on visual attention mechanism

Country Status (1)

Country Link
CN (1) CN110704665A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611420A (en) * 2020-05-26 2020-09-01 北京字节跳动网络技术有限公司 Method and apparatus for generating image description information
CN111696137A (en) * 2020-06-09 2020-09-22 电子科技大学 Target tracking method based on multilayer feature mixing and attention mechanism
CN111709458A (en) * 2020-05-25 2020-09-25 中国自然资源航空物探遥感中心 Automatic quality inspection method for top-resolution five-number images

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291945A (en) * 2017-07-12 2017-10-24 上海交通大学 The high-precision image of clothing search method and system of view-based access control model attention model
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108647736A (en) * 2018-05-16 2018-10-12 南京大学 A kind of image classification method based on perception loss and matching attention mechanism
CN109902750A (en) * 2019-03-04 2019-06-18 山西大学 Method is described based on two-way single attention mechanism image
CN110084128A (en) * 2019-03-29 2019-08-02 安徽艾睿思智能科技有限公司 Scene chart generation method based on semantic space constraint and attention mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN107291945A (en) * 2017-07-12 2017-10-24 上海交通大学 The high-precision image of clothing search method and system of view-based access control model attention model
CN108647736A (en) * 2018-05-16 2018-10-12 南京大学 A kind of image classification method based on perception loss and matching attention mechanism
CN109902750A (en) * 2019-03-04 2019-06-18 山西大学 Method is described based on two-way single attention mechanism image
CN110084128A (en) * 2019-03-29 2019-08-02 安徽艾睿思智能科技有限公司 Scene chart generation method based on semantic space constraint and attention mechanism

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
ASKE R. LEJBØLLE等: "Attention in Multimodal Neural Networks for Person Re-identification", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW)》 *
YIHANG LOU等: "Multi-Scale Context Attention Network for Image Retrieval", 《MM "18: PROCEEDINGS OF THE 26TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *
ZHOUXIA WANG等: "Multi-label Image Recognition by Recurrently Discovering Attentional Regions", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
周博通等: "基于LSTM的大规模知识库自动问答", 《北京大学学报(自然科学版)》 *
李玉刚等: "基于注意力的图像视觉关系识别研究", 《中国传媒大学学报(自然科学版)》 *
牛斌等: "一种基于注意力机制与多模态的图像描述方法", 《辽宁大学学报(自然科学版)》 *
陈宜明等: "基于潜在主题的分布式视觉检索模型", 《计算机工程》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709458A (en) * 2020-05-25 2020-09-25 中国自然资源航空物探遥感中心 Automatic quality inspection method for top-resolution five-number images
CN111709458B (en) * 2020-05-25 2021-04-13 中国自然资源航空物探遥感中心 Automatic quality inspection method for top-resolution five-number images
CN111611420A (en) * 2020-05-26 2020-09-01 北京字节跳动网络技术有限公司 Method and apparatus for generating image description information
CN111611420B (en) * 2020-05-26 2024-01-23 北京字节跳动网络技术有限公司 Method and device for generating image description information
CN111696137A (en) * 2020-06-09 2020-09-22 电子科技大学 Target tracking method based on multilayer feature mixing and attention mechanism
CN111696137B (en) * 2020-06-09 2022-08-02 电子科技大学 Target tracking method based on multilayer feature mixing and attention mechanism

Similar Documents

Publication Publication Date Title
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110334705B (en) Language identification method of scene text image combining global and local information
Wen et al. End-to-end detection-segmentation system for face labeling
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN108921047B (en) Multi-model voting mean value action identification method based on cross-layer fusion
CN111444968A (en) Image description generation method based on attention fusion
CN108446334B (en) Image retrieval method based on content for unsupervised countermeasure training
CN110704665A (en) Image feature expression method and system based on visual attention mechanism
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN114387366A (en) Method for generating image by sensing combined space attention text
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN113095251B (en) Human body posture estimation method and system
CN112836702B (en) Text recognition method based on multi-scale feature extraction
CN113111968A (en) Image recognition model training method and device, electronic equipment and readable storage medium
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Rizvi et al. Deep extreme learning machine-based optical character recognition system for nastalique urdu-like script languages
CN110991500A (en) Small sample multi-classification method based on nested integrated depth support vector machine
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN115035341A (en) Image recognition knowledge distillation method capable of automatically selecting student model structure
CN114997287A (en) Model training and data processing method, device, equipment and storage medium
CN112613451A (en) Modeling method of cross-modal text picture retrieval model
CN117009570A (en) Image-text retrieval method and device based on position information and confidence perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200117

WD01 Invention patent application deemed withdrawn after publication