CN110704665A

CN110704665A - Image feature expression method and system based on visual attention mechanism

Info

Publication number: CN110704665A
Application number: CN201910818508.5A
Authority: CN
Inventors: 段凌宇; 白燕; 楼燚航
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-01-17

Abstract

The invention relates to the field of computer vision, in particular to an image feature expression method and system based on a visual attention mechanism. Inputting the pictures into a trained deep network model to perform feature extraction on the pictures to obtain the attention feature values of the pictures, calculating the distance between the feature values of the pictures and the feature values of the target pictures, and selecting a plurality of target pictures with the closest distances to display. The invention realizes a target retrieval frame integrating feature extraction and distance measurement by utilizing a multi-scale attention network, and compared with the traditional algorithm, the processing speed and the accuracy are both improved better.

Description

Image feature expression method and system based on visual attention mechanism

Technical Field

The invention relates to the field of computer vision, in particular to an image feature expression method and system based on a visual attention mechanism.

Background

Image retrieval, which aims at retrieving from an image dataset the same image describing a particular object as a given query image, has gained much research attention. The success of convolutional neural networks has greatly facilitated the advancement of image retrieval in recent years, benefiting from discernability and compact representation capabilities. Although significant performance gains have been achieved with deep learning based image descriptors. But in practical application, the two challenges of background interference and scale change still exist. First, clutter as extraneous information greatly affects the representation of features on the information area for image retrieval; second, the interest/target objects in the query and reference images are typically different in scale. In this work, we focused on using multi-scale feature representations of information-rich regions in the image.

In practical scene application, background interference can seriously affect the process of feature matching, and for image retrieval, the information-rich area in the concerned image is beneficial to generating efficient features. Recently, CNN-based features are mostly trained into global descriptors using twin networks or triple networks. These global features are extracted directly from the output of the last convolutional layer using the maximum or average pooling layer operation, which is difficult to handle for complex scenes. This is because the target objects in the image are mostly misaligned and in some extreme cases even only a small part. Therefore, it is necessary to selectively focus on certain areas and ignore irrelevant areas. This alternative approach to attention, also known as the attention mechanism, has proven effective in a variety of research areas. Such as machine translation, speech recognition, and image description. One typical attention mechanism applied in CNN is predictive attention maps, where the values on each map indicate the amount of information at the corresponding location.

The scale is a main factor influencing the representation of features in image retrieval, and the noted region may be different on different scales. One representative work is Scale Invariant Feature Transform (SIFT), which finds extreme responses as feature points for image matching in a multi-scale gaussian pyramid. However, in current deep learning-based approaches, the multi-scale context of relevance between different-scale regions of interest has not been fully explored. Current networks for generating scale-robust features are typically equipped with data enhancement (i.e., randomly resizing or cropping training images, etc.) during the training phase or obtain fully-connected features of different scale input images as final features. In some extreme cases, when the object of interest occupies a small part of the input image, it is difficult to retain the response as the size of the feature map continues to decrease during the forward propagation, and in order to take reliable attention on different scales, we need to intuitively acquire multi-scale context information.

Disclosure of Invention

The embodiment of the invention provides an image feature expression method and system based on a visual attention mechanism, a multi-scale attention network is utilized to realize a target retrieval frame integrating feature extraction and distance measurement, and compared with the traditional algorithm, the processing speed and accuracy are improved better.

According to a first aspect of the embodiments of the present invention, an image feature expression method based on a visual attention mechanism includes

Inputting the picture into a trained deep network model to perform feature extraction on the picture to obtain an attention feature value of the picture,

calculating the distance between the attention characteristic value of the picture and the characteristic value of the target picture,

and selecting a plurality of target pictures with the closest distances for display.

The deep network model comprises

The system comprises a classification network, a data processing module and a data processing module, wherein a visual attention module is inserted in the middle of partial convolution layers of the classification network, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;

two layers of long and short term memory networks, wherein the LSTM modules in each layer of long and short term memory network correspond to the visual attention modules one by one;

the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first layer long-short term memory network is input into the LSTM module of the second layer long-short term memory network;

and calculating the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network to obtain the attention characteristic value.

The training of the deep network model comprises

Using triplet elements<x,x^p,xⁿ>Training a deep network model to combine triplet elements<x,x^p,xⁿ>Inputting a depth network model, wherein x is an anchor sample, x^pBelonging to the same class as x, xⁿCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;

and returning to continue updating until the loss function is smaller than the threshold value or reaches the set iteration number.

The loss function is

α is a parameter, x is an anchor sample, x^pBelonging to the same class as x, xⁿBelong to different classes; f () is the net output.

The last volume of lamination data output value is input into the next volume of lamination after being processed by the visual attention module, and the method specifically comprises the following steps

Processing the output value of the previous volume of lamination layer data through a visual attention module to obtain the value of a basic branch, and inputting the value of the basic branch into the next volume of lamination layer, wherein the calculation method of the processing is

b_i，j(x)＝f_i，j(x)⊙s_i，j(x)+f_i，j(x)

Wherein i, j represents a position on the entire feature map; b_i，j(x) Is the value of the base branch at i, j; s_i，j(x) The attention score at i, j is output; f. of_i，j(x) Is the value of the output value at i, j, f (x) is the output characteristic, ⊙ is the element dot product.

The attention feature value is obtained by calculating the output value of the last LSTM module of the second layer long-short term memory network and the output value of the classification network, and comprises

Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)

Wherein, s (x) is the output value of the last LSTM module of the second layer LSTM, and f (x) is the output value of the classification network.

The secondThe LSTM module state update rule of the layer length short-term memory network is c_t＝g_t⊙i_t⊙u_t+(1-g_t)⊙f_t⊙c_t-1

g_tIs the normalized attention gate for the t-step input; i.e. i_tIs an input gate for the t-step input; f. of_tIs the input gate of the t-step output; u. of_tIs an intermediate variable of the step t; c. C_tAre t-step memory cells.

An image feature expression system based on a visual attention mechanism comprises a depth network module and a feature calculation module,

the deep network module is used for inputting the picture into the trained deep network module to perform feature extraction on the picture to obtain the attention feature value of the picture,

the feature calculation module is used for calculating the distance between the attention feature value of the picture and the feature value of the target picture, and selecting a plurality of target pictures with the closest distances for display.

The deep network module comprises

The classified network module is characterized in that a visual attention module is inserted between partial convolution layers of the classified network module, and the output value of the previous convolution layer data is input into the next convolution layer after being processed by the visual attention module;

the system also comprises two layers of long and short term memory networks, and the LSTM modules in the long and short term memory networks correspond to the visual attention modules one by one;

the output value of the visual attention module is input into an LSTM module of a first layer long-short term memory network; the hidden state value of the LSTM module of the first-time long-short term memory network is input into the LSTM module of the second-layer long-short term memory network;

Also included is a pre-training module to employ triplet elements<x，x^p，xⁿ>Training deep network module to combine triplet unit<x，x^p，xⁿ>Input a deep network module, where x is an anchor sample, x^pBelonging to the same class as x, xⁿCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;

A loss function of

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the multi-scale attention network is utilized to realize a target retrieval frame integrating feature extraction and distance measurement, and compared with the traditional algorithm, the processing speed and accuracy are improved better;

through a plurality of modules with different scales and attention and context modeling among the modules, the characteristics of network learning can better resist the influence caused by scale and background interference when image content is searched and matched, and the performance of image searching is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flowchart of an image feature expression method based on a visual attention mechanism according to a first embodiment;

FIG. 2 is a flow chart of a method for expressing image features based on a visual attention mechanism;

FIG. 3 is a block diagram of an image feature representation system based on a visual attention mechanism.

Detailed Description

Example one

As shown in fig. 1, the present invention provides an image feature expression method based on a visual attention mechanism, including:

Preferably, the training of the deep network model comprises the steps of extracting the category and the individual characteristic of the same target object appearing in different image data through a deep neural network with a certain specific structural property, and enabling the characteristic to meet the relationship that the same target object is close in distance and different objects are far in distance in a high-dimensional Euclidean space;

and after preparing data and corresponding labeled information, continuously executing forward propagation and backward propagation according to a training mode of the deep neural network until the final output loss function is converged.

Preferably, the design of the deep network model includes a classification network of CNN structure as an underlying network,

each intermediate convolutional layer is inserted with a visual attention module,

comprises two layers of Long Short Term Memory (LSTM) network;

the output value of the visual attention module is input to the input end of an LSTM module of a first layer long-short term memory (LSTM) network, and the hidden state value of the output of the LSTM module of the first layer long-short term memory (LSTM) network is input to the input end of an LSTM module of a second layer long-short term memory (LSTM) network and the input end of a next LSTM module of the first layer long-short term memory (LSTM) network; the hidden state value of the output of the LSTM module of the second layer Long Short Term Memory (LSTM) network is input to the input end of the LSTM module of the next second layer Long Short Term Memory (LSTM) network.

Preferably, the output value of the last LSTM module of the second layer Long Short Term Memory (LSTM) network is the final visual attention feature map, and the final visual attention feature map is combined with the output of the last convolutional layer of the classification network and output as the final feature representation map. That is, the output of the last LSTM module of the second layer is treated as the final attention weight.

F′(x)＝F(x)+F(x)⊙S(x)

Preferably, the first LSTM module of the second layer Long Short Term Memory (LSTM) network receives an input of the hidden state value of the first LSTM module of the first layer Long Short Term Memory (LSTM) network and a multi-scale context memory M, wherein the expression of the multi-scale context memory M is:

average hidden state h of all steps T_tTo obtain the context memory M. It can be expressed as follows:

h_tis the hidden state value of step t.

Add attention gate g_tThen, the LSTM unit state update rule in the second layer LSTM will also change as follows:

c_t＝g_t⊙i_t⊙u_t+(1-g_t)⊙f_t⊙c_t-1

Preferably, a triple loss function is adopted to train the deep network model.

A triplet loss function is used to train our proposed model. The ternary network aims at projecting samples into the embedding space, where samples belonging to the same class will be closer together than those from a different class<x，x^p，xⁿ>Representing a triplet element, where x is an anchor sample, x^pBelonging to the same as xClass (b), xⁿConstraints can be expressed as:

d(x，x^p)+α≤d(x，xⁿ)

where α is a scalar that controls the boundary between the positive and negative samples.

x is an anchor sample, x^pBelonging to the same class as x, xⁿBelong to different classes; f is the net output

Due to the size of the training data and the sensitivity of triplet loss selection, the process of optimizing the triplet loss function is inefficient. As for computational losses, each iteration requires tens of triplet elements, but only a few may violate the constraints. Therefore, we perform online hard sample mining (online hard sample mining) to improve training efficiency. We define a hard sample as a sample that does not comply with the distance (margin) constraint. We then feed these "filtered" hard sample triplet elements again into the network to compute the losses and perform back propagation.

Preferably, the visual attention module can be viewed as another branch to calculate the importance score for each location in the feature map. Given an input x, we can obtain the output characteristics f (x) of the network, and the attention module calculates an attention (importance) score s (x),

wherein i, j represents a position on the entire feature map; b_i，j(x) Is the value of the base branch at i, j; s_i，j(x) The attention score at i, j is output; f. of_i，j(x) ⊙ is the element dot product;

the process is a dot product process of one element, so that the interference response can be suppressed and the response of the object of interest can be boosted. TheThe model may be trained end-to-end using a back-propagation algorithm, d_i，j(x) The partial derivative of (a) is given by:

where θ is a parameter in the attention module; wherein i, j represents a position on the entire feature map; d_i，j(x) The value of the intermediate variable at i, j for the base branch; s_i，j(x) The attention score at i, j is output; f. of_i，j(x) Is the value of the output feature at i, j; (x) is an output characteristic; notably, during the training period s_i，j(x) Is restricted to be non-negative.

To build a multi-scale context we add an attention module after each residual block in the ResNet 101. In particular, our attention module consists of two convolutional layers with kernel size (1 x 1). The attention score for each location is found using the softmax function after the output of the second layer. Directly feeding the weighted features into the next layer may affect the stability of the web learning. Since the attention weights range from 0 to 1, the original identity map in the parameter block is changed. Therefore, we also establish an identity map of attention, which can be expressed as follows:

b_i，j(x)＝f_i，j(x)⊙s_i，j(x)+f_i，j(x)

wherein i, j represents a position on the entire feature map; b_i，j(x) Is the value of the base branch at i, j; s_i，j(x) The attention score at i, j is output; f. of_i，j(x) Is the value of the output characteristic in i, j, f (x) is the output characteristic, ⊙ is the element dot product;

its motivation is similar to residual learning, as the identity mapping in the attention module ensures that an increase in attention will work better than before no increase.

Example two

As shown in FIG. 2, the present invention relates to an image feature expression method based on visual attention mechanism, which comprises

The deep network training step comprises the steps of carrying out category and individual feature extraction on the data of different pictures of the same target object through a deep neural network with a certain specific structural property, and enabling the feature to meet the relationship that the same target object is close in distance and different objects are far in distance in a high-dimensional Euclidean space;

the method comprises the steps of target accurate retrieval, wherein the accurate retrieval step comprises the steps of utilizing a trained deep network model to extract features of pictures, then calculating Euclidean distances of a plurality of pictures in Euclidean space, and realizing the target of target accurate retrieval through sequencing;

in the deep neural network training step, the method further comprises the following steps:

a) and a network structure design step, namely designing a multi-scale visual attention module to obtain final visual attention capable of better resisting background interference and scale change, and finally assisting to generate an image global descriptor with distinguishing force.

b) And (c) a model training step, namely after preparing data and corresponding labeled information, continuously executing forward propagation and backward propagation on the network model designed in the step a according to the training mode of the deep neural network until the final output loss function is converged.

In the step of designing the network structure, the method further comprises the following steps:

a) selecting a general classification network as a basic network structure;

b) inserting a plurality of visual attention modules in the middle of the classification network;

c) and inputting the attention diagrams output by the visual attention module of each layer into an LSTM long-time memory network, and finally outputting a final visual attention feature diagram by the LSTM.

d) And weighting the final visual attention feature map to the final depth feature of the network to obtain a final global image descriptor.

e) Applying triplet penalties on the final global image descriptor

The LSTM memory network further comprises the following steps:

a) inputting the visual attention feature map output by each attention module into the LSTM

b) The output of the first layer of the LSTM obtains the mean value of each hidden state to obtain the context information of visual attention under different scales;

c) the visual attention information in the first layer LSTM, as well as the context information, is dynamically introduced in the LSTM according to a threshold mechanism to generate the final visual attention.

In the retrieving step, the method further includes the steps of:

a) extracting features of all the object pictures by using the trained network model;

b) calculating Euclidean distances of different picture characteristics;

c) and (4) obtaining the difference degree among the pictures according to the distance sequence, and realizing accurate target retrieval.

In particular, this context within the attention sequence is modeled by a two-layer long short-term memory network (LSTM), the first layer encoding the attention map at different scales and generating an initial multi-scale context memory. This contextual memory is then input to the second LSTM layer to help the network selectively focus on information attention and further create multi-scale perceptual attention. If the attention response at a particular scale is information about the multi-scale context, the LSTM network will import more attention information at that scale.

a) Selecting a general classification network as a basic network structure; preferably, the classification network may be of a basic CNN structure;

b) inserting a visual attention module in the middle convolution layer of the classification network;

preferably, we design a soft care module to actively select responses in the network. For example, using ResNet101 as the underlying network, the attention model can be viewed as another branch to compute the importance score for each location in the feature map. Given an input x, we can obtain the output characteristics f (x) of the network, and the attention module computes an attention (importance) score s (x), which is softly weighted onto the output characteristics of the network. Outputted attention score s_i，j(x) Can be regarded as a base branch b_i，j(x) Door of d_i，j(x) To find the basic branch b_i，j(x) May be expressed as:

d_i，j(x)＝f_i，j(x)⊙s_i，j(x)

the process is a dot product process of one element, so that the interference response can be suppressed and the response of the object of interest can be boosted. The model can be trained end-to-end using a back propagation algorithm, d_i，j(x) The partial derivative of (a) is given by:

b_i，j(x)＝f_i，j(x)⊙s_i，j(x)+f_i，j(x)

c) And inputting the attention diagram output by the visual attention module of each layer into a corresponding LSTM long-time memory network, and finally outputting a final visual attention feature diagram by the LSTM.

Relating to long-short term memory (LSTM) networks, which are constructed by stacking LSTM cells, a typical LSTM cell includes an input gate i_tForgetting the door f_tOutput gate o_tAnd hidden state h_tAnd a memory cell c_t. The calculation process for the LSTM unit can be expressed as follows:

c_t＝i_t⊙u_t+f_t⊙c_t-1

h_t＝o_t⊙tanh(c_t)

wherein x_tFor input in step t, u_tIs the intermediate variable of step t, the adjusted input, h_t-1Sigma represents an activation function sigmoid in an LSTM unit; c. C_tIs a memory unit in t steps;

operation ⊙ represents the dot product of the elements, its three gate input gates i_tForgetting the door f_tOutput gate o_tAre different features of the LSTM unit and serve different purposes. Input gate determines the slave modulation input u_tDegree of import information to update c_tThen forget to remember the door f_tControlling unit c in slave step t_t-1Previous state of (1)In the last step, the output gate determines the output level of the memory cell.

Since we try to selectively introduce attention mechanisms from different scales, a multi-scale context memory should be obtained. We consider using the output of the first layer in the LSTM to generate a multi-scale context memory. In particular, we use the average hidden state h of all steps T in the first LSTM layer_tTo obtain the context memory M. It can be expressed as follows:

in each step t, the LSTM accepts attention maps from a particular scale. In addition, we also consider feeding all hidden states of the first layer to another forward network. To simplify and compact the model, we adopt an averaging method, since adding another sub-network would involve more parameters.

Attention in the second layer LSTM: we further evaluated the degree of information input at the second level, where an attention gate g was set_tTo selectively control the attention information imported from each scale it receives an input h_tAnd a multi-scale context memory M, which can be expressed as follows:

wherein g is_tIs the normalized attention gate for the t-step input, h_tIs the average hidden state; m is context memory

Add attention gate g_tThereafter, the cell state update rule in the second layer LSTM will also change as follows:

c_t＝g_t⊙i_t⊙u_t+(1-g_t)⊙f_t⊙c_t-1

It indicates that if the input attention ht is important for the multi-scale context, the units in the second layer will import more attention information from it; although the amount of information is less, we tend to block it and make better use of the history information of the LSTM unit.

The goal of our MSCAN is to generate more discriminative global descriptors that can learn the multi-scale context when modeling attention. The output of the last convolutional layer is then combined with the final attention as follows:

F′(x)＝F(x)+F(x)⊙S(x)

then we perform a global maximization pooling operation on F' (x) to generate a depth global descriptor; s (x) is the output value of the last LSTM module of the second layer of LSTM.

We take a triplet loss function to train our proposed model. The ternary network aims at projecting samples into the embedding space, where samples belonging to the same class will be closer together than those from a different class<x，x^p，xⁿ>Representing a triplet element, where x is an anchor sample, x^pBelonging to the same class as x, xⁿConstraints can be expressed as:

d(x，x^p)+α≤d(x，xⁿ)

x is an anchor sample, x^pBelonging to the same class as x, xⁿBelong to different classes; f () is the net output;

due to the size of the training data and the sensitivity of triplet loss selection, the process of optimizing the triplet loss function is inefficient. As for computational losses, each iteration requires tens of triplet elements, but only a few may violate the constraints. Therefore, we perform online hard sample mining (onlinehard instance mining) to improve training efficiency. We define a hard sample as a sample that does not comply with the distance (margin) constraint. We then feed these "filtered" hard sample triplet elements again into the network to compute the losses and perform back propagation.

The main contributions of this patent are as follows:

first, we propose a multi-scale contextual attention network that stacks multiple attention modules in network layers of different depths and scales. Therefore, we can capture the region with the largest amount of visual information from multiple scales.

Second, we model contextual information of visual attention at different scales. Such contextual information is modeled by an LSTM network with contextual memory capability to adaptively select visual attention information from multiple scales.

Thirdly, the method provided by the invention achieves excellent performance on all evaluated image retrieval benchmarks, and the visualization result further proves the effectiveness of the method.

the characteristic calculation module is used for calculating the distance between the characteristic value of the picture and the characteristic value of the target picture and selecting a plurality of target pictures with the closest distance for display.

The deep network module comprises

The system comprises a classification network module, a visual attention module and a data processing module, wherein the visual attention module is inserted in the middle of partial layers of the classification network module, and the data output value of the previous layer is processed by the visual attention module and then is input into the next layer;

A loss function of

Preferably, the output value of the data of the previous layer is processed by the visual attention module to obtain the value of the basic branch, and the value of the basic branch is input into the next layer, wherein the processing calculation method is

b_i，j(x)＝f_i，j(x)⊙s_i，j(x)+f_i，j(x)

Preferably, the attention feature value is calculated by the output value of the last LSTM module of the second layer long-short term memory network and the output value of the classification network, and comprises

Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)

Preferably, the LSTM module state update rule of the second layer long-short term memory network is c_t＝g_t⊙i_t⊙u_t+(1-g_t)⊙f_t⊙c_t-1

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An image feature expression method based on a visual attention mechanism is characterized by comprising

calculating the distance between the attention characteristic value of the picture and the attention characteristic value of the target picture,

2. The image feature expression method based on the visual attention mechanism as claimed in claim 1, wherein the depth network model comprises

3. The method of claim 2, wherein the training of the deep network model comprises training the deep network model to generate a visual attention model

4. The method of claim 3, wherein the loss function is a function of visual attention

5. The method as claimed in claim 4, wherein the step of inputting the previous volume of layer data output values into the next volume of layer after the processing by the visual attention module includes

b_i,j(x)＝f_i,j(x)⊙s_i,j(x)+f_i,j(x)

Wherein i, j represents a position on the entire feature map; b_i,j(x) Is the value of the base branch at i, j; s_i,j(x) The attention score at i, j is output; f. of_i,j(x) Is the value of the output value at i, j, f (x) is the output characteristic, ⊙ is the element dot product.

6. The method as claimed in claim 5, wherein the step of calculating the attention feature value from the output value of the last LSTM module of the second layer of long-short term memory network and the output value of the classification network comprises

Attention feature value F' (x) ═ F (x) + F (x) ⊙ s (x)

7. The image feature expression method based on visual attention mechanism as claimed in claim 6, wherein the LSTM module state update rule of the second layer long-short term memory network is

c_t＝g_t⊙i_t⊙u_t+(1-g_t)⊙f_t⊙c_t-1

8. An image feature expression system based on a visual attention mechanism is characterized by comprising a depth network module and a feature calculation module,

9. The visual attention mechanism-based image feature representation system of claim 8, wherein the depth network module comprises

10. The visual attention mechanism-based image feature representation system of claim 9, further comprising a pre-training module to employ triplet elements<x,x^p,xⁿ>Training deep network module to combine triplet unit<x,x^p,xⁿ>Input a deep network module, where x is an anchor sample, x^pBelonging to the same class as x, xⁿCorresponding network outputs are obtained by belonging to different classes, and the deep network model is updated through back propagation of a loss function;

A loss function of