CN111782853B

CN111782853B - Semantic image retrieval method based on attention mechanism

Info

Publication number: CN111782853B
Application number: CN202010582273.7A
Authority: CN
Inventors: 韩红; 杨慎全
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2022-12-02
Anticipated expiration: 2040-06-23
Also published as: CN111782853A

Abstract

The invention discloses a semantic image retrieval method based on an attention mechanism, which mainly solves the problem that a semantic gap influences retrieval accuracy in an image retrieval process. The method comprises the following implementation steps: 1) Constructing a CNN-RNN network model containing an attention mechanism and training; 2) Extracting the text characteristics of the pictures in the image library by using the trained network model; 3) Extracting semantic feature vectors of text features by using a text vector doc2vec model and storing the semantic feature vectors; 4) Extracting text features of the query picture by using the trained network model, and extracting a semantic feature vector corresponding to the text features; 5) And calculating and comparing the characteristic vector of the query picture with the characteristic vector in the image library by using a cosine method, and outputting a result. The method can effectively reduce the influence caused by the semantic gap, so that the system can perform similarity retrieval on the semantic information expressed by the picture, and can be used for quick retrieval of mass data in the internet and search of mobile phone photos in daily life.

Description

Semantic image retrieval method based on attention mechanism

Technical Field

The invention belongs to the technical field of image processing, and further relates to an image-based pattern recognition technology, in particular to a semantic image retrieval method based on an attention mechanism. In the image retrieval process, for a query picture (query image), images similar to the query picture in the image library are obtained by searching and output.

Background

Image retrieval refers to giving an image containing specific content and then finding images containing similar content in an image database, but because different images are greatly different under the influence of shooting angles, shading, lighting and other factors, it is a challenging topic to find a desired image quickly under the influence of the uncontrollable factors. In the network era today, a huge amount of images are uploaded to a server every moment on a network, and particularly with the rise of social networks, for example, nearly 60 hundred million pictures are stored in a server in Tencent, and the pictures contain very rich information, so how to exert the advantages of a computer in processing huge amount of image data, and quickly and accurately find out the pictures which are interested by a user for retrieval has great value and practical significance, and more scientific researchers are invested in the field.

Most of conventional image retrieval methods adopt models such as Histogram of Oriented Gradient (HOG), scale-invariant feature transform (SIFT) and the like to extract feature vectors of an image, and then output similar images by calculating distances of the feature vectors, but the above models are easily affected by noise, and have slow calculation speed and low retrieval accuracy, so that a new research method is urgently needed to be developed.

In recent years, with the great heat of deep learning research, the convolutional neural network CNN has become a research hotspot in the field of current speech analysis and image recognition, and the structures of weight sharing, receptive field and the like of the convolutional neural network CNN make the convolutional neural network occupy a domination position in the field of images and make the images directly serve as the input of the network, thereby avoiding the defects of large calculation amount and low speed of the traditional image retrieval algorithm.

Because of The rapid development of CNN and The like, a large number of convolutional neural network-based Image Retrieval algorithms have been proposed, among which The most classical algorithm belongs to The CNN and hash algorithm-based Image Retrieval method Deep Supervised Hashing for Fast Image Retrieval (Haomao Liu, ruiping Wang, shigueng, xilin Chen; the IEEE Conference on Computer Vision and Pattern Registration (CVPR), 2016, pp.2064-2072), which effectively extracts feature vectors of images, and reduces The dimensions of The feature vectors using binary codes, with good speed and precision. Therefore, many improved algorithms have appeared on the basis of CNN + hash coding, but this method also has the disadvantage that the "semantic gap" problem in image retrieval has not been solved completely, i.e. it cannot realize the retrieval of similar pictures from the viewpoint of picture semantics.

A method for searching images is provided in a patent 'a CNN-based rapid image searching method' applied to Chinese science and technology university (patent application number: CN201610211503.2, publication number: CN 105912611A). The method comprises a first stage of extracting vector features by using a Google pre-trained CNN network, and a second stage of performing K nearest neighbor searching on the vector features in a feature database. The patent is based on the idea of quick search of PQ, adds a reverse strategy in text search, considers the data volume of the patent in application, reasonably arranges system parameters and improves the reordering of search results. However, the scheme adopts a CNN feature extraction mode, so that the feature vector dimension is high, and the retrieval efficiency is low.

The invention provides a semantic analysis-based network image retrieval method applied to a patent of China academy of sciences automation research (patent application No. CN200910089536.4 and publication No. CN 101751447A). Content-based image retrieval for each feature finds a set of visually similar network images. And performing semantic learning by using the related text information corresponding to each image in the network image set to obtain semantic representation of the query image. And judging semantic consistency of the retrieval image set corresponding to the various features on the text information, measuring description capacity of the various features by the semantic consistency, and endowing different confidence coefficients. Performing text-based image retrieval in an image library by using the semantics and semantic consistency of the query image to obtain the semantic relevance between each image in the image library and the query image; retrieving the content-based images of the image library by using the bottom layer characteristics to obtain the visual relevance of each image in the image library and the query image; the semantic and visual relevance are fused by a linear function, and the images returned to the user have similarity on the semantic level and the visual level. The method has the defects that a retrieval system is too complex and has too many types of characteristics, which greatly influences the retrieval speed and cannot effectively overcome or reduce the semantic gap in the retrieval process.

Disclosure of Invention

The invention aims to provide a semantic image retrieval method based on an attention mechanism aiming at the defects of the prior art. And extracting the text features of the image content of the retrieved picture by using a CNN-RNN depth model with an attention mechanism, extracting a semantic feature vector corresponding to the text features by using a text vector doc2vec model, and comparing the feature vector with the feature vector in an image feature library to obtain similar pictures in the library. The accuracy of image retrieval is effectively improved, and the influence caused by semantic gap is reduced.

The method comprises the following specific steps:

(1) Constructing a CNN-RNN network model containing an attention mechanism and training:

(1a) Preprocessing a picture and a corresponding image title in the MS COCO data set;

(1b) Constructing a convolutional neural network VGG encoder and a cyclic neural network LSTM decoder, and adding an attention mechanism into the decoder to obtain a CNN-RNN network model consisting of the encoder and the decoder;

(1c) Dividing the preprocessed data into a training data set and a testing data set, training the network model by adopting the training data set, and testing by using the testing data set to obtain a final CNN-RNN network model;

(2) Extracting image titles of all pictures in an image library to be retrieved by using the final CNN-RNN network model, namely extracting text characteristics corresponding to the pictures, and storing the extracted text characteristics in a database;

(3) Extracting semantic feature vectors of text features in the database by using a text vector doc2vec model and storing the semantic feature vectors:

(3a) Sequentially processing all the text features obtained in the step (2) by using a text vector doc2vec model in a genesis library to obtain semantic feature vectors corresponding to all the pictures;

(3b) Storing the obtained semantic feature vectors and the corresponding pictures in a database, and matching the semantic feature vectors and the corresponding pictures;

(4) Extracting text features of the query picture by using the final CNN-RNN network model, and extracting corresponding semantic feature vectors;

(5) Comparing the semantic feature vector of the query picture with semantic feature vectors of other pictures in the image library by using a cosine similarity comparison method to obtain similar semantic feature vectors;

(6) And outputting the pictures corresponding to the similar semantic feature vectors, namely querying the similar pictures of the pictures.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention combines computer vision and related technology in natural language processing, namely an attention mechanism (attention mechanism) is introduced into a CNN-RNN network, the network can effectively extract high-level concepts related to pictures and express the concepts expressed by the pictures in a natural language form; the scheme of the invention combines the image retrieval technology based on the content under the image retrieval idea based on the text, so that the advantages of the image retrieval technology and the image retrieval technology based on the content are embodied, and the defects of complexity of manually labeling the text and influence caused by semantic gap are effectively overcome.

Secondly, because the invention adopts the word vector technology which is developed rapidly in the near term and uses the text vector doc2vec on the basis of the word vector, the problem of keeping the word sequence can be effectively solved, and when the natural language description is converted into the vector space, the invention has better conversion effect compared with the word vector word2vec model adopted in the prior art.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention;

FIG. 2 is a schematic diagram of a CNN-RNN network architecture with attention mechanism in the present invention;

FIG. 3 is a schematic diagram of the core structure of the convolutional neural network VGG encoder in the present invention.

Detailed Description

The invention is explained in further detail below with reference to the figures and examples:

referring to fig. 1, the method of the invention includes the following steps:

step 1, constructing a CNN-RNN network model containing an attention mechanism and training:

(1a) Preprocessing the images and the corresponding image titles in the MS COCO data set, wherein the preprocessing comprises word segmentation, syntactic analysis, word vectors and the like;

the core structure of the convolutional neural network VGG encoder, i.e., the initiation module, as shown in fig. 3, forms an initiation v2 network by stacking the modules; specifically, the construction of the convolutional neural network VGG encoder is to output the output of the last convolutional layer of the network as the characteristics of the picture, that is, at least 5 characteristic graphs of the last convolutional layer are selected as characteristic vectors to be output. The convolutional neural network is composed of 5 convolutional layers, 3 full-link layers and a softmax output layer, the layers are separated by using maximum pooling, and all hidden layer neurons adopt ReLU activation functions.

The input of the LSTM decoder comprises a word vector of the current step, an output vector of the previous time step and a weighting vector formed by the attention mechanism, and the output is the word vector output by the current time step. The attention mechanism is added into the decoder, namely that when the decoder decodes each time step, the feature vector output by the LSTM decoder of the recurrent neural network is weighted and averaged to obtain a context vector, and the context vector is also used as one input of the decoder network and used for guiding the decoding operation of the current time step. The CNN-RNN network model obtained by combining the LSTM decoder of the recurrent neural network can better solve the problems of gradient disappearance and explosion.

(1c) And dividing the preprocessed data into a training data set and a testing data set, training the network model by adopting the training data set, and testing by using the testing data set to obtain the final CNN-RNN network model.

And 2, extracting image titles of all pictures in the image library to be retrieved by using the final CNN-RNN network model, namely processing the pictures in the image library to be retrieved by using a pre-trained coding and decoding network, sequentially extracting text features (natural language description) corresponding to the pictures, and storing the extracted text features in a database.

Step 3, extracting semantic feature vectors of text features in the database by using a text vector doc2vec model and storing the semantic feature vectors:

(3a) Sequentially processing all the text features obtained in the step (2) by using a text vector doc2vec model in a genesis library, namely converting the extracted natural language into a feature vector space to obtain a semantic feature vector corresponding to each picture; specifically, each sentence described by the natural language is processed by using a doc2vec model to obtain a semantic feature vector corresponding to an image title caption of each picture, namely the semantic feature vector corresponding to the picture;

(3b) And storing the obtained semantic feature vectors and the corresponding pictures in a database, and matching the semantic feature vectors and the corresponding pictures with each other.

Step 4, extracting text features of the query picture by using the final CNN-RNN network model, and extracting corresponding semantic feature vectors; combining a CNN-RNN network with an attention mechanism and a doc2vec model to extract image titles and convert feature vectors of the query pictures; when the query image is to be retrieved, the query image is processed by sequentially using the coding and decoding network and the doc2vec model according to the previous processing mode of other images in the image library, and the feature vector corresponding to the query image is obtained.

Step 5, comparing the semantic feature vector of the query picture with semantic feature vectors of other pictures in the image library by using a cosine similarity comparison method to obtain similar semantic feature vectors;

the cosine similarity comparison method is also called cosine similarity calculation, specifically, similarity between two semantic feature vectors is evaluated by calculating cosine values of included angles of the two semantic feature vectors, and the calculation formula is as follows:

wherein A and B respectively represent two different semantic feature vectors. In this embodiment, a is a semantic feature vector of the query picture, and B is a semantic feature vector of another picture in the image library.

Similarity calculation and sequencing are carried out on the feature vectors of the query pictures and the feature vectors in the image library, so that similar semantic feature vectors of the query pictures can be obtained, and pictures corresponding to the similar semantic feature vectors are further obtained, so that which pictures with high similarity to the query pictures in the image library are specific can be obtained.

Step 6, outputting the pictures corresponding to the similar semantic feature vectors, namely the similar pictures of the query pictures;

and outputting the pictures corresponding to the sequenced similar semantic feature vectors according to the result of the last step and the requirement of the user to finish the retrieval.

The effects of the invention can be further illustrated by simulation:

1. simulation experiment conditions are as follows:

the data set used by the present invention is: NUS-WIDE; the data set is a database containing real-world pictures, which can be used for a variety of image processing tasks; 269648 pictures and related 5018 labels on Flickr, six extracted low-level features (64-dimensional color histogram, 144-dimensional color correlation diagram, 73-dimensional edge direction histogram, 128-dimensional wavelet texture, 225-dimensional block-by-block color moment and 500-dimensional bag-of-words feature based on SIFT description), and 247849 user information of images.

The hardware platform is as follows: intel Core i5-4210U CPU;

the software platform is as follows: visual studio code.

2. Contents and results of the experiments

According to the invention, through carrying out experiments on the NUS-WIDE data set, through extracting the natural language description of the picture, the feature vectors containing the semantic information of the picture are further extracted to form a feature image library, then the query picture is processed according to the same method, and finally, the result is obtained through calculation among the vectors. In 3000 pieces of data tested, the simulation results of the algorithms of Learning to Hash with Binary reconstructed images (BRE), deep Learning of Binary Hash codes for Fast Image Retrieval (DLBHC), and Deep super Hashing for Fast Image Retrieval (DSH) are compared, and as shown in table 1, it can be seen that the method has higher efficiency in Image Retrieval.

TABLE 1 mAP index comparison of the present invention and prior methods

The simulation analysis proves the correctness and the effectiveness of the method provided by the invention.

The invention has not been described in detail in part of the common general knowledge of those skilled in the art.

The above description is only one specific embodiment of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various modifications and variations in form and detail can be made without departing from the principle of the invention after understanding the content and principle of the invention, but such modifications and variations are still within the scope of the appended claims.

Claims

1. A semantic image retrieval method based on an attention mechanism is characterized by comprising the following steps:

(1a) Preprocessing the images and the corresponding image titles in the MS COCO data set;

(6) And outputting the pictures corresponding to the similar semantic feature vectors, namely the similar pictures of the pictures to be inquired.

2. The method of claim 1, wherein: the text feature is a short text describing the picture content in natural language.

3. The method of claim 1, wherein: the preprocessing in the step (1 a) comprises word segmentation, syntactic analysis and word vectors.

4. The method of claim 1, wherein: the step (1 b) of constructing the convolutional neural network VGG encoder is to specifically output the output of the last convolutional layer of the network as the feature of a picture, that is, at least 5 feature maps of the last convolutional layer are selected as feature vectors to be output.

5. The method of claim 4, wherein: the network structure of the convolutional neural network VGG encoder is composed of 5 convolutional layers, 3 full-connection layers and a softmax output layer, the layers are separated by using maximum pooling, and all hidden layer neurons adopt ReLU activation functions.

6. The method of claim 1, wherein: the step (1 b) of adding the attention mechanism to the decoder means that, at each time step decoded by the decoder, the feature vector output by the LSTM decoder of the recurrent neural network is weighted and averaged to obtain a context vector, and the context vector is also used as an input of the decoder network to implement the operation of guiding the decoding at the current time step.

7. The method of claim 1, wherein: and (2) the input of the LSTM decoder of the recurrent neural network in the step (1 b) comprises a word vector of the current step, an output vector of the previous time step and a weighting vector formed by an attention mechanism, and the output is the word vector output at the current time step.

8. The method of claim 1, wherein: and (3) extracting the semantic feature vector of the text feature in the database, namely converting the natural language description of the picture content into the semantic feature vector.

9. The method of claim 1, wherein: the cosine similarity in the step (5) is calculated according to the following formula:

wherein A represents the semantic feature vector of the query picture, and B represents the semantic feature vectors of other pictures in the image library.