CN108647350A

CN108647350A - Image-text associated retrieval method based on two-channel network

Info

Publication number: CN108647350A
Application number: CN201810465884.6A
Authority: CN
Inventors: 王家宝; 苗壮; 李阳; 李航; 张洋硕
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2018-10-12

Abstract

The invention provides a picture and text associated retrieval method based on a two-channel network. The image-text associated retrieval method based on the two-channel network comprises the following steps: constructing a training data set; constructing an image depth network model, constructing a text depth network model, constructing a related target loss function of the image characteristics and the text characteristics, and training the image depth network model and the text depth network model; respectively extracting the features of the image data and the text data in the data of the search library through the image depth network model and the text depth network model, extracting the image features and the text features of corresponding depths, and storing the image features and the text features in a correlated manner to form an index database; extracting the characteristics of the query data, matching the extracted characteristics of the query data with the corresponding text characteristics or image characteristics in the index database, and sorting and returning the query results according to the matching results.

Description

A kind of picture and text associative search method based on binary channels network

Technical field

The invention belongs to technical field of information retrieval, more particularly to a kind of picture and text associative search based on binary channels network Method.

Background technology

Fast-developing with technologies such as " internet+", " big datas ", the mankind obtain image and the data volume of text shows How surprising growth rate finds the important research content that content needed for user is information retrieval in large-scale data. Traditional information retrieval can no longer meet current reality by text query text or by the technology of image querying image Border demand.People be more desirable to by a retrieval text obtain with the relevant image of text or text, or pass through shooting/submission One photo is retrieved and the relevant image of photo content or text.Above by text query text or image query image Method belongs to the inquiry of single mode data, and how to carry out information search in multiple modalities or carry out cross-module state information search getting over More to obtain the attention of people.It needs to solve different modalities with image querying text or with the cross-module state retrieval of text query image The feature unified representation problem of data, since there are greatest differences in data format for text data and image data, such as What unified representation different modalities data becomes multi-modal or the retrieval of cross-module state key problem.

In order to solve this key problem, method that conventional multi-mode state or cross-module state data indicate is typically by by image Data obtain semantic text through semantic understanding, then are inquired by semantic text, but such method is by the essence of semantic understanding Degree limitation without being developed very well.In recent years, with the development of the development of deep learning, especially convolutional neural networks, Significantly more efficient character representation can be extracted by operations such as convolution, ponds, the semantic understanding and character representation of image are showed Go out extraordinary characteristic.Meanwhile the neural network language model of text data has also obtained fabulous development, wherein cycle nerve Network shows powerful long-term memory ability to the modeling of sequence data because of door control units such as shot and long term memories, can be used for Text data is modeled.Character representation and Recognition with Recurrent Neural Network clock synchronization ordinal number of the above-mentioned convolutional neural networks to image According to character representation, same characterization ability can be reached, but how to combine convolutional network and Recognition with Recurrent Neural Network, The consistent expression of common study different modalities data is still the Important Problems of limit multimode state and cross-module state information retrieval.

Invention content

It is an object of the invention in view of the drawbacks of the prior art or problem, provide a kind of picture and text based on binary channels network Associative search method.

Technical scheme is as follows：A kind of picture and text associative search method based on binary channels network includes following step Suddenly：Training dataset is constructed, the training data concentration includes multiple pairs of image datas and text data；Construction is to described Image data carries out the picture depth network model of image characteristics extraction, and construction carries out Text character extraction to the text data Text depth network model, and the associated objects loss function of described image feature and the text feature is constructed, according to institute State associated objects loss function, training image depth network model and text depth network model；Pass through described image depth net Network model and text depth network model respectively in search library data image data and text data carry out feature extraction, carry The characteristics of image and text feature of corresponding depth are taken, and the two is preserved in association and forms index data base；Extraction inquiry number According to feature, by it is described inquiry data extraction feature and corresponding text feature or characteristics of image in the index data Kuku It is matched, and is sorted according to matching result and return to query result.

Preferably, training dataset is constructed, the training data concentration includes multiple pairs of image data and textual data According to specifically comprising the following steps：

The image data of preset quantity is obtained, and dimension normalization is to 224 × 224 pixel sizes；

Artificial text description, the language that usual description content is one section tens or a word up to a hundred is constituted are carried out to image data Sentence；

Text description is carried out the pretreatment such as segmenting, obtains text word sequence；

Vector quantization expression is carried out to each word after participle, one section of text representation is a sequence vector for including N number of word, N is positive integer.

Preferably, construction specifically includes the picture depth network model of described image data progress image characteristics extraction：

A neural network model is constructed, and the neural network model includes several convolution units and pond layer, each Convolution unit includes that a batch normalizes layer, a convolutional layer and a nonlinear activation layer, and the neural network model Feature is finally exported by a global pool layer.

Preferably, construction specifically includes the text depth network model of text data progress Text character extraction：

Recognition with Recurrent Neural Network model is constructed, the Recognition with Recurrent Neural Network model includes a door control unit, and the gate Unit cycle receives current input vector and previous moment output quantity, and after the door control unit information processing, exports one Vector is used as text feature.

Preferably, it constructs characteristics of image and the associated objects loss function of text feature specifically includes：

Setting training data concentrates feature vector of each data sample after network exports as f, then gives piece image With one section of text, feature vector is respectively f after network exports_iAnd f_t, it is L (f to define the target loss between two features_i,f_t)；

Increasing regularization term prevents over-fitting, is defined as：L (W), wherein W are parameter；

So obtain associated objects loss function L=L (f_i,f_t)+λ L (W), wherein λ is regularization parameter.

Preferably, according to the associated objects loss function, training image depth network model and text depth network mould Type specifically includes：

The training data for giving a batch calculates associated objects loss by propagated forward；

Gradient of the target about input data is calculated by associated objects loss function；

Gradient is successively calculated by back-propagation algorithm, and updates gradient；

It repeats the above steps and is iterated training so that after iterations reach pre-determined number, then deconditioning；

For trained network parameter, it is saved on computer disk for retrieving.

Preferably, by described image depth network model and text depth network model respectively in search library data Image data and text data carry out feature extraction, extract the characteristics of image and text feature of corresponding depth, and the two is related Connection ground preserves to be specifically comprised the following steps in the step of forming index data base：

A search library data are given, characteristics of image is extracted using described image depth network model for image data, Text feature is extracted using the text depth network model for text data；

It for the characteristics of image and text feature of extraction, is preserved into index database using hash index, forms index data Library.

Preferably, by the feature of the inquiry data extraction and corresponding text feature or figure in the index data Kuku As feature is matched, and sorted according to matching result specifically include in the step of returning to query result it is as follows：

A given width query image extracts characteristics of image using using described image depth network model；

A query statement is given, text feature is extracted using the text depth network model；

The image for being higher than preset value to searching similarity in the characteristics of image or text feature to index data base that are extracted Data or text data；

It is ranked up, and is finally returned that user to returning the result.

Technical solution provided by the invention has the advantages that：

In the picture and text associative search methods based on binary channels network, retrieval and inquisition data are given, can be image, It can also be text；Moreover, after extracted feature, quickly returned by salted hash Salted relative in index database as a result, so The similarity for calculating query characteristics and planting modes on sink characteristic afterwards is used as returned data if similarity is more than predetermined threshold value, and result is pressed It is returned according to the sequence that is smoothed out of similarity from high to low, preferential return similarity is high as a result, improving the user experience of retrieval.

Description of the drawings

Fig. 1 is the picture and text associative search method flow diagram based on binary channels network that embodiment of the present invention provides.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

The description of specific distinct unless the context otherwise, the present invention in element and component, the shape that quantity both can be single Formula exists, and form that can also be multiple exists, and the present invention is defined not to this.Although step in the present invention with label into It has gone arrangement, but is not used to limit the precedence of step, unless expressly stated the order of step or holding for certain step Based on row needs other steps, otherwise the relative rank of step is adjustable.It is appreciated that used herein Term "and/or" one of is related to and covers associated Listed Items or one or more of any and all possible groups It closes.

As shown in Figure 1, the picture and text associative search method provided by the invention based on binary channels network includes the following steps：

S1, construction training dataset, the training data concentration includes multiple pairs of image datas and text data.

Specifically, in step sl, the first step completes the acquisition of training image data.On the internet, for specific neck The website node in domain realizes automatic download image by web crawlers, builds image data set.Internet be one by node and The graph model of directed edge structure, wherein directed edge indicate that node is by web page files or media file table by the URL link in webpage Show.Node is divided into as leaf node and non-leaf nodes, and leaf node can be the web page files not comprising hyperlink, can also It is the media files such as image, video, audio；And non-leaf nodes is then the web page files comprising hyperlink.When capturing webpage, The depth-priority-searching method of digraph can be used in web crawlers or breadth first algorithm traverses the node in network, downloads leaf Image in child node constructs training image collection.Second step completes the generation of pairs of text data.Training image is concentrated every Width image is each described as the sentence of one section tens or a word composition up to a hundred, generates by manually carrying out text description to it Corresponding text data.Due to the Subjective and language description difference of people, same sub-picture is it is possible that a variety of possible Description, therefore piece image can be carried out text description by more people, each description and correspondence image composition are a pair of, exist close between the two Connection relationship.In practice, since amount of images is more, description process is slow, and the mode of internet " crowdsourcing " can be used to improve to image The speed of description.

Specifically, step S1 includes the following steps：

For example, on the basis of completing pairs of image and text training data is built, it is also necessary to be carried out to image and text Pretreatment.For image data, image need to be unified to scaling to the size of 224 × 224 pixels and inputted as following model. It for text data, needs first to describe section to text to segment, and removes stop words；Word after participle is used Word2vec vectors indicate that vector dimension is fixed value (such as 300 dimension), therefore every section of text description is finally represented as a length Variable, the fixed sequence vector of dimension.

S2, construction carry out described image data the picture depth network model of image characteristics extraction.

In the present embodiment, according to the basic principle of depth convolutional neural networks, image characteristics extraction depth net is constructed Network model.It includes that a batch normalizes layer, a convolutional layer that the model, which has several convolution units construction, each convolution unit, With a ReLU layers of composition.Whole image passes through several (such as five) convolution units, and each convolution unit is followed by a size Characteristic pattern (feature map) size reduction is original half by the maximum value pond layer for being 2 for 2, step-length.For given Piece image, through network propagated forward calculating after, obtain d characteristic pattern form feature atlas, pass through a global pool Each characteristic pattern is changed into a numerical value by layer, and d feature atlas is eventually changed into a d dimensional vector, should be inputted in the process The characteristic pattern concentration characteristic pattern number of global pool layer determines that the dimension of the feature vector of final image, this feature vector are The characteristics of image of image characteristics extraction depth network extraction.

Specifically, step S2 includes following content：

S3, construction carry out the text data text depth network model of Text character extraction.

In the present embodiment, according to the basic principle of Recognition with Recurrent Neural Network, using door control unit to text sequence information It is modeled, which can be that shot and long term mnemon (LSTM), gate recursive unit (GRU) or structure are simpler M-reluGRU units, these door control units are not much different in modeling effect, but computation complexity successively decreases, it is proposed that uses M-reluGRU units.For inputting a sequence vector data, gate mnemon cycle accepted vector is handled, and most One d dimensional vector of output eventually, the vector are the feature of Text character extraction depth network extraction.

Specifically, step S3 includes following content：

S4, the associated objects loss function for constructing described image feature and the text feature.

In the present embodiment, the associated objects loss function of image and text feature be mainly used for measure characteristics of image and The relevance of text feature, if the two is related, loss is 0, and it is not 0 otherwise to lose.The definition of target loss function be exactly according to Learning training is carried out to feature extraction network parameter according to mentioned above principle, keeps loss as small as possible.Specifically, assuming to give a width figure As I and one section of text T and incidence relation s (I, T) ∈ { 0,1 } of the two, if value is 0, then it represents that the two is uncorrelated；If value It is 1, then it represents that related.The feature vector that image and text are extracted through character pair extraction network is respectively f_iAnd f_t, then f is defined_i And f_tLoss be L (f_i,f_t), concrete functional form is determined by the measuring similarity mode retrieved.Cosine is such as used to measure similar It spends, then L (f_i,f_t)=cos (f_i,f_t), the network parameter of such object function guidance learning may learn that be more suitable for this similar The parameter of measurement.

Over-fitting in order to prevent, loss target increase a regularization term, and 2 norm regularizations are carried out about to all parameters Beam, the data definition are L (W)=∑_k||W_k||², wherein k expression network kth layer parameters.Final object function be loss and Synthesis L=L (the f of regularization term_i,f_j)+λ L (W), wherein λ is regularization parameter.

Specifically, step S4 includes the following steps：

So obtain associated objects loss function L=L (f_i,f_j)+λ L (W), wherein λ is regularization parameter.

S5, according to the associated objects loss function, training image depth network model and text depth network model.

In the present embodiment, give the training data of a batch, the training data include one group of pairs of image and Incidence relation between text data and image-text, the feature that image, text data are obtained through corresponding network, Zhi Houji Calculate loss.After obtaining loss, the loss of demand solution is for inputting f_iAnd f_tPartial derivativeWith, then according to derivative Chain rule, the partial derivative of backwards calculation loss relatively each layer input and each layer parameter, finally according to stochastic gradient descent rule Undated parameterWherein η is the newer learning rate of parameter, and usual numerical value is smaller, can be according to the progress such as data set Adjustment.Finally, repeatedly above-mentioned forward calculation and backwards calculation process, undated parameter are executed.When object function no longer reduces or changes Study is terminated when generation number reaches preset times, by after study each layer parameter of network and the storage of network basis body structure to local Disk.

Specifically, step S5 includes the following steps：

For trained network parameter, it is saved on computer disk for retrieving.

S6, by described image depth network model and text depth network model respectively to the image in search library data Data and text data carry out feature extraction, extract the characteristics of image and text feature of corresponding depth, and in association by the two Preservation forms index data base.

In the present embodiment, for search library data, it is desirable to provide extraction feature simultaneously builds index, when improving retrieval Search efficiency.For the network learnt through step S5, by the image data scaling in search library to 224 × 224 pixels Size is sent into image characteristics extraction subnet and extracts feature, and sequence vector of the text data after participle pretreatment quantization is sent into Text character extraction network extracts feature, and two feature extraction subnets are independently run, different from step S5.Through it is preceding to The characteristic dimension being calculated is generally hundreds of thousands of dimensions, and to improve characteristic matching search efficiency, hash index is carried out to feature, It is stored in index database.

Specifically, step S6 includes the following steps：

S7, extraction inquiry data feature, by it is described inquiry data extraction feature with it is right in the index data Kuku The text feature or characteristics of image answered are matched, and are sorted according to matching result and returned to query result.

In the present embodiment, retrieval and inquisition data are given, can be image, can also be text.Through character pair After net extraction feature, quickly returned by salted hash Salted relative in index database as a result, then calculating query characteristics and library The similarity of feature, is used as returned data if similarity is more than predetermined threshold value, and by result according to similarity from high to low It is smoothed out sequence to return, preferential return similarity is high as a result, improving the user experience of retrieval.

Specifically, step S7 includes the following steps：

It is ranked up, and is finally returned that user to returning the result.

Based on foregoing description, image cutting can be the image block with independent semanteme by the present invention, and can be directed to each A image block determines corresponding vision word；It may then pass through determining vision word to be encoded, so that it is determined that each figure As corresponding feature vector.These feature vectors may be constructed image index library, can be with when inputting target image to be retrieved The target feature vector of target image is matched with the feature vector in image index library, so as to feed back and target figure As relevant retrieval result.The present invention utilizes depth characteristic and clustering algorithm, can make accurate image index library, to Improve the precision of image retrieval.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiment being appreciated that.

Claims

1. a kind of picture and text associative search method based on binary channels network, it is characterised in that：Include the following steps：

Training dataset is constructed, the training data concentration includes multiple pairs of image datas and text data；

Construct the picture depth network model that described image data are carried out with image characteristics extraction；

Construct the text depth network model that Text character extraction is carried out to the text data；

Construct the associated objects loss function of described image feature and the text feature；

According to the associated objects loss function, training image depth network model and text depth network model；

By described image depth network model and text depth network model respectively in search library data image data and Text data carries out feature extraction, extracts the characteristics of image and text feature of corresponding depth, and the two is preserved shape in association At index data base；

The feature of extraction inquiry data, by the feature of the inquiry data extraction and corresponding text in the index data Kuku Feature or characteristics of image are matched, and are sorted according to matching result and returned to query result.

2. a kind of picture and text associative search method based on binary channels network according to claim 1, which is characterized in that construction Training dataset, the training data concentration includes that multiple pairs of image datas and text data specifically comprise the following steps：

Artificial text description, the sentence that usual description content is one section tens or a word up to a hundred is constituted are carried out to image data；

Vector quantization expression is carried out to each word after participle, one section of text representation is a sequence vector for including N number of word, and N is Positive integer.

3. a kind of picture and text associative search method based on binary channels network according to claim 1, which is characterized in that construction The picture depth network model that described image data are carried out with image characteristics extraction specifically includes：

A neural network model is constructed, and the neural network model includes several convolution units and pond layer, each convolution Unit includes that a batch normalizes layer, a convolutional layer and a nonlinear activation layer, and the neural network model is last Feature is exported by a global pool layer.

4. a kind of picture and text associative search method based on binary channels network according to claim 1, which is characterized in that construction The text depth network model that Text character extraction is carried out to the text data specifically includes：

Recognition with Recurrent Neural Network model is constructed, the Recognition with Recurrent Neural Network model includes a door control unit, and the door control unit Cycle receives current input vector and previous moment output quantity, and after the door control unit information processing, exports a vector As text feature.

5. a kind of picture and text associative search method based on binary channels network according to claim 1, which is characterized in that construction The associated objects loss function of characteristics of image and text feature specifically includes：

Setting training data concentrates feature vector of each data sample after network exports as f, then gives piece image and one Duan Wenben, feature vector is respectively f after network exports_iAnd f_t, it is L (f to define the target loss between two features_i,f_t)；

6. a kind of picture and text associative search method based on binary channels network according to claim 1, which is characterized in that according to The associated objects loss function, training image depth network model and text depth network model specifically include：

For trained network parameter, it is saved on computer disk for retrieving.

7. the picture and text associative search method according to claim 1 based on binary channels network, which is characterized in that by described Picture depth network model and text depth network model respectively in search library data image data and text data carry out Feature extraction, extracts the characteristics of image and text feature of corresponding depth, and the two is preserved in association and forms index data base The step of in specifically comprise the following steps：

A search library data are given, characteristics of image is extracted using described image depth network model for image data, for Text data extracts text feature using the text depth network model；

It for the characteristics of image and text feature of extraction, is preserved into index database using hash index, forms index data base.

8. the picture and text associative search method according to claim 1 based on binary channels network, which is characterized in that extraction inquiry The feature of data, the feature of the inquiry data extraction and corresponding text feature in the index data Kuku or image is special Sign is matched, and is specifically included in the step of returning to query result according to matching result sequence as follows：

The image data for being higher than preset value to searching similarity in the characteristics of image or text feature to index data base that are extracted Or text data；

It is ranked up, and is finally returned that user to returning the result.