CN110110122A

CN110110122A - Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval

Info

Publication number: CN110110122A
Application number: CN201810649234.7A
Authority: CN
Inventors: 冀振燕; 姚伟娜; 杨文韬; 皮怀雨
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2019-08-09

Abstract

The present invention relates to the image-text cross-module state retrieval models for combining deep learning and hash method.In order to solve the problems, such as that traditional cross-module state hash method based on deep learning is directly converted into the limitation of single label when handling multi-tag data problem, a kind of depth cross-module state hash algorithm based on multilayer semanteme is proposed.The similarity between data is defined by the cooccurrence relation between multi-tag data, and in this, as the supervision message of network training.Design synthesis considers the loss function of multilayer semantic similarity and two-value similarity, is trained to network, so that feature extraction and Hash codes learning process are unified in a frame, realizes end-to-end study.The algorithm makes full use of the semantic dependency information between data, improves retrieval rate.

Description

Image based on multilayer semanteme depth hash algorithm-text cross-module state retrieval

Technical field

The present invention relates to cross-module state searching field more particularly to a kind of combination deep learning based on multilayer semanteme with The image of hash method-text cross-module state searching algorithm.

Background technique

It is universal with the equipment such as smart phone, digital camera with the development of mobile internet, the multimedia on internet Data are in explosive growth.In information retrieval field, the continuous growth of multimedia big data brings the retrieval application of cross-module state and needs It asks.And at present mainstream search engine, such as Baidu, Google, must should, a kind of search result of mode is only provided.In addition, with Deep learning obtains a series of breakthroughs in fields such as computer vision, natural language processings, by multimedia big data with Artificial intelligence combines, and is the following common development trend in two fields.Therefore, it in conjunction with new technology and new demand, explores new Cross-module state search modes become one of the challenge urgently to be resolved of current information searching field.

Traditional cross-module state retrieval generallys use the hand-designed feature for relying on domain knowledge, and " semantic gap " problem is still The difficult point in the field.Deep learning is applied to cross-module state searching field, is not only to solve between different modalities heterogeneous data " media wide gap " provides a large amount of feature learnings and indicates the advanced research achievement of aspect.However, not with multi-medium data It is disconnected to increase, the challenge of memory space and recall precision is faced since dimension is excessive using the character representation of deep learning, is caused It can not adapt to large scale multimedia data retrieval tasks.Meanwhile cross-module state search problem is also faced with truthful data there are multiple marks The problem of label.Existing solution, which has largely been all made of, converts the relevant single label problem concerning study of two-value for problem, leads It causes the model learnt that cannot be sufficiently reserved data in the incidence relation of former semantic space, influences final search result

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, the character representation based on deep learning will be combined, and same When consider image, text both modalities which data two-value similitude and multilayer Semantic Similarity, pass through network using hash method Training obtains the mapping of data to Hash codes, provides a kind of higher image of retrieval rate-text cross-module state search method.

To achieve the above object, technical solution provided by the present invention are as follows:

It is divided into three modules, respectively depth characteristic extraction module, similarity matrix generation module, Hash codes learn mould Block；

Wherein, depth characteristic extraction module extracts image and text data feature using deep neural network.The module uses two A sub-network extracts the structure of image and text modality data characteristics respectively, that is, includes two deep neural networks, one is used for The feature of image data is extracted, one for extracting text data feature.Using depth convolutional neural networks CNN-F network structure Carry out image characteristics extraction.The structure of CNN-F is made of 5 layers of convolutional layer and 3 layers of full articulamentum.In the Text character extraction stage, Text data is modeled with bag of words (Bag-of-Words, BOW) vector first.Based on above-mentioned bag of words, Text character extraction Network uses multi-layer perception (MLP) (Multi-Layer Perception, the MLP) network being made of three layers of full articulamentum to extract text Eigen.

For similarity matrix generation module, is generated comprising two-value similarity matrix and multilayer semantic similarity matrix is raw At.Each generates a cross-module state similarity matrix.For two-value similarity matrixWhen image i is similar to text j When, matrix is correspondingValue is 1；As image i and text j dissmilarity, matrix is correspondingValue is 0.For multilayer Semantic similarity matrixIts calculation method is designed according to label cooccurrence relation, so that the class label collection of two samples possesses When more similar tags, the similarity of sample is bigger, when two tally sets are identical,Reach maximum value 1.When When the label that two sample labels are concentrated is entirely different,It is minimized 0.

For Hash codes generation module, in order to make the Hash codes learnt retain two-value similarity matrixAnd multilayer language Adopted similarity matrixIn semantic information, design object function:

Wherein,

By optimizing the objective function, learning network parameter obtains the mapping relations of data and Hash codes.

Compared with prior art, this programme principle and advantage are as follows:

This programme combination deep learning and hash method, overcome traditional-handwork design feature in character representation ability not The shortcomings that foot and depth characteristic dimension are excessive, are unfavorable for data storage and calculate, and combine two-value similarity and multilayer semanteme phase It like degree, fully considers across similarity relationship complicated between modal data, the Hash codes learnt is made to retain more semantic informations, Improve retrieval rate.

Detailed description of the invention

Fig. 1 is that the present invention is based on the image of multilayer semanteme depth hash algorithm-text cross-module state retrieval general frame figures；

Specific embodiment

Below with reference to specific example, the invention will be further described:

It all discusses by taking image and text both modalities which as an example in the present invention.

The present invention provides a kind of image based on multilayer semanteme depth hash algorithm-text cross-module states to retrieve (Deep Multi-Level Semantic Hashing for Cross-modal Retrieval, DMSH) method, wherein including three Module: depth characteristic extraction module, similarity matrix generation module, Hash codes study module, as shown in Figure 1；

1 image characteristics extraction network structure of table

Depth characteristic extraction module extracts image and text data feature using deep neural network.Using depth convolution mind Image characteristics extraction is carried out through network C NN-F network structure, network structure configuration is as shown in table 1.In the Text character extraction stage, Text data is modeled with bag of words vector first.Based on bag of words, Text character extraction network is used by three layers of full articulamentum The multi-layer perception (MLP) network of composition extracts text feature, network configuration as shown in table 2

Wherein, 4 step-length convolution are used for conv1 layers, conv2-conv5 layers are all made of 1 step-length convolution.Pad mends side (Padding), step-length move mode is indicated.It usually shows image border and mends side, so that the picture size exported after convolution and original Size is consistent.LRN indicates local acknowledgement's normalization (Local Response Normalization).Its mimic biology neuron Lateral inhibition mechanism make to respond biggish value bigger to the activity creation competition mechanism of local neuron, and inhibit to feed back smaller Neuron, enhance model generalization ability.The pond technology operated using MAX, takes the maximum value in a certain size of original image, from And model parameter is effectively reduced, prevent over-fitting.And by Dropout Regularization Technique, lost by random during the training period A certain number of neurons are abandoned, network over-fitting is prevented.

2 Text character extraction network of table

Wherein, first hidden layer of network is full articulamentum identical with input bag of words vector length, and the second layer is hidden Layer is the full articulamentum of 4096 dimensions, and third layer is the full articulamentum that length is Hash code length.Output, that is, Text eigenvector of network.

Similarity matrix generation module includes that two-value similarity matrix generates and the generation of multilayer semantic similarity matrix.They One cross-module state similarity matrix of each self-generatingFor two-value similarity matrixWhen image i is similar to text j, matrix It is correspondingValue is 1；As image i and text j dissmilarity, matrix is correspondingValue is 0.Wherein, different modalities number Similitude between is measured by class label.Even image i and text j has one group of common class label, it is considered that They are similar；Otherwise it is assumed that they are dissimilar.It is defined as follows:

For multilayer semantic similarity matrixUsing a kind of based on the similarity matrix of class label cooccurrence relation Calculation method；Specific generation method is described below.

For two class label t_i,t_j, define label similarity:

Wherein, d (t_i, t_j) indicate two labels semantic distance, be defined as follows:

Wherein,Respectively indicate t in training set_i, t_jThe number of appearance；Indicate t_i, t_jTime occurred jointly Number；N_cIndicate the number of all labels in training set.

By defining (2) it is found that s (t_i, t_j) ∈ [0,1], it indicates when the number that two labels occur jointly is more, they Similarity it is bigger.According to label similitude s, the similitude between sample can define

For two sample D_m,D_n, define Sample Similarity

Wherein, t_m, t_mRespectively indicate sample D_m, D_nClass label collection；|t_m|,|t_n| respectively indicate t_m, t_nNumber；That is Hash label.By definition it is found that when the class label collection of two samples possesses more similar tags, sample Similarity it is bigger, as two tally set t_m, t_nWhen identical,Reach maximum value 1.Work as t_mIn label and t_n In label it is all dissimilar when,It is minimized 0.Therefore, the semantic similarity matrix based on multi-tagIt can Using the supervision message as Hash codes learning process.With two-value similarity matrixIt compares,By cross-module state similarity by from Scattered { 0,1 } is extended to continuous [0,1] section value, remains the semanteme abundant more lain in data category label Information.

Hash codes study module, withIndicate the sample D learnt_iCharacteristics of image, i.e. image The output of feature extraction network；WithIndicate the sample D learnt_jCharacter features, i.e. character features Extract the output of network.Respectively indicate the parameter of two depth networks.

In order to make the Hash codes learnt retain two-value similarity matrixSemantic information, using sigmoid cross entropy Loss function:

Wherein,To guarantee the stability of training process and avoiding overflowing Out, in implementation phase using the equivalent form of (3-5):

Based on above-mentioned two-value semantic information loss functionIt is further introduced into multilayer semanteme loss functionSo that study The model arrived, which retains, is included in multilayer semantic similarity matrixIn semantic information more abundant.Here same to use The equivalent form of sigmoid cross entropy loss function:

Therefore, the complete form of available objective function:

Wherein, F^(g)、F^(x)The feature vector of the image and text that learn is respectively indicated, they contain similarity matrixIn semantic information；C^(g)、C^(x)The Hash codes of image and text are respectively indicated, sign () indicates sign function, Definition is such as formula (3-9).F^(g)、F^(x)In semantic information C is passed to by sign function^(g)、C^(x)；Indicate Fibonacci model Number, E indicate that element value is all 1 vector；μ, ρ, τ are hyper parameter.

C^(g)=sign (F^(g)) (9)

C^(x)=sign (F^(x)) (10)

First two of objective function are the negative log-likelihood functions of cross-module state similarity, and by optimizing, this is certifiable to be worked as When bigger, F (g)_*iWith F^(x) _*jSimilarity it is bigger；It is smaller, F^(g) _*iWith F^(x) _*jSimilarity it is smaller.Therefore, optimize the 1st, 2 Item ensure that e-learning to image and the feature of text remain the cross-module state similitude of original semantic space.

The 3rd of objective functionIt is obtained for regularization term by optimizing this To the Hash codes C of image and text^(g)、C^(x), and remain the feature F of network extraction^(g) _*iWith F^(x) _*jSimilitude.Due to F^(g) _*iWith F^(x) _*jThe cross-module state similitude of semantic space is maintained, therefore obtained Hash codes also remain the cross-module of semantic space State similitude.

By the 4th of optimization object function, so that each of finally obtained Hash codes takes on entire training set Value is that the number of " 1 " and " -1 " keeps balance, i.e., takes the number of " 1 " and " -1 " respectively to account for half on the same position of Hash codes.This One constraint can guarantee that each information for including of Hash codes maximizes.

Experiment shows in the training process of network, enables image and text from same data point take identical Hash codes can preferably promote the performance of network.Therefore, increase addition of constraints C on the basis of former objective function herein^(g)=C^(x) =C, final objective function are as follows:

By optimizing the objective function, so that the network parameter that learning characteristic extracts simultaneously and Hash codes indicate, i.e., it will be special Sign study and Hash codes learning process are unified in a deep learning frame, realize end-to-end study.

In test and application stage, the image or text data of arbitrary single mode are inputted, it can be by training Network generate its corresponding two-value code vector, i.e. Hash codes.

Specifically, by data point D_iImage modalities g_iNetwork is inputted, its Hash codes is produced by the propagated forward of network It indicates, calculating process is as follows:

Similarly, to data point D_jText modality x_j, its corresponding Hash can be generated by the propagated forward of network Code:

Therefore, the inquiry number of any one mode of given image or text may be implemented in DMSH retrieval model proposed in this paper According to preceding k search result most like therewith in return different modalities database.In retrieving, inquiry data are calculated first (Query) the distance between the Hash codes stored in Hash codes and database to be retrieved, then the nearest preceding k of layback is a Hash codes, corresponding to k number according to be final search result.

Claims

1. a kind of image based on multilayer semanteme depth hash algorithm-text cross-module state search method.It is characterized by: whole frame Frame includes three modules: depth characteristic extraction module, similarity matrix generation module, Hash codes study module；It is respectively adopted two A deep neural network extracts image and character features, and feature learning and Hash codes learning process are unified in a frame, And entire training process is instructed by introducing the multi-level semantic supervision message based on label co-occurrence, the two-value code made is not only Basic similar/dissimilar relationship in original sample space is remained, and the similarity degree between sample can be distinguished, greatly Retain the high-level semantic between sample, improves retrieval rate；It is " similar in semantic space by applying to network in structure Image and text have similar Hash codes in Hamming space " this constraint is trained, directly using Hash codes as network Output, realizes end-to-end study, to guarantee that the feature learnt adapts to specific retrieval tasks.

2. a kind of image based on multilayer semanteme depth hash algorithm according to claim 1-text cross-module state retrieval side Method, it is characterised in that: general frame is by depth characteristic extraction module, similarity matrix generation module, Hash codes study module three A part is constituted, and passes through the two-value for being mapped as being made of in Hamming space "+the 1/-1 " of Unified Form by the data of luv space Code vector reduces memory space, improves computational efficiency.

3. a kind of image based on multilayer semanteme depth hash algorithm according to claim 1-text cross-module state retrieval side Method, it is characterised in that: depth characteristic extraction module image and text data is respectively adopted different deep neural networks, extracts The semantic feature of both modalities which data, to image data, using improved CNN-F network, to text data, using Multilayer Perception Machine network.

4. a kind of image based on multilayer semanteme depth hash algorithm according to claim 1-text cross-module state retrieval side Method, it is characterised in that: similarity matrix generation module according to whether have between different modalities data common tag generate two-value phase Like degree matrix, multilayer semantic similarity matrix is generated according to the similitude size of different modalities data label, retains more multi-tag The implied meaning information of offer.

5. a kind of image based on multilayer semanteme depth hash algorithm according to claim 1-text cross-module state retrieval side Method, it is characterised in that: Hash codes study module is by design while retaining data in the two-value similarity information of former semantic space With the objective function of multilayer semantic similarity information, network is trained, mapping of the learning characteristic space to Hamming space.