CN109284414A

CN109284414A - The cross-module state content search method and system kept based on semanteme

Info

Publication number: CN109284414A
Application number: CN201811156579.5A
Authority: CN
Inventors: 王树徽; 吴益灵; 黄庆明
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2019-01-29
Anticipated expiration: 2038-09-30
Also published as: CN109284414B

Abstract

The present invention relates to a kind of cross-module state content search methods kept based on semanteme, comprising: constructs fisrt feature figure and second feature figure respectively by node of the feature vector of first mode sample and second mode sample；The label vector for extracting all samples is that node constructs grapheme；Obtain the neighbor node of each node；The first mapping function and the second mapping function for first mode sample and second mode sample to be mapped as to implicit expression are constructed respectively；Mapping function is learnt, approximation maximizes the likelihood that the neighbor node of each node occurs, and each implicit expression is allowed to rebuild the corresponding label information of corresponding node；With the first mapping function by sample retrieval be mapped as retrieving it is implicit indicate, and each second mode sample is mapped as to the second mapping function target is implicit to be indicated；Acquisition retrieval is implicit to be indicated at a distance from the implicit expression of each target, using the corresponding second mode sample of all distances less than retrieval threshold as search result.

Description

The cross-module state content search method and system kept based on semanteme

Technical field

The present invention relates to the cross-module state retrieval techniques of MultiMedia Field, in particular to cross-module state content retrieval technology.

Background technique

With the development of multimedia technology, the data of various mode are widely present in internet.The retrieval of cross-module state is more One of important subject of field of media.Traditional single mode searching system, query sample and search result are confined to single Mode is not able to satisfy the growing demand of user.Cross-module state searching system is then different from single mode searching system, inquires sample This and search result are belonging respectively to different mode, such as image, video, audio data sample is used to retrieve as query sample Content of text.Cross-module state retrieval technique provides more convenient and fast retrieval mode for user, and user is facilitated to obtain a variety of moulds needed The information of state, improves user experience.Because query sample and search result are belonging respectively to different mode, how from semantically The similarity for comparing different modalities sample is good problem to study.

Since different mode has heterogeneity, the key of cross-module state retrieval is how to be associated with different modalities.Currently, absolutely The sample of different modalities is mapped to low-dimensional and implied in space by most of cross-module state searching algorithms.According to the implicit expression learnt Classification, can be divided into real number representation cross-module state search method and binary representation cross-module state search method.According to these method institutes The information classification used, can be divided into non-supervisory method and have measure of supervision.Non-supervisory method is total using only different modalities sample Existing information, the label information for having measure of supervision that sample has been used to have.In general, the information used is more, the cross-module state The effect of searching algorithm is better.

The high-layer semantic information that label information can be used as instructs the foundation of relationship between different modalities sample, although different moulds The sample of state has different feature spaces, but they have identical Label space.In the existing method, label information is used as In addition a mode, perhaps for calculating similarity associated images text pair or as the expression in implicit space.It is existing Method, using relatively simple, between consideration mode association, does not account for the association in mode, but mode for label information Interior related information is vital.In same mode, the sample with similar semantic is implicit to indicate similar, in addition mode Between the sample with similar semantic is implicit indicates similar, it is similar that the similitude of consistency ensure that the similar sample of all semantemes has Implicit expression.It is contemplated that one grapheme comprising all samples of creation provides high-level semantic constraint, in addition two include The characteristic pattern of respective mode sample provides manifold constraint, and rebuilds label information and provide global semantic constraint.In addition, traditional base Need to create complexity O (M when node quantity is M in the method for figure²) figure, solve and need complicated Eigenvalues Decomposition mistake Journey needs highly efficient algorithm to the study of graph structure.

Summary of the invention

In view of the above-mentioned problems, the invention discloses a kind of cross-module state content search method kept based on semanteme and system, Include: that retrieved set is constructed with first mode sample, object set is constructed with second mode sample；Extract the spy of the first mode sample Levying vector is that node constructs fisrt feature figure；The feature vector for extracting the second mode sample is that node constructs second feature figure； Extracting the retrieved set and neutralizing this target tightening the label vector of the label information of all samples is that node constructs grapheme；It obtains every The neighbor node of a node；The first mode sample for being mapped as the first mapping function of implicit expression, Yi Jiyong by building In the second mapping function that the second mode sample is mapped as to implicit expression；To first mapping function and the second mapping letter Number is learnt, and approximation maximizes the likelihood that the neighbor node of each node occurs, and each implicit expression is weighed Build the corresponding label information of corresponding node；It, should by the first mapping function using a certain first mode sample as sample retrieval Sample retrieval is mapped as retrieving implicit expression, and each second mode sample is mapped as target by the second mapping function and is implied It indicates；The implicit expression of the retrieval is obtained at a distance from the implicit expression of each target, with all distances less than retrieval threshold The corresponding second mode sample is the search result of the sample retrieval.

Cross-module state content search method of the present invention, wherein sampling is sampled and born using neighbours to the first mapping letter Several and second mapping function is learnt, and establishes multinomial point according to the weight on sampling nodes to the side of node adjacent thereto Cloth, from the multinomial distribution sampling with the sampling nodes have connection node be neighbor node, and with by be uniformly distributed selection with The connectionless node of the sampling nodes is negative nodal point.

Cross-module state content search method of the present invention, wherein the distance, which is that the retrieval is implicit, indicates implicit with the target Euclidean distance d (x between expression_i,x_j)=(x_i-x_j)²Or COS distanceWherein, x_iFor this Retrieval is implicit to be indicated, x_jIt is indicated for the target is implicit.

Cross-module state content search method of the present invention, wherein the mode of the first mode sample include visual modalities, Audio modality, text modality, the mode of the second mode sample include visual modalities, audio modality, text modality.

Cross-module state content search method of the present invention, wherein if the first mode sample and/or the second mode sample This mode is visual modalities, then the feature vector of the first mode sample and/or the second mode sample is that Scale invariant is special Levy transform characteristics or visual modalities convolutional neural networks feature or histograms of oriented gradients feature；If the first mode sample And/or the mode of the second mode sample is text modality, the then feature of the first mode sample and/or the second mode sample Vector is word frequency-inverse file frequecy characteristic or text modality depth convolution/recurrent neural network feature.

The invention also discloses a kind of cross-module state content retrieval systems kept based on semanteme, comprising:

Sample set constructs module, for constructing retrieved set with first mode sample, and constructs target with second mode sample Collection；

Characteristic pattern constructs module, for constructing fisrt feature figure and second feature figure and grapheme, and obtains first spy The neighbor node of each node in sign figure and the second feature figure；The feature vector for wherein extracting the first mode sample is node The fisrt feature figure is constructed, the feature vector for extracting the second mode sample is that node constructs the second feature figure, extracts the inspection Rope, which integrates, to be neutralized this target tightening the label vector of the label information of all samples and construct the grapheme as node；Obtain each node Neighbor node；

Mapping function study module, for constructing mapping function and learning to the mapping function；Wherein building is used for The first mode sample is mapped as to the first mapping function of implicit expression, and hidden for the second mode sample to be mapped as The second mapping function containing expression；First mapping function and second mapping function are learnt, approximation maximizes each The likelihood that the neighbor node of the node occurs, and each implicit expression is allowed to rebuild the corresponding label information of corresponding node；

Sample searching module, for obtaining search result；It is wherein sample retrieval by a certain first mode sample, leads to Cross first mapping function and the sample retrieval be mapped as retrieving implicit expression, and with second mapping function will it is each this second Mode sample is mapped as the implicit expression of target；The implicit expression of the retrieval is obtained at a distance from the implicit expression of each target, with institute Have less than retrieval threshold this apart from the corresponding second mode sample be the sample retrieval search result.

Cross-module state content retrieval system of the present invention, wherein the mapping function study module includes:

Neighbours' sampling module, for being sampled using neighbours to first mapping function and second mapping function It practises；Multinomial distribution is wherein established according to the weight on sampling nodes to the side of node adjacent thereto, is adopted from the multinomial distribution The node that sample and the sampling nodes have connection is neighbor node；

Negative sampling module learns first mapping function and second mapping function using negative sampling；Wherein from Multinomial distribution sampling is negative nodal point with the connectionless node of the sampling nodes.

Cross-module state content retrieval system of the present invention, wherein the retrieval and result obtain in module, which is The implicit Euclidean distance d (x indicated between the implicit expression of the target of the retrieval_i,x_j)=(x_i-x_j)²Or COS distanceWherein, x_iIt is indicated for the retrieval is implicit, x_jIt is indicated for the target is implicit.

Cross-module state content retrieval system of the present invention, wherein the mode of the first mode sample include visual modalities, Audio modality, text modality, the mode of the second mode sample include visual modalities, audio modality, text modality.

Cross-module state content retrieval system of the present invention, wherein if the first mode sample and/or the second mode sample This mode is visual modalities, then the feature vector of the first mode sample and/or the second mode sample is that Scale invariant is special Levy transform characteristics or visual modalities convolutional neural networks feature or histograms of oriented gradients feature；If the first mode sample And/or the mode of the second mode sample is text modality, the then feature of the first mode sample and/or the second mode sample Vector is word frequency-inverse file frequecy characteristic or text modality depth convolution/recurrent neural network feature.

Detailed description of the invention

Fig. 1 is the cross-module state content search method flow chart of the embodiment of the present invention kept based on semanteme.

Fig. 2 is the cross-module state content search method characteristic pattern of the embodiment of the present invention kept based on semanteme and showing for grapheme It is intended to.

Fig. 3 is the mapping function schematic diagram for the cross-module state content search method of the embodiment of the present invention kept based on semanteme.

Fig. 4 is the cross-module state content retrieval system schematic diagram of the embodiment of the present invention kept based on semanteme.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing, the present invention is mentioned A kind of cross-module state content search method and system based on semanteme holding out is further described.It should be appreciated that this place The specific implementation method of description is only used to explain the present invention, is not intended to limit the present invention.

The invention proposes a kind of cross-module state search methods kept based on semanteme, are related to multiple modalities.For convenience of description, The embodiment of the present invention only relates to two mode of text and image, it is to be understood that, cross-module state content inspection according to the present invention Suo Fangfa is widely portable to the mode such as text, vision, the sense of hearing and such as video is multi-modal, and is not limited to above-mentioned mode. Cross-module state search method according to the present invention is roughly divided into three steps, and original is extracted by the way of feature extraction to each sample first Then beginning feature learns mapping function for each sample and is mapped to implicit expression from primitive character, it is hidden finally to calculate sample retrieval Containing indicating that sample implies the distance indicated with target tightening, by distance-taxis, selection is with sample retrieval apart from the mesh less than threshold value Mark collection sample is as search result.

Fig. 1 is the cross-module state content search method flow chart of the embodiment of the present invention kept based on semanteme.As shown in Figure 1, In an embodiment of the present invention, the cross-module state search method kept based on semanteme is specifically included:

Step S1 is constructed with retrieved set and object set, wherein the sample standard deviation of retrieved set has first mode, referred to as the One mode sample, the sample of object set then all have second mode, referred to as second mode sample, first mode sample and second The mode of mode sample includes visual modalities, audio modality, text modality etc., or including visual modalities and audio modality Multi-modal, such as video modality etc., the present invention is not limited thereto；First mode sample and second mode sample have difference Mode, in an embodiment of the present invention, first mode is image modalities, and second mode is text modality；

Step S2, the feature vector for extracting all first mode samples is that node constructs fisrt feature figure；Extract all The feature vector of two mode samples is that node constructs second feature figure；Extract all first mode samples and second mode sample language The label information of adopted label is label vector, constructs grapheme by node of each label vector；In an embodiment of the present invention, When first mode sample is image pattern, and second mode sample is samples of text, image pattern and samples of text are extracted first Feature vector；Wherein the feature vector of image pattern can choose such as SIFT (Scale invariant features transform Scale- Invariant feature transform) feature or visual modalities CNN (convolutional neural networks Convolution Neural Network) feature or HOG (histograms of oriented gradients Histogram ofOriented Gradient) feature Deng the feature vector of samples of text can use TF-IDF (word frequency-inverse file frequency term frequency-inverse Document frequency) feature or text modality CNN (convolutional neural networks Convolution Neural Network) CNN/RNN (the depth convolution/recurrent neural network RecurrentNeural of feature or text modality Network) feature, the present invention is not limited thereto；

Fig. 2 is the cross-module state content search method characteristic pattern of the embodiment of the present invention kept based on semanteme and showing for grapheme It is intended to.As shown in Fig. 2, establishing three figures respectively using first mode sample and second mode sample, comprising: grapheme Gs, the One characteristic pattern (characteristics of image figure Gt), second feature figure (text feature figure Gi), all languages by samples of text and image pattern The adopted extracted label vector of label is all a node in grapheme Gs；

Step S3 is schemed by three, obtains the neighbor node of each node；Because grapheme Gs contain samples of text and The semantic label of image pattern, so the semantic information between containing mode and in mode；Wherein three figures refer to grapheme, One characteristic pattern and second feature figure；It is established using image pattern and the label information of samples of text as label vector each in grapheme The connection of node, is divided into two methods:

First method is, the label vector and if only if two nodes in grapheme has the value of at least one identical dimensional All to be non-zero, then a line is established among the two nodes, vector similarity is calculated as side between node according to label vector Weight, cosine similarity can be usedOr use index similarityHere z_i、z_jIt is the label vector of node i, j respectively, σ is spread factor；

Second method establishes the connection of each node of grapheme using existing knowledge mapping, for example, find image pattern and The label of samples of text corresponding concept in word net (WordNet) uses entity in the calculation knowledges map such as such as shortest path Similarity, the weight as side between node in grapheme；The case where for multi-tag, needs to the phase between all labels It is averaged like degree, the weight as side in grapheme Gs；In fisrt feature figure (characteristics of image figure), arbitrary two are tied Point calculates distance with the feature vector of image, if a node is k neighbour's node of another node, the two nodes Between have a connection, and the weight on side is 1；In second feature figure (text feature figure), for arbitrary two nodes, text is used Feature vector calculate distance and if a node is k neighbour's node of another node have company between the two nodes It connects, and the weight on side is 1；

Step S4 constructs the first mapping function for first mode sample to be mapped as to implicit expression, and constructs and be used for Second mode sample is mapped as to the second mapping function of implicit expression；To the first mapping function and the second mapping function It practises, approximation maximizes the likelihood that the neighbor node of each node occurs, and each implicit expression is allowed to rebuild corresponding node Corresponding label information；For image pattern v_i, implicit to be expressed as f_v(v_i), for samples of text t_i, implicit to be expressed as f_t(t_i), F (n is collectively expressed as to both implicit expressions_i), n_iFor image pattern or samples of text；In order to keep grapheme, the first spy The partial structurtes of sign figure and second feature figure, the neighbor node that the present invention maximizes each node to each figure respectively occur general Rate；

For node n_iOne neighbours sample set P (n of sampling_i), maximize probabilityHere V is that own in grapheme, fisrt feature figure and second feature figure The set of node, P (n_i) indicate node n_iThe corresponding sample of neighbor node, that is, neighbours' sample, T indicate vector transposition；

When node is large number of, negative sample is sampled by aforementioned probability P r (P (n_i)|n_i) relax to minimize lossN(n_i) indicate node n_i Negative sample；Neighbours' sample is sampled by neighbours and is obtained, i.e., according to each neighbor node to node n_iSide weight establish it is multinomial Formula distribution samples neighbor node from the multinomial distribution；Negative sampling obtains negative sample, i.e., then by being uniformly distributed selection and n_iWithout even The node connect negative sample the most；In three figures, partial structurtes are all guaranteed using similar neighbours' sampling and negative sample, in above formula G can be grapheme G_s, text feature figure G_iOr characteristics of image figure G_t, i.e., to image pattern v_iIt is availableTo samples of text t_iIt is available

In addition, introducing global semantic holding condition, that is, the implicit expression needs mapped can recover semantic label letter Breath；Enabling g () is from the implicit function indicated to semantic label, the global semantic loss kept are as follows:

WhereinIt is node n_iSemantic label；

In general, for image pattern v_i, the loss of optimization are as follows:Wherein α and β are Coefficient of balance；Similar, for samples of text t_i, the loss of optimization are as follows:

In order to model the non-linear relation between primitive character and implicit expression, the present invention uses the structure of neural network. Fig. 3 is the mapping function schematic diagram for the cross-module state content search method of the embodiment of the present invention kept based on semanteme.Such as Fig. 3 institute Show, f_v(·)、f_tText and image are mapped to unified implicit representation space by (), are mapped to semantic label by g () later Space, for different concrete application situations, the form of network can be different, such as f_v(·)、f_tThe number of plies of (), g () can To increase or reduce；Finally, optimizing loss function, study mapping using stochastic gradient descent method and error backpropagation algorithm Function；

Step S5 finds out the implicit expression of each sample according to the mapping function learnt；Some given is located at First mode sample (sample retrieval) in retrieved set, calculating it implicit indicates hidden with target tightening each second mode sample Euclidean distance d (x can be used in distance containing expression, distance described here_i,x_j)=(x_i-x_j)², also can be used cosine away from FromThe present invention is not limited thereto, wherein x_i, x_jRespectively indicate the implicit of first mode sample Indicate the implicit expression with second mode sample；It is ranked up the distance of all acquisitions is ascending, according to preset Retrieval threshold N selects search result of the second mode sample of preceding N in distance sequence as sample retrieval.

The invention also discloses a kind of cross-module state content retrieval systems kept based on semanteme.Fig. 4 is the embodiment of the present invention Based on semanteme keep cross-module state content retrieval system schematic diagram.As shown in figure 4, cross-module state content retrieval system of the invention It include: sample set building module, characteristic pattern building module, mapping function study module and sample searching module, wherein sample set Module is constructed for constructing with retrieved set and object set, the sample standard deviation of retrieved set kind has first mode, referred to as first mode Sample, the sample that target tightening then all have second mode, referred to as second mode sample；Characteristic pattern constructs module, for mentioning The feature vector for taking all first mode samples is that node constructs fisrt feature figure, extract the features of all second mode samples to Amount is that node constructs second feature figure, and extracts the label of the label information of all first mode samples and second mode sample Vector is that node constructs grapheme, and obtains the neighbor node of each node；Mapping function study module is reflected for constructing first Function and the second mapping function are penetrated, wherein the first mapping function is used to for first mode sample to be mapped as implicit expression, second reflects Function is penetrated for second mode sample to be mapped as implicit expression, by the first mapping function and the second mapping function It practises, approximation maximizes the likelihood that the neighbor node of each node occurs, and each implicit expression is allowed to rebuild corresponding node Corresponding label information；Sample searching module, for obtaining search result, wherein using certain first mode sample as sample retrieval, Sample retrieval is mapped as by the first mapping function to retrieve implicit expression, and every by what target tightening by the second mapping function A second mode sample is mapped as the implicit expression of target, and acquisition retrieval is implicit to be indicated to imply at a distance from expression with each target, will The distance of all acquisitions is ascending to be ranked up, and according to preset retrieval threshold N, selects the of preceding N in distance sequence Search result of the two mode samples as sample retrieval.

Claims

1. a kind of cross-module state content search method kept based on semanteme characterized by comprising

Retrieved set is constructed with first mode sample, object set is constructed with second mode sample；

The feature vector for extracting the first mode sample is that node constructs fisrt feature figure；Extract the feature of the second mode sample Vector is that node constructs second feature figure；Extract the retrieved set neutralize this target tightening all samples label information label to Amount is that node constitutes grapheme；Obtain the neighbor node of each node；

Building implies the first mapping function of expression for the first mode sample to be mapped as, and is used for the second mode Sample is mapped as the second mapping function of implicit expression；First mapping function and second mapping function are learnt, closely Like the likelihood that the neighbor node for maximizing each node occurs, and each implicit expression is allowed to rebuild the correspondence of corresponding node Label information；

Using a certain first mode sample as sample retrieval, which is mapped as by retrieval by the first mapping function and is implied It indicates, each second mode sample is mapped as by the implicit expression of target by the second mapping function；It obtains the retrieval and implies table Show with each target is implicit indicate at a distance from, be apart from the corresponding second mode sample with these all less than retrieval threshold The search result of the sample retrieval.

2. cross-module state content search method as described in claim 1, which is characterized in that sampled using neighbours and negative sampling is to this First mapping function and second mapping function are learnt, and are built according to the weight on sampling nodes to the side of node adjacent thereto Vertical multinomial distribution, the node for having connection with the sampling nodes from multinomial distribution sampling is neighbor node, and by uniform Distribution selects with the connectionless node of the sampling nodes to be negative nodal point.

3. cross-module state content search method as described in claim 1, which is characterized in that the distance be the retrieval it is implicit indicate with Euclidean distance d (x between the implicit expression of the target_i,x_j)=(x_i-x_j)²Or COS distance Wherein, x_iIt is indicated for the retrieval is implicit, x_jIt is indicated for the target is implicit.

4. cross-module state content search method as described in claim 1, which is characterized in that the mode of the first mode sample includes Visual modalities, audio modality, text modality, the mode of the second mode sample include visual modalities, audio modality, text mould State.

5. cross-module state content search method as claimed in claim 4, which is characterized in that if the first mode sample and/or should The mode of second mode sample is visual modalities, then the feature vector of the first mode sample and/or the second mode sample is Scale invariant features transform feature or visual modalities convolutional neural networks feature or histograms of oriented gradients feature；If this first The mode of mode sample and/or the second mode sample is text modality, then the first mode sample and/or the second mode sample This feature vector is word frequency-inverse file frequecy characteristic or text modality depth convolution/recurrent neural network feature.

6. a kind of cross-module state content retrieval system kept based on semanteme characterized by comprising

Sample set constructs module, for constructing retrieved set with first mode sample, and constructs object set with second mode sample；

Characteristic pattern constructs module, for constructing fisrt feature figure and second feature figure and grapheme, and obtains the fisrt feature figure With the neighbor node of each node in the second feature figure；Extract the feature vector of the first mode sample wherein as node building The fisrt feature figure, the feature vector for extracting the second mode sample is that node constructs the second feature figure, extracts the retrieved set The label vector for neutralizing this target tightening the label information of all samples is that node constructs the grapheme；Obtain the neighbour of each node Occupy node；

Mapping function study module, for constructing mapping function and learning to the mapping function；Wherein building was for should First mode sample is mapped as the first mapping function of implicit expression, and for the second mode sample to be mapped as implicit table The second mapping function shown；First mapping function and second mapping function are learnt, approximation maximizes each node The likelihood that occurs of neighbor node, and each implicit expression is allowed to rebuild the corresponding label information of corresponding node；

Sample searching module, for obtaining search result；It is wherein sample retrieval by a certain first mode sample, by this The sample retrieval is mapped as retrieving implicit expression by the first mapping function, and will each second mode with second mapping function Sample is mapped as the implicit expression of target；The implicit expression of the retrieval is obtained at a distance from the implicit expression of each target, with all small In retrieval threshold this apart from the corresponding second mode sample be the sample retrieval search result.

7. cross-module state content retrieval system as claimed in claim 6, which is characterized in that the mapping function study module packet It includes:

Neighbours' sampling module, for being learnt using neighbours' sampling to first mapping function and second mapping function；Its The weight on the middle side according to sampling nodes to node adjacent thereto establishes multinomial distribution, samples and is somebody's turn to do from the multinomial distribution The node that sampling nodes have connection is neighbor node；

Negative sampling module carries out approximate study to first mapping function and second mapping function using negative sampling；Wherein from Multinomial distribution sampling is negative nodal point with the connectionless node of the sampling nodes.

8. cross-module state content retrieval system as claimed in claim 6, which is characterized in that the retrieval and result obtain module In, which is the implicit Euclidean distance d (x indicated between the implicit expression of the target of the retrieval_i,x_j)=(x_i-x_j)²Or cosine DistanceWherein, x_iIt is indicated for the retrieval is implicit, x_jIt is indicated for the target is implicit.

9. cross-module state content retrieval system as claimed in claim 6, which is characterized in that the mode of the first mode sample includes Visual modalities, audio modality, text modality, the mode of the second mode sample include visual modalities, audio modality, text mould State.

10. cross-module state content retrieval system as claimed in claim 9, which is characterized in that if the first mode sample and/or should The mode of second mode sample is visual modalities, then the feature vector of the first mode sample and/or the second mode sample is Scale invariant features transform feature or visual modalities convolutional neural networks feature or histograms of oriented gradients feature；If this first The mode of mode sample and/or the second mode sample is text modality, then the first mode sample and/or the second mode sample This feature vector is word frequency-inverse file frequecy characteristic or text modality depth convolution/recurrent neural network feature.