CN107562812A

CN107562812A - A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space

Info

Publication number: CN107562812A
Application number: CN201710684763.6A
Authority: CN
Inventors: 彭宇新; 綦金玮
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2018-01-09
Anticipated expiration: 2037-08-11
Also published as: CN107562812B

Abstract

The present invention relates to a kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, comprise the following steps：1. cross-module state database is established, wherein comprising multiple modalities categorical data, and the data in database are divided into training set, test set and checking and collected.2. for every kind of modality type in cross-module state database, construction, by other modality type data projections to the semantic space, obtains the cross-module state similarity for the modality-specific for the semantic space of the modality-specific.3. the cross-module state similarity for modality-specific obtained from different modal semantic spaces is merged, final cross-module state similarity is obtained.4. any one modality type in test set is taken, using another modality type as target modalities, to calculate inquiry sample as inquiry mode and inquire about the similitude of target, the correlated results list of target modalities data is obtained according to similitude.The present invention can improve the accuracy rate of cross-module state retrieval.

Description

A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space

Technical field

The present invention relates to multimedia retrieval field, and in particular to a kind of cross-module state based on the modeling of modality-specific semantic space Similarity-based learning method.

Background technology

Nowadays, the multi-modal data including image, video, text and audio is widely present on the internet, these Multi-modal data is to aid in the basis of artificial intelligence cognition real world.Some research work are being attempted to break different modalities Isomery wide gap between data, and cross-module state retrieval studying a question as one of focus, it is possible to achieve cross over different moulds The information retrieval of state data, and there is extensive practical application request, such as search engine and digital library etc..Traditional Single mode is retrieved, such as image retrieval, video frequency searching etc., is all confined to the form of single mode, can only be returned and inquire about and be identical The retrieval result of modality type.It is different, the retrieval of cross-module state it is more convenient with it is useful, any modality type can be passed through Query and search obtain the retrieval result of different modalities.

A major challenge of cross-module state retrieval is how to tackle the inconsistency of different modalities, and learns inherent pass therebetween Connection.Because different modalities data have diversified representation and distribution character, and it is empty to be dispersed in respective feature Between, this isomery characteristic make it that the similitude measured between different modalities is very difficult, such as piece image and a section audio it Between similitude.In view of the above-mentioned problems, researcher proposes certain methods, the character representation of different modalities data is projected Same uniform spaces learn Unified Characterization, so as to which the similitude between different modalities data corresponding can be united by calculating its One characterize between distance obtain.Conventional method for different modalities data by learning mapping matrix to maximize pass therebetween Connection, such as different modalities are analyzed by canonical correlation analysis (Canonical Correlation Analysis, abbreviation CCA) Paired incidence relation between data, different modalities data are mapped to the public subspace of same dimension.In addition, Zhai et al. In document " Learning Cross-Media Joint Representation with Sparse and Semi- The method based on figure stipulations is proposed in Supervised Regularization ", is different modalities data configuration graph model, Cross-module state association study is carried out simultaneously and high-level semantic is abstracted.

In recent years, the huge progress that deep learning obtains is promoted researcher and different moulds is modeled using deep neural network Incidence relation, Feng et al. between state data is in document " Cross-modal Retrieval with Correspondence Corresponding self-encoding encoder (Correspondence Autoencoder, abbreviation Corr-AE) is proposed in Autoencoder ", passes through structure The connected network structure of two-way is built, while models the incidence relation and reconstruction information of different modalities data.Peng et al. is in document “Cross-media shared representation by hierarchical learning with multiple Deep networks " propose cross-module state Multi net voting structural model (Cross-media Multiple Deep Network, letter Claim CMDN), it represents that the study stage models between semantic information and different modalities in mode simultaneously in single-mode separation Related information, then in Unified Characterization study stage structure multitiered network structure, fusion single mode semantic abstraction represents and single mode State association represents, and the mode learnt using stacking models reconstruction simultaneously and related information learns to obtain cross-module state Unified Characterization.

But above-mentioned existing method is mostly comparably to throw the data of different modalities by mapping matrix or depth model Uniform spaces are mapped to excavate potential alignment relation therebetween, it means that the information excavated from different modal datas is equivalent 's.But in general, different modalities data, such as image and text, relation therebetween is often unequal and complementary.When When they describe same semantic jointly, the information of inequality may be included, because information exclusive inside some mode is not Content that can be well with the statement of other mode is alignd.Therefore, it is potential to excavate comparably to treat different modalities data Fine granularity alignment content simultaneously builds a uniform spaces, can lose information exclusive and useful in mode, and can not make full use of The abundant internal information that every kind of mode provides.

The content of the invention

In view of the shortcomings of the prior art, the present invention proposes a kind of cross-module state phase based on the modeling of modality-specific semantic space Like sexology learning method, construction is trained circulation notice network to the modality-specific data, built for the semantic space of modality-specific Fine granularity information and spatial context information inside mould mode, study is then associated by the joint based on notice mechanism Other modal datas are projected to the semantic space of the mode, fully learn unbalanced related information between different modalities, most The cross-module state similarity for modality-specific obtained from different modal semantic spaces is carried out using the mode of dynamic fusion afterwards Fusion, the complementarity of different modalities semantic space is further excavated, improve the accuracy rate of cross-module state retrieval.

To achieve the above objectives, the technical solution adopted by the present invention is as follows：

A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, specific mould is directed to for constructing The semantic space of state, and the cross-module state similarity for modality-specific obtained from different modal semantic spaces is merged, The similarity of different modalities data is obtained, so as to realize that cross-module state is retrieved, is comprised the following steps, wherein step (1)-(3) obtain Cross-module state similarity, step (4) further realize that cross-module state is retrieved：

(1) cross-module state database is established, wherein including the data of multiple modalities type；

(2) the every kind of modality type being directed in cross-module state database, construction is directed to the semantic space of the modality-specific, by it His modality type data projection obtains the cross-module state similarity for the modality-specific to the semantic space；

(3) the cross-module state similarity for modality-specific obtained from the semantic space of different modalities is merged, obtained To final cross-module state similarity；

(4) any one modality type is used, using another modality type as target modalities, will to be looked into as inquiry mode Each data of inquiry mode, which are used as, inquires about sample, the data in searched targets mode, calculates inquiry sample and the phase of inquiry target Like property, the correlated results list of target modalities data is obtained according to similitude.

Further, a kind of above-mentioned cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, the step Suddenly (1) cross-module state database can include multiple modalities type, such as image, text etc..

Further, a kind of above-mentioned cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, the step Suddenly the semantic space building method for modality-specific of (2), the modality-specific data are trained with circulation notice network, then Other modality type data projections are obtained to the semantic space of the mode by the joint association study based on notice mechanism For the cross-module state similarity of the modality-specific.

Further, a kind of above-mentioned cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, the step Suddenly (3) mid-span mode similarity learning method, it is directed to using the mode of dynamic fusion by what is obtained from different modal semantic spaces The cross-module state similarity of modality-specific is merged.

Further, a kind of above-mentioned cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, the step Suddenly the retrieval mode of (4) is that, using a kind of modality type as inquiry mode, another modality type is as target modalities. Each data for inquiring about mode are used as inquiry sample, after similitude is calculated according to step (3), with target modalities All data calculate similitude, are then sorted from big to small according to similitude, obtain correlated results list.

Effect of the invention is that：Compared with the conventional method, this method is directed to the semantic space of modality-specific by constructing, The fine granularity information and spatial context information inside mode can be fully modeled, then passes through the connection based on notice mechanism Association study is closed, fully learns unbalanced related information between different modalities, it is finally further using the mode of dynamic fusion The complementarity of different modalities semantic space is excavated, improves the accuracy rate of cross-module state retrieval.

Why this method has foregoing invention effect, and its reason is：For the semantic space of modality-specific, to the spy Determine modal data training circulation notice network, model the fine granularity information and spatial context information inside mode, then By the joint association study based on notice mechanism by the semantic space of other modality type data projections to the mode, fully Unbalanced related information between study different modalities, will be from different modal semantic spaces finally using the mode of dynamic fusion The obtained cross-module state similarity for modality-specific is merged, and further excavates the complementarity of different modalities semantic space, Improve the accuracy rate of cross-module state retrieval.

Brief description of the drawings

Fig. 1 is a kind of cross-module state similarity-based learning method flow based on the modeling of modality-specific semantic space of the present invention Figure.

Fig. 2 is the schematic diagram of the complete network structure of the present invention.

Embodiment

The present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space of the present invention, its flow is as schemed Shown in 1, comprise the steps of：

(1) cross-module state database is established, wherein including the data of multiple modalities type, and the data in database are divided into Training set, test set and checking collection.

In the present embodiment, the cross-module state database can include multiple modalities type, including image, text.

Cross-module state data set, D={ D are represented with D⁽ⁱ⁾,D^(t), wherein

For medium type r, wherein r=i, t (i represents image, and t represents text), n is defined^(r)For its data amount check.Instruction Each data that white silk is concentrated have and an only semantic classes.

DefinitionFor the characteristic vector of p-th of data in medium type r, it represents that structure is a d^(r)× 1 to Amount, wherein d^(r)Presentation medium type r characteristic vector dimension.

DefinitionSemantic label be set toIt represents the vector that structure is c × 1, and wherein c represents semantic classes Total amount.In have and it is only one-dimensional be 1, remaining is 0, represents the semantic classes of the data for mark that value is corresponding to 1 row Label.

(2) the every kind of modality type being directed in cross-module state database, construction is directed to the semantic space of the modality-specific, by it His modality type data projection obtains the cross-module state similarity for the modality-specific to the semantic space.

The process of the step is as shown in Figure 2.In the present embodiment, for image, semantic spatial configuration, circulation notice is used Network model modeled images data, zoom to 256 × 256, and be input in convolutional neural networks by original image first.Then From convolutional neural networks, last pond layer (pooling layer) is the different respective character representation of extracted region of imageAnd the regional in an image is organized into a sequence in order, use LSTM (Long-Short Term Memory, shot and long term memory) spatial context information between neural net model establishing different images region, its sequence exported can To be expressed asTraining pattern is set to focus on prior image-region followed by notice mechanism, specifically Ground, fully-connected network and Softmax active coatings are constructed, passes through equation below computation vision notice weight：

WhereinWithFor the network parameter of each layer, and aⁱInclude the visual attention weight of different zones in image.Cause This, the characteristic vector in n-th of region can be expressed as in an image(in image, semantic space in Fig. 2It is shown), Local the fine granularity information and spatial context information of image are contained simultaneously.In next step, text data is projected into image Semantic space learns to carry out the association of cross-module state.Specifically, the term vector that k dimensions are first extracted for each word in text data is special Sign, a text then comprising n word can be expressed as n × k matrix, be input to text convolutional neural networks and obtain the sentence The character representation of wordsThen image i_pWith text t_pCross-module state similarity in image, semantic space is defined as follows (in such as Fig. 2 In image, semantic spaceIt is shown)：

WhereinRepresent image i_pIn j-th of provincial characteristics vector.Loss function realization is finally defined as follows to be based on The association study of notice：

Two of above-mentioned formula are defined respectively as：

WhereinThe picture/text pair of matching is represented,WithRepresent unmatched picture/text pair, α It is boundary parameter, and N represents the triple number of sampling.So far, can be obtained for image modalities from image, semantic space Cross-module state similarity sim_i, expression study and measuring similarity learning process are incorporated, while fully modeled inside image Unbalanced related information between fine granularity information and different modalities.

It is first using circulation notice network model modeling text data for text semantic spatial configuration in the present embodiment For each text data, the term vector feature of k dimensions is extracted for wherein each word, then text comprising n word can be with N × k matrix is expressed as, is input to text convolutional neural networks, and from last pond layer (pooling layer) of network Extract the character representation of different text blocks.Then it is successively inputted in LSTM neutral nets, to model the context of text letter Breath, its sequence exported can be expressed asTraining pattern is focused on followed by notice mechanism heavier The text fragments wanted, specifically, construct fully-connected network and Softmax active coatings, and text notice is calculated by equation below Weight：

WhereinWithFor the network parameter of each layer, and a^tInclude the text notice weight of different fragments in text.Cause This, the characteristic vector of m-th of fragment can be expressed as in a text(in Fig. 2 Chinese version semantic spacesInstitute Show), while contain local the fine granularity information and spatial context information of text.In next step, view data is projected Text semantic space learns to carry out the association of cross-module state.Specifically, first the overall feature of image is extracted using convolutional neural networks RepresentThen image i_pWith text t_pCross-module state similarity in text semantic space is defined as follows (such as Fig. 2 Chinese versions semanteme In spaceIt is shown)：

WhereinRepresent text t_pIn j-th of segment characterizations vector.Loss function realization is finally defined as follows to be based on The association study of notice：

Two of above-mentioned formula are defined respectively as：

WhereinThe picture/text pair of matching is represented,WithRepresent unmatched picture/text pair, β It is boundary parameter, and M represents the triple number of sampling.So far, can obtain being directed to text modality from text semantic space Cross-module state similarity sim_t, expression study and measuring similarity learning process are incorporated, while fully modeled inside text Fine granularity information and different modalities between unbalanced related information.

(3) the cross-module state similarity for modality-specific obtained from different modal semantic spaces is merged, obtained Final cross-module state similarity.

In the present embodiment, using dynamic fusion mode by from different modal semantic spaces obtain for modality-specific Cross-module state similarity is merged.First, it is the cross-module state for modality-specific obtained from different modal semantic spaces is similar Degree is normalized between 0 to 1 according to formula below：

Then, for picture/text to (i_p,t_p), obtain branch's conduct after normalization is calculated from image, semantic space The picture/text is to the changeable weight in text space, and the branch that obtains after normalization is calculated from text semantic space makees It is the picture/text to the changeable weight in image space.Therefore, final cross-module state similarity is defined as follows：

Sim(i_p,t_p)=r_t(i_p,t_p)·sim_i(i_p,t_p)+r_i(i_p,t_p)·sim_t(i_p,t_p)

The complementarity of different modalities semantic space can be fully excavated, and further lifts the effect of cross-module state retrieval.

(4) any one modality type in test set is used to be used as mesh using another modality type as inquiry mode Mark mode.Using each data for inquiring about mode as inquiring about sample, the data in searched targets mode, according in step (3) Mode, calculate inquiry sample and inquire about the similitude of target, by similitude according to sorting from big to small, obtain target modalities data Correlated results list.

It is following test result indicates that, compared with the conventional method, cross-module state of the present invention based on more granularity hierarchical networks is closed Join learning method, higher retrieval rate can be obtained.

The present embodiment employs Wikipedia cross-module state data sets and tested, and the data set is by document " A New Approach to Cross-Modal Multimedia Retrieval " (author N.Rasiwasia, J.Pereira, E.Coviello, G.Doyle, G.Lanckriet, R.Levy and N.Vasconcelos, it is published in the ACM of 2010 years International conference on Multimedia) propose, including 2866 sections of texts and 2866 images, and Text and image are one-to-one, are always divided into 10 classifications, wherein 2173 sections of texts and 2173 images are as training set, 231 sections of texts and 231 images collect as checking, and 492 sections of texts and 492 images are as test set.Test following 3 kinds of sides Method is as Experimental comparison：

Existing method one：Document " Learning Cross-Media Joint Representation with Sparse Joint in and Semi-Supervised Regularization " (author X.Zhai, Y.Peng, and J.Xiao) represents Learn (Joint Representation Learning, abbreviation JRL) method, build graph model for different modalities data, simultaneously Carry out cross-module state association study and high-level semantic is abstracted, and introduce sparse and semi-supervised stipulations.

Existing method two：Document " Cross-modal Retrieval with Correspondence Corresponding self-encoding encoder network (Correspondence in Autoencoder " (author F.Feng, X.Wang, and R.Li) Autoencoder, abbreviation Corr-AE) method, two road networks are constructed, and be connected to model related information simultaneously in intermediate layer With reconstruction information.

Existing method three：Document " Cross-media shared representation by hierarchical Cross-module state in learning with multiple deep networks " (author Y.Peng, X.Huang, and J.Qi) Multi net voting structure (Cross-media Multiple Deep Network, abbreviation CMDN), study rank is represented in single-mode separation Section models the related information between semantic information and different modalities in mode simultaneously, then learns stage structure in Unified Characterization Build multitiered network structure, and using the mode of stacking study model reconstruction simultaneously and related information learns to obtain cross-module state and unifies table Sign.

The present invention：The method of the present embodiment.

Experiment evaluates and tests cross-module state using conventional MAP (the mean average precision) indexs of information retrieval field The accuracy of retrieval, MAP refer to the average value of each inquiry sample retrieval accuracy, and MAP value is bigger, illustrates the retrieval of cross-module state As a result it is better.

The Experimental results show of the present invention of table 1.

	Image querying text	Text query image	It is average
				Existing method one	0.479	0.428	0.454
Existing method two	0.442	0.429	0.436
				Existing method three	0.487	0.427	0.457
The present invention	0.516	0.458	0.487

As it can be seen from table 1 the present invention compares existing method in two image querying text, text query image tasks Achieve larger raising.Existing method one builds graph model under conventional frame and different modalities data is linearly mapped to unified sky Between, it is difficult to the fully complicated cross-module state incidence relation of modeling.Existing method two and existing method three use depth network structure, But the data of different modalities are comparably projected into uniform spaces to excavate potential alignment association therebetween by depth model, Information exclusive and useful in mode can be lost, and the internal information that every kind of mode can not be made full use of to provide.A side of the invention Surface construction is directed to the semantic space of modality-specific, models fine granularity information and spatial context information inside mode, simultaneously Fully unbalanced related information between study different modalities.On the other hand, will be from different modalities using the mode of dynamic fusion The cross-module state similarity for modality-specific that semantic space obtains is merged, and further excavates different modalities semantic space Complementarity, so as to improve the accuracy rate of cross-module state retrieval.

In other embodiments, the method for the construction modality-specific semantic space in step (2) of the present invention, uses LSTM The contextual information of (Long-Short Term Memory, shot and long term memory) neural net model establishing image and text data, together Sample can use Recognition with Recurrent Neural Network (Recurrent Neural Network, abbreviation RNN) as replacement.

Obviously, those skilled in the art can carry out the essence of various changes and modification without departing from the present invention to the present invention God and scope.So, if these modifications and variations of the present invention belong to the scope of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to comprising including these changes and modification.

Claims

1. a kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space, comprises the following steps：

(2) the every kind of modality type being directed in cross-module state database, construction is directed to the semantic space of the modality-specific, by other moulds State categorical data projects the semantic space, obtains the cross-module state similarity for the modality-specific；

(3) the cross-module state similarity for modality-specific obtained from the semantic space of different modalities is merged, obtained most Whole cross-module state similarity.

2. the method as described in claim 1, it is characterised in that the cross-module state database includes multiple modalities type, described Multiple modalities type includes image, text.

3. the method as described in claim 1, it is characterised in that the semantic space for modality-specific in step (2) constructs Method is：To the data training circulation notice network of the modality-specific, then associated by the joint based on notice mechanism By the semantic space of the data projection of other modality types to the mode, the cross-module state for obtaining being directed to the modality-specific is similar for study Degree.

4. method as claimed in claim 3, it is characterised in that the building method in image, semantic space is：

A) by original image and it is input in convolutional neural networks；

B) from convolutional neural networks, last pond layer is the different respective character representation of extracted region of image And the regional in an image is organized into a sequence in order, built using LSTM neutral nets or RNN neutral nets Spatial context information between mould different images region, its sequence exported are expressed as

C) training pattern is focused on important image-region using notice mechanism, first construct fully-connected network and Softmax active coatings, then pass through equation below computation vision notice weight：

WhereinWithFor the network parameter of each layer, and aⁱInclude the visual attention weight of different zones in image, therefore, one The characteristic vector in n-th of region is expressed as in individual imageLocal fine granularity information and the space of image are contained simultaneously Contextual information；

D) text data is projected into image, semantic space to carry out the association study of cross-module state, first to be each in text data The term vector feature of word extraction k dimensions, the matrix that a text representation then comprising n word is n × k, is input to text convolution Neutral net obtains the character representation of wordThen image i is defined_pWith text t_pCross-module state phase in image, semantic space It is as follows like spending：

<mrow> <msub> <mi>sim</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mi>p</mi> </msub> <mo>,</mo> <msub> <mi>t</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msubsup> <mi>a</mi> <mi>j</mi> <msub> <mi>i</mi> <mi>p</mi> </msub> </msubsup> <msubsup> <mi>h</mi> <mi>j</mi> <msub> <mi>i</mi> <mi>p</mi> </msub> </msubsup> <mo>&CenterDot;</mo> <msubsup> <mi>q</mi> <mi>p</mi> <mi>t</mi> </msubsup> <mo>,</mo> </mrow>

WhereinRepresent image i_pIn j-th of provincial characteristics vector；

E) it is defined as follows loss function and realizes the association study based on notice：

<mrow> <msub> <mi>L</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>l</mi> <mrow> <mi>i</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>l</mi> <mrow> <mi>i</mi> <mn>2</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Two in above-mentioned formula are defined respectively as：

<mrow> <msub> <mi>l</mi> <mrow> <mi>i</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mi>&alpha;</mi> <mo>+</mo> <msub> <mi>sim</mi> <mi>i</mi> </msub> <mo>(</mo> <mrow> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> </mrow> <mo>)</mo> <mo>-</mo> <msub> <mi>sim</mi> <mi>i</mi> </msub> <mo>(</mo> <mrow> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>-</mo> </msubsup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

<mrow> <msub> <mi>l</mi> <mrow> <mi>i</mi> <mn>2</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mrow> <mo>(</mo> <mn>0</mn> <mo>,</mo> <mi>&alpha;</mi> <mo>+</mo> <msub> <mi>sim</mi> <mi>i</mi> </msub> <mo>(</mo> <mrow> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> </mrow> <mo>)</mo> <mo>-</mo> <msub> <mi>sim</mi> <mi>i</mi> </msub> <mo>(</mo> <mrow> <msubsup> <mi>i</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

WhereinThe picture/text pair of matching is represented,WithUnmatched picture/text pair is represented, α is side Boundary's parameter, and N represents the triple number of sampling.

5. method as claimed in claim 4, it is characterised in that the building method in text semantic space is：

A) for each text data, the term vector feature of k dimensions is extracted for wherein each word, then a text for including n word Originally n × k matrix is expressed as, is input to text convolutional neural networks；

B) from the character representation of last different text block of pond layer extraction of convolutional neural networks, then it is successively inputted to In LSTM neutral nets or RNN neutral nets, to model the contextual information of text, its sequence exported is expressed as

C) training pattern is focused on important text fragments using notice mechanism, first construct fully-connected network and Softmax active coatings, text notice weight is then calculated by equation below：

WhereinWithFor the network parameter of each layer, and a^tInclude the text notice weight of different fragments in text, therefore one The characteristic vector of m-th of fragment is expressed as in individual textLocal fine granularity information and the space of text are contained simultaneously Contextual information；

D) view data is projected into text semantic space to carry out the association study of cross-module state, carried first by convolutional neural networks Take the overall character representation of imageThen image i is defined_pWith text t_pText semantic space cross-module state similarity such as Under：

<mrow> <msub> <mi>sim</mi> <mi>t</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>i</mi> <mi>p</mi> </msub> <mo>,</mo> <msub> <mi>t</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msubsup> <mi>a</mi> <mi>j</mi> <msub> <mi>t</mi> <mi>p</mi> </msub> </msubsup> <msubsup> <mi>h</mi> <mi>j</mi> <msub> <mi>t</mi> <mi>p</mi> </msub> </msubsup> <mo>&CenterDot;</mo> <msubsup> <mi>q</mi> <mi>p</mi> <mi>i</mi> </msubsup> <mo>,</mo> </mrow>

WhereinRepresent text t_pIn j-th of segment characterizations vector；

<mrow> <msub> <mi>L</mi> <mi>t</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mn>2</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>i</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>+</mo> </msubsup> <mo>,</mo> <msubsup> <mi>t</mi> <mi>n</mi> <mo>-</mo> </msubsup> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Two in above-mentioned formula are defined respectively as：

WhereinThe picture/text pair of matching is represented,WithUnmatched picture/text pair is represented, β is side Boundary's parameter, and M represents the triple number of sampling.

6. the method as described in claim 1, it is characterised in that step (3) will be from different modalities using the mode of dynamic fusion The cross-module state similarity for modality-specific that semantic space obtains is merged, and is comprised the following steps：First, will be from different moulds The cross-module state similarity for modality-specific that state semantic space obtains is normalized between 0 to 1 according to formula below：

Then, for picture/text to (i_p,t_p), the score after normalization is calculated as the figure from image, semantic space Picture/text is to the changeable weight in text space, and the branch that obtains after normalization is calculated from text semantic space is used as this Picture/text is to the changeable weight in image space；Final cross-module state similarity is defined as follows：

Sim(i_p,t_p)=r_t(i_p,t_p)·sim_i(i_p,t_p)+r_i(i_p,t_p)·sim_t(i_p,t_p)。

7. a kind of cross-module state search method, comprises the following steps：

1) any claim methods described in claim 1 to 6 is used to calculate cross-module state similarity；

2) a kind of modality type is used, using another modality type as target modalities, mode will to be inquired about as inquiry mode Each data as the data in inquiry sample searched targets mode, inquire about sample and inquire about the similitude of target by calculating, according to Similitude obtains the retrieval result of target modalities data.

8. method as claimed in claim 7, it is characterised in that step 2) is calculated inquiry sample and inquires about the similar of target After property, sorted from big to small according to similitude, obtain correlated results list.