CN109783655B

CN109783655B - Cross-modal retrieval method and device, computer equipment and storage medium

Info

Publication number: CN109783655B
Application number: CN201811490973.2A
Authority: CN
Inventors: 宋彬; 姚继鹏; 郭洁; 罗文雯
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2022-12-30
Anticipated expiration: 2038-12-07
Also published as: CN109783655A

Abstract

The invention relates to the technical field of multi-modal data retrieval, in particular to a cross-modal retrieval method, a cross-modal retrieval device, computer equipment and a storage medium. The method comprises the following steps: acquiring data to be matched in a first modality, wherein the data to be matched in the first modality comprises image data and text data; when the data to be matched in the first mode is image data, extracting a feature vector by using a depth residual error network ResNet model, and when the data to be matched in the first mode is text data, extracting the feature vector by using a variational self-encoder model; mapping the feature vectors to a public expression space by using a preset mapping function; and calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, and outputting the corresponding pairing data of the second modality according to the similarity to finish the cross-modality retrieval. The invention can extract the data features more fully and improve the retrieval accuracy.

Description

Cross-modal retrieval method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of multi-modal data retrieval, in particular to a cross-modal retrieval method, a cross-modal retrieval device, computer equipment and a storage medium.

Background

With the rapid development of deep learning technology and the rapid growth of multi-modal data in recent years, people begin to try to combine two relatively independent fields of computer vision and natural language processing for research so as to realize visual and semantic combined embedding. The task needs to represent the image and text data as a vector with a fixed length, and then embed the image and text data into the same vector space. Cross-modal retrieval is a typical application of visual semantic joint embedding. At present, data such as characters, pictures, audio and the like are exponentially growing, information carriers are more and more diversified, and people hope to search information among different information carriers. Most of the existing information retrieval systems are limited to retrieval among single-mode data, and can only realize functions similar to searching images by using images and searching texts by using texts. Or using the keywords as retrieval conditions, and using a search engine to search for the content which is most matched with the query request in a plurality of objects on the network. With the difference in demand, people hope that the information retrieval system can realize information retrieval between cross-modal data, and pay more attention to the content itself, rather than retrieve by one or two words. Data among most of modalities are usually unstructured, and meanwhile dimensions among features of different modalities are often different due to different feature extraction modes, so that information contained in the features cannot be visually compared. Thereby increasing the semantic gap between the high-level semantics and the bottom-level features.

A cross-modal retrieval method based on a theme model is proposed in a patent document 'a cross-modal retrieval method based on a theme model' applied by Zhejiang university (patent application number: 201410532057.6). The method comprises the steps of firstly carrying out feature extraction and label recording on multi-modal data in a database, and then establishing a cross-modal retrieval graph model based on a theme for retrieval.

The Guilin electronic technology university provides a novel cross-modal retrieval method in a patent document 'a cross-modal retrieval method based on a deep association network' (patent application number: 201710989497.8) applied by the Guilin electronic technology university. The method is divided into three modules: firstly, extracting original features of image modal data by using methods such as a pyramid histogram of words (PHOW), global feature information (Gist) and the like, and extracting original features of text modal data by using a word bag model; then, performing advanced expression vector learning by using a restricted Boltzmann machine model and an automatic encoder model; and finally, carrying out similarity matching and giving a retrieval list according to a calculation result.

Therefore, the feature extraction method of the image and text data in the prior art is based on the traditional algorithm, the feature extraction is too shallow, partial feature information is lost, and the cross-modal retrieval accuracy is not high.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a cross-modal search method, apparatus, computer device, and storage medium.

The invention is realized in this way, a cross-modal retrieval method, comprising the following steps:

acquiring data to be matched in a first modality, wherein the data to be matched in the first modality comprises image data and text data;

when the data to be matched in the first mode is image data, extracting a feature vector of the data to be matched by using a depth residual error network ResNet model, and when the data to be matched in the first mode is text data, extracting the feature vector of the data to be matched by using a variational self-encoder model;

mapping the feature vectors to a public expression space by using a preset mapping function;

calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, and outputting the corresponding pairing data of the second modality according to the similarity to finish the cross-modality retrieval;

the second modality pairing data comprises image data and text data, and in one round of retrieval, the first modality data to be matched and the second modality pairing data are different types of data.

In an embodiment of the present invention, there is also provided a cross-modal search apparatus, including:

the data acquisition module is used for acquiring data to be matched in a first modality, and the data to be matched in the first modality comprises image data and text data;

the characteristic vector extraction module is used for extracting characteristic vectors of the data to be matched in the first mode by using a depth residual error network ResNet model when the data to be matched in the first mode is image data, and extracting the characteristic vectors of the data to be matched by using a variational self-encoder model when the data to be matched in the first mode is text data;

the mapping module is used for mapping the characteristic vector to a public expression space by utilizing a preset mapping function;

the matching module is used for calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, outputting the corresponding pairing data of the second modality according to the similarity and finishing the cross-modality retrieval;

In addition, an embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of the cross-modal search method.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is caused to execute the steps of the cross-modal retrieval method.

The embodiment of the invention provides a cross-modal retrieval method, a cross-modal retrieval device, computer equipment and a storage medium, wherein the method is used for extracting the feature vector of image data through a deep residual error network ResNet model and extracting the feature vector of text data through a variational self-encoder model, so that the problems that in the traditional algorithm, partial feature information is lost and the retrieval accuracy is influenced in the feature extraction method of the image data and the text data are solved, and the cross-modal retrieval method is simple in related network structure and easy to train.

Drawings

FIG. 1 is a diagram of an application environment of a cross-modal search method provided in an embodiment;

FIG. 2 is a schematic flow chart illustrating a cross-modal search method according to an embodiment;

FIG. 3 is a diagram of a variational self-encoder model architecture provided in one embodiment;

FIG. 4 is a diagram illustrating experimental effects of text retrieval images according to an embodiment;

FIG. 5 is a diagram illustrating experimental effects of image retrieval text according to an embodiment;

FIG. 6 is a block diagram of an exemplary cross-modal search apparatus;

FIG. 7 is a block diagram showing an internal configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.

Fig. 1 is a diagram of an application environment of a cross-modal retrieval method provided in an embodiment, as shown in fig. 1, in the application environment, including a terminal 110 and a computer device 120.

In the present invention, the terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 110 and the computer device 120 may be connected through a network, and the present invention is not limited thereto.

In the present invention, the computer device 120 may be an independent physical server or terminal, or may be a server cluster formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN.

As shown in fig. 2, in an embodiment, a cross-modal retrieval method is provided, and this embodiment is mainly exemplified by applying the method to the terminal 110 (or the server 120) in fig. 1. The method specifically comprises the following steps:

step S201, acquiring data to be matched in a first modality, wherein the data to be matched in the first modality comprises image data and text data;

step S202, when the data to be matched in the first mode is image data, extracting a feature vector of the data to be matched by using a depth residual error network ResNet model, and when the data to be matched in the first mode is text data, extracting the feature vector of the data to be matched by using a variational self-encoder model;

step S203, mapping the feature vector to a public expression space by using a preset mapping function;

step S204, calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, and outputting the corresponding pairing data of the second modality according to the similarity to complete the cross-modality retrieval;

In the present invention, in step S201, the data to be matched in the first modality may be image data or text data, and the cross-modality search in the present invention refers to a cross-modality search implemented between the image data and the text data. It can be understood that the data to be matched in the first modality and the matching data in the second modality are different types of data, and in one round of retrieval, when the matching data in the first modality and the like are image data, the matching data in the second modality corresponds to text data; when the matching data of the first modality and the like is text data, the matching data of the second modality corresponds to image data. In the present invention, the data to be matched in the first modality is data submitted by a data request terminal (for example, the terminal 110 or the computer device 120), the data to be matched in the second modality is data returned by a response terminal through a search database, and the database may be set in a local device, a cloud server, or the like, which is not limited in the present invention.

In the invention, step S202, a deep convolutional Neural Network ResNet (Residual Neural Network) containing 50 layers is constructed as an image feature extraction model, the deep convolutional Neural Network is divided into five parts, from front to back, the five parts are conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x in sequence, picture data is obtained through an input layer, feature extraction is carried out on the picture data layer by layer, and the model training difficulty is considered while the complete extraction of the picture information is ensured.

In the invention, step S202, the text information is extracted through a Variational self-Encoder model (Variational Auto-Encoder), compared with the traditional methods of counting word frequency and the like and the current popular methods of recurrent neural network and the like, the method for overcoming the defects of the invention overcomes the problem that the semantic information of the text data is seriously ignored, and the Variational self-Encoder model used in the invention has better performance in the aspect of retaining the characteristic information.

In the present invention, in step S204, the corresponding second modality paired data is output according to the similarity, it should be understood that the output data may not be unique, for example, multiple data are all output when the similarities are the same or similar, or the first several paired data with the highest similarity are output by default, which is not limited by the present invention.

Compared with the prior art, the invention uses a depth residual error network ResNet model and a variational self-coder model to extract the characteristics of image data and text data, and compared with the method of processing the image data by using a Pyramid Histogram (PHOW), global characteristic information (Gist) and the like and the method of processing the text data by using a word bag model and the like, the characteristic vector obtained by the invention is more abundant and accurate in representation; especially in the aspect of text data processing, the semantic information existing in the text data is seriously ignored in the traditional method of counting word frequency and the like, and compared with the recurrent neural network and the like which are popular at present, the variational self-encoder used by the invention has better performance in the aspect of retaining characteristic information.

In an embodiment, when the data to be matched in the first modality is image data in step S202, extracting a feature vector of the data to be matched in the first modality by using a depth residual error network ResNet model may specifically include the following steps:

step S301, adjusting the data to be matched in the first mode to a first pixel size, and cutting out a partial region with a second pixel size within the range of the first pixel size, wherein the second pixel size is not larger than the first pixel size;

and step S302, extracting and storing the characteristic vector of the data to be matched in the first mode by using a depth residual error network ResNet model.

In the embodiment of the present invention, step S301, the image data is clipped to a first pixel size, so as to unify the size of the image, so as to input the model; the image data is cropped to a second pixel size, so that the purpose of data enhancement can be achieved. It is to be understood that the first size is an adjustment to the size of the image, not equal to "cropping"; the second pixel size is clipped within the first pixel size, for example, the first pixel size may be 256 × 256, and the corresponding second pixel size may be 224 × 224, and the invention is not limited to the specific combination thereof.

In the invention, step S302, a depth residual error network ResNet model with 50 layers is constructed, and the pre-trained depth residual error network ResNet parameter weight on ImageNet (a large visual database for visual object recognition software research) is downloaded and loaded into the constructed convolutional neural network ResNet model. Inputting the picture processed in step S301 from the input layer, sequentially performing batch standardization, convolution operation, and nonlinear Relu (Rectified Linear Unit) conversion, and extracting a feature vector of the image after passing through the convolution layer and the full link layer, where the vector represents a dimension of 2048 dimensions; and the extracted 2048-dimensional vector of each image is stored, so that the subsequent network architecture can be directly used conveniently.

According to the invention, the image data is subjected to feature vector extraction through the deep residual error network ResNet model, compared with the prior art, the extracted feature vectors are more abundantly represented, the information loss is less, the loss of feature information is reduced, and the retrieval accuracy is improved.

In an embodiment, as shown in fig. 3, when the data to be matched in the first modality in step S202 is text data, performing feature vector extraction on the data to be matched by using a variational self-encoder model may specifically include the following steps:

step S401, truncating the data to be matched into a preset length;

step S402, using a word vector model to code and represent each word of the data to be matched, and cascading the codes;

and S403, processing the cascaded data by using a variational self-encoder model to obtain and store a feature vector of the data to be matched in the first mode.

In the present invention, step S401 aims to unify the length of each text data, and to retain as much information as possible without causing data redundancy. For example, truncating each text to 25 words, and complementing with the code 0 when the text is shorter than 25 words.

In the invention, step S402 is used to encode and represent the text data with adjusted length, the encoding of each word is set to 300 dimensions, and then the words in each sentence are concatenated to obtain the vector corresponding to the sentence, the vector dimension is 7500 dimensions.

In the present invention, in step S403, the 7500-dimensional vector describing the sentence is sent to the input layer of the variational self-encoder, and then is transmitted forward to the full-link layer, and then the mean and standard deviation vector space layer is obtained, and two vectors with dimensions n (n is the vector dimension of the implicit vector space) are obtained: one is a mean vector and the other is a standard deviation vector, so that the vector representation of the implicit vector spatial layer is obtained by the mean vector and the standard deviation vector, and finally the vector representation is input to a decoder for decoding reconstruction.

In the process of model construction, the process also comprises a training process of a variational self-encoder, and an objective function during training is as follows:

wherein: phi and theta respectively represent encoding and decoding network layer parameters of the variational self-encoder; logp (Logp) _θ (X | Z) is the log-likelihood estimate of the reconstructed sample; q. q of _φ (Z | X) is a parameter containing a coded networkA posterior probability distribution function p of number phi _θ A variation approximation distribution of (Z | X); p (Z) is an a priori normal distribution; KL represents the KL divergence to measure the similarity of two distribution functions, and when the two distribution functions are the same, the KL distance takes the minimum value of 0, so that in the neural network, a constraint term is added to make some variables obey the set probability distribution.

The loss function actually consists of two terms, the first representing the reconstruction loss, and like the variational autocoder model, only the expectation operator is added, because the samples are taken from the distribution. The latter term of the loss function is relative entropy, and the goal is to make the trained distribution close to normal, i.e. the mean value is close to 0 and the standard deviation is close to 1. The antecedent corresponds to a decoding process, is a log-likelihood estimation, and is used for reconstructing original sample data, and the postcedent corresponds to an encoding process and is used for measuring the similarity between an approximate posterior probability distribution function and prior distribution.

The training steps of the variational self-encoder are as follows:

step 1, a forward propagation stage, namely inputting samples into a variational self-encoder to calculate corresponding actual output, and in the stage, information is transmitted to an output layer of the variational self-encoder from an input layer of the variational self-encoder through encoding and decoding step-by-step transformation.

And 2, in a backward propagation stage, performing backward propagation to adjust the model parameters of the variational self-encoder according to the loss objective function and a method for minimizing errors.

And 3, repeating the operation of the step 1 and the operation of the step 2 until the target objective function of the variational self-encoder is reduced to a certain threshold value, and obtaining a trained variational self-encoder model.

In the present invention, in step S403, the trained variational self-encoder model is used to perform feature extraction on the text data, and the feature vectors in the implicit vector space layer are extracted as the extracted features of the text description.

The embodiment of the invention provides a cross-modal retrieval method, which extracts text information through a variational self-encoder model, reserves richer and more accurate information for a feature vector, and improves the retrieval accuracy.

In an embodiment, in step S204, the similarity is measured by an inner product of feature vectors of the data to be matched in the first modality and the paired data in the second modality, and the inner product is calculated by a scoring function, which is specifically as follows:

s(i,c)＝f(i；W _f ,θ _φ )·g(c；W _g ,θ _ψ )

wherein: s (i, c) represents a scoring function, (i, c) represents paired image text data, f (i; W) _f ,θ _φ ) And g (c; w is a group of _g ,θ _ψ ) Vector representations of feature vectors respectively representing the image data and text data within the common representation space.

In the invention, the extracted image and text characteristic vectors are mapped to a public vector representation space through a mapping function, the vector dimensions of the two kinds of modal data after mapping are consistent, and the method is set to be 1024 dimensions.

The embodiment of the invention provides a specific form of a mapping function, and visual and semantic information of image and text data is jointly embedded into a public representation space by establishing the space, so that the problem of semantic gap existing in different modal data is effectively solved, the dimensions of the different modal data in the public representation space are consistent, and the similarity measurement is convenient.

As an optimization scheme of the foregoing embodiment, the mapping function is optimized through a rank-order-loss function, and is used to map the feature vectors of the image and the genre data to the common representation space, and the specific form of the mapping function is as follows:

wherein: i represents image data, c represents text data; theta _φ And theta _ψ Respectively image and text dataExtracting model parameters from the feature vectors; phi (i; theta) _φ ) And ψ (c; theta _ψ ) Respectively extracting feature vectors of the image and the text; w _f And W _g And mapping weight matrixes of the image and the text data are obtained after the sequencing loss function is optimized.

As a further optimization of the above optimization scheme, the ranking loss function is designed according to the score function, and is used to optimize a mapping weight matrix of the mapping function, specifically:

wherein: i' = arg max _j≠i s (j, c) is the image data with the highest similarity with the text c in the unpaired data in the database; c' = arg max _d≠c And s (i, d) is the text data with the highest similarity with the image i in the unpaired data in the database.

In the present invention, the prior art loss function is of the form:

wherein: [ x ] of] ₊ = max (x, 0), s (i, c) represents a pair of image text data scores,

and

representing an unpaired image text data score. The first summation of the penalty function is to sum all unmatched text data given an image query i

The second summation is to sum all unmatched image data given a text query c

The overall purpose of the existing loss function is to have the pair-wise matching image text data be closer in distance in the common representation space than any pair of unmatched image text data.

The invention adopts hard negative sample, namely, a query mode is given, and a pair with the highest score is found from all unmatched image/text data pairs: i' = arg max _j≠i s (j, c) and c' = arg max _d≠c s (i, d), (where (j, c) and (i, d) represent unpaired image text data) to arrive at the form given by the present invention.

Compared with the prior art, the cross-modal retrieval method provided by the invention adopts the modified sorting loss function as the target loss function, so that the correlation mining between different modal data is more sufficient and accurate, and compared with the existing sorting loss function, the modified loss function has the advantages of less calculation amount, higher speed and improved accuracy.

The technical effect achieved by the present invention will be further explained with reference to a specific embodiment.

The simulation hardware environment of this embodiment is: intel Core (TM) [email protected] 8, GPU NVIDIAGeForce GTX 1070,8GB memory; software environment: ubuntu 16.04, python3.6.

The simulation experiments performed in this example were based on the Flickr30K dataset. Firstly, feature extraction is respectively carried out on image data and text data by using a depth residual error network ResNet model and a variational self-coder model, then, feature vectors of the image data and the feature vectors of the text data are mapped to a public representation space through a mapping function, and similarity is calculated in the public representation space. The common general criteria for cross-modality retrieval are R @ N: the ratio of correct results in the top N results is shown, and the larger the value is, the better the result is. Table 1 compares the results of the prior art process with the process used in the present invention.

TABLE 1 Cross-modal search method effect comparison Table

Comparing table 1 and fig. 4 and 5, it can be found that compared with other methods, the method of the present invention achieves higher accuracy and better effect in text retrieval of images and image retrieval of texts.

As shown in fig. 6, in an embodiment, a cross-modal retrieval apparatus is provided, which may specifically include:

the data acquiring module 601 is configured to acquire data to be matched in a first modality, where the data to be matched in the first modality includes image data and text data;

a feature vector extraction module 602, configured to, when the data to be matched in the first modality is image data, perform feature vector extraction on the data to be matched by using a depth residual error network ResNet model, and when the data to be matched in the first modality is text data, perform feature vector extraction on the data to be matched by using a variational self-encoder model;

a mapping module 603, configured to map the feature vector to a common representation space by using a preset mapping function;

a matching module 604, configured to calculate a similarity between the data to be matched in the first modality and the paired data in the second modality in the common representation space, and output the corresponding paired data in the second modality according to the similarity, thereby completing a cross-modality search;

In the present invention, the data to be matched in the first modality of the data acquisition module 601 may be image data or text data, and the cross-modality retrieval according to the present invention means that cross-modality retrieval is implemented between the image data and the text data. It can be understood that the data to be matched in the first modality and the matching data in the second modality are different types of data, and in one round of retrieval, when the matching data in the first modality and the like are image data, the matching data in the second modality is corresponding to text data; when the matching data of the first modality and the like is text data, the matching data of the second modality corresponds to image data. In the present invention, the data to be matched in the first modality is data submitted by a data requesting end (for example, the terminal 110 or the computer device 120), the data to be matched in the second modality is data returned by a responding end through a retrieval database, and the database may be set in a local device, may also be set in a cloud server, and the like, which is not limited in the present invention.

In the invention, a feature vector extraction module 602 constructs a deep convolutional Neural Network (ResNet) containing 50 layers as an image feature extraction model, the deep convolutional Neural Network is divided into five parts, from front to back, the five parts are conv1, conv2_ x, conv3_ x, conv4_ x and conv5_ x in sequence, picture data is obtained through an input layer, feature extraction is carried out on the picture data layer by layer, and the model training difficulty is considered while the complete extraction of image information is ensured.

In the invention, the feature vector extraction module 602 extracts the genre information through a Variational Auto-Encoder model (Variational Auto-Encoder), and compared with the traditional methods of word frequency statistics and the like and the popular methods of a recurrent neural network and the like at the present stage, the method for overcoming the defects of the text data overcomes the problem that semantic information is seriously ignored, and the Variational Auto-Encoder model used by the invention is superior in the aspect of keeping feature information.

In the present invention, the matching module 604 outputs the corresponding second modality paired data according to the similarity, it should be understood that the output data may not be unique, for example, when the similarities are the same or similar, a plurality of data are all output, or the first several paired data with the highest similarity are output by default, and the present invention does not limit this too much.

In an embodiment, the feature vector extraction module 602 is configured to, when the data to be matched in the first modality is image data, extract a feature vector for the data to be matched in the first modality by using a depth residual error network ResNet model, specifically:

adjusting the data to be matched in the first mode to a first pixel size, and cutting out a partial region with a second pixel size within the range of the first pixel size, wherein the second pixel size is not larger than the first pixel size;

and extracting and storing the characteristic vector of the data to be matched in the first mode by using a depth residual error network ResNet model.

In the embodiment of the invention, the image data is cut into a first pixel size, aiming at unifying the size of the image so as to input the model; the image data is cropped to a second pixel size, so that the purpose of data enhancement can be achieved. It is to be understood that the first size is an adjustment to the size of the image, not equal to "cropping"; the second pixel size is clipped within the first pixel size, for example, the first pixel size may be 256 × 256, and the corresponding second pixel size may be 224 × 224, and the invention is not limited to the specific combination thereof.

In the invention, a depth residual error network ResNet model with 50 layers is constructed, the pre-trained depth residual error network ResNet parameter weight on ImageNet (a large visual database for visual object recognition software research) is downloaded, and the pre-trained depth residual error network ResNet parameter weight is loaded into the constructed convolutional neural network ResNet model. Inputting the picture processed in the last step from an input layer, sequentially carrying out batch standardization, convolution operation and nonlinear Relu (Rectified Linear Unit) conversion, extracting a feature vector of the image after passing through a convolution layer and a full connection layer, wherein the vector has 2048 dimensions in a representation dimension; the extracted 2048-dimensional vector of each image is stored, so that the subsequent network architecture can be directly used conveniently.

According to the invention, the image data is subjected to feature vector extraction through the depth residual error network ResNet model, compared with the prior art, the extracted feature vectors are more abundant in representation and less in information loss, the loss of feature information is reduced, and the retrieval accuracy is improved.

In an embodiment, as shown in fig. 3, the feature vector extraction module 602 is configured to, when the data to be matched in the first modality is text data, perform feature vector extraction on the data to be matched by using a variational self-encoder model, and specifically is configured to:

cutting off the data to be matched into preset length;

using a word vector model to encode and represent each word of the data to be matched, and cascading the codes;

and processing the cascaded data by using a variational self-encoder model to obtain and store the characteristic vector of the data to be matched in the first mode.

In the invention, the data to be matched is cut into preset length, which aims to unify the length of each text data, keep more information as much as possible and simultaneously avoid data redundancy. For example, truncating each text to 25 words, and complementing with the code 0 when the text is shorter than 25 words.

In the invention, a word vector model is used for coding and representing each word of the data to be matched, the codes are cascaded, the function is to code and represent the text data with the adjusted length, the code of each word is set to be 300 dimensions, then the words in each sentence are cascaded to obtain the vector of the corresponding sentence, and the vector dimension is 7500 dimensions.

In the invention, a variational self-encoder model is utilized to process cascaded data to obtain and store a characteristic vector of the first mode data to be matched, the 7500-dimensional vector of the description sentence is sent to an input layer of a variational self-encoder and then is transmitted to a full-link layer in a forward direction, then a mean value and standard deviation vector space layer is obtained, and vectors with two dimensions of n (n is the vector dimension of an implicit vector space) are obtained: one is a mean vector and the other is a standard deviation vector, so that the vector representation of the implicit vector spatial layer is obtained by the mean vector and the standard deviation vector, and finally the vector representation is input to a decoder for decoding reconstruction.

wherein phi and theta respectively represent encoding and decoding network layer parameters of the variational self-encoder; logp (Logp) _θ (X | Z) is the log-likelihood estimate of the reconstructed sample; q. q.s _φ (Z | X) is a posterior probability distribution function p containing a coding network parameter phi _θ A variation approximation distribution of (Z | X); p (Z) is an a priori normal distribution; KL represents KL divergence and is used for measuring the similarity of two distribution functions, and when the two distribution functions are the same, the KL distance obtains the minimum value of 0, so that in the neural network, a constraint term is added, and certain variables are subjected to set probability distribution.

The loss function actually consists of a front term and a back term, the front term representing the reconstruction loss, similar to the variational autocoder model, only the expectation operator is added, since samples are taken from the distribution. The latter term of the loss function is relative entropy, and the goal is to make the trained distribution close to normal, i.e. the mean value is close to 0 and the standard deviation is close to 1. The antecedent corresponds to a decoding process, is a log-likelihood estimation, and is used for reconstructing original sample data, and the postcedent corresponds to an encoding process and is used for measuring the similarity between an approximate posterior probability distribution function and prior distribution.

The training steps of the variational self-encoder are as follows:

step 1, a forward propagation stage, namely inputting samples into a variational self-encoder to calculate corresponding actual output, wherein in the stage, information is transmitted to a variational self-encoder output layer from a variational self-encoder input layer through encoding and decoding step-by-step transformation.

And 2, in a backward propagation stage, according to the loss objective function described above, and according to a method for minimizing errors, performing backward propagation to adjust model parameters of the variational self-encoder.

In the invention, the trained variational self-encoder model is adopted to extract the characteristics of the text data, and the characteristic vectors in the implicit vector space layer are taken out and used as the extracted characteristics of the text description.

The embodiment of the invention provides a cross-modal retrieval device, which extracts text information through a variational self-coder model, reserves richer and more accurate information by a feature vector, and improves the retrieval accuracy.

In an embodiment, the matching module 604 measures the similarity by an inner product of feature vectors of the data to be matched in the first modality and the paired data in the second modality, where the inner product is calculated by a scoring function, and specifically, the inner product is as follows:

s(i,c)＝f(i；W _f ,θ _φ )·g(c；W _g ,θ _ψ )

wherein: s (i, c) represents a scoring function, (i, c) represents paired image text data, f (i; W) _f ,θ _φ ) And g (c; w _g ,θ _ψ ) Vector representations of feature vectors respectively representing the image data and text data within the common representation space.

The embodiment of the invention provides a specific form of a mapping function, visual and semantic information of image and text data is jointly embedded into a public representation space by establishing the space, so that the problem of semantic gap existing in different modal data is effectively solved, the dimensions of different modal data in the public representation space are consistent, and the similarity measurement is convenient to carry out.

As an optimization scheme of the foregoing embodiment, the mapping function is optimized by a rank loss function, and is used to map the feature vectors of the image and genre data to the common representation space, and the specific form of the mapping function is as follows:

wherein: i represents image data, c represents text data; theta _φ And theta _ψ Extracting model parameters for the image and text data feature vectors respectively; phi (i; theta) _ψ ) And ψ (c; theta _ψ ) Respectively extracting feature vectors of the image and the text; w _f And W _g And mapping weight matrixes of the image and the text data are obtained after the sequencing loss function is optimized.

As a further optimization of the above optimization scheme, the ranking loss function is designed according to the score function, and is used to optimize the mapping weight matrix of the mapping function, specifically:

In the present invention, the prior art loss function is of the form:

wherein, [ x ]] ₊ = max (x, 0), s (i, c) represents a pair of image text data scores,

and with

The second summation is to sum all unmatched image data given a text query c

Compared with the prior art, the cross-modal retrieval device provided by the invention adopts the modified sorting loss function as the target loss function, so that the correlation mining between different modal data is more sufficient and accurate, and compared with the existing sorting loss function, the modified loss function has the advantages of less calculation amount, higher speed and improved accuracy.

FIG. 7 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 7, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a cross-modal retrieval method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a cross-modal search method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configuration shown in fig. 7 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the cross-modal retrieval apparatus provided by the present application may be implemented in the form of a computer program, which may be run on a computer device as shown in fig. 7. The memory of the computer device may store various program modules constituting the cross-modality retrieval apparatus, such as a data acquisition module 601, a feature vector extraction module 602, a mapping module 603, and a matching module 604 shown in fig. 6. The computer program constituted by the program modules causes the processor to execute the steps in the cross-modal retrieval method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 7 may execute step S201 through the data obtaining module 601 in the cross-modal retrieval apparatus shown in fig. 6; the computer device may perform step S202 through the feature vector extraction module 602; the computer device may perform step S203 through the mapping module 603; the computer device may perform step S204 through the matching module 604.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

step S204, calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, and outputting the corresponding pairing data of the second modality according to the similarity to finish the cross-modality retrieval;

In one embodiment, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of: .

It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A cross-modal search method, the method comprising:

calculating the similarity between the data to be matched of the first modality and the pairing data of the second modality in the public representation space, and outputting the corresponding pairing data of the second modality according to the similarity to complete cross-modality retrieval;

the second modality matching data comprises image data and text data, and in one round of retrieval, the first modality data to be matched and the second modality matching data are different types of data.

2. The cross-mode retrieval method according to claim 1, wherein when the data to be matched in the first mode is image data, the method performs feature vector extraction on the image data by using a depth residual network ResNet model, and comprises the following steps:

3. The cross-modal search method according to claim 1, wherein when the data to be matched in the first modality is text data, the method for extracting feature vectors of the data to be matched by using a variational self-encoder model comprises the following steps:

cutting off the data to be matched into preset length;

using a word vector model to carry out coding representation on each word of the data to be matched, and cascading the codes;

4. A cross-modality retrieval method according to claim 1, wherein the similarity is measured by an inner product of feature vectors of the data to be matched in the first modality and paired data in the second modality, and the inner product is calculated by a scoring function, specifically as follows:

s(i,c)＝f(i；W _f ,θ _φ )·g(c；W _g ,θ _ψ )

wherein: s (i, c) represents a scoring function, f (i; W) _f ,θ _φ ) And g (c; w is a group of _g ,θ _ψ ) Vector representations of feature vectors respectively representing the image data and text data within the common representation space.

5. A cross-modality retrieval method according to claim 4, wherein the mapping function is optimized by a rank-order-loss function for mapping the feature vectors of the image, discourse data to the common representation space, and the mapping function is of the specific form:

wherein: i represents image data, c represents text data; theta.theta. _φ And theta _ψ Extracting model parameters for the image and text data feature vectors respectively; phi (i; theta) _φ ) And ψ (c; theta _ψ ) Respectively extracting feature vectors of the image and the text; w _f And W _g And mapping weight matrixes of the image and the text data are obtained after the sequencing loss function is optimized.

6. The cross-modal search method of claim 5, wherein the ranking loss function is designed according to the scoring function, and is used to optimize a mapping weight matrix of the mapping function, and specifically comprises:

wherein: i' = argmax _j≠i s (j, c) which is the image data with the highest similarity matching data in the database; c' = argmax _d≠c And s (i, d) is the text data with the highest similarity pairing data in the database.

7. A cross-modality retrieval apparatus, the apparatus comprising:

the mapping module is used for mapping the characteristic vectors to a public expression space by utilizing a preset mapping function;

8. A computer arrangement comprising a memory and a processor, the memory having stored thereon a computer program that, when executed by the processor, causes the processor to carry out the steps of the cross-modality retrieval method of any one of claims 1 to 6.

9. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the cross-modal retrieval method of any one of claims 1 to 6.