CN112148916A

CN112148916A - Cross-modal retrieval method, device, equipment and medium based on supervision

Info

Publication number: CN112148916A
Application number: CN202011044741.1A
Authority: CN
Inventors: 李国徽; 袁凌; 周思远; 徐志鹏; 潘鹏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2020-12-29

Abstract

The invention discloses a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, wherein the method comprises the following steps: carrying out feature extraction on training sample data of an image mode and a text mode; mapping the extracted image data features and text data features to a common representation space; respectively calculating the loss of a label space, the loss in each mode and among different modes in a public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; optimizing parameters of the retrieval model by minimizing a loss function; and mapping the target retrieval data to a public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set to obtain a corresponding retrieval sequencing result. Therefore, the discriminability of different semantic data samples and the semantic information of the original data are reserved, the correlation among the cross-modal data can be calculated more effectively, and the retrieval accuracy is higher.

Description

Cross-modal retrieval method, device, equipment and medium based on supervision

Technical Field

The invention relates to the technical field of data retrieval, in particular to a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision.

Background

With the rapid development of science and technology, the generation forms and the acquisition channels of scientific and technical information are increasingly abundant. Data expression forms of various scientific and technical information are rich and diverse, and are gradually changed from single text data into mixed data types of other modes, such as pictures, videos and the like with more vivid expression forms and richer contents. The traditional single-mode retrieval method has a good query effect on a single mode, but due to the fact that characteristic heterogeneity and weak correlation possibly exist among data of different modes, characteristic vectors of the data of different modes cannot directly participate in calculation due to different dimensions and attributes, and single-mode retrieval is not suitable for retrieval among multiple modes. The cross-modal retrieval method retrieves similar content among multi-modal data by utilizing semantic similarity existing among different modalities. By the cross-modal retrieval method, the requirement of multi-angle intelligent analysis on multi-modal scientific and technological information can be met.

The core of cross-modal search is how to measure content similarity between different modal data, i.e. to solve heterogeneity between different modal data. Representation learning is a general method for overcoming heterogeneity among different modality data, and aims to design a proper function to map the different modality data to a common representation space, and in the representation space, the similarity among the different modality data can be directly solved due to the consistent dimension of the data. In order to construct a suitable representation space, researchers have proposed many methods of designing the mapping function.

The conventional method uses statistical correlation analysis to learn a linear function by optimizing a target statistic, but the correlation between multi-modal data in the real world is complex, and the linear function cannot completely model a mapping relation. Since deep neural networks perform well in the field of representation learning, a number of deep learning-based methods are used to learn a common representation space for multimodal data. A supervised-based deep learning approach may be used to learn more discriminative representation features, relative to an unsupervised approach, thereby enabling better separation of different classes of data in a common representation space. Existing cross-modal search methods based on supervision include learning discriminative features between multimodal data using label information, learning semantics or discriminative properties within each modality using classification information, and the like. Although these methods use classification information, the classification information is only used to learn discriminant features within each modality or across modalities, so these cross-modality methods do not fully utilize semantic information in the raw data.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, and aims to solve the technical problem that the prior cross-modal retrieval method does not fully utilize semantic information in original data, so that the retrieval accuracy is low.

In order to achieve the above object, the present invention provides a cross-modal search method based on supervision, which comprises the following steps: s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set; s2: mapping the extracted image data features and text data features to a common representation space; s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model; s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.

Further, the step S1 includes: s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network; s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.

Further, the step S2 includes: and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.

Further, a linear classifier is added after the third fully-connected layer to predict the class of images and texts and compare with the real class, thereby calculating the loss of tag space.

Further, the loss function is expressed as: l ═ λ L₁+μL₂+ηL₃Wherein, in the step (A),

L₁for the loss of label space, n is the number of the picture text data pairs, | | · u_FRepresenting Frobenius norm, P is a projection matrix of a linear classifier, alpha and beta are weights corresponding to image and text prediction labels respectively, and U, V, Y is a representation matrix of an image, a representation matrix of a text and a representation matrix of a corresponding label in a public representation space respectively;

L₂for losses within individual modalities and between different modalities in the common representation space,_ij＝cos(u_i,v_j)，Φ_ij＝cos(u_i,u_j)，Θ_ij＝cos(v_i,v_j)，

cos is a cosine function used to measure similarity; sgn is a sign function, and is 1 if the two representing elements belong to the same class, otherwise is 0;

for mapping the modality of the image,

for mapping text modalities, wherein

And

y for the ith image sample and the jth text sample^αAnd upsilon^βIs a learnable parameter;

L₃loss of invariance between image and text modalities;

λ, μ, η are L₁、L₂、L₃The weight coefficient of (2).

Further, in step S5, calculating the similarity between the target retrieval data and the data in the image-text data set includes: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.

In another aspect, the present invention provides a cross-modal search apparatus based on supervision, including:

the characteristic extraction module is used for respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;

the public representation space learning module is used for mapping the extracted image data features and the extracted text data features to a public representation space;

the loss function calculation module is used for calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes respectively, and adding different weights to obtain a loss function of the retrieval model;

the retrieval model optimization module is used for optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;

and the retrieval result determining module is used for mapping the target retrieval data to the public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval sequencing result corresponding to the target retrieval data.

The invention also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the method comprises the steps of mapping the extracted image data features and the extracted text data features to a public representation space; calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among the image modes and the text modes, and adding different weights to obtain a loss function of a retrieval model; and optimizing parameters of the retrieval model through a minimum loss function to obtain the optimized retrieval model. An end-to-end supervision-based cross-modal deep learning framework is adopted, the classification information of multi-modal data is fully utilized to learn a common representation space, the discriminability of different semantic data samples is reserved, and the difference between cross-modal data is eliminated. Compared with the traditional retrieval method based on Hash, the method adopts the cross-modal representation learning method based on real values, retains the semantic information of the original data, can more effectively calculate the correlation among the cross-modal data, eliminates the influence of heterogeneity among different modes, and has higher retrieval accuracy.

(2) The present invention makes full use of classification information to learn the common representation space. Meanwhile, the method is a cross-modal representation learning method of real values, and compared with a binary representation learning method or a cross-modal Hash retrieval method, the method retains the information of original data and has higher retrieval accuracy.

Drawings

FIG. 1 is a schematic flow chart of a cross-modal search method based on supervision according to the present invention;

fig. 2 is a block diagram of a cross-modal search apparatus based on supervision according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a cross-modal retrieval method based on supervision, as shown in fig. 1, comprising the following steps:

s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;

s2: mapping the extracted image data features and text data features to a common representation space;

s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model;

s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;

s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.

The present invention supervises model learning discrimination features by minimizing discrimination losses in both the label space and the common representation space. At the same time, modality-invariant features in the common representation space are learned by minimizing loss of invariance between different modalities and using a weight sharing strategy. According to this learning strategy, pairs of label information and classification information are utilized as fully as possible to ensure that the learned representation is discriminative in semantic structure and invariant in all modalities.

Specifically, the method comprises the following steps:

1. data preprocessing and image text feature extraction

The invention can be applied to cross-modal retrieval among different modalities, the introduction mainly takes image and text data as an example, and n picture text data pairs are assumed to be defined as

Wherein

Is the ith sample of the image,

is the ith text sample. Each image text pair corresponds to a semantic label vector

c represents the number of directories, if the ith instance belongs to the jth directory, the corresponding component in the tag vector is 1, otherwise, it is 0.

The data preprocessing of the image comprises size adjustment, cutting, normalization and the like; the text is preprocessed by denoising, word segmentation, word filtering stop and the like. The method of feature extraction is described below.

For picture data, a deep convolutional neural network is adopted to extract a feature vector of an image. The convolution kernel in the convolution neural network and the input image are operated with an element corresponding product and summation operation to project the information in the receptive field into an element in the feature mapping, so as to complete the feature extraction of the image. The part finally expresses the picture as a feature vector with a specific dimension by controlling the dimension of each layer. Generally speaking, convolutional neural networks have a multi-layer nature, where the neural network following each layer represents the picture data in a more abstract way, so the output of the second to last layer can be used as a feature vector. The image feature extraction method is characterized in that a specific convolutional neural network is utilized to extract the features of a single mode of the image, and then the feature vector of the image is represented for the next step of representation learning. Therefore, a 4096-dimensional high-level feature vector is extracted as a semantic vector of an image by fc7 level in VGGNet of 19 levels. Then, common representation learning is performed using one fully connected layer to obtain a common representation vector u of the image_i。

For text data, a feature extraction algorithm based on a language model and a neural network is adopted to extract text information from a non-text informationThe original text of the structure is converted into a high-level semantic feature vector representation. For the text sub-network, a BERT model (BERT-base, Chinese) is used to extract the feature vectors of Chinese text. BERT utilizes a transform structure to construct a multilayer bidirectional encoder network, can directly convert an original text into a high-level semantic vector form with semantic features, and then utilizes a full-link layer to carry out common representation learning to obtain a common representation vector v of the text_j。

And a feature extraction module is used for respectively extracting high-level semantic feature vectors of the scientific and technological information image and the text. For the image and text subnetworks, one for the image modality and the other for the text modality, both employ an end-to-end training mode.

2. Common representation space learning

For the common representation learning module, in order to ensure that two subnets can learn a common representation space, a fully connected layer is used for weight sharing learning at the end of the model. Next, a linear classifier with a parameter matrix P is connected to the end of the model based on the assumption that the representative vector in the common space fits the classification to learn the discriminant features by using the label information. Therefore, the method can well learn the cross-modal correlation and accurately distinguish the representation features of the public space.

Since the image and the text data are respectively in different representation spaces, the similarity of the image and the text data cannot be directly compared, so that the method learns that the image and the text are respectively mapped to a common representation space by two mapping functions. Definition of

To map the modality of the image,

to map the text modality, wherein γ^αAnd upsilon^βAre learnable parameters.

Defining a representation matrix representing an image in space as U ═ U₁,u₂,...,u_n]The expression matrix of the text isV＝[v₁,v₂,...,v_n]The expression matrix of the corresponding label is Y ═ Y₁,y₂,...,y_n]。

3. Training and objective functions of models

The goal of the model is to learn the semantic structure of the data, i.e. to learn a common space where samples from the same semantic class should be similar, even though the data may be from different modalities, and samples from different semantic classes should be dissimilar. To learn discriminative representation features of multimodal data, the method minimizes the discriminative loss in label space and common representation space, while at the same time minimizing the distance between each image text pair representation to reduce cross-modality differences.

In order to preserve the difference of different classes of samples after feature projection, the method assumes that the common representation features are very suitable for classification, and uses a linear classifier to predict semantic labels of the sample features in the common representation space. In particular, a full connectivity layer is used to connect the end of the image modality network and the text modality network. The classifier classifies representative features of training data in a common space to generate a prediction label for each data sample.

The objective function can be divided into three parts. The first part acts as a measure of the loss of label space, as shown in equation (1).

Wherein | · | purple sweet_FRepresenting the Frobenius norm, P is the projection matrix of the linear classifier, and α and β are the weights corresponding to the image and text prediction labels, respectively. Since the way in which images and text extract high level semantic vectors is differentiated so that their prediction losses mapped to features in the common representation space are not uniform, different weights are applied to the image and text prediction labels to balance the difference in prediction losses.

The second part of the objective function is used for directly measuring the discriminant loss between different modal data and in each modal data in the common representation interval, as shown in formula (2).

Wherein the content of the first and second substances,_ij＝cos(u_i,v_j)，Φ_ij＝cos(u_i,u_j)，Θ_ij＝cos(v_i,v_j)，

cos is a cosine function used to measure similarity; sgn is a sign function, which is 1 if two representing elements belong to the same class, and 0 otherwise.

The likelihood function for measuring the similarity between the modalities is shown in equation (3).

Since the logarithm of the maximum likelihood function, i.e., the minimum likelihood function, takes a negative sign, equation (3) can be reduced to the first line of equation (2). It can be deduced that the cosine similarity cos (u)_i,v_j) The larger the size of the tube is,_ijthe greater, and thus the probability p (1| u)_i,v_j) The larger, that is, the common representation space is classified by similarity. Similarly, the second and third lines in equation (2) measure the similarity of the respective interior of the image and text sample. Therefore, the formula is a reasonable method for measuring the common representation space similarity and is a criterion for learning the discriminant features.

To eliminate the cross-modal differences, the third part of the objective function minimizes the distance between all the picture text pair representations, as shown in equation (4).

And (5) integrating (1), (2) and (4), and applying different weights respectively to obtain a total objective function of the model, as shown in formula (5).

L＝λL₁+μL₂+ηL₃ (5)

4. Similarity calculation and result display

And finally, training the model and using the model. And mapping the target retrieval data to a public expression space by using a trained model, and calculating the similarity between the target retrieval data and data in a science and technology image-text data set. In practical use, similarity of homomorphic data can be added to improve the accuracy of retrieval, i.e. as shown in formula (6).

S＝αSimilarity(x,U)+βSimilarity(x,V) (6)

Wherein, alpha and beta are weights, Similarity () is a function for measuring Similarity, x is input image or text data, S is a final return result, S is sorted, and the data in the front of the sorting is taken as a final result.

the feature extraction module 11 is configured to perform feature extraction on training sample data in an image modality and a text modality, where the training sample data is acquired from a graphic data set;

a common representation space learning module 21 for mapping the extracted image data features and text data features to a common representation space;

a loss function calculating module 31, configured to calculate a loss of a tag space, a loss in each modality and between different modalities in the common representation space, and an invariance loss between image modalities and text modalities, and add different weights to obtain a loss function of the retrieval model;

a retrieval model optimization module 41, configured to optimize parameters of the retrieval model by minimizing the loss function, so as to obtain an optimized retrieval model;

and a retrieval result determining module 51, configured to map the target retrieval data to the common representation space by using the optimized retrieval model, and calculate a similarity between the target retrieval data and the data in the image-text data set, so as to obtain a retrieval ordering result corresponding to the target retrieval data.

The division of each module in the above-mentioned cross-modal search apparatus is merely for illustration, and in other embodiments, the cross-modal search apparatus may be divided into different modules as required to complete all or part of the functions of the above-mentioned apparatus.

The implementation of each module in the cross-modal retrieval apparatus based on supervision provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. Program modules constituted by such computer programs may be stored on the memory of the electronic device. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of a surveillance-based cross-modality retrieval method.

A computer program product containing instructions which, when run on a computer, cause the computer to perform a surveillance-based cross-modal retrieval method.

Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It will be readily understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, so that various changes, modifications and substitutions may be made without departing from the spirit and scope of the invention.

Claims

1. A cross-modal retrieval method based on supervision is characterized by comprising the following steps:

2. The supervised-based cross-modal retrieval method of claim 1, wherein the step S1 comprises:

s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network;

s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.

3. The supervised-based cross-modal retrieval method of claim 2, wherein the step S2 comprises:

and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.

4. The supervised-based cross-modal search method of claim 3, wherein a linear classifier is added after the third fully connected layer to predict image and text classes and compare them with the true classes to calculate the loss of label space.

5. The supervised-based cross-modal retrieval method of claim 1,

the loss function is expressed as: l ═ λ L₁+μL₂+ηL₃Wherein, in the step (A),

for mapping the modality of the image,

for mapping text modalities, wherein

And

L₃loss of invariance between image and text modalities;

λ, μ, η are L₁、L₂、L₃The weight coefficient of (2).

6. The supervised-based cross-modal retrieval method of claim 1, wherein the calculating of the similarity between the target retrieval data and the data in the teletext data set in step S5 comprises: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.

7. A cross-modal search apparatus based on supervision, comprising:

8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.