CN110647904B

CN110647904B - Cross-modal retrieval method and system based on unmarked data migration

Info

Publication number: CN110647904B
Application number: CN201910707010.1A
Authority: CN
Inventors: 朱福庆; 王雪如; 张卫博; 戴娇; 虎嵩林; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2022-09-23
Anticipated expiration: 2039-08-01
Also published as: CN110647904A

Abstract

The invention provides a cross-modal retrieval method and a cross-modal retrieval system based on label-free data migration. The invention well solves the problem of small data scale of the cross-modal data set, and better conforms to the condition that the actual user query is not in the predefined category range; meanwhile, upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the accuracy of cross-mode retrieval is improved.

Description

Cross-modal retrieval method and system based on unmarked data migration

Technical Field

The invention relates to the technical field of cross-modal data retrieval, in particular to a cross-modal retrieval method and a cross-modal retrieval system based on unmarked data migration.

Background

Different modal data such as images and texts are widely present in the internet and show a trend of mutual fusion. The cross-modal retrieval task tries to break the boundary between different modal data and realize information retrieval by crossing different modal data, namely, a certain modal sample is tried to retrieve samples of other modalities similar to the semantics of the certain modal sample, and the cross-modal retrieval task is widely applied to search engines and big data management. The existing cross-modal retrieval method tries to map feature representations of different modal data to a common space to learn a unified representation, and measures similarity by calculating the distance between corresponding unified representations. However, due to the heterogeneity of different modal data, the data distribution and characterization are inconsistent, semantic association is difficult to achieve, and cross-modal similarity is still difficult to measure.

Although the internet has a large amount of image and text data, most of the image and text data are unmarked and difficult to use. The data contains rich semantic information, on one hand, data annotation requires a large amount of cost, on the other hand, internet information is updated constantly, and each new hot event is accompanied by a large amount of data such as images and texts of new categories, so that the data of all the categories cannot be annotated, and how to fully utilize the non-annotated data is a great challenge for the traditional cross-modal retrieval task.

In an actual scenario, the query submitted by the user often does not necessarily fall within the predefined category range, and the situation that the training set and the test set do not share the same category sometimes occurs. Existing cross-modality retrieval methods are generally only directed to cases where training data and test data are of the same category (non-extensible cross-modality retrieval). How to better construct a cross-modal common space, inputting a modal data, no matter the category of the data is known or unknown, the multi-modal data related to the data can be retrieved, which has important significance in practical application.

Disclosure of Invention

In order to solve the problems of data heterogeneity of different modes, excessive unmarked data, insufficient training data, inextensible and the like, the invention provides a cross-mode retrieval method and a system based on unmarked data migration.

The technical scheme of the invention is as follows:

a cross-mode retrieval method based on unmarked data migration comprises the following steps:

inputting a sample to be retrieved into a trained cross-modal data retrieval model to obtain characteristic representation of the cross-modal data retrieval model;

calculating Euclidean distances between each sample to be retrieved and all other modal samples, and then sequencing, wherein the other modal samples with the distances smaller than a specified threshold value are retrieval results;

the training process of the cross-modal data retrieval model is as follows:

(1) setting pseudo labels for the unmarked images and the texts respectively by a clustering method;

(2) respectively transferring knowledge contained in unmarked images and texts with pseudo labels to image and text parts of a cross-modal data set, and learning the independent expression of the images and texts of the cross-modal data set;

(3) and transmitting the independent expressions of the images and the texts into the same network, and learning the common expression of the images and the texts in the same semantic space.

Further, the method for determining the threshold value comprises the following steps: loss in training process _cross-modal The Loss value is the distance of the paired image text, in terms of Loss _cross-modal Setting 10-20 initial thresholds for the Loss value, calculating the retrieved mAP (mean Average precision) value under each threshold (measuring the quality of the learned model on all queries, namely the Average value of all APs; AP (mean precision) measuring the quality of the learned model on each query), and making the threshold with the maximum mAP value be the retrieval threshold, wherein the Loss value is set to 10-20 initial thresholds, and calculating the retrieved mAP value (mean Average precision) under each threshold, and the threshold with the maximum mAP value is the retrieval threshold _cross-modal As a loss function across modal knowledge:

where l6, l7 refers to two fully connected layers connected across the modal dataset image text, nl refers to the incoming image and textThe number of the pairs is logarithmic,

for the p-th image-text pair, the image and text are mapped into feature vectors using g ().

A cross-modal retrieval system based on unmarked data migration, comprising:

the system comprises a label-free data clustering module, a data migration module and a common space learning module, wherein a migration data set is constructed through the label-free data clustering module and is used as a migration source domain of the data migration module, and finally, the common space learning module is used for uniformly expressing the image and text learning obtained by the data migration module and establishing a similarity measurement basis of the cross-modal data, so that cross-modal retrieval is realized.

Further, the label-free data clustering module comprises an image clustering submodule and a text clustering submodule. The module extracts the characteristics of all unmarked images/texts and then conducts unsupervised clustering to obtain a series of clustering centers; and classifying the image/text samples under the same cluster center into one class, and setting the samples as the same label, namely completing the construction of the migration data set.

Further, the data migration module comprises an image migration submodule and a text migration submodule, and migration only occurs in the same submodule. For each sub-module, the migration source domain is unmarked data after corresponding modal clustering, and the target domain is data of corresponding modal of the cross-modal data set. Transfer learning is achieved by minimizing the loss of distribution between the source domain and the target domain. The inputs of the cross-modal data set are all input in pairs and belong to the same category, the expressions generated finally should be similar, and the distance between the images and texts with the same semantic information is as close as possible and the distance between the images and texts with different semantics is as far as possible by minimizing the pair Euclidean distance between the two modal data sets, and the images and texts are independent of the modalities.

Furthermore, the common space learning module transmits the single expression of the image and the text obtained by the data migration module into the same network to learn the unified expression of data in different modes, the network comprises a plurality of shared full connection layers, and word embedding vectors of cross-mode data set categories are added into the network, so that semantic association among different modes is increased, and semantic information is further enhanced.

The method has the beneficial effects that:

according to the method, a large number of unmarked monomodal data sets are clustered and are distributed with the pseudo labels, and the clustered unmarked data are transferred to the cross-modal data set, so that the problem of small data scale of the cross-modal data set is well solved, and the method is more suitable for the condition that the actual user query is not in the predefined category range. By the method, the upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the accuracy of cross-mode retrieval is improved. The method achieves good effects in both public data sets and practical applications.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a data migration flow diagram;

FIG. 3 is a flow diagram of a feature extraction system.

Detailed Description

The method mainly introduces a cross-modal retrieval network modeling based on transfer learning, label-free data clustering, data transfer, co-expression learning and testing process.

The method will be further described with reference to the accompanying drawings.

Modeling of a cross-modal retrieval network based on transfer learning:

clustering unlabeled data, i.e. giving unlabeled data set S, using image clustering algorithm C _i Will not have label image S _i Poly is k _i Class, using a text clustering algorithm C _t For unlabelled text S _t Poly is k _t In each category, all images and texts in the same clustering center are marked with the same pseudo label y _i . Migrating the clustered label-free data set S to a cross-modal data set D by using a migration learning algorithm T, and performing combined training to generate a single vector expression R of images and texts of the cross-modal data set _i ，R _t . Most preferablySeparate expression R of images and texts _i ，R _t And transmitting the word embedding vector V of the category into the same full-connection network F, and generating a common expression R of the image and the text in the same space. Wherein:

unlabeled dataset S ═ S _i ,S _t }: as a source domain for transfer learning, wherein S _i For unlabeled image datasets, S _t Is a non-labeled text data set.

Cross-modality dataset D ═ { D ═ D _i ,D _t }：D _i And D _t Images, text across the modal dataset, images and text across the modal dataset are entered in pairs and correlated, for each image/text pair, the images and text are from the same article, or the text is a description of the image.

Word embedding vector V: all known classes across modal datasets are converted to 300-dimensional Word vectors by the Word2vec model.

Text input: text is a description of an image and may be an article, paragraph, sentence, word, etc. Text vectors are extracted using Bert, with dimensions 768 dimensions.

Image input: in this network, the input of the image is a graph of 224 x 224.

Clustering algorithm C ═ { C ═ C _i ,C _t }，C _i As an image clustering algorithm, C _t A text clustering algorithm.

Number of clusters k _i ,k _t The method is obtained through experience and multiple calculations.

The migration algorithm T: is an algorithm that gains some knowledge through the source domain to promote the target task, where the source domain is not equal to the target domain or the source task is not the same as the target task.

Co-expression vector R: the resulting vector representation of the image and text.

A label-free data clustering module:

for unlabeled images containing rich semantic information, firstly, a pre-trained VGG network is used for extracting a feature vector of each image, and then, the KMeans method is used for clustering the images. The specific method comprises the following steps: based on unmarked imagesNumber and distribution, setting initial cluster center number (i.e. k) _i ) And randomly select k _i The images serve as the initial cluster centers. And traversing all the images, distributing each image to the nearest cluster center, updating the mean value of each cluster to be used as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached. All samples of the same cluster are classified into one class and set as the same label for constructing a source domain data set for image migration.

For the unlabeled text, firstly using Bert to extract the characteristics of each text, then adopting the unsupervised clustering method the same as the images to classify similar texts into the same cluster and marking the same label for constructing a source domain data set of text migration.

Method for determining the number of suitable cluster centers: setting the initial value of k to be 5-15 according to the size of the unmarked data volume, clustering each value of k and recording the corresponding SSE (sum of the squared errors, SSE is the sum of the distances between each sample point and the corresponding clustering center of the sample point). With the increase of the number of clusters, the sample division is finer, the aggregation degree of each cluster is gradually improved, and the sum of squared errors SSE is gradually reduced. When k is less than the optimal cluster number, the increase of k greatly increases the aggregation level of each cluster, and the decrease of the SSE is large, and when k reaches the optimal cluster number, the return of the aggregation level obtained by increasing k is rapidly reduced, so that the decrease of the SSE is rapidly reduced and then becomes gentle as the k value continues to increase. The relationship between k and SSE is plotted, and the point at which the slope changes is the optimal value of k.

A data migration module:

the data migration module comprises two parts, namely monomodal knowledge migration and cross-modal knowledge sharing.

And the single-mode migration refers to migrating the clustered unlabeled images to images corresponding to the cross-mode data set and migrating the clustered unlabeled texts to texts corresponding to the cross-mode data set. Therefore, the module comprises two single-mode migration submodules of image and text.

Referring to fig. 2, for image migration, the source domain of the migration is the unlabeled image after clustering, and the target domain is the image portion of the cross-modal data. Firstly, images of a source domain and a target domain are transmitted into a network, pass through the first five convolutional layers of the AlexNet network, and are added with three full connection layers fc6, fc7 and fc8, wherein the loss function of the source domain is SoftMax loss. The migration of knowledge of image modalities is achieved by minimizing the loss function MMD (Maximum Mean difference) of the source and target domains, which is used to measure the difference of two different but related distributions. Defining the distribution of the image object field as X _i Distribution of source domains is Y _i The migration loss of the image modality is:

wherein

Indicating that this distance is measured by f () mapping the data into a regenerated hilbert space (RKHS), m being the number of samples of the source domain data and n being the number of samples of the target domain data.

The text migration and image migration processes are similar, a migrated source domain is a clustered label-free text, a migrated target domain is a text part of cross-modal data, text feature vectors of the source domain and the target domain are respectively extracted by using an NLP model Bert disclosed by GOOGLE, and then the text feature vectors pass through three full-connection layers fc6, fc7 and fc8, wherein a loss function of the source domain is SoftMax loss, and a loss function of the migration is MMD loss. Defining a distribution of text target fields as X _t Distribution of source field is Y _t The migration loss of the text modality is:

the purpose of setting the cross-modal knowledge sharing layer is to fully utilize similar semantic information among different modalities, overcome the heterogeneity difference among the modalities, and no matter which modality the data comes from, as long as the data contains the same semantic information, the data should have similar feature vectors, contain different semantic information, and the distance of the feature vectors should be far away. The similarity of vectors is measured using Euclidean distances (fc6-img/fc6-txt and fc7-img/fc7-txt), which should be as small as possible for each pair of similar images and text that are input. The loss function across modal knowledge is:

where l6, l7 refers to two fully connected layers connected across the modal dataset image text, nl refers to the logarithm of the incoming image and text,

After passing through the two monomodal knowledge migration modules and the cross-modal knowledge sharing module, the model makes full use of unmarked data, has stronger semantic discrimination capability, and generates a separate representation for each sample in the cross-modal data set.

The final loss function of the migration module is:

Loss _transfer ＝Loss _img +Loss _txt +Loss _cross-modal

a common space learning module:

the cross-modal target domain internal semantic association also provides key semantic information for the cross-modal common space construction, and in order to further enhance the semantic correlation of image and text features, a common space learning module is further designed to enhance the correlation. The module is a simple and efficient structure comprising two fully connected layers and a common classification layer. Word embedding (word embedding) vectors of image features, text features and categories are introduced into the module, since the parameters of fc8, fc9 are shared by two modalities, so that semantic relevance of different modalities can be guaranteed with supervisory information in the cross-modality target domain. Considering the labels of two paired modalities in the target domain, the correlation penalty is:

wherein f is _s In order to be a function of the SoftMax loss,

for the p-th relevant image-text pair input,/ ^p A category label for the image text pair.

The migration module and the common space learning module are a unified network structure, and the two modules are trained together and mutually promote. The net penalty is therefore:

Loss＝Loss _transfer +Loss _common

the embodiment is as follows:

the invention comprises a training system, a feature extraction system and a retrieval three parts: the three modules are combined to form the overall structure (figure 1) of the invention, and training data are transmitted into a training system for training and are stored to obtain a training model. The parameters of the feature extraction system (fig. 3) and the training system are the same, but structures such as data migration and category word embedding are not needed, and the test set is transmitted to the feature extraction system to obtain vector representation of each sample of the test set. And during retrieval, calculating the distance between the sample to be retrieved and all samples in other modes, wherein the distance smaller than a specified threshold value is a retrieval result.

A training system:

as shown in fig. 1, the three modules (the unlabeled data clustering module, the data migration module, and the co-expression learning module) are combined to form a training system. The specific training steps are as follows:

1. image source domain preprocessing: for each image in the unlabeled image set, extracting image features by using a pre-trained VGG network, and selecting k from the image features _i Taking the image as an initial clustering center, assigning each image to the nearest clustering center, and updatingAnd taking the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration number is reached. All samples of the same cluster are classified into one class and set to the same label li (li is between 0 and ki-1) for constructing the migration dataset. And storing the image path and the pseudo label into the same txt file, wherein each line represents an image and is in the format of an image path li.

2. Text source domain preprocessing: for each text in the unlabeled text set, the characteristics of each text are extracted by using Bert, the number of clusters is set to kt, then similar texts are classified into the same cluster by adopting an unsupervised clustering method the same as that of the image, and the same label lt (lt is between 0 and kt-1) is marked. And storing the text path and the pseudo label into the same txt file, wherein each line represents a text and is in a format of a text path lt.

3. Cross-modality data set preprocessing: the images and texts correspond to each other one by one across the modal dataset and are input in pairs. The images are stored in a txt document in the format "image path similarity", each line representing an image. The text is firstly converted into a vector, and the vector and the category label are stored in the lmdb file.

4. Setting the network learning rate to be fixed, setting the basic learning rate to be 0.01, iterating for 500 rounds, updating the network parameters, and using a random gradient descent algorithm.

5. And transmitting the image source domain and the text source domain into the model across the modal data set, and starting to train the model. After the images and the texts pass through the migration module and the common space learning module, the expression R of the images and the texts in the common space is obtained.

A feature extraction system:

the inventive feature extraction process block diagram is shown in fig. 3, which is a system that has fewer migration source domains, word embedding vectors for classes, and SoftMax loss functions than a training system, and does not require pairwise input across modal datasets. The feature extraction system firstly extracts feature representation of the image/text, wherein the input mode of the image/text is consistent with the training process, the image/text is sent into a CNN model after learning optimization in the training process, and the response of the last but one full connection layer is taken as the feature representation of the image/text. And after the characteristic representation of the image/text is obtained, cross-modal retrieval is carried out.

And (3) retrieval:

1. transmitting the images and texts of all the test sets into a feature extraction system to obtain feature representations of the images and the texts;

2. realizing 'text searching for pictures' and 'picture searching for text': and calculating Euclidean distances between each image and all texts, and sequencing, wherein a plurality of texts closest to the image are retrieval results. The text is also true.

Claims

1. A cross-modal retrieval method based on label-free data migration, wherein the migration comprises monomodal knowledge migration and cross-modal knowledge sharing, and the method comprises the following steps:

the training process of the cross-modal data retrieval model is as follows:

(1) collecting unmarked images and unmarked texts;

(2) extracting a feature vector of each image by using a pre-trained VGG network, determining the number of image clustering centers according to the data volume of the unlabeled images, and selecting the unlabeled images with the same number as the number of the image clustering centers as the initial clustering centers, wherein the determining of the number of the image clustering centers comprises the following steps:

setting an initial value range of the image clustering center number according to the data size of the unmarked image, clustering each initial value and recording the error square sum;

drawing a relational graph of the number of the image clustering centers and the sum of squares of errors, and obtaining the number of the image clustering centers based on the slope change in the relational graph; with the increase of the number of clusters, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when the number of the image clustering centers is less than the optimal clustering number, the increase of the number of the image clustering centers can increase the aggregation degree of each cluster, and the reduction range of the error square sum is large; when the number of the image clustering centers reaches the optimal clustering number and the number of the image clustering centers is increased, the descending amplitude of the error square sum is suddenly reduced, and the slope tends to be gentle as the number of the image clustering centers is continuously increased;

(3) traversing all the unmarked images, distributing each unmarked image to the nearest cluster center, updating the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached;

(4) classifying all the unmarked images of the same cluster into one class and setting the same label, thereby obtaining the unmarked images with the pseudo labels;

(5) extracting the characteristics of each unmarked text by using the Bert, and carrying out unsupervised clustering on the characteristics to obtain an unmarked file with a pseudo label;

(6) respectively transferring knowledge contained in the unlabeled image and the text with the pseudo labels to the image and text parts of the cross-modal dataset to generate separate expressions of the cross-modal dataset image and the text; wherein the Loss function Loss _transfer ＝Loss _img +Loss _txt +Loss _cross-modal ；

Knowledge migration loss for image modalities

Represents the distance measured by mapping the data into the regenerated hilbert space by f (); x _i For distribution of object fields of the image, Y _i For the distribution of image source domain, k is the number of clustering centers, m is the number of samples of source domain data, and n is the target domainThe number of samples of data;

the method for realizing knowledge migration of the image modality comprises the following steps: firstly, transmitting images of a source domain and an image of a target domain into a network, passing through the first five convolutional layers of the AlexNet network, and then adding three full-connection layers, wherein the loss function of the source domain is SoftMax loss; the knowledge transfer of the image modality is realized by minimizing the loss function MMD of the source domain and the target domain;

knowledge migration loss for text modalities

X _t For the distribution of text target fields, Y _t Is the distribution of text source domains;

the method for realizing knowledge migration in the text mode comprises the following steps: respectively extracting text characteristic vectors of a source domain and a target domain by using Bert, and then passing through three full-connection layers, wherein a loss function of the source domain is SoftMax loss, and a loss function of migration is MMD loss;

loss function across modal knowledge

l6, l7 refers to two fully connected layers of image text connected across the modal dataset, nl refers to the logarithm of the incoming image and text,

for the p-th image-text pair, mapping the image and text into feature vectors using g ();

(7) and transmitting the independent expressions of the images and the texts into the same network, and learning the common expression of the images and the texts in the same semantic space.

2. The cross-modal search method based on markerless data migration of claim 1, wherein the common spatial learning Loss function Loss _common Comprises the following steps:

wherein f is _s In order to be a function of the SoftMax loss,

for the p-th relevant image-text pair input,/ ^p N is the number of image text pairs as the category label of the image text pair.

3. The cross-modal retrieval method based on unmarked data migration as claimed in claim 1, wherein the threshold determination method comprises: loss function Loss across modal knowledge during training _cross-modeal The Loss value is the distance of the paired image text, in terms of Loss _cross-modal The loss value sets 10-20 initial thresholds, and the mAP value retrieved under each threshold is calculated, so that the threshold with the maximum mAP value is the retrieved threshold.

4. A cross-modal retrieval system based on markerless data migration, the migration comprising unimodal knowledge migration and cross-modal knowledge sharing, comprising: the system comprises a label-free data clustering module, a data migration module and a common space learning module;

the unmarked data clustering module is used for constructing a migration data set through the unmarked data clustering module, and taking the data set as a migration source domain of the data migration module, and comprises the following steps:

collecting unmarked images and unmarked texts;

extracting a feature vector of each image by using a pre-trained VGG network, determining the number of image clustering centers according to the data volume of the unlabeled images, and selecting the unlabeled images with the same number as the number of the image clustering centers as initial clustering centers; wherein the determining the number of image clustering centers comprises:

traversing all the unmarked images, distributing each unmarked image to the nearest cluster center, updating the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached;

classifying all the unmarked images of the same cluster into one class and setting the same label, thereby obtaining the unmarked images with the pseudo labels;

extracting the characteristics of each unmarked text, and carrying out unsupervised clustering on the characteristics to obtain an unmarked file with a pseudo label;

the data migration module is used for learning and uniformly expressing the images and texts obtained by the data migration module through the common space learning module and establishing a similarity measurement basis of the cross-modal data so as to realize cross-modal retrieval, wherein the Loss function Loss of the data migration module _transfer ＝Loss _img +Loss _txt +Loss _cross-modal ；

Knowledge migration loss for image modalities

Represents the distance measured by mapping the data into the regenerated hilbert space by f (); x _i For distribution of object fields of the image, Y _i Distributing image source domains, wherein k is the number of clustering centers, m is the number of samples of source domain data, and n is the number of samples of target domain data;

the method for realizing knowledge migration of the image modality comprises the following steps: firstly, transmitting images of a source domain and a target domain into a network, passing through the first five convolutional layers of the AlexNet network, and then adding three full-connection layers, wherein the loss function of the source domain is SoftMax loss; the knowledge transfer of the image modality is realized by minimizing the loss function MMD of the source domain and the target domain;

knowledge migration loss for text modalities

the method for realizing the knowledge migration of the text modality comprises the following steps: respectively extracting text characteristic vectors of a source domain and a target domain by using Bert, and then passing through three full-connection layers, wherein a loss function of the source domain is SoftMax loss, and a loss function of migration is MMD loss;

loss function across modal knowledge

l6, l7 refers to two fully connected layers of image text connected across the modality dataset, nl refers to the logarithm of the incoming image and text,

and the common space learning module is used for transmitting the independent expressions of the images and the texts into the same network and learning the common expression of the images and the texts in the same semantic space.