CN110647904B - Cross-modal retrieval method and system based on unmarked data migration - Google Patents

Cross-modal retrieval method and system based on unmarked data migration Download PDF

Info

Publication number
CN110647904B
CN110647904B CN201910707010.1A CN201910707010A CN110647904B CN 110647904 B CN110647904 B CN 110647904B CN 201910707010 A CN201910707010 A CN 201910707010A CN 110647904 B CN110647904 B CN 110647904B
Authority
CN
China
Prior art keywords
image
text
data
modal
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910707010.1A
Other languages
Chinese (zh)
Other versions
CN110647904A (en
Inventor
朱福庆
王雪如
张卫博
戴娇
虎嵩林
韩冀中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201910707010.1A priority Critical patent/CN110647904B/en
Publication of CN110647904A publication Critical patent/CN110647904A/en
Application granted granted Critical
Publication of CN110647904B publication Critical patent/CN110647904B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a cross-modal retrieval method and a cross-modal retrieval system based on label-free data migration. The invention well solves the problem of small data scale of the cross-modal data set, and better conforms to the condition that the actual user query is not in the predefined category range; meanwhile, upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the accuracy of cross-mode retrieval is improved.

Description

Cross-modal retrieval method and system based on unmarked data migration
Technical Field
The invention relates to the technical field of cross-modal data retrieval, in particular to a cross-modal retrieval method and a cross-modal retrieval system based on unmarked data migration.
Background
Different modal data such as images and texts are widely present in the internet and show a trend of mutual fusion. The cross-modal retrieval task tries to break the boundary between different modal data and realize information retrieval by crossing different modal data, namely, a certain modal sample is tried to retrieve samples of other modalities similar to the semantics of the certain modal sample, and the cross-modal retrieval task is widely applied to search engines and big data management. The existing cross-modal retrieval method tries to map feature representations of different modal data to a common space to learn a unified representation, and measures similarity by calculating the distance between corresponding unified representations. However, due to the heterogeneity of different modal data, the data distribution and characterization are inconsistent, semantic association is difficult to achieve, and cross-modal similarity is still difficult to measure.
Although the internet has a large amount of image and text data, most of the image and text data are unmarked and difficult to use. The data contains rich semantic information, on one hand, data annotation requires a large amount of cost, on the other hand, internet information is updated constantly, and each new hot event is accompanied by a large amount of data such as images and texts of new categories, so that the data of all the categories cannot be annotated, and how to fully utilize the non-annotated data is a great challenge for the traditional cross-modal retrieval task.
In an actual scenario, the query submitted by the user often does not necessarily fall within the predefined category range, and the situation that the training set and the test set do not share the same category sometimes occurs. Existing cross-modality retrieval methods are generally only directed to cases where training data and test data are of the same category (non-extensible cross-modality retrieval). How to better construct a cross-modal common space, inputting a modal data, no matter the category of the data is known or unknown, the multi-modal data related to the data can be retrieved, which has important significance in practical application.
Disclosure of Invention
In order to solve the problems of data heterogeneity of different modes, excessive unmarked data, insufficient training data, inextensible and the like, the invention provides a cross-mode retrieval method and a system based on unmarked data migration.
The technical scheme of the invention is as follows:
a cross-mode retrieval method based on unmarked data migration comprises the following steps:
inputting a sample to be retrieved into a trained cross-modal data retrieval model to obtain characteristic representation of the cross-modal data retrieval model;
calculating Euclidean distances between each sample to be retrieved and all other modal samples, and then sequencing, wherein the other modal samples with the distances smaller than a specified threshold value are retrieval results;
the training process of the cross-modal data retrieval model is as follows:
(1) setting pseudo labels for the unmarked images and the texts respectively by a clustering method;
(2) respectively transferring knowledge contained in unmarked images and texts with pseudo labels to image and text parts of a cross-modal data set, and learning the independent expression of the images and texts of the cross-modal data set;
(3) and transmitting the independent expressions of the images and the texts into the same network, and learning the common expression of the images and the texts in the same semantic space.
Further, the method for determining the threshold value comprises the following steps: loss in training process cross-modal The Loss value is the distance of the paired image text, in terms of Loss cross-modal Setting 10-20 initial thresholds for the Loss value, calculating the retrieved mAP (mean Average precision) value under each threshold (measuring the quality of the learned model on all queries, namely the Average value of all APs; AP (mean precision) measuring the quality of the learned model on each query), and making the threshold with the maximum mAP value be the retrieval threshold, wherein the Loss value is set to 10-20 initial thresholds, and calculating the retrieved mAP value (mean Average precision) under each threshold, and the threshold with the maximum mAP value is the retrieval threshold cross-modal As a loss function across modal knowledge:
Figure GDA0003391471780000021
where l6, l7 refers to two fully connected layers connected across the modal dataset image text, nl refers to the incoming image and textThe number of the pairs is logarithmic,
Figure GDA0003391471780000022
for the p-th image-text pair, the image and text are mapped into feature vectors using g ().
A cross-modal retrieval system based on unmarked data migration, comprising:
the system comprises a label-free data clustering module, a data migration module and a common space learning module, wherein a migration data set is constructed through the label-free data clustering module and is used as a migration source domain of the data migration module, and finally, the common space learning module is used for uniformly expressing the image and text learning obtained by the data migration module and establishing a similarity measurement basis of the cross-modal data, so that cross-modal retrieval is realized.
Further, the label-free data clustering module comprises an image clustering submodule and a text clustering submodule. The module extracts the characteristics of all unmarked images/texts and then conducts unsupervised clustering to obtain a series of clustering centers; and classifying the image/text samples under the same cluster center into one class, and setting the samples as the same label, namely completing the construction of the migration data set.
Further, the data migration module comprises an image migration submodule and a text migration submodule, and migration only occurs in the same submodule. For each sub-module, the migration source domain is unmarked data after corresponding modal clustering, and the target domain is data of corresponding modal of the cross-modal data set. Transfer learning is achieved by minimizing the loss of distribution between the source domain and the target domain. The inputs of the cross-modal data set are all input in pairs and belong to the same category, the expressions generated finally should be similar, and the distance between the images and texts with the same semantic information is as close as possible and the distance between the images and texts with different semantics is as far as possible by minimizing the pair Euclidean distance between the two modal data sets, and the images and texts are independent of the modalities.
Furthermore, the common space learning module transmits the single expression of the image and the text obtained by the data migration module into the same network to learn the unified expression of data in different modes, the network comprises a plurality of shared full connection layers, and word embedding vectors of cross-mode data set categories are added into the network, so that semantic association among different modes is increased, and semantic information is further enhanced.
The method has the beneficial effects that:
according to the method, a large number of unmarked monomodal data sets are clustered and are distributed with the pseudo labels, and the clustered unmarked data are transferred to the cross-modal data set, so that the problem of small data scale of the cross-modal data set is well solved, and the method is more suitable for the condition that the actual user query is not in the predefined category range. By the method, the upper-layer semantic information of data in different modes can be better extracted, the heterogeneity difference between the modes is overcome, the similarity between the modes is increased, and the accuracy of cross-mode retrieval is improved. The method achieves good effects in both public data sets and practical applications.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a data migration flow diagram;
FIG. 3 is a flow diagram of a feature extraction system.
Detailed Description
The method mainly introduces a cross-modal retrieval network modeling based on transfer learning, label-free data clustering, data transfer, co-expression learning and testing process.
The method will be further described with reference to the accompanying drawings.
Modeling of a cross-modal retrieval network based on transfer learning:
clustering unlabeled data, i.e. giving unlabeled data set S, using image clustering algorithm C i Will not have label image S i Poly is k i Class, using a text clustering algorithm C t For unlabelled text S t Poly is k t In each category, all images and texts in the same clustering center are marked with the same pseudo label y i . Migrating the clustered label-free data set S to a cross-modal data set D by using a migration learning algorithm T, and performing combined training to generate a single vector expression R of images and texts of the cross-modal data set i ,R t . Most preferablySeparate expression R of images and texts i ,R t And transmitting the word embedding vector V of the category into the same full-connection network F, and generating a common expression R of the image and the text in the same space. Wherein:
unlabeled dataset S ═ S i ,S t }: as a source domain for transfer learning, wherein S i For unlabeled image datasets, S t Is a non-labeled text data set.
Cross-modality dataset D ═ { D ═ D i ,D t }:D i And D t Images, text across the modal dataset, images and text across the modal dataset are entered in pairs and correlated, for each image/text pair, the images and text are from the same article, or the text is a description of the image.
Word embedding vector V: all known classes across modal datasets are converted to 300-dimensional Word vectors by the Word2vec model.
Text input: text is a description of an image and may be an article, paragraph, sentence, word, etc. Text vectors are extracted using Bert, with dimensions 768 dimensions.
Image input: in this network, the input of the image is a graph of 224 x 224.
Clustering algorithm C ═ { C ═ C i ,C t },C i As an image clustering algorithm, C t A text clustering algorithm.
Number of clusters k i ,k t The method is obtained through experience and multiple calculations.
The migration algorithm T: is an algorithm that gains some knowledge through the source domain to promote the target task, where the source domain is not equal to the target domain or the source task is not the same as the target task.
Co-expression vector R: the resulting vector representation of the image and text.
A label-free data clustering module:
for unlabeled images containing rich semantic information, firstly, a pre-trained VGG network is used for extracting a feature vector of each image, and then, the KMeans method is used for clustering the images. The specific method comprises the following steps: based on unmarked imagesNumber and distribution, setting initial cluster center number (i.e. k) i ) And randomly select k i The images serve as the initial cluster centers. And traversing all the images, distributing each image to the nearest cluster center, updating the mean value of each cluster to be used as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached. All samples of the same cluster are classified into one class and set as the same label for constructing a source domain data set for image migration.
For the unlabeled text, firstly using Bert to extract the characteristics of each text, then adopting the unsupervised clustering method the same as the images to classify similar texts into the same cluster and marking the same label for constructing a source domain data set of text migration.
Method for determining the number of suitable cluster centers: setting the initial value of k to be 5-15 according to the size of the unmarked data volume, clustering each value of k and recording the corresponding SSE (sum of the squared errors, SSE is the sum of the distances between each sample point and the corresponding clustering center of the sample point). With the increase of the number of clusters, the sample division is finer, the aggregation degree of each cluster is gradually improved, and the sum of squared errors SSE is gradually reduced. When k is less than the optimal cluster number, the increase of k greatly increases the aggregation level of each cluster, and the decrease of the SSE is large, and when k reaches the optimal cluster number, the return of the aggregation level obtained by increasing k is rapidly reduced, so that the decrease of the SSE is rapidly reduced and then becomes gentle as the k value continues to increase. The relationship between k and SSE is plotted, and the point at which the slope changes is the optimal value of k.
A data migration module:
the data migration module comprises two parts, namely monomodal knowledge migration and cross-modal knowledge sharing.
And the single-mode migration refers to migrating the clustered unlabeled images to images corresponding to the cross-mode data set and migrating the clustered unlabeled texts to texts corresponding to the cross-mode data set. Therefore, the module comprises two single-mode migration submodules of image and text.
Referring to fig. 2, for image migration, the source domain of the migration is the unlabeled image after clustering, and the target domain is the image portion of the cross-modal data. Firstly, images of a source domain and a target domain are transmitted into a network, pass through the first five convolutional layers of the AlexNet network, and are added with three full connection layers fc6, fc7 and fc8, wherein the loss function of the source domain is SoftMax loss. The migration of knowledge of image modalities is achieved by minimizing the loss function MMD (Maximum Mean difference) of the source and target domains, which is used to measure the difference of two different but related distributions. Defining the distribution of the image object field as X i Distribution of source domains is Y i The migration loss of the image modality is:
Figure GDA0003391471780000061
wherein
Figure GDA0003391471780000065
Indicating that this distance is measured by f () mapping the data into a regenerated hilbert space (RKHS), m being the number of samples of the source domain data and n being the number of samples of the target domain data.
The text migration and image migration processes are similar, a migrated source domain is a clustered label-free text, a migrated target domain is a text part of cross-modal data, text feature vectors of the source domain and the target domain are respectively extracted by using an NLP model Bert disclosed by GOOGLE, and then the text feature vectors pass through three full-connection layers fc6, fc7 and fc8, wherein a loss function of the source domain is SoftMax loss, and a loss function of the migration is MMD loss. Defining a distribution of text target fields as X t Distribution of source field is Y t The migration loss of the text modality is:
Figure GDA0003391471780000062
the purpose of setting the cross-modal knowledge sharing layer is to fully utilize similar semantic information among different modalities, overcome the heterogeneity difference among the modalities, and no matter which modality the data comes from, as long as the data contains the same semantic information, the data should have similar feature vectors, contain different semantic information, and the distance of the feature vectors should be far away. The similarity of vectors is measured using Euclidean distances (fc6-img/fc6-txt and fc7-img/fc7-txt), which should be as small as possible for each pair of similar images and text that are input. The loss function across modal knowledge is:
Figure GDA0003391471780000063
where l6, l7 refers to two fully connected layers connected across the modal dataset image text, nl refers to the logarithm of the incoming image and text,
Figure GDA0003391471780000064
for the p-th image-text pair, the image and text are mapped into feature vectors using g ().
After passing through the two monomodal knowledge migration modules and the cross-modal knowledge sharing module, the model makes full use of unmarked data, has stronger semantic discrimination capability, and generates a separate representation for each sample in the cross-modal data set.
The final loss function of the migration module is:
Loss transfer =Loss img +Loss txt +Loss cross-modal
a common space learning module:
the cross-modal target domain internal semantic association also provides key semantic information for the cross-modal common space construction, and in order to further enhance the semantic correlation of image and text features, a common space learning module is further designed to enhance the correlation. The module is a simple and efficient structure comprising two fully connected layers and a common classification layer. Word embedding (word embedding) vectors of image features, text features and categories are introduced into the module, since the parameters of fc8, fc9 are shared by two modalities, so that semantic relevance of different modalities can be guaranteed with supervisory information in the cross-modality target domain. Considering the labels of two paired modalities in the target domain, the correlation penalty is:
Figure GDA0003391471780000071
wherein f is s In order to be a function of the SoftMax loss,
Figure GDA0003391471780000072
for the p-th relevant image-text pair input,/ p A category label for the image text pair.
The migration module and the common space learning module are a unified network structure, and the two modules are trained together and mutually promote. The net penalty is therefore:
Loss=Loss transfer +Loss common
the embodiment is as follows:
the invention comprises a training system, a feature extraction system and a retrieval three parts: the three modules are combined to form the overall structure (figure 1) of the invention, and training data are transmitted into a training system for training and are stored to obtain a training model. The parameters of the feature extraction system (fig. 3) and the training system are the same, but structures such as data migration and category word embedding are not needed, and the test set is transmitted to the feature extraction system to obtain vector representation of each sample of the test set. And during retrieval, calculating the distance between the sample to be retrieved and all samples in other modes, wherein the distance smaller than a specified threshold value is a retrieval result.
A training system:
as shown in fig. 1, the three modules (the unlabeled data clustering module, the data migration module, and the co-expression learning module) are combined to form a training system. The specific training steps are as follows:
1. image source domain preprocessing: for each image in the unlabeled image set, extracting image features by using a pre-trained VGG network, and selecting k from the image features i Taking the image as an initial clustering center, assigning each image to the nearest clustering center, and updatingAnd taking the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration number is reached. All samples of the same cluster are classified into one class and set to the same label li (li is between 0 and ki-1) for constructing the migration dataset. And storing the image path and the pseudo label into the same txt file, wherein each line represents an image and is in the format of an image path li.
2. Text source domain preprocessing: for each text in the unlabeled text set, the characteristics of each text are extracted by using Bert, the number of clusters is set to kt, then similar texts are classified into the same cluster by adopting an unsupervised clustering method the same as that of the image, and the same label lt (lt is between 0 and kt-1) is marked. And storing the text path and the pseudo label into the same txt file, wherein each line represents a text and is in a format of a text path lt.
3. Cross-modality data set preprocessing: the images and texts correspond to each other one by one across the modal dataset and are input in pairs. The images are stored in a txt document in the format "image path similarity", each line representing an image. The text is firstly converted into a vector, and the vector and the category label are stored in the lmdb file.
4. Setting the network learning rate to be fixed, setting the basic learning rate to be 0.01, iterating for 500 rounds, updating the network parameters, and using a random gradient descent algorithm.
5. And transmitting the image source domain and the text source domain into the model across the modal data set, and starting to train the model. After the images and the texts pass through the migration module and the common space learning module, the expression R of the images and the texts in the common space is obtained.
A feature extraction system:
the inventive feature extraction process block diagram is shown in fig. 3, which is a system that has fewer migration source domains, word embedding vectors for classes, and SoftMax loss functions than a training system, and does not require pairwise input across modal datasets. The feature extraction system firstly extracts feature representation of the image/text, wherein the input mode of the image/text is consistent with the training process, the image/text is sent into a CNN model after learning optimization in the training process, and the response of the last but one full connection layer is taken as the feature representation of the image/text. And after the characteristic representation of the image/text is obtained, cross-modal retrieval is carried out.
And (3) retrieval:
1. transmitting the images and texts of all the test sets into a feature extraction system to obtain feature representations of the images and the texts;
2. realizing 'text searching for pictures' and 'picture searching for text': and calculating Euclidean distances between each image and all texts, and sequencing, wherein a plurality of texts closest to the image are retrieval results. The text is also true.

Claims (4)

1. A cross-modal retrieval method based on label-free data migration, wherein the migration comprises monomodal knowledge migration and cross-modal knowledge sharing, and the method comprises the following steps:
inputting a sample to be retrieved into a trained cross-modal data retrieval model to obtain characteristic representation of the cross-modal data retrieval model;
calculating Euclidean distances between each sample to be retrieved and all other modal samples, and then sequencing, wherein the other modal samples with the distances smaller than a specified threshold value are retrieval results;
the training process of the cross-modal data retrieval model is as follows:
(1) collecting unmarked images and unmarked texts;
(2) extracting a feature vector of each image by using a pre-trained VGG network, determining the number of image clustering centers according to the data volume of the unlabeled images, and selecting the unlabeled images with the same number as the number of the image clustering centers as the initial clustering centers, wherein the determining of the number of the image clustering centers comprises the following steps:
setting an initial value range of the image clustering center number according to the data size of the unmarked image, clustering each initial value and recording the error square sum;
drawing a relational graph of the number of the image clustering centers and the sum of squares of errors, and obtaining the number of the image clustering centers based on the slope change in the relational graph; with the increase of the number of clusters, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when the number of the image clustering centers is less than the optimal clustering number, the increase of the number of the image clustering centers can increase the aggregation degree of each cluster, and the reduction range of the error square sum is large; when the number of the image clustering centers reaches the optimal clustering number and the number of the image clustering centers is increased, the descending amplitude of the error square sum is suddenly reduced, and the slope tends to be gentle as the number of the image clustering centers is continuously increased;
(3) traversing all the unmarked images, distributing each unmarked image to the nearest cluster center, updating the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached;
(4) classifying all the unmarked images of the same cluster into one class and setting the same label, thereby obtaining the unmarked images with the pseudo labels;
(5) extracting the characteristics of each unmarked text by using the Bert, and carrying out unsupervised clustering on the characteristics to obtain an unmarked file with a pseudo label;
(6) respectively transferring knowledge contained in the unlabeled image and the text with the pseudo labels to the image and text parts of the cross-modal dataset to generate separate expressions of the cross-modal dataset image and the text; wherein the Loss function Loss transfer =Loss img +Loss txt +Loss cross-modal
Knowledge migration loss for image modalities
Figure FDA0003753470450000021
Figure FDA0003753470450000022
Figure FDA0003753470450000023
Represents the distance measured by mapping the data into the regenerated hilbert space by f (); x i For distribution of object fields of the image, Y i For the distribution of image source domain, k is the number of clustering centers, m is the number of samples of source domain data, and n is the target domainThe number of samples of data;
the method for realizing knowledge migration of the image modality comprises the following steps: firstly, transmitting images of a source domain and an image of a target domain into a network, passing through the first five convolutional layers of the AlexNet network, and then adding three full-connection layers, wherein the loss function of the source domain is SoftMax loss; the knowledge transfer of the image modality is realized by minimizing the loss function MMD of the source domain and the target domain;
knowledge migration loss for text modalities
Figure FDA0003753470450000024
Figure FDA0003753470450000025
X t For the distribution of text target fields, Y t Is the distribution of text source domains;
the method for realizing knowledge migration in the text mode comprises the following steps: respectively extracting text characteristic vectors of a source domain and a target domain by using Bert, and then passing through three full-connection layers, wherein a loss function of the source domain is SoftMax loss, and a loss function of migration is MMD loss;
loss function across modal knowledge
Figure FDA0003753470450000026
l6, l7 refers to two fully connected layers of image text connected across the modal dataset, nl refers to the logarithm of the incoming image and text,
Figure FDA0003753470450000027
for the p-th image-text pair, mapping the image and text into feature vectors using g ();
(7) and transmitting the independent expressions of the images and the texts into the same network, and learning the common expression of the images and the texts in the same semantic space.
2. The cross-modal search method based on markerless data migration of claim 1, wherein the common spatial learning Loss function Loss common Comprises the following steps:
Figure FDA0003753470450000028
wherein f is s In order to be a function of the SoftMax loss,
Figure FDA0003753470450000029
for the p-th relevant image-text pair input,/ p N is the number of image text pairs as the category label of the image text pair.
3. The cross-modal retrieval method based on unmarked data migration as claimed in claim 1, wherein the threshold determination method comprises: loss function Loss across modal knowledge during training cross-modeal The Loss value is the distance of the paired image text, in terms of Loss cross-modal The loss value sets 10-20 initial thresholds, and the mAP value retrieved under each threshold is calculated, so that the threshold with the maximum mAP value is the retrieved threshold.
4. A cross-modal retrieval system based on markerless data migration, the migration comprising unimodal knowledge migration and cross-modal knowledge sharing, comprising: the system comprises a label-free data clustering module, a data migration module and a common space learning module;
the unmarked data clustering module is used for constructing a migration data set through the unmarked data clustering module, and taking the data set as a migration source domain of the data migration module, and comprises the following steps:
collecting unmarked images and unmarked texts;
extracting a feature vector of each image by using a pre-trained VGG network, determining the number of image clustering centers according to the data volume of the unlabeled images, and selecting the unlabeled images with the same number as the number of the image clustering centers as initial clustering centers; wherein the determining the number of image clustering centers comprises:
setting an initial value range of the image clustering center number according to the data size of the unmarked image, clustering each initial value and recording the error square sum;
drawing a relational graph of the number of the image clustering centers and the sum of squares of errors, and obtaining the number of the image clustering centers based on the slope change in the relational graph; with the increase of the number of clusters, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when the number of the image clustering centers is less than the optimal clustering number, the increase of the number of the image clustering centers can increase the aggregation degree of each cluster, and the reduction range of the error square sum is large; when the number of the image clustering centers reaches the optimal clustering number and the number of the image clustering centers is increased, the descending amplitude of the error square sum is suddenly reduced, and the slope tends to be gentle as the number of the image clustering centers is continuously increased;
traversing all the unmarked images, distributing each unmarked image to the nearest cluster center, updating the mean value of each cluster as a new cluster center, and iterating for multiple times until each cluster is not changed any more or the maximum iteration times is reached;
classifying all the unmarked images of the same cluster into one class and setting the same label, thereby obtaining the unmarked images with the pseudo labels;
extracting the characteristics of each unmarked text, and carrying out unsupervised clustering on the characteristics to obtain an unmarked file with a pseudo label;
the data migration module is used for learning and uniformly expressing the images and texts obtained by the data migration module through the common space learning module and establishing a similarity measurement basis of the cross-modal data so as to realize cross-modal retrieval, wherein the Loss function Loss of the data migration module transfer =Loss img +Loss txt +Loss cross-modal
Knowledge migration loss for image modalities
Figure FDA0003753470450000041
Figure FDA0003753470450000042
Figure FDA0003753470450000043
Represents the distance measured by mapping the data into the regenerated hilbert space by f (); x i For distribution of object fields of the image, Y i Distributing image source domains, wherein k is the number of clustering centers, m is the number of samples of source domain data, and n is the number of samples of target domain data;
the method for realizing knowledge migration of the image modality comprises the following steps: firstly, transmitting images of a source domain and a target domain into a network, passing through the first five convolutional layers of the AlexNet network, and then adding three full-connection layers, wherein the loss function of the source domain is SoftMax loss; the knowledge transfer of the image modality is realized by minimizing the loss function MMD of the source domain and the target domain;
knowledge migration loss for text modalities
Figure FDA0003753470450000044
Figure FDA0003753470450000045
X t For the distribution of text target fields, Y t Is the distribution of text source domains;
the method for realizing the knowledge migration of the text modality comprises the following steps: respectively extracting text characteristic vectors of a source domain and a target domain by using Bert, and then passing through three full-connection layers, wherein a loss function of the source domain is SoftMax loss, and a loss function of migration is MMD loss;
loss function across modal knowledge
Figure FDA0003753470450000046
Figure FDA0003753470450000047
l6, l7 refers to two fully connected layers of image text connected across the modality dataset, nl refers to the logarithm of the incoming image and text,
Figure FDA0003753470450000048
for the p-th image-text pair, mapping the image and text into feature vectors using g ();
and the common space learning module is used for transmitting the independent expressions of the images and the texts into the same network and learning the common expression of the images and the texts in the same semantic space.
CN201910707010.1A 2019-08-01 2019-08-01 Cross-modal retrieval method and system based on unmarked data migration Active CN110647904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910707010.1A CN110647904B (en) 2019-08-01 2019-08-01 Cross-modal retrieval method and system based on unmarked data migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910707010.1A CN110647904B (en) 2019-08-01 2019-08-01 Cross-modal retrieval method and system based on unmarked data migration

Publications (2)

Publication Number Publication Date
CN110647904A CN110647904A (en) 2020-01-03
CN110647904B true CN110647904B (en) 2022-09-23

Family

ID=68989992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910707010.1A Active CN110647904B (en) 2019-08-01 2019-08-01 Cross-modal retrieval method and system based on unmarked data migration

Country Status (1)

Country Link
CN (1) CN110647904B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353076B (en) * 2020-02-21 2023-10-10 华为云计算技术有限公司 Method for training cross-modal retrieval model, cross-modal retrieval method and related device
CN111898663B (en) * 2020-07-20 2022-05-13 武汉大学 Cross-modal remote sensing image matching method based on transfer learning
CN112016523B (en) * 2020-09-25 2023-08-29 北京百度网讯科技有限公司 Cross-modal face recognition method, device, equipment and storage medium
CN112732956A (en) * 2020-12-24 2021-04-30 江苏智水智能科技有限责任公司 Efficient query method based on perception multi-mode big data
CN112669331B (en) * 2020-12-25 2023-04-18 上海交通大学 Target data migration iterative learning method and target data migration iterative learning system
CN113515657B (en) * 2021-07-06 2022-06-14 天津大学 Cross-modal multi-view target retrieval method and device
CN114120074B (en) * 2021-11-05 2023-12-12 北京百度网讯科技有限公司 Training method and training device for image recognition model based on semantic enhancement
CN116777896B (en) * 2023-07-07 2024-03-19 浙江大学 Negative migration inhibition method for cross-domain classification and identification of apparent defects
CN117636100B (en) * 2024-01-25 2024-04-30 北京航空航天大学杭州创新研究院 Pre-training task model adjustment processing method and device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102881019A (en) * 2012-10-08 2013-01-16 江南大学 Fuzzy clustering image segmenting method with transfer learning function
CN103020122A (en) * 2012-11-16 2013-04-03 哈尔滨工程大学 Transfer learning method based on semi-supervised clustering
CN107220337A (en) * 2017-05-25 2017-09-29 北京大学 A kind of cross-media retrieval method based on mixing migration network
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN108460134A (en) * 2018-03-06 2018-08-28 云南大学 The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain
CN109784405A (en) * 2019-01-16 2019-05-21 山东建筑大学 Cross-module state search method and system based on pseudo label study and semantic consistency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102881019A (en) * 2012-10-08 2013-01-16 江南大学 Fuzzy clustering image segmenting method with transfer learning function
CN103020122A (en) * 2012-11-16 2013-04-03 哈尔滨工程大学 Transfer learning method based on semi-supervised clustering
CN107220337A (en) * 2017-05-25 2017-09-29 北京大学 A kind of cross-media retrieval method based on mixing migration network
CN107273517A (en) * 2017-06-21 2017-10-20 复旦大学 Picture and text cross-module state search method based on the embedded study of figure
CN108460134A (en) * 2018-03-06 2018-08-28 云南大学 The text subject disaggregated model and sorting technique of transfer learning are integrated based on multi-source domain
CN109784405A (en) * 2019-01-16 2019-05-21 山东建筑大学 Cross-module state search method and system based on pseudo label study and semantic consistency

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Cross-modal Common Representation Learning by Hybrid Transfer Network";Xin Huang et al.;《arXiv》;20170624;第1-8页 *
"基于域与样例平衡的多源迁移学习方法";季鼎承 等;《电子学报》;20190331;第47卷(第3期);第692-699页 *
"基于迁移学习的图像检索算法";李晓雨 等;《计算机科学》;20190131;第46卷(第1期);第73-77页 *
"混合迁移学习方法在医学图像检索中的应用";贾刚 等;《哈尔滨工程大学学报》;20150731;第36卷(第7期);第938-942页 *

Also Published As

Publication number Publication date
CN110647904A (en) 2020-01-03

Similar Documents

Publication Publication Date Title
CN110647904B (en) Cross-modal retrieval method and system based on unmarked data migration
CN109918532B (en) Image retrieval method, device, equipment and computer readable storage medium
CN107273517B (en) Graph-text cross-modal retrieval method based on graph embedding learning
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
JP5749279B2 (en) Join embedding for item association
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
WO2019015246A1 (en) Image feature acquisition
CN107220337B (en) Cross-media retrieval method based on hybrid migration network
CN106095829A (en) Cross-media retrieval method based on degree of depth study with the study of concordance expression of space
CN110941734B (en) Depth unsupervised image retrieval method based on sparse graph structure
CN109902714B (en) Multi-modal medical image retrieval method based on multi-graph regularization depth hashing
CN110309268A (en) A kind of cross-language information retrieval method based on concept map
CN109829065B (en) Image retrieval method, device, equipment and computer readable storage medium
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN115309930A (en) Cross-modal retrieval method and system based on semantic identification
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
CN109635004B (en) Object description providing method, device and equipment of database
CN114118310A (en) Clustering method and device based on comprehensive similarity
CN117273134A (en) Zero-sample knowledge graph completion method based on pre-training language model
CN111339258A (en) University computer basic exercise recommendation method based on knowledge graph
CN113779287B (en) Cross-domain multi-view target retrieval method and device based on multi-stage classifier network
Su et al. Deep supervised hashing with hard example pairs optimization for image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant