CN114461836A - Cross-modal retrieval method for image-text - Google Patents

Cross-modal retrieval method for image-text Download PDF

Info

Publication number
CN114461836A
CN114461836A CN202210124470.3A CN202210124470A CN114461836A CN 114461836 A CN114461836 A CN 114461836A CN 202210124470 A CN202210124470 A CN 202210124470A CN 114461836 A CN114461836 A CN 114461836A
Authority
CN
China
Prior art keywords
text
image
retrieval
modal
anchor point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210124470.3A
Other languages
Chinese (zh)
Inventor
张师超
石慧敏
章成源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210124470.3A priority Critical patent/CN114461836A/en
Publication of CN114461836A publication Critical patent/CN114461836A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal retrieval method for an image-text, which comprises the steps of obtaining an image-text pair data set and constructing a cross-modal image-text retrieval initial model; processing the image-text pair data set to obtain a feature vector; acquiring a uniform dimension characteristic vector of the characteristic vector in a public space, performing label classification and calculating label classification loss; carrying out weighted sampling based on multiple negative samples on the image-text pair data set and calculating the weighted contrast loss of the multiple negative samples; optimizing the cross-modal image-text retrieval initial model through an optimizer to obtain a cross-modal image-text retrieval model; and performing actual image-text cross-modal retrieval by adopting a cross-modal image-text retrieval model. According to the invention, a cross-modal image-text retrieval model is pre-constructed, image/text characteristic vectors are projected to a unified public space, multi-negative sample sampling and weighted learning are introduced for model training, and the trained model is used for cross-modal retrieval; therefore, the method has high retrieval accuracy, good reliability and high retrieval speed.

Description

Cross-modal retrieval method for image-text
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a cross-modal retrieval method for an image-text.
Background
With the development of economic technology, search technology has been widely applied to the production and life of people, and brings endless convenience to the production and life of people.
There is a large amount of data generated on the internet anytime and anywhere, including multimedia information such as text, images, video, audio, and the like. The information content is rich and the form is various, so that how to acquire the information really needed from mass data becomes a matter to be solved urgently. Meanwhile, the large-scale growth of multimedia data requires more efficient techniques to achieve efficient retrieval of multimedia data. Traditional retrieval only stays in a single modality, i.e. entering text retrieves text and entering images retrieves images. Today facing the big data era, traditional single-mode retrieval cannot meet new requirements of people on information retrieval.
Cross-modality retrieval means that query data and retrieval results come from different modalities, such as finding the best matching text image of a commodity or finding the most suitable text description for one image. The cross-module retrieval is free from the limitation of the traditional single-mode retrieval, and the cross-module retrieval is realized for different modes, so that the practical applicability is very strong. Therefore, the cross-modal information retrieval has wide application prospect and important research significance.
The core of the existing cross-modal retrieval technology is to utilize linear projection or deep learning technology to learn a public subspace, so that data of different modalities can be subjected to semantic similarity measurement learning in the public subspace, and then retrieval results similar to query data are returned through sequencing. Nikhil et al propose an image-text retrieval method using canonical correlation analysis, which obtains feature projection vectors in subspace by linear transformation, and then maximizes the correlation between the two modalities. Ding et al propose a collaborative matrix decomposition to implement a cross-modal retrieval method, find the common semantics of different modal data through the collaborative matrix decomposition, and project them into a common space. The methods adopt various modes for feature projection, but the sampling modes are single, the method is based on sampling of few samples and loss measurement, and the optimization gradient lacks flexibility.
Disclosure of Invention
The invention aims to provide a cross-modal retrieval method for an image-text, which has high retrieval accuracy, good reliability and high retrieval speed.
The invention provides a cross-modal retrieval method for image-text, which comprises the following steps:
s1, acquiring a graph-text pair data set, and constructing a cross-mode graph-text retrieval initial model;
s2, processing the image-text data set obtained in the step S1 to obtain a feature vector;
s3, obtaining the uniform dimension characteristic vector in the public space by the characteristic vector obtained in the step S2 through a projection function, carrying out label classification and calculating the label classification loss;
s4, carrying out weighted sampling based on multiple negative samples on the image-text data set, and calculating the weighted contrast loss of the multiple negative samples;
s5, optimizing the constructed cross-modal image-text retrieval initial model through an optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4 to obtain a cross-modal image-text retrieval model;
and S6, performing actual image-text cross-modal retrieval by adopting the cross-modal image-text retrieval model obtained in the step S5.
The obtaining of the image-text pair data set and the constructing of the cross-modality image-text retrieval initial model in step S1 specifically include the following steps:
A. acquiring a graph-text pair data set: the image-text pair data set comprises an image data set and a text data set;
B. the constructed cross-modal image-text retrieval initial model comprises the following steps: extracting an image feature vector set from the image data set through a convolutional neural network; extracting a text feature vector set from the text data set through a bag-of-words model; then projecting the image feature vector set and the text feature vector set to a public space through a projection function; and finally, carrying out similarity measurement and label classification in a public space.
The processing of the image-text pair data set obtained in step S1 to obtain a feature vector in step S2 specifically includes the following steps:
extracting image features from the image data set in the image-text pair data set obtained in the step S1 through a convolutional neural network, thereby obtaining an image feature vector set; and (5) extracting key semantic words from the text data set in the image-text pair data set obtained in the step (S1) through a bag-of-words model, thereby obtaining a text feature vector set.
Step S3, obtaining the uniform dimension feature vector in the public space by projecting the feature vector obtained in step S2 through a projection function, performing label classification, and calculating a label classification loss, specifically including the following steps:
a. passing the image feature vectors in the image feature vector set obtained in the step S2 through a three-layer full-connection layer network;
b. the text characteristic vectors in the text characteristic vector set obtained in the step S2 pass through a three-layer full-connection layer network;
c. through the processing of the step a and the step b, the image characteristic vector and the text characteristic vector are located in the same real value space, then parameter sharing is carried out through the same full connection layer, and finally the image characteristic vector and the text characteristic vector are projected to the same low-dimensional public space for label classification;
d. the label classification loss L is calculated by the following formula1
Figure BDA0003499873720000031
Wherein n is the number of training samples; y isiA label for each training instance; p is a radical ofi(ui) Generating a probability distribution for the image; p is a radical ofi(vi) Is the generated text probability distribution.
Step S4, performing weighted sampling based on multiple negative samples on the image-text pair data set, and calculating the weighted contrast loss of the multiple negative samples, specifically including the following steps:
if the anchor point is image modal data, acquiring a positive sample and a plurality of negative samples of the text modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
if the anchor point is text modal data, acquiring a positive sample and a plurality of negative samples of the image modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
the weighted contrast loss L of multiple negative samples is calculated by the following formula2
Figure BDA0003499873720000041
In the formula wuThe anchor point is the weight when the image mode data is the anchor point; f. of()Is a similarity measure function; u is an image mode anchor point; v + is a positive sample text similar to the image modality anchor u; n is a radical of1Is the number of negative samples of the sample that are dissimilar to the image modality anchor point u; v. ofkThe k-th negative sample which is sampled and is not similar to the image mode anchor point u; w is avThe anchor point is the weight when the anchor point is the text modal data; v is a text mode anchor point; u + is a positive sample image similar to the text mode anchor point v; n is a radical of2The negative sample number which is sampled and is not similar to the text mode anchor point v; u. ofkIs the k-th negative sample of the sample that is dissimilar to the text modality anchor point v.
And calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, specifically calculating the cosine distances between the anchor point and the positive sample and between the anchor point and the negative samples.
In step S5, the constructed cross-modal image-text retrieval initial model is optimized by an optimizer according to the label classification loss obtained in step S3 and the multi-negative sample weighted contrast loss obtained in step S4, so as to obtain a cross-modal image-text retrieval model, and the method specifically includes the following steps:
and updating by adopting an Adam optimizer according to the label classification loss obtained in the step S3 and the multi-negative sample weighted comparison loss obtained in the step S4, and continuously optimizing parameters in the cross-modal image-text retrieval initial model to obtain a final cross-modal image-text retrieval model.
Step S6, which is to perform actual image-text cross-modality retrieval by using the cross-modality image-text retrieval model obtained in step S5, specifically includes the following steps:
when the retrieval object is text modal data, calculating the distance between the text modal data and each data in the image retrieval library, and selecting a plurality of optimal image data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval;
and when the retrieval object is image modal data, calculating the distance between the image modal data and each data in the text retrieval library, and selecting a plurality of optimal text data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval.
The cross-modal retrieval method for the image-text pre-constructs a cross-modal image-text retrieval model, inputs image data and text data, extracts characteristic vectors through an image processing network and a text processing network, and extracts effective information in the data; projecting the image characteristic vector and the text characteristic vector to a unified public space by using a projection function, so as to improve the matching speed; multi-negative sample sampling and weighted learning are introduced, the distance between the positive sample pairs in the public space is shortened, the distance between the negative sample pairs is lengthened, and the semantic distinguishability is high; performing cross-modal retrieval by using the trained model, performing measurement sequencing on data based on cosine distance, and returning the data in front of the sequencing as a retrieval result, wherein the retrieval accuracy is high and the robustness is strong; therefore, the method has high retrieval accuracy, good reliability and high retrieval speed.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a cross-modal retrieval method for image-text, which comprises the following steps:
s1, acquiring a graph-text pair data set, and constructing a cross-mode graph-text retrieval initial model; the method specifically comprises the following steps:
A. acquiring a graph-text pair data set: the image-text pair data set comprises an image data set and a text data set;
B. the constructed cross-modal image-text retrieval initial model comprises the following steps: extracting an image feature vector set from the image data set through a convolutional neural network; extracting a text feature vector set from the text data set through a bag-of-words model; then projecting the image feature vector set and the text feature vector set to a public space through a projection function; finally, similarity measurement and label classification are carried out in the public space;
s2, processing the image-text data set obtained in the step S1 to obtain a feature vector; the method specifically comprises the following steps:
extracting image features from the image data set in the image-text pair data set obtained in the step S1 through a convolutional neural network, thereby obtaining an image feature vector set; extracting key semantic words from the text data set in the image-text pair data set obtained in the step S1 through a bag-of-words model, thereby obtaining a text feature vector set;
s3, obtaining the uniform dimension characteristic vector in the public space by the characteristic vector obtained in the step S2 through a projection function, carrying out label classification and calculating the label classification loss; the method specifically comprises the following steps:
a. passing the image feature vectors in the image feature vector set obtained in the step S2 through a three-layer full-connection layer network;
b. the text characteristic vectors in the text characteristic vector set obtained in the step S2 pass through a three-layer full-connection layer network;
c. through the processing of the step a and the step b, the image characteristic vector and the text characteristic vector are located in the same real value space, then parameter sharing is carried out through the same full connection layer, and finally the image characteristic vector and the text characteristic vector are projected to the same low-dimensional public space for label classification;
d. the label classification loss L is calculated by the following formula1
Figure BDA0003499873720000071
Wherein n is the number of training samples; y isiA label for each training instance; p is a radical ofi(ui) Generating a probability distribution for the image; p is a radical ofi(vi) Generating a text probability distribution;
s4, carrying out weighted sampling based on multiple negative samples on the image-text data set, and calculating the weighted contrast loss of the multiple negative samples; the method specifically comprises the following steps:
if the anchor point is image modal data, acquiring a positive sample and a plurality of negative samples of the text modal data, calculating the distances (preferably cosine distances) between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
if the anchor point is text mode data, acquiring a positive sample and a plurality of negative samples of the image mode data, calculating the distances (preferably cosine distances) between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
the weighted contrast loss L of multiple negative samples is calculated by the following formula2
Figure BDA0003499873720000072
In the formula wuThe anchor point is the weight when the image mode data is the anchor point; f. of()Is a similarity measure function; u is an image mode anchor point; v + is a positive sample text similar to the image modality anchor u; n is a radical of1Is the number of negative samples of the sample that are dissimilar to the image modality anchor point u; v. ofkThe k-th negative sample which is sampled and is not similar to the image mode anchor point u; w is avThe anchor point is the weight when the anchor point is the text modal data; v is a text mode anchor point; u + is a positive sample image similar to the text mode anchor point v; n is a radical of2The negative sample number which is sampled and is not similar to the text mode anchor point v; u. ofkThe k-th negative sample which is sampled and is not similar to the anchor point v of the text mode;
s5, optimizing the constructed cross-modal image-text retrieval initial model through an optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4 to obtain a cross-modal image-text retrieval model; the method specifically comprises the following steps:
updating by adopting an Adam optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4, and continuously optimizing parameters in the cross-modal image-text retrieval initial model to obtain a final cross-modal image-text retrieval model;
s6, performing actual image-text cross-modal retrieval by adopting the cross-modal image-text retrieval model obtained in the step S5; the method specifically comprises the following steps:
when the retrieval object is text mode data, calculating the distance (preferably cosine distance) between the text mode data and each data in the image retrieval library, and selecting a plurality of optimal image data as a final retrieval result for feedback according to the obtained distance (preferably cosine distance), thereby completing the actual image-text cross-mode retrieval;
and when the retrieval object is image modality data, calculating the distance (preferably cosine distance) between the image modality data and each data in the text retrieval library, and selecting a plurality of optimal text data as a final retrieval result for feedback according to the obtained distance (preferably cosine distance), thereby completing the actual image-text cross-modality retrieval.

Claims (8)

1. A cross-modal retrieval method for image-text comprises the following steps:
s1, acquiring a graph-text pair data set, and constructing a cross-mode graph-text retrieval initial model;
s2, processing the image-text data set obtained in the step S1 to obtain a feature vector;
s3, obtaining the uniform dimension characteristic vector in the public space by the characteristic vector obtained in the step S2 through a projection function, carrying out label classification and calculating the label classification loss;
s4, carrying out weighted sampling based on multiple negative samples on the image-text data set, and calculating the weighted contrast loss of the multiple negative samples;
s5, optimizing the constructed cross-modal image-text retrieval initial model through an optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4 to obtain a cross-modal image-text retrieval model;
and S6, performing actual image-text cross-modal retrieval by adopting the cross-modal image-text retrieval model obtained in the step S5.
2. The method according to claim 1, wherein the step S1 of obtaining the teletext pair data set and constructing a cross-modality teletext retrieval initial model comprises the following steps:
A. acquiring a graph-text pair data set: the image-text pair data set comprises an image data set and a text data set;
B. the constructed cross-modal image-text retrieval initial model comprises the following steps: extracting an image feature vector set from the image data set through a convolutional neural network; extracting a text feature vector set from the text data set through a bag-of-words model; then projecting the image feature vector set and the text feature vector set to a public space through a projection function; and finally, carrying out similarity measurement and label classification in a public space.
3. The method as claimed in claim 2, wherein the step S2 of processing the image-text data set obtained in step S1 to obtain the feature vector comprises the following steps:
extracting image features from the image data set in the image-text pair data set obtained in the step S1 through a convolutional neural network, thereby obtaining an image feature vector set; and (5) extracting key semantic words from the text data set in the image-text pair data set obtained in the step (S1) through a bag-of-words model, thereby obtaining a text feature vector set.
4. The method of claim 3, wherein the step S3 is implemented by using the feature vector obtained in the step S2 as a projection function to obtain a uniform dimension feature vector in a public space, and performing label classification and calculating a label classification loss, and the method specifically comprises the following steps:
a. passing the image feature vectors in the image feature vector set obtained in the step S2 through a three-layer full-connection layer network;
b. the text characteristic vectors in the text characteristic vector set obtained in the step S2 pass through a three-layer full-connection layer network;
c. through the processing of the step a and the step b, the image characteristic vector and the text characteristic vector are located in the same real value space, then parameter sharing is carried out through the same full connection layer, and finally the image characteristic vector and the text characteristic vector are projected to the same low-dimensional public space for label classification;
d. the label classification loss L is calculated by the following formula1
Figure FDA0003499873710000021
Wherein n is the number of training samples; y isiA label for each training instance; p is a radical of formulai(ui) Generating a probability distribution for the image; p is a radical ofi(vi) Is the generated text probability distribution.
5. The method according to claim 4, wherein the step S4 of performing weighted sampling based on multiple negative samples on the image-text pair data set and calculating the weighted contrast loss of the multiple negative samples comprises the following steps:
if the anchor point is image modal data, acquiring a positive sample and a plurality of negative samples of the text modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
if the anchor point is text modal data, acquiring a positive sample and a plurality of negative samples of the image modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;
the weighted contrast loss L of multiple negative samples is calculated by the following formula2
Figure FDA0003499873710000031
In the formula wuThe anchor point is the weight when the image mode data is the anchor point; f. of()Is a similarity measure function; u is an image mode anchor point; v + is a positive sample text similar to the image modality anchor u; n is a radical of1Is the number of negative samples of the sample that are dissimilar to the image modality anchor point u; v. ofkThe k-th negative sample which is sampled and is not similar to the image mode anchor point u; w is avThe anchor point is the weight when the anchor point is the text modal data; v is a text mode anchor point; u + is a positive sample image similar to the text mode anchor point v; n is a radical of2The negative sample number which is sampled and is not similar to the text mode anchor point v; u. ofkIs the k-th negative sample of the sample that is dissimilar to the text modality anchor point v.
6. The method according to claim 5, wherein the distances between the anchor point and the positive sample and between the anchor point and the negative samples are calculated, specifically, cosine distances between the anchor point and the positive sample and between the anchor point and the negative samples are calculated.
7. The method of claim 5, wherein the step S5 of optimizing the constructed cross-modal image-text retrieval initial model by the optimizer according to the tag classification loss obtained in the step S3 and the multi-negative sample weighted contrast loss obtained in the step S4 to obtain the cross-modal image-text retrieval model, specifically comprises the following steps:
and updating by adopting an Adam optimizer according to the label classification loss obtained in the step S3 and the multi-negative sample weighted comparison loss obtained in the step S4, and continuously optimizing parameters in the cross-modal image-text retrieval initial model to obtain a final cross-modal image-text retrieval model.
8. The method as claimed in claim 6, wherein the step S6 of performing actual cross-modal image-text retrieval by using the cross-modal image-text retrieval model obtained in step S5 includes the following steps:
when the retrieval object is text modal data, calculating the distance between the text modal data and each data in the image retrieval library, and selecting a plurality of optimal image data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval;
and when the retrieval object is image modal data, calculating the distance between the image modal data and each data in the text retrieval library, and selecting a plurality of optimal text data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval.
CN202210124470.3A 2022-02-10 2022-02-10 Cross-modal retrieval method for image-text Pending CN114461836A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210124470.3A CN114461836A (en) 2022-02-10 2022-02-10 Cross-modal retrieval method for image-text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210124470.3A CN114461836A (en) 2022-02-10 2022-02-10 Cross-modal retrieval method for image-text

Publications (1)

Publication Number Publication Date
CN114461836A true CN114461836A (en) 2022-05-10

Family

ID=81414549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210124470.3A Pending CN114461836A (en) 2022-02-10 2022-02-10 Cross-modal retrieval method for image-text

Country Status (1)

Country Link
CN (1) CN114461836A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391578A (en) * 2022-08-03 2022-11-25 北京乾图科技有限公司 Cross-modal image-text retrieval model training method and system
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN116167434A (en) * 2023-04-24 2023-05-26 清华大学 Training method and device for weak supervision visual language pre-training model
CN116383342A (en) * 2023-04-07 2023-07-04 四川大学 Robust cross-domain text retrieval method under noise label
CN116975318A (en) * 2023-08-03 2023-10-31 四川大学 Half-pairing image-text retrieval method based on cross-correlation mining

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN115391578A (en) * 2022-08-03 2022-11-25 北京乾图科技有限公司 Cross-modal image-text retrieval model training method and system
CN116383342A (en) * 2023-04-07 2023-07-04 四川大学 Robust cross-domain text retrieval method under noise label
CN116383342B (en) * 2023-04-07 2023-11-14 四川大学 Robust cross-domain text retrieval method under noise label
CN116167434A (en) * 2023-04-24 2023-05-26 清华大学 Training method and device for weak supervision visual language pre-training model
CN116167434B (en) * 2023-04-24 2023-07-04 清华大学 Training method and device for weak supervision visual language pre-training model
CN116975318A (en) * 2023-08-03 2023-10-31 四川大学 Half-pairing image-text retrieval method based on cross-correlation mining
CN116975318B (en) * 2023-08-03 2024-01-23 四川大学 Half-pairing image-text retrieval method based on cross-correlation mining

Similar Documents

Publication Publication Date Title
CN114461836A (en) Cross-modal retrieval method for image-text
Yuan et al. Video summarization by learning deep side semantic embedding
CN111666406B (en) Short text classification prediction method based on word and label combination of self-attention
CN113780003B (en) Cross-modal enhancement method for space-time data variable-division encoding and decoding
CN108959522B (en) Migration retrieval method based on semi-supervised countermeasure generation network
CN113177141B (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
CN112100410A (en) Cross-modal retrieval method and system based on semantic condition association learning
CN113239159B (en) Cross-modal retrieval method for video and text based on relational inference network
CN113705218A (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
Zhao et al. Disentangled representation learning and residual GAN for age-invariant face verification
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
Su et al. Semi-supervised knowledge distillation for cross-modal hashing
Xiong et al. Affective impression: Sentiment-awareness POI suggestion via embedding in heterogeneous LBSNs
CN110705384B (en) Vehicle re-identification method based on cross-domain migration enhanced representation
CN110442736B (en) Semantic enhancer spatial cross-media retrieval method based on secondary discriminant analysis
CN115203529A (en) Deep neural network recommendation model and method based on multi-head self-attention mechanism
CN114662652A (en) Expert recommendation method based on multi-mode information learning
CN114239730A (en) Cross-modal retrieval method based on neighbor sorting relation
CN115481313A (en) News recommendation method based on text semantic mining
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN110674265B (en) Unstructured information oriented feature discrimination and information recommendation system
CN117033804A (en) Click induction detection method under subjective and objective visual angle guidance
CN116701569A (en) Multi-field false news detection method based on multi-view collaboration
CN114842301A (en) Semi-supervised training method of image annotation model
Zhang et al. A social commerce information propagation prediction model based on transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination