CN114461836A

CN114461836A - Cross-modal retrieval method for image-text

Info

Publication number: CN114461836A
Application number: CN202210124470.3A
Authority: CN
Inventors: 张师超; 石慧敏; 章成源
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-10

Abstract

The invention discloses a cross-modal retrieval method for an image-text, which comprises the steps of obtaining an image-text pair data set and constructing a cross-modal image-text retrieval initial model; processing the image-text pair data set to obtain a feature vector; acquiring a uniform dimension characteristic vector of the characteristic vector in a public space, performing label classification and calculating label classification loss; carrying out weighted sampling based on multiple negative samples on the image-text pair data set and calculating the weighted contrast loss of the multiple negative samples; optimizing the cross-modal image-text retrieval initial model through an optimizer to obtain a cross-modal image-text retrieval model; and performing actual image-text cross-modal retrieval by adopting a cross-modal image-text retrieval model. According to the invention, a cross-modal image-text retrieval model is pre-constructed, image/text characteristic vectors are projected to a unified public space, multi-negative sample sampling and weighted learning are introduced for model training, and the trained model is used for cross-modal retrieval; therefore, the method has high retrieval accuracy, good reliability and high retrieval speed.

Description

Cross-modal retrieval method for image-text

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a cross-modal retrieval method for an image-text.

Background

With the development of economic technology, search technology has been widely applied to the production and life of people, and brings endless convenience to the production and life of people.

There is a large amount of data generated on the internet anytime and anywhere, including multimedia information such as text, images, video, audio, and the like. The information content is rich and the form is various, so that how to acquire the information really needed from mass data becomes a matter to be solved urgently. Meanwhile, the large-scale growth of multimedia data requires more efficient techniques to achieve efficient retrieval of multimedia data. Traditional retrieval only stays in a single modality, i.e. entering text retrieves text and entering images retrieves images. Today facing the big data era, traditional single-mode retrieval cannot meet new requirements of people on information retrieval.

Cross-modality retrieval means that query data and retrieval results come from different modalities, such as finding the best matching text image of a commodity or finding the most suitable text description for one image. The cross-module retrieval is free from the limitation of the traditional single-mode retrieval, and the cross-module retrieval is realized for different modes, so that the practical applicability is very strong. Therefore, the cross-modal information retrieval has wide application prospect and important research significance.

The core of the existing cross-modal retrieval technology is to utilize linear projection or deep learning technology to learn a public subspace, so that data of different modalities can be subjected to semantic similarity measurement learning in the public subspace, and then retrieval results similar to query data are returned through sequencing. Nikhil et al propose an image-text retrieval method using canonical correlation analysis, which obtains feature projection vectors in subspace by linear transformation, and then maximizes the correlation between the two modalities. Ding et al propose a collaborative matrix decomposition to implement a cross-modal retrieval method, find the common semantics of different modal data through the collaborative matrix decomposition, and project them into a common space. The methods adopt various modes for feature projection, but the sampling modes are single, the method is based on sampling of few samples and loss measurement, and the optimization gradient lacks flexibility.

Disclosure of Invention

The invention aims to provide a cross-modal retrieval method for an image-text, which has high retrieval accuracy, good reliability and high retrieval speed.

The invention provides a cross-modal retrieval method for image-text, which comprises the following steps:

s1, acquiring a graph-text pair data set, and constructing a cross-mode graph-text retrieval initial model;

s2, processing the image-text data set obtained in the step S1 to obtain a feature vector;

s3, obtaining the uniform dimension characteristic vector in the public space by the characteristic vector obtained in the step S2 through a projection function, carrying out label classification and calculating the label classification loss;

s4, carrying out weighted sampling based on multiple negative samples on the image-text data set, and calculating the weighted contrast loss of the multiple negative samples;

s5, optimizing the constructed cross-modal image-text retrieval initial model through an optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4 to obtain a cross-modal image-text retrieval model;

and S6, performing actual image-text cross-modal retrieval by adopting the cross-modal image-text retrieval model obtained in the step S5.

The obtaining of the image-text pair data set and the constructing of the cross-modality image-text retrieval initial model in step S1 specifically include the following steps:

A. acquiring a graph-text pair data set: the image-text pair data set comprises an image data set and a text data set;

B. the constructed cross-modal image-text retrieval initial model comprises the following steps: extracting an image feature vector set from the image data set through a convolutional neural network; extracting a text feature vector set from the text data set through a bag-of-words model; then projecting the image feature vector set and the text feature vector set to a public space through a projection function; and finally, carrying out similarity measurement and label classification in a public space.

The processing of the image-text pair data set obtained in step S1 to obtain a feature vector in step S2 specifically includes the following steps:

extracting image features from the image data set in the image-text pair data set obtained in the step S1 through a convolutional neural network, thereby obtaining an image feature vector set; and (5) extracting key semantic words from the text data set in the image-text pair data set obtained in the step (S1) through a bag-of-words model, thereby obtaining a text feature vector set.

Step S3, obtaining the uniform dimension feature vector in the public space by projecting the feature vector obtained in step S2 through a projection function, performing label classification, and calculating a label classification loss, specifically including the following steps:

a. passing the image feature vectors in the image feature vector set obtained in the step S2 through a three-layer full-connection layer network;

b. the text characteristic vectors in the text characteristic vector set obtained in the step S2 pass through a three-layer full-connection layer network;

c. through the processing of the step a and the step b, the image characteristic vector and the text characteristic vector are located in the same real value space, then parameter sharing is carried out through the same full connection layer, and finally the image characteristic vector and the text characteristic vector are projected to the same low-dimensional public space for label classification;

d. the label classification loss L is calculated by the following formula₁：

Wherein n is the number of training samples; y is_iA label for each training instance; p is a radical of_i(u_i) Generating a probability distribution for the image; p is a radical of_i(v_i) Is the generated text probability distribution.

Step S4, performing weighted sampling based on multiple negative samples on the image-text pair data set, and calculating the weighted contrast loss of the multiple negative samples, specifically including the following steps:

if the anchor point is image modal data, acquiring a positive sample and a plurality of negative samples of the text modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;

if the anchor point is text modal data, acquiring a positive sample and a plurality of negative samples of the image modal data, calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;

the weighted contrast loss L of multiple negative samples is calculated by the following formula₂：

In the formula w_uThe anchor point is the weight when the image mode data is the anchor point; f. of₍₎Is a similarity measure function; u is an image mode anchor point; v + is a positive sample text similar to the image modality anchor u; n is a radical of₁Is the number of negative samples of the sample that are dissimilar to the image modality anchor point u; v. of_kThe k-th negative sample which is sampled and is not similar to the image mode anchor point u; w is a_vThe anchor point is the weight when the anchor point is the text modal data; v is a text mode anchor point; u + is a positive sample image similar to the text mode anchor point v; n is a radical of₂The negative sample number which is sampled and is not similar to the text mode anchor point v; u. of_kIs the k-th negative sample of the sample that is dissimilar to the text modality anchor point v.

And calculating the distances between the anchor point and the positive sample and between the anchor point and the negative samples, specifically calculating the cosine distances between the anchor point and the positive sample and between the anchor point and the negative samples.

In step S5, the constructed cross-modal image-text retrieval initial model is optimized by an optimizer according to the label classification loss obtained in step S3 and the multi-negative sample weighted contrast loss obtained in step S4, so as to obtain a cross-modal image-text retrieval model, and the method specifically includes the following steps:

and updating by adopting an Adam optimizer according to the label classification loss obtained in the step S3 and the multi-negative sample weighted comparison loss obtained in the step S4, and continuously optimizing parameters in the cross-modal image-text retrieval initial model to obtain a final cross-modal image-text retrieval model.

Step S6, which is to perform actual image-text cross-modality retrieval by using the cross-modality image-text retrieval model obtained in step S5, specifically includes the following steps:

when the retrieval object is text modal data, calculating the distance between the text modal data and each data in the image retrieval library, and selecting a plurality of optimal image data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval;

and when the retrieval object is image modal data, calculating the distance between the image modal data and each data in the text retrieval library, and selecting a plurality of optimal text data as a final retrieval result for feedback according to the obtained distance, thereby completing the actual image-text cross-modal retrieval.

The cross-modal retrieval method for the image-text pre-constructs a cross-modal image-text retrieval model, inputs image data and text data, extracts characteristic vectors through an image processing network and a text processing network, and extracts effective information in the data; projecting the image characteristic vector and the text characteristic vector to a unified public space by using a projection function, so as to improve the matching speed; multi-negative sample sampling and weighted learning are introduced, the distance between the positive sample pairs in the public space is shortened, the distance between the negative sample pairs is lengthened, and the semantic distinguishability is high; performing cross-modal retrieval by using the trained model, performing measurement sequencing on data based on cosine distance, and returning the data in front of the sequencing as a retrieval result, wherein the retrieval accuracy is high and the robustness is strong; therefore, the method has high retrieval accuracy, good reliability and high retrieval speed.

Drawings

FIG. 1 is a schematic process flow diagram of the process of the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a cross-modal retrieval method for image-text, which comprises the following steps:

s1, acquiring a graph-text pair data set, and constructing a cross-mode graph-text retrieval initial model; the method specifically comprises the following steps:

B. the constructed cross-modal image-text retrieval initial model comprises the following steps: extracting an image feature vector set from the image data set through a convolutional neural network; extracting a text feature vector set from the text data set through a bag-of-words model; then projecting the image feature vector set and the text feature vector set to a public space through a projection function; finally, similarity measurement and label classification are carried out in the public space;

s2, processing the image-text data set obtained in the step S1 to obtain a feature vector; the method specifically comprises the following steps:

extracting image features from the image data set in the image-text pair data set obtained in the step S1 through a convolutional neural network, thereby obtaining an image feature vector set; extracting key semantic words from the text data set in the image-text pair data set obtained in the step S1 through a bag-of-words model, thereby obtaining a text feature vector set;

s3, obtaining the uniform dimension characteristic vector in the public space by the characteristic vector obtained in the step S2 through a projection function, carrying out label classification and calculating the label classification loss; the method specifically comprises the following steps:

d. the label classification loss L is calculated by the following formula₁：

Wherein n is the number of training samples; y is_iA label for each training instance; p is a radical of_i(u_i) Generating a probability distribution for the image; p is a radical of_i(v_i) Generating a text probability distribution;

s4, carrying out weighted sampling based on multiple negative samples on the image-text data set, and calculating the weighted contrast loss of the multiple negative samples; the method specifically comprises the following steps:

if the anchor point is image modal data, acquiring a positive sample and a plurality of negative samples of the text modal data, calculating the distances (preferably cosine distances) between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;

if the anchor point is text mode data, acquiring a positive sample and a plurality of negative samples of the image mode data, calculating the distances (preferably cosine distances) between the anchor point and the positive sample and between the anchor point and the negative samples, and adjusting the contrast weight values of the positive sample and the negative samples to ensure that the distance between the anchor point and the positive sample in the public space is as close as possible and the distance between the anchor point and each negative sample in the public space is as far as possible;

In the formula w_uThe anchor point is the weight when the image mode data is the anchor point; f. of₍₎Is a similarity measure function; u is an image mode anchor point; v + is a positive sample text similar to the image modality anchor u; n is a radical of₁Is the number of negative samples of the sample that are dissimilar to the image modality anchor point u; v. of_kThe k-th negative sample which is sampled and is not similar to the image mode anchor point u; w is a_vThe anchor point is the weight when the anchor point is the text modal data; v is a text mode anchor point; u + is a positive sample image similar to the text mode anchor point v; n is a radical of₂The negative sample number which is sampled and is not similar to the text mode anchor point v; u. of_kThe k-th negative sample which is sampled and is not similar to the anchor point v of the text mode;

s5, optimizing the constructed cross-modal image-text retrieval initial model through an optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4 to obtain a cross-modal image-text retrieval model; the method specifically comprises the following steps:

updating by adopting an Adam optimizer according to the label classification loss obtained in the step S3 and the multi-negative-sample weighted comparison loss obtained in the step S4, and continuously optimizing parameters in the cross-modal image-text retrieval initial model to obtain a final cross-modal image-text retrieval model;

s6, performing actual image-text cross-modal retrieval by adopting the cross-modal image-text retrieval model obtained in the step S5; the method specifically comprises the following steps:

when the retrieval object is text mode data, calculating the distance (preferably cosine distance) between the text mode data and each data in the image retrieval library, and selecting a plurality of optimal image data as a final retrieval result for feedback according to the obtained distance (preferably cosine distance), thereby completing the actual image-text cross-mode retrieval;

and when the retrieval object is image modality data, calculating the distance (preferably cosine distance) between the image modality data and each data in the text retrieval library, and selecting a plurality of optimal text data as a final retrieval result for feedback according to the obtained distance (preferably cosine distance), thereby completing the actual image-text cross-modality retrieval.

Claims

1. A cross-modal retrieval method for image-text comprises the following steps:

2. The method according to claim 1, wherein the step S1 of obtaining the teletext pair data set and constructing a cross-modality teletext retrieval initial model comprises the following steps:

3. The method as claimed in claim 2, wherein the step S2 of processing the image-text data set obtained in step S1 to obtain the feature vector comprises the following steps:

4. The method of claim 3, wherein the step S3 is implemented by using the feature vector obtained in the step S2 as a projection function to obtain a uniform dimension feature vector in a public space, and performing label classification and calculating a label classification loss, and the method specifically comprises the following steps:

d. the label classification loss L is calculated by the following formula₁：

Wherein n is the number of training samples; y is_iA label for each training instance; p is a radical of formula_i(u_i) Generating a probability distribution for the image; p is a radical of_i(v_i) Is the generated text probability distribution.

5. The method according to claim 4, wherein the step S4 of performing weighted sampling based on multiple negative samples on the image-text pair data set and calculating the weighted contrast loss of the multiple negative samples comprises the following steps:

6. The method according to claim 5, wherein the distances between the anchor point and the positive sample and between the anchor point and the negative samples are calculated, specifically, cosine distances between the anchor point and the positive sample and between the anchor point and the negative samples are calculated.

7. The method of claim 5, wherein the step S5 of optimizing the constructed cross-modal image-text retrieval initial model by the optimizer according to the tag classification loss obtained in the step S3 and the multi-negative sample weighted contrast loss obtained in the step S4 to obtain the cross-modal image-text retrieval model, specifically comprises the following steps:

8. The method as claimed in claim 6, wherein the step S6 of performing actual cross-modal image-text retrieval by using the cross-modal image-text retrieval model obtained in step S5 includes the following steps: