CN112148916A - Cross-modal retrieval method, device, equipment and medium based on supervision - Google Patents

Cross-modal retrieval method, device, equipment and medium based on supervision Download PDF

Info

Publication number
CN112148916A
CN112148916A CN202011044741.1A CN202011044741A CN112148916A CN 112148916 A CN112148916 A CN 112148916A CN 202011044741 A CN202011044741 A CN 202011044741A CN 112148916 A CN112148916 A CN 112148916A
Authority
CN
China
Prior art keywords
data
retrieval
text
image
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011044741.1A
Other languages
Chinese (zh)
Inventor
李国徽
袁凌
周思远
徐志鹏
潘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202011044741.1A priority Critical patent/CN112148916A/en
Publication of CN112148916A publication Critical patent/CN112148916A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, wherein the method comprises the following steps: carrying out feature extraction on training sample data of an image mode and a text mode; mapping the extracted image data features and text data features to a common representation space; respectively calculating the loss of a label space, the loss in each mode and among different modes in a public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; optimizing parameters of the retrieval model by minimizing a loss function; and mapping the target retrieval data to a public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set to obtain a corresponding retrieval sequencing result. Therefore, the discriminability of different semantic data samples and the semantic information of the original data are reserved, the correlation among the cross-modal data can be calculated more effectively, and the retrieval accuracy is higher.

Description

Cross-modal retrieval method, device, equipment and medium based on supervision
Technical Field
The invention relates to the technical field of data retrieval, in particular to a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision.
Background
With the rapid development of science and technology, the generation forms and the acquisition channels of scientific and technical information are increasingly abundant. Data expression forms of various scientific and technical information are rich and diverse, and are gradually changed from single text data into mixed data types of other modes, such as pictures, videos and the like with more vivid expression forms and richer contents. The traditional single-mode retrieval method has a good query effect on a single mode, but due to the fact that characteristic heterogeneity and weak correlation possibly exist among data of different modes, characteristic vectors of the data of different modes cannot directly participate in calculation due to different dimensions and attributes, and single-mode retrieval is not suitable for retrieval among multiple modes. The cross-modal retrieval method retrieves similar content among multi-modal data by utilizing semantic similarity existing among different modalities. By the cross-modal retrieval method, the requirement of multi-angle intelligent analysis on multi-modal scientific and technological information can be met.
The core of cross-modal search is how to measure content similarity between different modal data, i.e. to solve heterogeneity between different modal data. Representation learning is a general method for overcoming heterogeneity among different modality data, and aims to design a proper function to map the different modality data to a common representation space, and in the representation space, the similarity among the different modality data can be directly solved due to the consistent dimension of the data. In order to construct a suitable representation space, researchers have proposed many methods of designing the mapping function.
The conventional method uses statistical correlation analysis to learn a linear function by optimizing a target statistic, but the correlation between multi-modal data in the real world is complex, and the linear function cannot completely model a mapping relation. Since deep neural networks perform well in the field of representation learning, a number of deep learning-based methods are used to learn a common representation space for multimodal data. A supervised-based deep learning approach may be used to learn more discriminative representation features, relative to an unsupervised approach, thereby enabling better separation of different classes of data in a common representation space. Existing cross-modal search methods based on supervision include learning discriminative features between multimodal data using label information, learning semantics or discriminative properties within each modality using classification information, and the like. Although these methods use classification information, the classification information is only used to learn discriminant features within each modality or across modalities, so these cross-modality methods do not fully utilize semantic information in the raw data.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, and aims to solve the technical problem that the prior cross-modal retrieval method does not fully utilize semantic information in original data, so that the retrieval accuracy is low.
In order to achieve the above object, the present invention provides a cross-modal search method based on supervision, which comprises the following steps: s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set; s2: mapping the extracted image data features and text data features to a common representation space; s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model; s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
Further, the step S1 includes: s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network; s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.
Further, the step S2 includes: and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.
Further, a linear classifier is added after the third fully-connected layer to predict the class of images and texts and compare with the real class, thereby calculating the loss of tag space.
Further, the loss function is expressed as: l ═ λ L1+μL2+ηL3Wherein, in the step (A),
Figure BDA0002707645000000031
L1for the loss of label space, n is the number of the picture text data pairs, | | · uFRepresenting Frobenius norm, P is a projection matrix of a linear classifier, alpha and beta are weights corresponding to image and text prediction labels respectively, and U, V, Y is a representation matrix of an image, a representation matrix of a text and a representation matrix of a corresponding label in a public representation space respectively;
Figure BDA0002707645000000032
L2for losses within individual modalities and between different modalities in the common representation space,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj),
Figure BDA0002707645000000033
Figure BDA0002707645000000034
cos is a cosine function used to measure similarity; sgn is a sign function, and is 1 if the two representing elements belong to the same class, otherwise is 0;
Figure BDA0002707645000000035
for mapping the modality of the image,
Figure BDA0002707645000000036
for mapping text modalities, wherein
Figure BDA0002707645000000037
And
Figure BDA0002707645000000038
y for the ith image sample and the jth text sampleαAnd upsilonβIs a learnable parameter;
Figure BDA0002707645000000039
L3loss of invariance between image and text modalities;
λ, μ, η are L1、L2、L3The weight coefficient of (2).
Further, in step S5, calculating the similarity between the target retrieval data and the data in the image-text data set includes: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.
In another aspect, the present invention provides a cross-modal search apparatus based on supervision, including:
the characteristic extraction module is used for respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
the public representation space learning module is used for mapping the extracted image data features and the extracted text data features to a public representation space;
the loss function calculation module is used for calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes respectively, and adding different weights to obtain a loss function of the retrieval model;
the retrieval model optimization module is used for optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
and the retrieval result determining module is used for mapping the target retrieval data to the public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval sequencing result corresponding to the target retrieval data.
The invention also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the method comprises the steps of mapping the extracted image data features and the extracted text data features to a public representation space; calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among the image modes and the text modes, and adding different weights to obtain a loss function of a retrieval model; and optimizing parameters of the retrieval model through a minimum loss function to obtain the optimized retrieval model. An end-to-end supervision-based cross-modal deep learning framework is adopted, the classification information of multi-modal data is fully utilized to learn a common representation space, the discriminability of different semantic data samples is reserved, and the difference between cross-modal data is eliminated. Compared with the traditional retrieval method based on Hash, the method adopts the cross-modal representation learning method based on real values, retains the semantic information of the original data, can more effectively calculate the correlation among the cross-modal data, eliminates the influence of heterogeneity among different modes, and has higher retrieval accuracy.
(2) The present invention makes full use of classification information to learn the common representation space. Meanwhile, the method is a cross-modal representation learning method of real values, and compared with a binary representation learning method or a cross-modal Hash retrieval method, the method retains the information of original data and has higher retrieval accuracy.
Drawings
FIG. 1 is a schematic flow chart of a cross-modal search method based on supervision according to the present invention;
fig. 2 is a block diagram of a cross-modal search apparatus based on supervision according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a cross-modal retrieval method based on supervision, as shown in fig. 1, comprising the following steps:
s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
s2: mapping the extracted image data features and text data features to a common representation space;
s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model;
s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
The present invention supervises model learning discrimination features by minimizing discrimination losses in both the label space and the common representation space. At the same time, modality-invariant features in the common representation space are learned by minimizing loss of invariance between different modalities and using a weight sharing strategy. According to this learning strategy, pairs of label information and classification information are utilized as fully as possible to ensure that the learned representation is discriminative in semantic structure and invariant in all modalities.
Specifically, the method comprises the following steps:
1. data preprocessing and image text feature extraction
The invention can be applied to cross-modal retrieval among different modalities, the introduction mainly takes image and text data as an example, and n picture text data pairs are assumed to be defined as
Figure BDA0002707645000000061
Wherein
Figure BDA0002707645000000062
Is the ith sample of the image,
Figure BDA0002707645000000063
is the ith text sample. Each image text pair corresponds to a semantic label vector
Figure BDA0002707645000000064
c represents the number of directories, if the ith instance belongs to the jth directory, the corresponding component in the tag vector is 1, otherwise, it is 0.
The data preprocessing of the image comprises size adjustment, cutting, normalization and the like; the text is preprocessed by denoising, word segmentation, word filtering stop and the like. The method of feature extraction is described below.
For picture data, a deep convolutional neural network is adopted to extract a feature vector of an image. The convolution kernel in the convolution neural network and the input image are operated with an element corresponding product and summation operation to project the information in the receptive field into an element in the feature mapping, so as to complete the feature extraction of the image. The part finally expresses the picture as a feature vector with a specific dimension by controlling the dimension of each layer. Generally speaking, convolutional neural networks have a multi-layer nature, where the neural network following each layer represents the picture data in a more abstract way, so the output of the second to last layer can be used as a feature vector. The image feature extraction method is characterized in that a specific convolutional neural network is utilized to extract the features of a single mode of the image, and then the feature vector of the image is represented for the next step of representation learning. Therefore, a 4096-dimensional high-level feature vector is extracted as a semantic vector of an image by fc7 level in VGGNet of 19 levels. Then, common representation learning is performed using one fully connected layer to obtain a common representation vector u of the imagei
For text data, a feature extraction algorithm based on a language model and a neural network is adopted to extract text information from a non-text informationThe original text of the structure is converted into a high-level semantic feature vector representation. For the text sub-network, a BERT model (BERT-base, Chinese) is used to extract the feature vectors of Chinese text. BERT utilizes a transform structure to construct a multilayer bidirectional encoder network, can directly convert an original text into a high-level semantic vector form with semantic features, and then utilizes a full-link layer to carry out common representation learning to obtain a common representation vector v of the textj
And a feature extraction module is used for respectively extracting high-level semantic feature vectors of the scientific and technological information image and the text. For the image and text subnetworks, one for the image modality and the other for the text modality, both employ an end-to-end training mode.
2. Common representation space learning
For the common representation learning module, in order to ensure that two subnets can learn a common representation space, a fully connected layer is used for weight sharing learning at the end of the model. Next, a linear classifier with a parameter matrix P is connected to the end of the model based on the assumption that the representative vector in the common space fits the classification to learn the discriminant features by using the label information. Therefore, the method can well learn the cross-modal correlation and accurately distinguish the representation features of the public space.
Since the image and the text data are respectively in different representation spaces, the similarity of the image and the text data cannot be directly compared, so that the method learns that the image and the text are respectively mapped to a common representation space by two mapping functions. Definition of
Figure BDA0002707645000000081
To map the modality of the image,
Figure BDA0002707645000000082
to map the text modality, wherein γαAnd upsilonβAre learnable parameters.
Defining a representation matrix representing an image in space as U ═ U1,u2,...,un]The expression matrix of the text isV=[v1,v2,...,vn]The expression matrix of the corresponding label is Y ═ Y1,y2,...,yn]。
3. Training and objective functions of models
The goal of the model is to learn the semantic structure of the data, i.e. to learn a common space where samples from the same semantic class should be similar, even though the data may be from different modalities, and samples from different semantic classes should be dissimilar. To learn discriminative representation features of multimodal data, the method minimizes the discriminative loss in label space and common representation space, while at the same time minimizing the distance between each image text pair representation to reduce cross-modality differences.
In order to preserve the difference of different classes of samples after feature projection, the method assumes that the common representation features are very suitable for classification, and uses a linear classifier to predict semantic labels of the sample features in the common representation space. In particular, a full connectivity layer is used to connect the end of the image modality network and the text modality network. The classifier classifies representative features of training data in a common space to generate a prediction label for each data sample.
The objective function can be divided into three parts. The first part acts as a measure of the loss of label space, as shown in equation (1).
Figure BDA0002707645000000083
Wherein | · | purple sweetFRepresenting the Frobenius norm, P is the projection matrix of the linear classifier, and α and β are the weights corresponding to the image and text prediction labels, respectively. Since the way in which images and text extract high level semantic vectors is differentiated so that their prediction losses mapped to features in the common representation space are not uniform, different weights are applied to the image and text prediction labels to balance the difference in prediction losses.
The second part of the objective function is used for directly measuring the discriminant loss between different modal data and in each modal data in the common representation interval, as shown in formula (2).
Figure BDA0002707645000000091
Wherein the content of the first and second substances,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj),
Figure BDA0002707645000000092
Figure BDA0002707645000000093
cos is a cosine function used to measure similarity; sgn is a sign function, which is 1 if two representing elements belong to the same class, and 0 otherwise.
The likelihood function for measuring the similarity between the modalities is shown in equation (3).
Figure BDA0002707645000000094
Since the logarithm of the maximum likelihood function, i.e., the minimum likelihood function, takes a negative sign, equation (3) can be reduced to the first line of equation (2). It can be deduced that the cosine similarity cos (u)i,vj) The larger the size of the tube is,ijthe greater, and thus the probability p (1| u)i,vj) The larger, that is, the common representation space is classified by similarity. Similarly, the second and third lines in equation (2) measure the similarity of the respective interior of the image and text sample. Therefore, the formula is a reasonable method for measuring the common representation space similarity and is a criterion for learning the discriminant features.
To eliminate the cross-modal differences, the third part of the objective function minimizes the distance between all the picture text pair representations, as shown in equation (4).
Figure BDA0002707645000000095
And (5) integrating (1), (2) and (4), and applying different weights respectively to obtain a total objective function of the model, as shown in formula (5).
L=λL1+μL2+ηL3 (5)
4. Similarity calculation and result display
And finally, training the model and using the model. And mapping the target retrieval data to a public expression space by using a trained model, and calculating the similarity between the target retrieval data and data in a science and technology image-text data set. In practical use, similarity of homomorphic data can be added to improve the accuracy of retrieval, i.e. as shown in formula (6).
S=αSimilarity(x,U)+βSimilarity(x,V) (6)
Wherein, alpha and beta are weights, Similarity () is a function for measuring Similarity, x is input image or text data, S is a final return result, S is sorted, and the data in the front of the sorting is taken as a final result.
In another aspect, the present invention provides a cross-modal search apparatus based on supervision, including:
the feature extraction module 11 is configured to perform feature extraction on training sample data in an image modality and a text modality, where the training sample data is acquired from a graphic data set;
a common representation space learning module 21 for mapping the extracted image data features and text data features to a common representation space;
a loss function calculating module 31, configured to calculate a loss of a tag space, a loss in each modality and between different modalities in the common representation space, and an invariance loss between image modalities and text modalities, and add different weights to obtain a loss function of the retrieval model;
a retrieval model optimization module 41, configured to optimize parameters of the retrieval model by minimizing the loss function, so as to obtain an optimized retrieval model;
and a retrieval result determining module 51, configured to map the target retrieval data to the common representation space by using the optimized retrieval model, and calculate a similarity between the target retrieval data and the data in the image-text data set, so as to obtain a retrieval ordering result corresponding to the target retrieval data.
The division of each module in the above-mentioned cross-modal search apparatus is merely for illustration, and in other embodiments, the cross-modal search apparatus may be divided into different modules as required to complete all or part of the functions of the above-mentioned apparatus.
The implementation of each module in the cross-modal retrieval apparatus based on supervision provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. Program modules constituted by such computer programs may be stored on the memory of the electronic device. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of a surveillance-based cross-modality retrieval method.
A computer program product containing instructions which, when run on a computer, cause the computer to perform a surveillance-based cross-modal retrieval method.
Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It will be readily understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, so that various changes, modifications and substitutions may be made without departing from the spirit and scope of the invention.

Claims (9)

1. A cross-modal retrieval method based on supervision is characterized by comprising the following steps:
s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
s2: mapping the extracted image data features and text data features to a common representation space;
s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model;
s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
2. The supervised-based cross-modal retrieval method of claim 1, wherein the step S1 comprises:
s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network;
s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.
3. The supervised-based cross-modal retrieval method of claim 2, wherein the step S2 comprises:
and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.
4. The supervised-based cross-modal search method of claim 3, wherein a linear classifier is added after the third fully connected layer to predict image and text classes and compare them with the true classes to calculate the loss of label space.
5. The supervised-based cross-modal retrieval method of claim 1,
the loss function is expressed as: l ═ λ L1+μL2+ηL3Wherein, in the step (A),
Figure FDA0002707644990000021
L1for the loss of label space, n is the number of the picture text data pairs, | | · uFRepresenting Frobenius norm, P is a projection matrix of a linear classifier, alpha and beta are weights corresponding to image and text prediction labels respectively, and U, V, Y is a representation matrix of an image, a representation matrix of a text and a representation matrix of a corresponding label in a public representation space respectively;
Figure FDA0002707644990000022
L2for losses within individual modalities and between different modalities in the common representation space,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj),
Figure FDA0002707644990000023
Figure FDA0002707644990000024
cos is a cosine function used to measure similarity; sgn is a sign function, and is 1 if the two representing elements belong to the same class, otherwise is 0;
Figure FDA0002707644990000026
for mapping the modality of the image,
Figure FDA0002707644990000029
for mapping text modalities, wherein
Figure FDA0002707644990000027
And
Figure FDA0002707644990000028
y for the ith image sample and the jth text sampleαAnd upsilonβIs a learnable parameter;
Figure FDA0002707644990000025
L3loss of invariance between image and text modalities;
λ, μ, η are L1、L2、L3The weight coefficient of (2).
6. The supervised-based cross-modal retrieval method of claim 1, wherein the calculating of the similarity between the target retrieval data and the data in the teletext data set in step S5 comprises: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.
7. A cross-modal search apparatus based on supervision, comprising:
the characteristic extraction module is used for respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
the public representation space learning module is used for mapping the extracted image data features and the extracted text data features to a public representation space;
the loss function calculation module is used for calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes respectively, and adding different weights to obtain a loss function of the retrieval model;
the retrieval model optimization module is used for optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
and the retrieval result determining module is used for mapping the target retrieval data to the public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval sequencing result corresponding to the target retrieval data.
8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011044741.1A 2020-09-28 2020-09-28 Cross-modal retrieval method, device, equipment and medium based on supervision Pending CN112148916A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011044741.1A CN112148916A (en) 2020-09-28 2020-09-28 Cross-modal retrieval method, device, equipment and medium based on supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011044741.1A CN112148916A (en) 2020-09-28 2020-09-28 Cross-modal retrieval method, device, equipment and medium based on supervision

Publications (1)

Publication Number Publication Date
CN112148916A true CN112148916A (en) 2020-12-29

Family

ID=73896074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011044741.1A Pending CN112148916A (en) 2020-09-28 2020-09-28 Cross-modal retrieval method, device, equipment and medium based on supervision

Country Status (1)

Country Link
CN (1) CN112148916A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010720A (en) * 2021-02-24 2021-06-22 华侨大学 Deep supervision cross-modal retrieval method based on key object characteristics
CN113033622A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for cross-modal retrieval model
CN113157739A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Cross-modal retrieval method and device, electronic equipment and storage medium
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113297485A (en) * 2021-05-24 2021-08-24 中国科学院计算技术研究所 Method for generating cross-modal representation vector and cross-modal recommendation method
CN113360700A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Method, device, equipment and medium for training image-text retrieval model and image-text retrieval
CN113408282A (en) * 2021-08-06 2021-09-17 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for topic model training and topic prediction
CN113449070A (en) * 2021-05-25 2021-09-28 北京有竹居网络技术有限公司 Multimodal data retrieval method, device, medium and electronic equipment
CN113779283A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN113792207A (en) * 2021-09-29 2021-12-14 嘉兴学院 Cross-modal retrieval method based on multi-level feature representation alignment
CN114691907A (en) * 2022-05-31 2022-07-01 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN114840734A (en) * 2022-04-29 2022-08-02 北京百度网讯科技有限公司 Training method of multi-modal representation model, cross-modal retrieval method and device
CN114841243A (en) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN114861016A (en) * 2022-07-05 2022-08-05 人民中科(北京)智能技术有限公司 Cross-modal retrieval method and device and storage medium
CN115082930A (en) * 2021-03-11 2022-09-20 腾讯科技(深圳)有限公司 Image classification method and device, electronic equipment and storage medium
WO2023024413A1 (en) * 2021-08-25 2023-03-02 平安科技(深圳)有限公司 Information matching method and apparatus, computer device and readable storage medium
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device
WO2024011814A1 (en) * 2022-07-12 2024-01-18 苏州元脑智能科技有限公司 Image-text mutual retrieval method, system and device, and nonvolatile readable storage medium
CN117891960A (en) * 2024-01-19 2024-04-16 中国科学技术大学 Multi-mode hash retrieval method and system based on adaptive gradient modulation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309331A (en) * 2019-07-04 2019-10-08 哈尔滨工业大学(深圳) A kind of cross-module state depth Hash search method based on self-supervisory

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309331A (en) * 2019-07-04 2019-10-08 哈尔滨工业大学(深圳) A kind of cross-module state depth Hash search method based on self-supervisory

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIANGLI ZHEN 等: "deep supervised cross modal retrieval", IEEE, 31 January 2020 (2020-01-31), pages 10394 - 10403 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010720A (en) * 2021-02-24 2021-06-22 华侨大学 Deep supervision cross-modal retrieval method based on key object characteristics
CN113010720B (en) * 2021-02-24 2022-06-07 华侨大学 Deep supervision cross-modal retrieval method based on key object characteristics
CN113033622A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for cross-modal retrieval model
CN115082930B (en) * 2021-03-11 2024-05-28 腾讯科技(深圳)有限公司 Image classification method, device, electronic equipment and storage medium
CN115082930A (en) * 2021-03-11 2022-09-20 腾讯科技(深圳)有限公司 Image classification method and device, electronic equipment and storage medium
CN113157739A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Cross-modal retrieval method and device, electronic equipment and storage medium
CN113220919B (en) * 2021-05-17 2022-04-22 河海大学 Dam defect image text cross-modal retrieval method and model
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN113297485A (en) * 2021-05-24 2021-08-24 中国科学院计算技术研究所 Method for generating cross-modal representation vector and cross-modal recommendation method
CN113297485B (en) * 2021-05-24 2023-01-24 中国科学院计算技术研究所 Method for generating cross-modal representation vector and cross-modal recommendation method
CN113449070A (en) * 2021-05-25 2021-09-28 北京有竹居网络技术有限公司 Multimodal data retrieval method, device, medium and electronic equipment
CN113360700B (en) * 2021-06-30 2023-09-29 北京百度网讯科技有限公司 Training of image-text retrieval model, image-text retrieval method, device, equipment and medium
CN113360700A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Method, device, equipment and medium for training image-text retrieval model and image-text retrieval
CN113408282A (en) * 2021-08-06 2021-09-17 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for topic model training and topic prediction
CN113408282B (en) * 2021-08-06 2021-11-09 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for topic model training and topic prediction
WO2023024413A1 (en) * 2021-08-25 2023-03-02 平安科技(深圳)有限公司 Information matching method and apparatus, computer device and readable storage medium
CN113792207A (en) * 2021-09-29 2021-12-14 嘉兴学院 Cross-modal retrieval method based on multi-level feature representation alignment
CN113792207B (en) * 2021-09-29 2023-11-17 嘉兴学院 Cross-modal retrieval method based on multi-level feature representation alignment
CN113779283B (en) * 2021-11-11 2022-04-01 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN113779283A (en) * 2021-11-11 2021-12-10 南京码极客科技有限公司 Fine-grained cross-media retrieval method with deep supervision and feature fusion
CN114841243A (en) * 2022-04-02 2022-08-02 中国科学院上海高等研究院 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
CN114840734A (en) * 2022-04-29 2022-08-02 北京百度网讯科技有限公司 Training method of multi-modal representation model, cross-modal retrieval method and device
CN114691907A (en) * 2022-05-31 2022-07-01 上海蜜度信息技术有限公司 Cross-modal retrieval method, device and medium
CN114861016A (en) * 2022-07-05 2022-08-05 人民中科(北京)智能技术有限公司 Cross-modal retrieval method and device and storage medium
WO2024011814A1 (en) * 2022-07-12 2024-01-18 苏州元脑智能科技有限公司 Image-text mutual retrieval method, system and device, and nonvolatile readable storage medium
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device
CN116955699B (en) * 2023-07-18 2024-04-26 北京邮电大学 Video cross-mode search model training method, searching method and device
CN116775918A (en) * 2023-08-22 2023-09-19 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN116775918B (en) * 2023-08-22 2023-11-24 四川鹏旭斯特科技有限公司 Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning
CN117891960A (en) * 2024-01-19 2024-04-16 中国科学技术大学 Multi-mode hash retrieval method and system based on adaptive gradient modulation

Similar Documents

Publication Publication Date Title
CN112148916A (en) Cross-modal retrieval method, device, equipment and medium based on supervision
CN106547880B (en) Multi-dimensional geographic scene identification method fusing geographic area knowledge
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN111914558A (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110516095A (en) Weakly supervised depth Hash social activity image search method and system based on semanteme migration
CN111881262A (en) Text emotion analysis method based on multi-channel neural network
CN111475622A (en) Text classification method, device, terminal and storage medium
CN110502757B (en) Natural language emotion analysis method
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113948217A (en) Medical nested named entity recognition method based on local feature integration
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
CN113378938B (en) Edge transform graph neural network-based small sample image classification method and system
CN114882521A (en) Unsupervised pedestrian re-identification method and unsupervised pedestrian re-identification device based on multi-branch network
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN110705384B (en) Vehicle re-identification method based on cross-domain migration enhanced representation
CN115374786A (en) Entity and relationship combined extraction method and device, storage medium and terminal
CN112364166A (en) Method for establishing relation extraction model and relation extraction method
CN111428502A (en) Named entity labeling method for military corpus
CN111723572B (en) Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN111898528B (en) Data processing method, device, computer readable medium and electronic equipment
CN113536015A (en) Cross-modal retrieval method based on depth identification migration
CN111859979A (en) Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination