CN112148916A - Cross-modal retrieval method, device, equipment and medium based on supervision - Google Patents
Cross-modal retrieval method, device, equipment and medium based on supervision Download PDFInfo
- Publication number
- CN112148916A CN112148916A CN202011044741.1A CN202011044741A CN112148916A CN 112148916 A CN112148916 A CN 112148916A CN 202011044741 A CN202011044741 A CN 202011044741A CN 112148916 A CN112148916 A CN 112148916A
- Authority
- CN
- China
- Prior art keywords
- data
- retrieval
- text
- image
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013507 mapping Methods 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000012163 sequencing technique Methods 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 38
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 claims description 2
- 239000013598 vector Substances 0.000 description 17
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, wherein the method comprises the following steps: carrying out feature extraction on training sample data of an image mode and a text mode; mapping the extracted image data features and text data features to a common representation space; respectively calculating the loss of a label space, the loss in each mode and among different modes in a public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; optimizing parameters of the retrieval model by minimizing a loss function; and mapping the target retrieval data to a public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set to obtain a corresponding retrieval sequencing result. Therefore, the discriminability of different semantic data samples and the semantic information of the original data are reserved, the correlation among the cross-modal data can be calculated more effectively, and the retrieval accuracy is higher.
Description
Technical Field
The invention relates to the technical field of data retrieval, in particular to a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision.
Background
With the rapid development of science and technology, the generation forms and the acquisition channels of scientific and technical information are increasingly abundant. Data expression forms of various scientific and technical information are rich and diverse, and are gradually changed from single text data into mixed data types of other modes, such as pictures, videos and the like with more vivid expression forms and richer contents. The traditional single-mode retrieval method has a good query effect on a single mode, but due to the fact that characteristic heterogeneity and weak correlation possibly exist among data of different modes, characteristic vectors of the data of different modes cannot directly participate in calculation due to different dimensions and attributes, and single-mode retrieval is not suitable for retrieval among multiple modes. The cross-modal retrieval method retrieves similar content among multi-modal data by utilizing semantic similarity existing among different modalities. By the cross-modal retrieval method, the requirement of multi-angle intelligent analysis on multi-modal scientific and technological information can be met.
The core of cross-modal search is how to measure content similarity between different modal data, i.e. to solve heterogeneity between different modal data. Representation learning is a general method for overcoming heterogeneity among different modality data, and aims to design a proper function to map the different modality data to a common representation space, and in the representation space, the similarity among the different modality data can be directly solved due to the consistent dimension of the data. In order to construct a suitable representation space, researchers have proposed many methods of designing the mapping function.
The conventional method uses statistical correlation analysis to learn a linear function by optimizing a target statistic, but the correlation between multi-modal data in the real world is complex, and the linear function cannot completely model a mapping relation. Since deep neural networks perform well in the field of representation learning, a number of deep learning-based methods are used to learn a common representation space for multimodal data. A supervised-based deep learning approach may be used to learn more discriminative representation features, relative to an unsupervised approach, thereby enabling better separation of different classes of data in a common representation space. Existing cross-modal search methods based on supervision include learning discriminative features between multimodal data using label information, learning semantics or discriminative properties within each modality using classification information, and the like. Although these methods use classification information, the classification information is only used to learn discriminant features within each modality or across modalities, so these cross-modality methods do not fully utilize semantic information in the raw data.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a cross-modal retrieval method, a cross-modal retrieval device, a cross-modal retrieval equipment and a cross-modal retrieval medium based on supervision, and aims to solve the technical problem that the prior cross-modal retrieval method does not fully utilize semantic information in original data, so that the retrieval accuracy is low.
In order to achieve the above object, the present invention provides a cross-modal search method based on supervision, which comprises the following steps: s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set; s2: mapping the extracted image data features and text data features to a common representation space; s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model; s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model; s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
Further, the step S1 includes: s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network; s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.
Further, the step S2 includes: and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.
Further, a linear classifier is added after the third fully-connected layer to predict the class of images and texts and compare with the real class, thereby calculating the loss of tag space.
Further, the loss function is expressed as: l ═ λ L1+μL2+ηL3Wherein, in the step (A),
L1for the loss of label space, n is the number of the picture text data pairs, | | · uFRepresenting Frobenius norm, P is a projection matrix of a linear classifier, alpha and beta are weights corresponding to image and text prediction labels respectively, and U, V, Y is a representation matrix of an image, a representation matrix of a text and a representation matrix of a corresponding label in a public representation space respectively;
L2for losses within individual modalities and between different modalities in the common representation space,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj), cos is a cosine function used to measure similarity; sgn is a sign function, and is 1 if the two representing elements belong to the same class, otherwise is 0;for mapping the modality of the image,for mapping text modalities, whereinAndy for the ith image sample and the jth text sampleαAnd upsilonβIs a learnable parameter;
λ, μ, η are L1、L2、L3The weight coefficient of (2).
Further, in step S5, calculating the similarity between the target retrieval data and the data in the image-text data set includes: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.
In another aspect, the present invention provides a cross-modal search apparatus based on supervision, including:
the characteristic extraction module is used for respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
the public representation space learning module is used for mapping the extracted image data features and the extracted text data features to a public representation space;
the loss function calculation module is used for calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes respectively, and adding different weights to obtain a loss function of the retrieval model;
the retrieval model optimization module is used for optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
and the retrieval result determining module is used for mapping the target retrieval data to the public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval sequencing result corresponding to the target retrieval data.
The invention also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the method comprises the steps of mapping the extracted image data features and the extracted text data features to a public representation space; calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among the image modes and the text modes, and adding different weights to obtain a loss function of a retrieval model; and optimizing parameters of the retrieval model through a minimum loss function to obtain the optimized retrieval model. An end-to-end supervision-based cross-modal deep learning framework is adopted, the classification information of multi-modal data is fully utilized to learn a common representation space, the discriminability of different semantic data samples is reserved, and the difference between cross-modal data is eliminated. Compared with the traditional retrieval method based on Hash, the method adopts the cross-modal representation learning method based on real values, retains the semantic information of the original data, can more effectively calculate the correlation among the cross-modal data, eliminates the influence of heterogeneity among different modes, and has higher retrieval accuracy.
(2) The present invention makes full use of classification information to learn the common representation space. Meanwhile, the method is a cross-modal representation learning method of real values, and compared with a binary representation learning method or a cross-modal Hash retrieval method, the method retains the information of original data and has higher retrieval accuracy.
Drawings
FIG. 1 is a schematic flow chart of a cross-modal search method based on supervision according to the present invention;
fig. 2 is a block diagram of a cross-modal search apparatus based on supervision according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a cross-modal retrieval method based on supervision, as shown in fig. 1, comprising the following steps:
s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
s2: mapping the extracted image data features and text data features to a common representation space;
s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model;
s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
The present invention supervises model learning discrimination features by minimizing discrimination losses in both the label space and the common representation space. At the same time, modality-invariant features in the common representation space are learned by minimizing loss of invariance between different modalities and using a weight sharing strategy. According to this learning strategy, pairs of label information and classification information are utilized as fully as possible to ensure that the learned representation is discriminative in semantic structure and invariant in all modalities.
Specifically, the method comprises the following steps:
1. data preprocessing and image text feature extraction
The invention can be applied to cross-modal retrieval among different modalities, the introduction mainly takes image and text data as an example, and n picture text data pairs are assumed to be defined asWhereinIs the ith sample of the image,is the ith text sample. Each image text pair corresponds to a semantic label vectorc represents the number of directories, if the ith instance belongs to the jth directory, the corresponding component in the tag vector is 1, otherwise, it is 0.
The data preprocessing of the image comprises size adjustment, cutting, normalization and the like; the text is preprocessed by denoising, word segmentation, word filtering stop and the like. The method of feature extraction is described below.
For picture data, a deep convolutional neural network is adopted to extract a feature vector of an image. The convolution kernel in the convolution neural network and the input image are operated with an element corresponding product and summation operation to project the information in the receptive field into an element in the feature mapping, so as to complete the feature extraction of the image. The part finally expresses the picture as a feature vector with a specific dimension by controlling the dimension of each layer. Generally speaking, convolutional neural networks have a multi-layer nature, where the neural network following each layer represents the picture data in a more abstract way, so the output of the second to last layer can be used as a feature vector. The image feature extraction method is characterized in that a specific convolutional neural network is utilized to extract the features of a single mode of the image, and then the feature vector of the image is represented for the next step of representation learning. Therefore, a 4096-dimensional high-level feature vector is extracted as a semantic vector of an image by fc7 level in VGGNet of 19 levels. Then, common representation learning is performed using one fully connected layer to obtain a common representation vector u of the imagei。
For text data, a feature extraction algorithm based on a language model and a neural network is adopted to extract text information from a non-text informationThe original text of the structure is converted into a high-level semantic feature vector representation. For the text sub-network, a BERT model (BERT-base, Chinese) is used to extract the feature vectors of Chinese text. BERT utilizes a transform structure to construct a multilayer bidirectional encoder network, can directly convert an original text into a high-level semantic vector form with semantic features, and then utilizes a full-link layer to carry out common representation learning to obtain a common representation vector v of the textj。
And a feature extraction module is used for respectively extracting high-level semantic feature vectors of the scientific and technological information image and the text. For the image and text subnetworks, one for the image modality and the other for the text modality, both employ an end-to-end training mode.
2. Common representation space learning
For the common representation learning module, in order to ensure that two subnets can learn a common representation space, a fully connected layer is used for weight sharing learning at the end of the model. Next, a linear classifier with a parameter matrix P is connected to the end of the model based on the assumption that the representative vector in the common space fits the classification to learn the discriminant features by using the label information. Therefore, the method can well learn the cross-modal correlation and accurately distinguish the representation features of the public space.
Since the image and the text data are respectively in different representation spaces, the similarity of the image and the text data cannot be directly compared, so that the method learns that the image and the text are respectively mapped to a common representation space by two mapping functions. Definition ofTo map the modality of the image,to map the text modality, wherein γαAnd upsilonβAre learnable parameters.
Defining a representation matrix representing an image in space as U ═ U1,u2,...,un]The expression matrix of the text isV=[v1,v2,...,vn]The expression matrix of the corresponding label is Y ═ Y1,y2,...,yn]。
3. Training and objective functions of models
The goal of the model is to learn the semantic structure of the data, i.e. to learn a common space where samples from the same semantic class should be similar, even though the data may be from different modalities, and samples from different semantic classes should be dissimilar. To learn discriminative representation features of multimodal data, the method minimizes the discriminative loss in label space and common representation space, while at the same time minimizing the distance between each image text pair representation to reduce cross-modality differences.
In order to preserve the difference of different classes of samples after feature projection, the method assumes that the common representation features are very suitable for classification, and uses a linear classifier to predict semantic labels of the sample features in the common representation space. In particular, a full connectivity layer is used to connect the end of the image modality network and the text modality network. The classifier classifies representative features of training data in a common space to generate a prediction label for each data sample.
The objective function can be divided into three parts. The first part acts as a measure of the loss of label space, as shown in equation (1).
Wherein | · | purple sweetFRepresenting the Frobenius norm, P is the projection matrix of the linear classifier, and α and β are the weights corresponding to the image and text prediction labels, respectively. Since the way in which images and text extract high level semantic vectors is differentiated so that their prediction losses mapped to features in the common representation space are not uniform, different weights are applied to the image and text prediction labels to balance the difference in prediction losses.
The second part of the objective function is used for directly measuring the discriminant loss between different modal data and in each modal data in the common representation interval, as shown in formula (2).
Wherein the content of the first and second substances,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj), cos is a cosine function used to measure similarity; sgn is a sign function, which is 1 if two representing elements belong to the same class, and 0 otherwise.
The likelihood function for measuring the similarity between the modalities is shown in equation (3).
Since the logarithm of the maximum likelihood function, i.e., the minimum likelihood function, takes a negative sign, equation (3) can be reduced to the first line of equation (2). It can be deduced that the cosine similarity cos (u)i,vj) The larger the size of the tube is,ijthe greater, and thus the probability p (1| u)i,vj) The larger, that is, the common representation space is classified by similarity. Similarly, the second and third lines in equation (2) measure the similarity of the respective interior of the image and text sample. Therefore, the formula is a reasonable method for measuring the common representation space similarity and is a criterion for learning the discriminant features.
To eliminate the cross-modal differences, the third part of the objective function minimizes the distance between all the picture text pair representations, as shown in equation (4).
And (5) integrating (1), (2) and (4), and applying different weights respectively to obtain a total objective function of the model, as shown in formula (5).
L=λL1+μL2+ηL3 (5)
4. Similarity calculation and result display
And finally, training the model and using the model. And mapping the target retrieval data to a public expression space by using a trained model, and calculating the similarity between the target retrieval data and data in a science and technology image-text data set. In practical use, similarity of homomorphic data can be added to improve the accuracy of retrieval, i.e. as shown in formula (6).
S=αSimilarity(x,U)+βSimilarity(x,V) (6)
Wherein, alpha and beta are weights, Similarity () is a function for measuring Similarity, x is input image or text data, S is a final return result, S is sorted, and the data in the front of the sorting is taken as a final result.
In another aspect, the present invention provides a cross-modal search apparatus based on supervision, including:
the feature extraction module 11 is configured to perform feature extraction on training sample data in an image modality and a text modality, where the training sample data is acquired from a graphic data set;
a common representation space learning module 21 for mapping the extracted image data features and text data features to a common representation space;
a loss function calculating module 31, configured to calculate a loss of a tag space, a loss in each modality and between different modalities in the common representation space, and an invariance loss between image modalities and text modalities, and add different weights to obtain a loss function of the retrieval model;
a retrieval model optimization module 41, configured to optimize parameters of the retrieval model by minimizing the loss function, so as to obtain an optimized retrieval model;
and a retrieval result determining module 51, configured to map the target retrieval data to the common representation space by using the optimized retrieval model, and calculate a similarity between the target retrieval data and the data in the image-text data set, so as to obtain a retrieval ordering result corresponding to the target retrieval data.
The division of each module in the above-mentioned cross-modal search apparatus is merely for illustration, and in other embodiments, the cross-modal search apparatus may be divided into different modules as required to complete all or part of the functions of the above-mentioned apparatus.
The implementation of each module in the cross-modal retrieval apparatus based on supervision provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. Program modules constituted by such computer programs may be stored on the memory of the electronic device. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of a surveillance-based cross-modality retrieval method.
A computer program product containing instructions which, when run on a computer, cause the computer to perform a surveillance-based cross-modal retrieval method.
Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It will be readily understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, so that various changes, modifications and substitutions may be made without departing from the spirit and scope of the invention.
Claims (9)
1. A cross-modal retrieval method based on supervision is characterized by comprising the following steps:
s1: respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
s2: mapping the extracted image data features and text data features to a common representation space;
s3: respectively calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes, and adding different weights to obtain a loss function of a retrieval model;
s4: optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
s5: and mapping target retrieval data to the public expression space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval ordering result corresponding to the target retrieval data.
2. The supervised-based cross-modal retrieval method of claim 1, wherein the step S1 comprises:
s11: performing feature extraction on training sample data of an image mode by using a deep convolutional neural network, and adding a first full connection layer after an image extraction sub-network;
s12: and performing feature extraction on training sample data of a text mode by using a natural language processing model, and adding a second full connection layer after the text extraction sub-network.
3. The supervised-based cross-modal retrieval method of claim 2, wherein the step S2 comprises:
and adding a third full connection layer after the first full connection layer and the second full connection layer, and mapping the extracted image data features and text data features to a common representation space through the third full connection layer.
4. The supervised-based cross-modal search method of claim 3, wherein a linear classifier is added after the third fully connected layer to predict image and text classes and compare them with the true classes to calculate the loss of label space.
5. The supervised-based cross-modal retrieval method of claim 1,
the loss function is expressed as: l ═ λ L1+μL2+ηL3Wherein, in the step (A),
L1for the loss of label space, n is the number of the picture text data pairs, | | · uFRepresenting Frobenius norm, P is a projection matrix of a linear classifier, alpha and beta are weights corresponding to image and text prediction labels respectively, and U, V, Y is a representation matrix of an image, a representation matrix of a text and a representation matrix of a corresponding label in a public representation space respectively;
L2for losses within individual modalities and between different modalities in the common representation space,ij=cos(ui,vj),Φij=cos(ui,uj),Θij=cos(vi,vj), cos is a cosine function used to measure similarity; sgn is a sign function, and is 1 if the two representing elements belong to the same class, otherwise is 0;for mapping the modality of the image,for mapping text modalities, whereinAndy for the ith image sample and the jth text sampleαAnd upsilonβIs a learnable parameter;
λ, μ, η are L1、L2、L3The weight coefficient of (2).
6. The supervised-based cross-modal retrieval method of claim 1, wherein the calculating of the similarity between the target retrieval data and the data in the teletext data set in step S5 comprises: and calculating the similarity between the target retrieval data and the data in the image-text data set by carrying out weighted average on the cross-modal data similarity and the homomodal data similarity.
7. A cross-modal search apparatus based on supervision, comprising:
the characteristic extraction module is used for respectively extracting the characteristics of training sample data of an image mode and a text mode, wherein the training sample data is obtained from a graphic data set;
the public representation space learning module is used for mapping the extracted image data features and the extracted text data features to a public representation space;
the loss function calculation module is used for calculating the loss of a label space, the loss in each mode and among different modes in the public expression space and the invariance loss among image modes and text modes respectively, and adding different weights to obtain a loss function of the retrieval model;
the retrieval model optimization module is used for optimizing parameters of the retrieval model by minimizing the loss function to obtain an optimized retrieval model;
and the retrieval result determining module is used for mapping the target retrieval data to the public representation space by using the optimized retrieval model, and calculating the similarity between the target retrieval data and the data in the image-text data set so as to obtain a retrieval sequencing result corresponding to the target retrieval data.
8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011044741.1A CN112148916A (en) | 2020-09-28 | 2020-09-28 | Cross-modal retrieval method, device, equipment and medium based on supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011044741.1A CN112148916A (en) | 2020-09-28 | 2020-09-28 | Cross-modal retrieval method, device, equipment and medium based on supervision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112148916A true CN112148916A (en) | 2020-12-29 |
Family
ID=73896074
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011044741.1A Pending CN112148916A (en) | 2020-09-28 | 2020-09-28 | Cross-modal retrieval method, device, equipment and medium based on supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112148916A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010720A (en) * | 2021-02-24 | 2021-06-22 | 华侨大学 | Deep supervision cross-modal retrieval method based on key object characteristics |
CN113033622A (en) * | 2021-03-05 | 2021-06-25 | 北京百度网讯科技有限公司 | Training method, device, equipment and storage medium for cross-modal retrieval model |
CN113157739A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Cross-modal retrieval method and device, electronic equipment and storage medium |
CN113220919A (en) * | 2021-05-17 | 2021-08-06 | 河海大学 | Dam defect image text cross-modal retrieval method and model |
CN113239214A (en) * | 2021-05-19 | 2021-08-10 | 中国科学院自动化研究所 | Cross-modal retrieval method, system and equipment based on supervised contrast |
CN113297485A (en) * | 2021-05-24 | 2021-08-24 | 中国科学院计算技术研究所 | Method for generating cross-modal representation vector and cross-modal recommendation method |
CN113360700A (en) * | 2021-06-30 | 2021-09-07 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for training image-text retrieval model and image-text retrieval |
CN113408282A (en) * | 2021-08-06 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for topic model training and topic prediction |
CN113449070A (en) * | 2021-05-25 | 2021-09-28 | 北京有竹居网络技术有限公司 | Multimodal data retrieval method, device, medium and electronic equipment |
CN113779283A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method with deep supervision and feature fusion |
CN113792207A (en) * | 2021-09-29 | 2021-12-14 | 嘉兴学院 | Cross-modal retrieval method based on multi-level feature representation alignment |
CN114691907A (en) * | 2022-05-31 | 2022-07-01 | 上海蜜度信息技术有限公司 | Cross-modal retrieval method, device and medium |
CN114840734A (en) * | 2022-04-29 | 2022-08-02 | 北京百度网讯科技有限公司 | Training method of multi-modal representation model, cross-modal retrieval method and device |
CN114841243A (en) * | 2022-04-02 | 2022-08-02 | 中国科学院上海高等研究院 | Cross-modal retrieval model training method, cross-modal retrieval method, device and medium |
CN114861016A (en) * | 2022-07-05 | 2022-08-05 | 人民中科(北京)智能技术有限公司 | Cross-modal retrieval method and device and storage medium |
CN115082930A (en) * | 2021-03-11 | 2022-09-20 | 腾讯科技(深圳)有限公司 | Image classification method and device, electronic equipment and storage medium |
WO2023024413A1 (en) * | 2021-08-25 | 2023-03-02 | 平安科技(深圳)有限公司 | Information matching method and apparatus, computer device and readable storage medium |
CN115827954B (en) * | 2023-02-23 | 2023-06-06 | 中国传媒大学 | Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment |
CN116775918A (en) * | 2023-08-22 | 2023-09-19 | 四川鹏旭斯特科技有限公司 | Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
WO2024011814A1 (en) * | 2022-07-12 | 2024-01-18 | 苏州元脑智能科技有限公司 | Image-text mutual retrieval method, system and device, and nonvolatile readable storage medium |
CN117891960A (en) * | 2024-01-19 | 2024-04-16 | 中国科学技术大学 | Multi-mode hash retrieval method and system based on adaptive gradient modulation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309331A (en) * | 2019-07-04 | 2019-10-08 | 哈尔滨工业大学(深圳) | A kind of cross-module state depth Hash search method based on self-supervisory |
-
2020
- 2020-09-28 CN CN202011044741.1A patent/CN112148916A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110309331A (en) * | 2019-07-04 | 2019-10-08 | 哈尔滨工业大学(深圳) | A kind of cross-module state depth Hash search method based on self-supervisory |
Non-Patent Citations (1)
Title |
---|
LIANGLI ZHEN 等: "deep supervised cross modal retrieval", IEEE, 31 January 2020 (2020-01-31), pages 10394 - 10403 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010720A (en) * | 2021-02-24 | 2021-06-22 | 华侨大学 | Deep supervision cross-modal retrieval method based on key object characteristics |
CN113010720B (en) * | 2021-02-24 | 2022-06-07 | 华侨大学 | Deep supervision cross-modal retrieval method based on key object characteristics |
CN113033622A (en) * | 2021-03-05 | 2021-06-25 | 北京百度网讯科技有限公司 | Training method, device, equipment and storage medium for cross-modal retrieval model |
CN115082930B (en) * | 2021-03-11 | 2024-05-28 | 腾讯科技(深圳)有限公司 | Image classification method, device, electronic equipment and storage medium |
CN115082930A (en) * | 2021-03-11 | 2022-09-20 | 腾讯科技(深圳)有限公司 | Image classification method and device, electronic equipment and storage medium |
CN113157739A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Cross-modal retrieval method and device, electronic equipment and storage medium |
CN113220919B (en) * | 2021-05-17 | 2022-04-22 | 河海大学 | Dam defect image text cross-modal retrieval method and model |
CN113220919A (en) * | 2021-05-17 | 2021-08-06 | 河海大学 | Dam defect image text cross-modal retrieval method and model |
CN113239214A (en) * | 2021-05-19 | 2021-08-10 | 中国科学院自动化研究所 | Cross-modal retrieval method, system and equipment based on supervised contrast |
CN113297485A (en) * | 2021-05-24 | 2021-08-24 | 中国科学院计算技术研究所 | Method for generating cross-modal representation vector and cross-modal recommendation method |
CN113297485B (en) * | 2021-05-24 | 2023-01-24 | 中国科学院计算技术研究所 | Method for generating cross-modal representation vector and cross-modal recommendation method |
CN113449070A (en) * | 2021-05-25 | 2021-09-28 | 北京有竹居网络技术有限公司 | Multimodal data retrieval method, device, medium and electronic equipment |
CN113360700B (en) * | 2021-06-30 | 2023-09-29 | 北京百度网讯科技有限公司 | Training of image-text retrieval model, image-text retrieval method, device, equipment and medium |
CN113360700A (en) * | 2021-06-30 | 2021-09-07 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for training image-text retrieval model and image-text retrieval |
CN113408282A (en) * | 2021-08-06 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for topic model training and topic prediction |
CN113408282B (en) * | 2021-08-06 | 2021-11-09 | 腾讯科技(深圳)有限公司 | Method, device, equipment and storage medium for topic model training and topic prediction |
WO2023024413A1 (en) * | 2021-08-25 | 2023-03-02 | 平安科技(深圳)有限公司 | Information matching method and apparatus, computer device and readable storage medium |
CN113792207A (en) * | 2021-09-29 | 2021-12-14 | 嘉兴学院 | Cross-modal retrieval method based on multi-level feature representation alignment |
CN113792207B (en) * | 2021-09-29 | 2023-11-17 | 嘉兴学院 | Cross-modal retrieval method based on multi-level feature representation alignment |
CN113779283B (en) * | 2021-11-11 | 2022-04-01 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method with deep supervision and feature fusion |
CN113779283A (en) * | 2021-11-11 | 2021-12-10 | 南京码极客科技有限公司 | Fine-grained cross-media retrieval method with deep supervision and feature fusion |
CN114841243A (en) * | 2022-04-02 | 2022-08-02 | 中国科学院上海高等研究院 | Cross-modal retrieval model training method, cross-modal retrieval method, device and medium |
CN114840734A (en) * | 2022-04-29 | 2022-08-02 | 北京百度网讯科技有限公司 | Training method of multi-modal representation model, cross-modal retrieval method and device |
CN114691907A (en) * | 2022-05-31 | 2022-07-01 | 上海蜜度信息技术有限公司 | Cross-modal retrieval method, device and medium |
CN114861016A (en) * | 2022-07-05 | 2022-08-05 | 人民中科(北京)智能技术有限公司 | Cross-modal retrieval method and device and storage medium |
WO2024011814A1 (en) * | 2022-07-12 | 2024-01-18 | 苏州元脑智能科技有限公司 | Image-text mutual retrieval method, system and device, and nonvolatile readable storage medium |
CN115827954B (en) * | 2023-02-23 | 2023-06-06 | 中国传媒大学 | Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
CN116955699B (en) * | 2023-07-18 | 2024-04-26 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
CN116775918A (en) * | 2023-08-22 | 2023-09-19 | 四川鹏旭斯特科技有限公司 | Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning |
CN116775918B (en) * | 2023-08-22 | 2023-11-24 | 四川鹏旭斯特科技有限公司 | Cross-modal retrieval method, system, equipment and medium based on complementary entropy contrast learning |
CN117891960A (en) * | 2024-01-19 | 2024-04-16 | 中国科学技术大学 | Multi-mode hash retrieval method and system based on adaptive gradient modulation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112148916A (en) | Cross-modal retrieval method, device, equipment and medium based on supervision | |
CN106547880B (en) | Multi-dimensional geographic scene identification method fusing geographic area knowledge | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN111914558A (en) | Course knowledge relation extraction method and system based on sentence bag attention remote supervision | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN110516095A (en) | Weakly supervised depth Hash social activity image search method and system based on semanteme migration | |
CN111881262A (en) | Text emotion analysis method based on multi-channel neural network | |
CN111475622A (en) | Text classification method, device, terminal and storage medium | |
CN110502757B (en) | Natural language emotion analysis method | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN113948217A (en) | Medical nested named entity recognition method based on local feature integration | |
CN113836992A (en) | Method for identifying label, method, device and equipment for training label identification model | |
CN113378938B (en) | Edge transform graph neural network-based small sample image classification method and system | |
CN114882521A (en) | Unsupervised pedestrian re-identification method and unsupervised pedestrian re-identification device based on multi-branch network | |
CN106227836B (en) | Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters | |
CN113657115A (en) | Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion | |
CN110705384B (en) | Vehicle re-identification method based on cross-domain migration enhanced representation | |
CN115374786A (en) | Entity and relationship combined extraction method and device, storage medium and terminal | |
CN112364166A (en) | Method for establishing relation extraction model and relation extraction method | |
CN111428502A (en) | Named entity labeling method for military corpus | |
CN111723572B (en) | Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM | |
CN111898528B (en) | Data processing method, device, computer readable medium and electronic equipment | |
CN113536015A (en) | Cross-modal retrieval method based on depth identification migration | |
CN111859979A (en) | Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium | |
WO2023134085A1 (en) | Question answer prediction method and prediction apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |