CN116756363A - Strong-correlation non-supervision cross-modal retrieval method guided by information quantity - Google Patents

Strong-correlation non-supervision cross-modal retrieval method guided by information quantity Download PDF

Info

Publication number
CN116756363A
CN116756363A CN202310657100.0A CN202310657100A CN116756363A CN 116756363 A CN116756363 A CN 116756363A CN 202310657100 A CN202310657100 A CN 202310657100A CN 116756363 A CN116756363 A CN 116756363A
Authority
CN
China
Prior art keywords
image
features
feature
local
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310657100.0A
Other languages
Chinese (zh)
Inventor
蓝如师
戴六连
李芳�
杨睿
罗笑南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanning Guidian Electronic Technology Research Institute Co ltd
Guilin University of Electronic Technology
Original Assignee
Nanning Guidian Electronic Technology Research Institute Co ltd
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanning Guidian Electronic Technology Research Institute Co ltd, Guilin University of Electronic Technology filed Critical Nanning Guidian Electronic Technology Research Institute Co ltd
Priority to CN202310657100.0A priority Critical patent/CN116756363A/en
Publication of CN116756363A publication Critical patent/CN116756363A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of cross-modal retrieval, in particular to a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which is realized by the following steps: firstly, extracting local features, global features and text features of an image; enhancing local features and global features of the image; regularizing the reinforced local features; then, carrying out orthogonal fusion on the global features and the local features of the image by using an image feature fusion network; then, fusing the image features and the text features by using a multi-mode fusion network according to the information quantity conversion proportion principle of the features of different modes; finally, mapping the features of different modes into hash codes, and carrying out similarity sorting by utilizing Hamming distances so as to obtain a retrieval result. The invention focuses on the enhancement and fusion of the data characteristics, can acquire more semantic information, and improves the retrieval efficiency.

Description

Strong-correlation non-supervision cross-modal retrieval method guided by information quantity
Technical Field
The invention relates to the technical field of cross-modal retrieval, in particular to a strong-correlation non-supervision cross-modal retrieval method guided by information quantity.
Background
Along with the continuous and strong growth of the sizes of text, images, video and audio data, the diversity of multimedia data is greatly enriched, and the search requirement of users on the Internet also shows the development trend from single mode to cross mode. In the computer field, a search task aims at searching out data similar to the semantics of the search task through a query condition, while cross-modal search is to search related data of another type by taking the data of one type as the query condition, for example, input text data to search related pictures or video data. How to efficiently implement the retrieval between the query data and the database data of different modalities is a major challenge faced by the present cross-modality retrieval.
Many advances have been made by existing cross-modal hashing methods, which fall into two main categories, supervised and unsupervised types. The supervised retrieval method focuses on using existing labels to construct similarity relations between the in-mode and the mode to measure similarity of different mode data, while the unsupervised method aims at finding structural information between the data to compare the similarity, does not need to rely on manual annotation, and is more suitable for the real world. The traditional hash retrieval method extracts manual features to separate feature extraction from a hash learning process, so that the distinguishing property of hash codes is reduced to a certain extent. In recent years, because of the strong feature extraction capability of deep learning, effective representations of different modes are extracted by using a neural network and mapped to the same Hamming space, and semantic association of different modes is established at a high level to become an effective cross-mode hash retrieval method. However, the existing method ignores co-occurrence information in the image-text pair, ignores semantic relevance between the feature vector of one modality and the corresponding hash code of the other modality, and cannot accurately capture the relationship between data in different modalities.
Disclosure of Invention
The invention aims to provide a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, and aims to solve the technical problems of poor retrieval efficiency and precision caused by the fact that the existing cross-modal retrieval method ignores image-text co-occurrence information and semantic correlation of different modalities.
In order to achieve the above object, the present invention provides a strong-correlation unsupervised cross-modal retrieval method guided by information amount, comprising the steps of:
step 1: preprocessing an image data set and a text data set respectively, and extracting features of the processed image and text data to obtain image local features, image global features and text features;
step 2: processing to obtain local features of the enhanced image and global features of the enhanced image, and generating a complete text feature vector;
step 3: regularization processing is carried out on the local features of the enhanced image, so that the repeated attention of a certain regional feature in the image is avoided, and the similarity between the enhanced local features is reduced;
step 4: inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector;
step 5: inputting the complete text feature vector and the complete image feature vector into a multi-mode fusion network, and fusing the image feature and the text feature according to the information quantity conversion proportion to obtain a multi-mode fusion feature vector;
step 6: and mapping the complete feature vectors of different modes into respective hash codes to respectively obtain hash vectors of the images and the texts, and calculating Hamming distances between different modes through the hash vectors, wherein the smaller the distance is, the larger the similarity is, and the higher the retrieval precision is.
Optionally, the image dataset and the text dataset are published MIRFlickr and NUS-WIDE, both the two datasets contain image and text data, in the preprocessing process, the image data is cut, remodelled and mapped into one-dimensional feature vectors, the local feature vectors are formed after position information is embedded, and all the local feature vectors are spliced to obtain global feature vectors; for text data, the original latent dirichlet allocation topic vector is used as the original text feature.
Optionally, in the process of obtaining the local feature of the enhanced image and the global feature of the enhanced image, the local feature vector of the image is input into the feature extraction network model of the local attention mechanism module to obtain the local image feature with stronger characterization capability, and the global feature of the image is input into the channel attention module and the spatial attention module to be enhanced.
Optionally, the local attention mechanism module, after embedding the location information, follows the following formula:
W I =σ(WX+d)W
X′=X+W I X
sigma (·) is a Sigmoid function, W and d are shared parameters of the fully connected layer, and X is an image feature vector.
Optionally, the regularization process uses the following formula:
min Re=||X′ i T ⊙(X′ i -I)|| 2
as indicated by the matrix multiplication, I is the identity matrix.
Optionally, the global feature of the enhanced image and the local feature of the enhanced image after regularization are input into an image feature fusion network, and in the process of obtaining a complete image feature vector, the image fusion network adopts a DOLG model to perform orthogonal fusion, and the projection of each local feature X' on the global feature Xg is calculated first:
k is the number of blocks per image, i.e.m, X' i,k Representing the kth local feature of the ith picture; after the projection matrix is obtained, the orthogonal components are calculated:
obtaining complete image fusion characteristics:
representing hadamard operations.
Optionally, after the full text feature vector and the full image feature vector are input into a multi-mode fusion network, similarity between different mode data sets is calculated respectively by using the generated image features and text features, and similarity inside an image/text mode is calculated by using a cosine formula:
wherein ,for image fusion features->Is a text feature.
Alternatively, the hamming distances are reflected by the angular distances between the hash codes, and neighbor relationships in the hamming are calculated to be a pair cosine similarity matrix representation:
representing similarity between the ith instance and the jth instance, b I,i Representing an image hash matrix B I The hash vector of the i-th instance.
The invention provides a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which is realized by the following steps: firstly, extracting local features, global features and text features of an image; enhancing local features and global features of the image; regularizing the reinforced local features; then, carrying out orthogonal fusion on the global features and the local features of the image by using an image feature fusion network; then, fusing the image features and the text features by using a multi-mode fusion network according to the information quantity conversion proportion principle of the features of different modes; finally, mapping the features of different modes into hash codes, and carrying out similarity sorting by utilizing Hamming distances so as to obtain a retrieval result. The invention focuses on the enhancement and fusion of the data characteristics, can acquire more semantic information, improves the retrieval efficiency, and solves the technical problems of poor retrieval efficiency and precision caused by the fact that the existing cross-mode retrieval method ignores the image-text co-occurrence information and semantic relevance of different modes.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of the present invention.
Fig. 2 is a schematic diagram of an image feature enhancement flow of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The invention provides a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which comprises the following steps:
s1: preprocessing an image data set and a text data set respectively, and extracting features of the processed image and text data to obtain image local features, image global features and text features;
s2: processing to obtain local features of the enhanced image and global features of the enhanced image, and generating a complete text feature vector;
s3: regularization processing is carried out on the local features of the enhanced image, so that the repeated attention of a certain regional feature in the image is avoided, and the similarity between the enhanced local features is reduced;
s4: inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector;
s5: inputting the complete text feature vector and the complete image feature vector into a multi-mode fusion network, and fusing the image feature and the text feature according to the information quantity conversion proportion to obtain a multi-mode fusion feature vector;
s6: and mapping the complete feature vectors of different modes into respective hash codes to respectively obtain hash vectors of the images and the texts, and calculating Hamming distances between different modes through the hash vectors, wherein the smaller the distance is, the larger the similarity is, and the higher the retrieval precision is.
Specifically, fig. 1 is a schematic flow chart of a strong-correlation non-supervision cross-modal retrieval method guided by information quantity.
The data set adopted in the step S1 is published MIRFlickr and NUS-WIDE, and both the two data sets contain image and text data, and the specific process of the step S1 is as follows:
1.1 for image data, the short side of the picture is scaled to 256 and the other side is scaled equally, and then the picture is randomly cut into 224 square sized images from the center. Remolding the cut image into a series of two-dimensional image blocks, mapping the image blocks into one-dimensional feature vectors, embedding position information to form local feature vectors, and splicing all the local feature vectors to obtain global feature vectors; for text data, an original Latent Dirichlet Allocation (LDA) topic vector is used as an original text feature.
1.2 image characteristics are noted as wherein />I k As global feature vector, the global feature of the kth image is represented, h i Representing the ith image block feature as a local feature vector, m is the total amount of image data, D I For the image feature dimension, image 4094-dimensional features are extracted herein. n is the number of image cutting blocks, where n is set to 9; text characteristics are recorded asT k For text feature vectors, D T Is the text feature dimension.
In step S2, as shown in fig. 2, the image feature enhancement flow uses a Resnet network as a base network, and uses a global channel attention module and a global space attention module to perform feature enhancement when the input is the whole image; when an image block is input, after embedding position information, a local attention module is used for applying weight to local features, and the principle of the local attention module is as follows:
W I =σ(WX+d)W (1)
X′=X+W i X (2)
sigma (·) is a Sigmoid function, W and d are shared parameters of the fully connected layer, and X is an image feature vector.
The regularization process is performed in step S3 using the following formula:
min Re=||X′ i T ⊙(X′ i -I)|| 2 (3)
as indicated by the matrix multiplication, I is the identity matrix.
The image fusion network in the step S4 adopts a DOLG model to carry out orthogonal fusion, and the projection of each local feature X' on the global feature Xg is calculated firstly:
k is the number of blocks per image, i.e.m, X' i,k Representing the kth local feature of the ith picture. After the projection matrix is obtained, the orthogonal components are calculated:
obtaining image fusion characteristics:
representing hadamard operations.
The specific process of the step S5 is as follows:
using the generated image features and text features, respectively calculating the similarity between different modal data sets, wherein the similarity is measured by using a cosine formula:
the similarity between the image data is calculated through the above formula, and the text data is the same. Since text data is descriptive of the image content, the proportions of high and low similarity should be similar in different modalities. The text features are generally characterized by higher capability than the image features, and in order to better integrate the different modal features, the information content H (G I ;S I ) And text feature matrix information amount H (G T ;S T ),S I and ST Similarity matrices for images and text, respectively. For better calculation of information quantity, let G be I And G T Equal.
And (3) respectively obtaining the characteristic information quantity of the image and the text through a formula (8), and carrying out characteristic fusion according to the information quantity proportion of different modes.
The specific process of the step S6 is as follows:
inputting the image full feature vector and the text full feature vector into hash layers of respective networks respectively to generate an image hash code B I And text hash code B T . The angular distance between hash codes reflects their hamming distance. Thus, to describe neighbor relationships in Hamming space, a pair cosine similarity matrix will be calculated herein:
representing similarity between the ith instance and the jth instance, b I,i Representation B I Hash vector of the i-th instance in (a), and the sameObtainable b T,j . The loss function is made +.>Semantic information maximization:
the similarity information of the hash code is aligned with intra-modality information using equation (11):
the similarity information of the hash code is aligned with the inter-modality information using equation (12):
s.t.(a,b)∈{(I,I),(I,T),(T,T)} (12)
is the multimodal fusion matrix in step S5. Finally, the following objective function is obtained.
After the objective function is used for carrying out multi-round training on the data, the hash mapping function with the optimal different modes can be finally learned. When query data is input, mapping the query data into corresponding hash vectors by using a learned hash mapping function, calculating the Hamming distance between the query hash vectors and sample hash vectors in a search library, wherein the smaller the Hamming distance is, the higher the similarity is, and finally sequentially outputting search results according to the sequence from small Hamming distance to large.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims (8)

1. An information-quantity-guided strong-correlation unsupervised cross-modal retrieval method, characterized by comprising the steps of:
step 1: preprocessing an image data set and a text data set respectively, and extracting features of the processed image and text data to obtain image local features, image global features and text features;
step 2: processing to obtain local features of the enhanced image and global features of the enhanced image, and generating a complete text feature vector;
step 3: regularization processing is carried out on the local features of the enhanced image, so that the repeated attention of a certain regional feature in the image is avoided, and the similarity between the enhanced local features is reduced;
step 4: inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector;
step 5: inputting the complete text feature vector and the complete image feature vector into a multi-mode fusion network, and fusing the image feature and the text feature according to the information quantity conversion proportion to obtain a multi-mode fusion feature vector;
step 6: and mapping the complete feature vectors of different modes into respective hash codes to respectively obtain hash vectors of the images and the texts, and calculating Hamming distances between different modes through the hash vectors, wherein the smaller the distance is, the larger the similarity is, and the higher the retrieval precision is.
2. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 1,
the image data set and the text data set are published MIRFlickr and NUS-WIDE, the two data sets both contain image and text data, in the preprocessing process, the image data is cut, remodelled and mapped into one-dimensional feature vectors, the position information is embedded to form local feature vectors, and all the local feature vectors are spliced to obtain global feature vectors; for text data, the original latent dirichlet allocation topic vector is used as the original text feature.
3. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 2,
in the process of processing to obtain the local features of the enhanced image and the global features of the enhanced image, the local feature vectors of the image are input into a feature extraction network model of a local attention mechanism module to obtain local image features with stronger characterization capability, and the global features of the image are input into a channel attention module and a space attention module to be enhanced.
4. The traffic-directed strong-correlation unsupervised cross-modal retrieval method of claim 3,
the local attention mechanism module embeds the position information after embedding the position information, and follows the following formula:
W I =σ(WX+d)W
X′=X+W I X
sigma (·) is a Sigmoid function, W and d are shared parameters of the fully connected layer, and X is an image feature vector.
5. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 4,
the regularization process employs the following formula:
as indicated by the matrix multiplication, I is the identity matrix.
6. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 5,
in the process of inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector, the image fusion network adopts a DOLG model to carry out orthogonal fusion, and each local feature X' is calculated in the global feature X first g Projection onto:
k is the number of blocks per image, i.e.m, X' i,k Representing the kth local feature of the ith picture; after the projection matrix is obtained, the orthogonal components are calculated:
obtaining complete image fusion characteristics:
representing hadamard operations.
7. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 6,
after the complete text feature vector and the complete image feature vector are input into a multi-mode fusion network, similarity among different mode data sets is calculated respectively by using the generated image features and text features, and similarity inside an image/text mode is calculated by using a cosine formula:
wherein ,for image fusion features->Is a text feature.
8. The traffic-directed strong-correlation unsupervised cross-modal retrieval method of claim 7,
the hamming distance is reflected by the angular distance between the hash codes, and the neighbor relation in the hamming is calculated to be expressed by a pair cosine similarity matrix:
representing similarity between the ith instance and the jth instance, b I,i Representing an image hash matrix B I The hash vector of the i-th instance.
CN202310657100.0A 2023-06-05 2023-06-05 Strong-correlation non-supervision cross-modal retrieval method guided by information quantity Pending CN116756363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310657100.0A CN116756363A (en) 2023-06-05 2023-06-05 Strong-correlation non-supervision cross-modal retrieval method guided by information quantity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310657100.0A CN116756363A (en) 2023-06-05 2023-06-05 Strong-correlation non-supervision cross-modal retrieval method guided by information quantity

Publications (1)

Publication Number Publication Date
CN116756363A true CN116756363A (en) 2023-09-15

Family

ID=87950685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310657100.0A Pending CN116756363A (en) 2023-06-05 2023-06-05 Strong-correlation non-supervision cross-modal retrieval method guided by information quantity

Country Status (1)

Country Link
CN (1) CN116756363A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874278A (en) * 2024-03-11 2024-04-12 盛视科技股份有限公司 Image retrieval method and system based on multi-region feature combination

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874278A (en) * 2024-03-11 2024-04-12 盛视科技股份有限公司 Image retrieval method and system based on multi-region feature combination
CN117874278B (en) * 2024-03-11 2024-05-28 盛视科技股份有限公司 Image retrieval method and system based on multi-region feature combination

Similar Documents

Publication Publication Date Title
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
US10922350B2 (en) Associating still images and videos
Kaur et al. Comparative analysis on cross-modal information retrieval: A review
CN108694225B (en) Image searching method, feature vector generating method and device and electronic equipment
Wang et al. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval
Wang et al. Annotating images by mining image search results
CN106202256B (en) Web image retrieval method based on semantic propagation and mixed multi-instance learning
US8594468B2 (en) Statistical approach to large-scale image annotation
CN113779361A (en) Construction method and application of cross-modal retrieval model based on multi-layer attention mechanism
Xiao et al. Convolutional hierarchical attention network for query-focused video summarization
CN111079444A (en) Network rumor detection method based on multi-modal relationship
CN112241468A (en) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
Cornia et al. Explaining digital humanities by aligning images and textual descriptions
Gao et al. A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective
CN111782852B (en) Deep learning-based high-level semantic image retrieval method
CN116955707A (en) Content tag determination method, device, equipment, medium and program product
CN111461175A (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
Caicedo et al. Multimodal fusion for image retrieval using matrix factorization
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN117765450B (en) Video language understanding method, device, equipment and readable storage medium
CN116756363A (en) Strong-correlation non-supervision cross-modal retrieval method guided by information quantity
Wang et al. Listen, look, and find the one: Robust person search with multimodality index
Papapanagiotou et al. Improving concept-based image retrieval with training weights computed from tags
Lu et al. Inferring user image-search goals under the implicit guidance of users

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination