CN116756363A

CN116756363A - Strong-correlation non-supervision cross-modal retrieval method guided by information quantity

Info

Publication number: CN116756363A
Application number: CN202310657100.0A
Authority: CN
Inventors: 蓝如师; 戴六连; 李芳�; 杨睿; 罗笑南
Original assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Current assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-15

Abstract

The invention relates to the technical field of cross-modal retrieval, in particular to a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which is realized by the following steps: firstly, extracting local features, global features and text features of an image; enhancing local features and global features of the image; regularizing the reinforced local features; then, carrying out orthogonal fusion on the global features and the local features of the image by using an image feature fusion network; then, fusing the image features and the text features by using a multi-mode fusion network according to the information quantity conversion proportion principle of the features of different modes; finally, mapping the features of different modes into hash codes, and carrying out similarity sorting by utilizing Hamming distances so as to obtain a retrieval result. The invention focuses on the enhancement and fusion of the data characteristics, can acquire more semantic information, and improves the retrieval efficiency.

Description

Strong-correlation non-supervision cross-modal retrieval method guided by information quantity

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a strong-correlation non-supervision cross-modal retrieval method guided by information quantity.

Background

Along with the continuous and strong growth of the sizes of text, images, video and audio data, the diversity of multimedia data is greatly enriched, and the search requirement of users on the Internet also shows the development trend from single mode to cross mode. In the computer field, a search task aims at searching out data similar to the semantics of the search task through a query condition, while cross-modal search is to search related data of another type by taking the data of one type as the query condition, for example, input text data to search related pictures or video data. How to efficiently implement the retrieval between the query data and the database data of different modalities is a major challenge faced by the present cross-modality retrieval.

Many advances have been made by existing cross-modal hashing methods, which fall into two main categories, supervised and unsupervised types. The supervised retrieval method focuses on using existing labels to construct similarity relations between the in-mode and the mode to measure similarity of different mode data, while the unsupervised method aims at finding structural information between the data to compare the similarity, does not need to rely on manual annotation, and is more suitable for the real world. The traditional hash retrieval method extracts manual features to separate feature extraction from a hash learning process, so that the distinguishing property of hash codes is reduced to a certain extent. In recent years, because of the strong feature extraction capability of deep learning, effective representations of different modes are extracted by using a neural network and mapped to the same Hamming space, and semantic association of different modes is established at a high level to become an effective cross-mode hash retrieval method. However, the existing method ignores co-occurrence information in the image-text pair, ignores semantic relevance between the feature vector of one modality and the corresponding hash code of the other modality, and cannot accurately capture the relationship between data in different modalities.

Disclosure of Invention

The invention aims to provide a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, and aims to solve the technical problems of poor retrieval efficiency and precision caused by the fact that the existing cross-modal retrieval method ignores image-text co-occurrence information and semantic correlation of different modalities.

In order to achieve the above object, the present invention provides a strong-correlation unsupervised cross-modal retrieval method guided by information amount, comprising the steps of:

step 1: preprocessing an image data set and a text data set respectively, and extracting features of the processed image and text data to obtain image local features, image global features and text features;

step 2: processing to obtain local features of the enhanced image and global features of the enhanced image, and generating a complete text feature vector;

step 3: regularization processing is carried out on the local features of the enhanced image, so that the repeated attention of a certain regional feature in the image is avoided, and the similarity between the enhanced local features is reduced;

step 4: inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector;

step 5: inputting the complete text feature vector and the complete image feature vector into a multi-mode fusion network, and fusing the image feature and the text feature according to the information quantity conversion proportion to obtain a multi-mode fusion feature vector;

step 6: and mapping the complete feature vectors of different modes into respective hash codes to respectively obtain hash vectors of the images and the texts, and calculating Hamming distances between different modes through the hash vectors, wherein the smaller the distance is, the larger the similarity is, and the higher the retrieval precision is.

Optionally, the image dataset and the text dataset are published MIRFlickr and NUS-WIDE, both the two datasets contain image and text data, in the preprocessing process, the image data is cut, remodelled and mapped into one-dimensional feature vectors, the local feature vectors are formed after position information is embedded, and all the local feature vectors are spliced to obtain global feature vectors; for text data, the original latent dirichlet allocation topic vector is used as the original text feature.

Optionally, in the process of obtaining the local feature of the enhanced image and the global feature of the enhanced image, the local feature vector of the image is input into the feature extraction network model of the local attention mechanism module to obtain the local image feature with stronger characterization capability, and the global feature of the image is input into the channel attention module and the spatial attention module to be enhanced.

Optionally, the local attention mechanism module, after embedding the location information, follows the following formula:

W _I ＝σ(WX+d)W

X′＝X+W _I X

sigma (·) is a Sigmoid function, W and d are shared parameters of the fully connected layer, and X is an image feature vector.

Optionally, the regularization process uses the following formula:

min Re＝||X′ _i ^T ⊙(X′ _i -I)|| ²

as indicated by the matrix multiplication, I is the identity matrix.

Optionally, the global feature of the enhanced image and the local feature of the enhanced image after regularization are input into an image feature fusion network, and in the process of obtaining a complete image feature vector, the image fusion network adopts a DOLG model to perform orthogonal fusion, and the projection of each local feature X' on the global feature Xg is calculated first:

k is the number of blocks per image, i.e.m, X' _i，k Representing the kth local feature of the ith picture; after the projection matrix is obtained, the orthogonal components are calculated:

obtaining complete image fusion characteristics:

representing hadamard operations.

Optionally, after the full text feature vector and the full image feature vector are input into a multi-mode fusion network, similarity between different mode data sets is calculated respectively by using the generated image features and text features, and similarity inside an image/text mode is calculated by using a cosine formula:

wherein ,for image fusion features->Is a text feature.

Alternatively, the hamming distances are reflected by the angular distances between the hash codes, and neighbor relationships in the hamming are calculated to be a pair cosine similarity matrix representation:

representing similarity between the ith instance and the jth instance, b _I，i Representing an image hash matrix B ^I The hash vector of the i-th instance.

The invention provides a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which is realized by the following steps: firstly, extracting local features, global features and text features of an image; enhancing local features and global features of the image; regularizing the reinforced local features; then, carrying out orthogonal fusion on the global features and the local features of the image by using an image feature fusion network; then, fusing the image features and the text features by using a multi-mode fusion network according to the information quantity conversion proportion principle of the features of different modes; finally, mapping the features of different modes into hash codes, and carrying out similarity sorting by utilizing Hamming distances so as to obtain a retrieval result. The invention focuses on the enhancement and fusion of the data characteristics, can acquire more semantic information, improves the retrieval efficiency, and solves the technical problems of poor retrieval efficiency and precision caused by the fact that the existing cross-mode retrieval method ignores the image-text co-occurrence information and semantic relevance of different modes.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of the present invention.

Fig. 2 is a schematic diagram of an image feature enhancement flow of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The invention provides a strong-correlation non-supervision cross-modal retrieval method guided by information quantity, which comprises the following steps:

s1: preprocessing an image data set and a text data set respectively, and extracting features of the processed image and text data to obtain image local features, image global features and text features;

s2: processing to obtain local features of the enhanced image and global features of the enhanced image, and generating a complete text feature vector;

s3: regularization processing is carried out on the local features of the enhanced image, so that the repeated attention of a certain regional feature in the image is avoided, and the similarity between the enhanced local features is reduced;

s4: inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector;

s5: inputting the complete text feature vector and the complete image feature vector into a multi-mode fusion network, and fusing the image feature and the text feature according to the information quantity conversion proportion to obtain a multi-mode fusion feature vector;

s6: and mapping the complete feature vectors of different modes into respective hash codes to respectively obtain hash vectors of the images and the texts, and calculating Hamming distances between different modes through the hash vectors, wherein the smaller the distance is, the larger the similarity is, and the higher the retrieval precision is.

Specifically, fig. 1 is a schematic flow chart of a strong-correlation non-supervision cross-modal retrieval method guided by information quantity.

The data set adopted in the step S1 is published MIRFlickr and NUS-WIDE, and both the two data sets contain image and text data, and the specific process of the step S1 is as follows:

1.1 for image data, the short side of the picture is scaled to 256 and the other side is scaled equally, and then the picture is randomly cut into 224 square sized images from the center. Remolding the cut image into a series of two-dimensional image blocks, mapping the image blocks into one-dimensional feature vectors, embedding position information to form local feature vectors, and splicing all the local feature vectors to obtain global feature vectors; for text data, an original Latent Dirichlet Allocation (LDA) topic vector is used as an original text feature.

1.2 image characteristics are noted as wherein />I _k As global feature vector, the global feature of the kth image is represented, h _i Representing the ith image block feature as a local feature vector, m is the total amount of image data, D _I For the image feature dimension, image 4094-dimensional features are extracted herein. n is the number of image cutting blocks, where n is set to 9; text characteristics are recorded asT _k For text feature vectors, D _T Is the text feature dimension.

In step S2, as shown in fig. 2, the image feature enhancement flow uses a Resnet network as a base network, and uses a global channel attention module and a global space attention module to perform feature enhancement when the input is the whole image; when an image block is input, after embedding position information, a local attention module is used for applying weight to local features, and the principle of the local attention module is as follows:

W _I ＝σ(WX+d)W (1)

X′＝X+W _i X (2)

The regularization process is performed in step S3 using the following formula:

min Re＝||X′ _i ^T ⊙(X′ _i -I)|| ² (3)

as indicated by the matrix multiplication, I is the identity matrix.

The image fusion network in the step S4 adopts a DOLG model to carry out orthogonal fusion, and the projection of each local feature X' on the global feature Xg is calculated firstly:

k is the number of blocks per image, i.e.m, X' _i，k Representing the kth local feature of the ith picture. After the projection matrix is obtained, the orthogonal components are calculated:

obtaining image fusion characteristics:

representing hadamard operations.

The specific process of the step S5 is as follows:

using the generated image features and text features, respectively calculating the similarity between different modal data sets, wherein the similarity is measured by using a cosine formula:

the similarity between the image data is calculated through the above formula, and the text data is the same. Since text data is descriptive of the image content, the proportions of high and low similarity should be similar in different modalities. The text features are generally characterized by higher capability than the image features, and in order to better integrate the different modal features, the information content H (G ^I ；S ^I ) And text feature matrix information amount H (G ^T ；S ^T )，S ^I and S^T Similarity matrices for images and text, respectively. For better calculation of information quantity, let G be ^I And G ^T Equal.

And (3) respectively obtaining the characteristic information quantity of the image and the text through a formula (8), and carrying out characteristic fusion according to the information quantity proportion of different modes.

The specific process of the step S6 is as follows:

inputting the image full feature vector and the text full feature vector into hash layers of respective networks respectively to generate an image hash code B ^I And text hash code B ^T . The angular distance between hash codes reflects their hamming distance. Thus, to describe neighbor relationships in Hamming space, a pair cosine similarity matrix will be calculated herein:

representing similarity between the ith instance and the jth instance, b _I，i Representation B ^I Hash vector of the i-th instance in (a), and the sameObtainable b _T，j . The loss function is made +.>Semantic information maximization:

the similarity information of the hash code is aligned with intra-modality information using equation (11):

the similarity information of the hash code is aligned with the inter-modality information using equation (12):

s.t.(a，b)∈{(I，I)，(I，T)，(T，T)} (12)

is the multimodal fusion matrix in step S5. Finally, the following objective function is obtained.

After the objective function is used for carrying out multi-round training on the data, the hash mapping function with the optimal different modes can be finally learned. When query data is input, mapping the query data into corresponding hash vectors by using a learned hash mapping function, calculating the Hamming distance between the query hash vectors and sample hash vectors in a search library, wherein the smaller the Hamming distance is, the higher the similarity is, and finally sequentially outputting search results according to the sequence from small Hamming distance to large.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. An information-quantity-guided strong-correlation unsupervised cross-modal retrieval method, characterized by comprising the steps of:

2. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 1,

the image data set and the text data set are published MIRFlickr and NUS-WIDE, the two data sets both contain image and text data, in the preprocessing process, the image data is cut, remodelled and mapped into one-dimensional feature vectors, the position information is embedded to form local feature vectors, and all the local feature vectors are spliced to obtain global feature vectors; for text data, the original latent dirichlet allocation topic vector is used as the original text feature.

3. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 2,

in the process of processing to obtain the local features of the enhanced image and the global features of the enhanced image, the local feature vectors of the image are input into a feature extraction network model of a local attention mechanism module to obtain local image features with stronger characterization capability, and the global features of the image are input into a channel attention module and a space attention module to be enhanced.

4. The traffic-directed strong-correlation unsupervised cross-modal retrieval method of claim 3,

the local attention mechanism module embeds the position information after embedding the position information, and follows the following formula:

W _I ＝σ(WX+d)W

X′＝X+W _I X

5. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 4,

the regularization process employs the following formula:

as indicated by the matrix multiplication, I is the identity matrix.

6. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 5,

in the process of inputting the global features of the enhanced image and the regularized local features of the enhanced image into an image feature fusion network to obtain a complete image feature vector, the image fusion network adopts a DOLG model to carry out orthogonal fusion, and each local feature X' is calculated in the global feature X first ^g Projection onto:

k is the number of blocks per image, i.e.m, X' _i,k Representing the kth local feature of the ith picture; after the projection matrix is obtained, the orthogonal components are calculated:

obtaining complete image fusion characteristics:

representing hadamard operations.

7. The traffic-guided, strong-correlation, unsupervised cross-modal retrieval method of claim 6,

after the complete text feature vector and the complete image feature vector are input into a multi-mode fusion network, similarity among different mode data sets is calculated respectively by using the generated image features and text features, and similarity inside an image/text mode is calculated by using a cosine formula:

wherein ,for image fusion features->Is a text feature.

8. The traffic-directed strong-correlation unsupervised cross-modal retrieval method of claim 7,

the hamming distance is reflected by the angular distance between the hash codes, and the neighbor relation in the hamming is calculated to be expressed by a pair cosine similarity matrix:

representing similarity between the ith instance and the jth instance, b _I,i Representing an image hash matrix B ^I The hash vector of the i-th instance.