CN115935194A

CN115935194A - Visual and text cross-modal matching method based on consensus embedding space and similarity

Info

Publication number: CN115935194A
Application number: CN202211385488.5A
Authority: CN
Inventors: 梁雪峰; 林坚; 王晨阳; 玄慧君; 刘真佑; 杨小慧
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-04-07

Abstract

The invention discloses a cross-modal matching method of vision and text, which comprises the following steps: determining candidate objects according to the objects input into the retrieval platform; determining the similarity between the object and the candidate object through the model of the retrieval platform, and screening the matched object of the object according to the similarity; the model is used for obtaining the common recognition target word characteristics of common words in the text; respectively obtaining the consensus local visual feature of the example area and the consensus word feature of the word according to the local visual feature of the example area in the image, the word feature of the word in the text and the consensus target word feature; obtaining a consensus global visual feature of the image according to the consensus local visual feature of the image, and obtaining a consensus global text feature of the text according to the consensus word feature of the text; and determining the local similarity of the consensus word features and the consensus local visual features and the global similarity of the consensus global visual features and the consensus global text features, and obtaining the similarity of the image and the description text according to the local similarity and the global similarity.

Description

Visual and text cross-modal matching method based on consensus embedding space and similarity

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a vision and text cross-modal matching method based on consensus embedding space and similarity.

Background

Visual-text matching refers to extracting features of visual contents (such as images and videos) and texts and measuring the similarity of the visual contents and the texts so as to obtain a visual-text pair with similar expression contents. Visual-text matching is becoming increasingly important in various visual-language tasks, such as cross-modal retrieval, image description generation, text-image synthesis, multi-modal neuro-machine translation, and smart warehousing robots. In recent years, techniques have been developed that take advantage of global alignment between images and sentences, and local alignment between regions and words. However, pattern matching between images and texts is still a challenging problem due to the complex matching pattern and large semantic difference between the images and the texts.

To address this problem, a number of current approaches encode images and text into compact feature representations using deep neural networks and natural language techniques, and attempt to map the entire image and the complete text into a joint embedding space under the guidance of a feature matching loss function, in which the feature similarity between different modalities can be measured directly. In order to improve the discrimination capability of uniform embedding, strategies such as semantic concept learning, regional relation reasoning and the like are developed, and the visual features are enhanced by integrating local regional semantics. However, these methods fail to capture local interactions between image regions and sentence fragments, resulting in limited improvements in interpretability and performance. To solve this problem, anderson et al propose a top-down and bottom-up attention mechanism to further improve the performance of the model, extract the visual features of local instances through fast R-CNN, then fuse the visual and textual features through different attention mechanisms, train the textual feature extractor by maximizing the similarity between positive samples. The method has good interpretability, can focus on local information at the same time, and is beneficial to performing complex visual semantic pattern matching tasks, however, the method usually only takes the average value among all local features as a global feature (or query of the global feature), which enables a network to ignore the spatial and positional relationship among the local features, and meanwhile, a visual feature extractor (R-CNN) of the method is much larger than a text feature extractor, but uses a simple dot product or a shallow attention layer to represent the similarity of embedded features of two patterns, which makes a 'heterogeneous gap' difficult to overcome.

Currently, pre-training and fine-tuning schemes have been extended to the joint field of vision and language, which has made vision and language pre-training (VLP) models in force. CLIP, as a representative of VLP models, has attracted wide attention once being proposed, and shows considerable zero-shot accuracy and good generalization capability in various downstream tasks, carries out classification tasks through text supervision images, and carries out contrast pre-training on 4 hundred million pairs of image-text data, so that a network has better cross-modal matching capability. Although a large-scale pre-training model obtains good performance in various downstream tasks, the large-scale pre-training model has poor performance on complex and abstract problems (such as recognition number or distance), which is related to a training task mainly using a language supervision visual classification task, so that a network pays more attention to main targets in pictures and ignores local information.

That is to say, when the visual-text pre-training model in the prior art performs a cross-mode matching task, the following two problems exist:

(1) The lack of focus on local information and spatial location makes it underperforming on some abstract or systematic tasks (e.g., calculating the number of objects in an image) and more complex tasks (e.g., predicting how close to the nearest car in a photograph).

(2) The similarity of the embedded features of the two modalities is represented using a simple dot product, and there is no efficient use of the information learned by an efficient text and visual feature extractor.

Disclosure of Invention

In order to solve the above problems in the related art, the present invention provides a cross-modality matching method for visual and text based on consensus embedding space and similarity. The technical problem to be solved by the invention is realized by the following technical scheme:

the invention provides a vision and text cross-modal matching method based on consensus embedding space and similarity, which comprises the following steps:

acquiring an input object of an input retrieval platform;

determining candidate objects of the input object; when the input object is a text, the candidate objects are a plurality of preset images of the retrieval platform; when the input object is an image, the candidate object is a description text corresponding to each preset image;

determining the similarity between the input object and each candidate object through a pre-training matching model deployed on the retrieval platform, and screening the matching objects of the input object according to the similarity;

the pre-training matching model is obtained by adopting a sample image and a sample description text and carrying out iterative training on an initial matching model; the pre-training matching model is used for obtaining the co-recognition target word characteristics of the preset words according to the word characteristics of the preset words, the occurrence frequency of the preset words and the co-occurrence times among different preset words, wherein the occurrence frequency of the preset words in the description text meets the frequency threshold; obtaining the consensus local visual feature of the example region through an attention mechanism according to the local visual feature of the example region in the image and the consensus target word feature, and obtaining the consensus word feature of each word through the attention mechanism according to the word feature of each word in the description text and the consensus target word feature; obtaining the consensus global visual feature of each image according to the consensus local visual feature of each image through an attention mechanism, and obtaining the consensus global text feature of each description text according to the consensus word feature of each description text; obtaining local similarity according to the consensus word characteristics and the consensus local visual characteristics, obtaining global similarity according to the consensus global visual characteristics and the consensus global text characteristics, and obtaining global output through a self-attention mechanism according to the local similarity and the global similarity; and according to the global output, obtaining the similarity between the image and the description text.

The invention has the following beneficial technical effects:

the pre-training matching model can obtain the consensus target word feature of the preset word according to the word feature of the preset word with the occurrence frequency meeting the frequency threshold value in the description text, the occurrence frequency of the preset word and the number of co-occurrences among different preset words, obtain the consensus local visual feature of the example region through an attention mechanism according to the local visual feature and the consensus target word feature of the example region in the image, obtain the consensus spatial word feature of each word through the attention mechanism according to the word feature and the consensus target word feature of each word in the description text, obtain the consensus global visual feature of each image according to the consensus local visual feature corresponding to each image through an attention mechanism, and obtain the consensus global text feature of each description text according to the consensus word feature corresponding to each description text; therefore, the top-down and bottom-up attention methods can be introduced into feature extraction, so that global and local features can be considered simultaneously, local information, spatial information and structural information can be concerned more, and abundant feature representation is provided for the model; in addition, consensus knowledge can be introduced into the matching of the global image and the global text, so that the characteristics of the global image and the characteristics of the global text can be optimized, the global semantic information can be expressed better, and the performance of image-text cross-mode matching is improved. Moreover, the pre-training matching model can also obtain local similarity according to the consensus word features and the consensus local visual features, obtain global similarity according to the consensus global visual features and the consensus global text features, obtain global output corresponding to the global similarity through a self-attention mechanism according to the local similarity and the global similarity, and obtain a similarity value between the corresponding image and the description text according to the global output; in this way, learnable similarity representations and self-attention mechanism-based similarity reasoning can be used to fully characterize the association between local features of example regions in images and features describing words in text, and information interaction between local and global similarities can be introduced, thereby improving the accuracy of image-text matching.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is an alternative flowchart of a visual and text cross-modality matching method based on consensus embedding space and similarity according to an embodiment of the present invention;

fig. 2A is a retrieval result page of an exemplary retrieval platform according to an embodiment of the present invention;

fig. 2B is another retrieval result page of the exemplary retrieval platform according to the embodiment of the present invention;

fig. 3 is a schematic diagram of a process of determining similarity between an image and a text by using a pre-trained matching model according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

In the description of the present invention, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated is significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Fig. 1 is an alternative flowchart of a visual and text cross-modality matching method based on consensus embedding space and similarity according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

s101, acquiring an input object of an input retrieval platform.

In the embodiment of the invention, the retrieval platform is provided with a text library and an image library, and the input object can be a text or an image. The text may be characters input by the user or texts selected from a text library, and correspondingly, the image may be an image input by the user or an image selected from an image library.

S102, determining candidate objects of the input object; when the input object is a text, the candidate objects are a plurality of preset images of the retrieval platform; and when the input object is an image, the candidate object is a description text corresponding to each preset image.

In the embodiment of the invention, the retrieval platform can retrieve the matched text from the text library of the platform according to the input image, and can retrieve the matched image from the image library of the platform according to the input text.

S103, determining the similarity between the input object and each candidate object through a pre-training matching model deployed on a retrieval platform, and screening the matching object of the input object according to the similarity; the pre-training matching model is obtained by adopting a sample image and a sample description text and carrying out iterative training on an initial matching model; the pre-training matching model is used for obtaining the common recognition target word characteristics of the preset words according to the word characteristics of the preset words, the appearance frequency of the preset words and the co-occurrence times of different preset words, wherein the appearance frequency of the preset words in the description text meets the frequency threshold; obtaining the consensus local visual feature of the example region through an attention mechanism according to the local visual feature and the consensus target word feature of the example region in the image, and obtaining the consensus word feature of each word through the attention mechanism according to the word feature and the consensus target word feature of each word in the description text; obtaining the consensus global visual feature of each image according to the consensus local visual feature of each image through an attention mechanism, and obtaining the consensus global text feature of each description text according to the consensus word feature of each description text; obtaining local similarity according to the consensus word features and the consensus local visual features, obtaining global similarity according to the consensus global visual features and the consensus global text features, and obtaining global output through a self-attention mechanism according to the local similarity and the global similarity; and according to the global output, obtaining the similarity between the image and the description text.

In the embodiment of the invention, when the retrieval platform obtains the similarity between the input object and each candidate object, one or more candidate objects with the similarity meeting a preset similarity threshold or with the highest similarity can be selected from all the candidate objects according to the similarity and serve as the matching objects of the input object.

Here, the retrieval platform may be a server program built according to the pre-trained matching model, and the server program may present the matching result in the form of an interactive interface.

Illustratively, fig. 2A is a retrieval result page of the retrieval platform, and as shown in fig. 2A, when the input object is an image, the image, each instance region of the image (clipped instance in fig. 2A), and text matching the image (5 texts below "retrieval result" shown in fig. 2A) may be displayed on the retrieval result page. For example, fig. 2B is another search result page of the search platform, and as shown in fig. 2B, when the input object is text, the input text (the text in the search box shown in fig. 2B) and the image matching the input text (the 5 images below the "queried picture" shown in fig. 2B) may be displayed on the search result page.

In the embodiment of the present invention, the pre-training matching model may include: the device comprises a preprocessing module, a pre-training feature extraction model, a consensus determination module, a pre-training graph convolution network and a matching module containing pre-training learning parameters. The preprocessing module is used for determining an example area of the image and determining a word in the description text; the pre-training feature extraction model is used for extracting global visual features of the image and local visual features of each example region, and extracting global text features of the description text and word features of each word; the consensus determining module is used for determining word characteristics of preset words, the occurrence frequency of the preset words and the number of co-occurrences between different preset words, wherein the occurrence frequency of the preset words in the description text meets a frequency threshold; the frequency of occurrence of each word is the total number of occurrences of the word in the description text; the pre-training graph convolutional network is used for obtaining the co-recognition target word characteristics of the preset words according to the word characteristics, the occurrence frequency, the co-occurrence times and the first pre-training learning parameters of the preset words; the matching module is used for obtaining the consensus local visual feature of the example region through an attention mechanism according to the local visual feature, the consensus target word feature and the second pre-training learning parameter of the example region in the image, and obtaining the consensus word feature of each word through the attention mechanism according to the word feature, the consensus target word feature and the second pre-training learning parameter of each word in the description text; obtaining a consensus global visual feature of each image according to the consensus local visual feature and the third pre-training learning parameter of each image through an attention mechanism, and obtaining a consensus global text feature of each description text according to the consensus word feature and the third pre-training learning parameter of each description text; obtaining local similarity according to the consensus word feature, the consensus local visual feature and the fourth pre-training learning parameter, and obtaining global similarity according to the consensus global visual feature, the consensus global text feature and the fifth pre-training learning parameter; obtaining global output through an attention mechanism according to the local similarity, the global similarity and the sixth pre-training learning parameter; and obtaining the similarity between the image and the description text according to the global output and the seventh pre-training learning parameter.

In some embodiments, when the input object is an input image and the candidate objects are a plurality of candidate texts, the determining the similarity between the input object and each candidate object through the pre-trained matching model deployed in the search platform in S103 may be implemented by:

and S1031, determining global visual features of the input image, local visual features of each instance region in the input image, global text features of any candidate text and word features of words in any candidate text through a pre-training matching model for the input image and any candidate text.

The method comprises the steps that target detection can be carried out on an input image through a preprocessing module to obtain coordinates of each target, and the input image is cut according to the coordinates to obtain each example area of the input image; and, the candidate text of length L may be divided into L words by the pre-processing module. For example, the pre-processing module may include an object detection model (e.g., mask rcnn model) and a tokenizer. Then, the global visual features of the input image, the local visual features of each example region, the global text features of the candidate text and the word features of each word can be extracted through a pre-training feature extraction model.

Illustratively, the Pre-trained feature extraction model may be a Pre-trained CLIP (contextual Language-Image Pre-Training) model, and the example region and the input Image may be respectively input into a visual branch of the CLIP model to extract local and global features of the input Image; meanwhile, the candidate text and each word of the candidate text are input into a language branch of the CLIP model, and local and global features of the candidate text are extracted. Here, using the CLIP model with efficient cross-modal capabilities as a feature extractor can mitigate the effect of the "heterogeneous gap" inherent to visual and textual features on the model performance.

S1032, according to the occurrence frequency of the words in the candidate texts, selecting a plurality of words with the occurrence frequency meeting a frequency threshold value from the words in the candidate texts as preset words, and determining the word characteristics of each preset word and the co-occurrence times of different preset words in the same candidate text.

And S1033, obtaining the consensus target word characteristics corresponding to each preset word according to the word characteristics, the occurrence frequency and the co-occurrence times of the preset words.

In the embodiment of the invention, a plurality of words with the frequency meeting the frequency threshold value can be selected from the words of the candidate texts as the preset words according to the occurrence frequency of the words in the candidate texts, the word characteristic of each preset word and the co-occurrence frequency of different preset words in the same candidate text are determined, and the co-recognition target word characteristic corresponding to each preset word is obtained according to the word characteristic, the occurrence frequency and the co-occurrence frequency of the preset words;

in some embodiments, a relationship graph which takes preset words as nodes and represents the co-occurrence relationship between two preset words by whether connecting edges with directions exist between the nodes or not can be constructed according to the word features, the occurrence frequency and the co-occurrence times of the preset words; performing graph convolution processing on the word features and the relation graph of the preset words according to the first pre-training learning parameter to obtain the consensus target word feature corresponding to each preset word; the consensus target word feature represents a feature obtained by mapping the word feature of the preset word to the consensus space.

Exemplary ofThe occurrence frequency of each word in all candidate texts can be counted, q 'words with the largest occurrence frequency can be selected from the q' preset words as q 'preset words according to the occurrence frequency, wherein the q' preset words can comprise nouns, verbs and adjectives, and the nouns, the verbs and the adjectives can be distributed according to a preset proportion (for example, noun: verb: adjective =7 _ij And G _ji ，G _ij Can be expressed by the formula (1), G _ji Can be expressed by equation (2):

wherein, in the formula (1), E _ij Representing the number of co-occurrences between node i and node j, N _i Representing the frequency of occurrence, P, of node i _ij Representing the probability of the occurrence of a node i when a node j occurs, s and u representing preset scaling parameters, e representing a preset probability threshold, B _ij Representing the probability of node i appearing when node j appears after the scaling process. When G is _ij Equal to 0 indicates that there is no connecting edge between node j and node i pointed to node j by node i, when G _ij Equal to 1 indicates that there is a connecting edge between node j and node i, which points from node i to node j. After obtaining the relationship graph corresponding to the preset word, a multi-layer graph convolution network (for example, 2 layers) with a first pre-training learning parameter can be adopted to pre-trainSetting the word characteristics of the words and the relation graph to carry out graph convolution processing, thereby mapping the word characteristics of each preset word to a consensus space to obtain each preset word C _i Corresponding consensus target word feature z _i (ii) a The principle of the graph convolution processing of the ith layer can be expressed by formula (3):

where p is the activation function,

is to normalize a symmetric matrix representing a correlation graph, H ⁽⁰⁾ For each preset word C _i Word feature of H ^(l+1) Is the output of the l-th layer, w ^(l) Representing a first pre-training learning parameter.

Here, through the above steps, common knowledge shared between the two modalities, namely, image and text (referred to as consensus knowledge) can be incorporated into image-text matching, so that learning of low-frequency and invisible words in image and text descriptions can be compensated, and richness of text features can be enhanced. And in the method, the obtained co-occurrence probability is subjected to scaling treatment through the preset scaling parameter and the preset probability threshold, so that noise contained in the long-tail data can be inhibited, the co-occurrence relation between words is more reliable, and the matching precision between subsequent images and texts is improved.

S1034, respectively taking the local visual features of the input image and the word features of any candidate text as queries, taking the common identification target word features as key values and value items, and respectively obtaining the common identification local visual features of the example area and the common identification word features of each word through an attention mechanism.

In the embodiment of the invention, local visual features corresponding to an input image can be used as a query (query), consensus target word features are used as a key value (key) and a value item (value), and the consensus local visual features of all example areas of the input image are obtained through an attention mechanism according to a second pre-training learning parameter; and obtaining the common recognition word characteristics of each word of any candidate text through an attention mechanism according to the second pre-training learning parameter by taking the word characteristics corresponding to any candidate text as a query and the common recognition target word characteristics as a key value and a value item.

Illustratively, when the input image has 36 example regions, taking the local visual features of the example regions of the input image as an example, the consensus local visual feature of each example region of the input image can be calculated by formula (4); the principle of obtaining the consensus word feature of any candidate text is the same as that of obtaining the consensus local visual feature of the input image. Equation (4) is as follows:

V ^C ＝Softmax(λVW ^v Z ^T )×Z (4)

wherein, V ^C A matrix formed by the common local visual characteristics of 36 example areas, lambda is a preset smoothing parameter of a Softmax function, and V is equal to R ^36×d A matrix of local visual features for 36 example regions, d being the feature dimension, W ^v For the second pre-training learning parameter (matrix parameter), Z is a matrix formed by the characteristics of the co-recognition target words of q' preset words, Z ^T Is the transpose of Z.

By introducing self-attention, better expression of local visual features and local text features can be obtained, and accurate matching between subsequent images and texts is facilitated.

And S1035, respectively obtaining the global visual feature of the input image and the global text feature of any candidate text through an attention mechanism by taking the global visual feature of the input image and the global text feature of any candidate text as a query, taking the consensus local visual feature of the input image as a key value and a value item, and taking the consensus word feature of any candidate text as a key value and a value item.

In the embodiment of the invention, the global visual feature of the input image can be used as a query, the consensus local visual feature of the input image can be used as a key value and a value item, and the consensus global visual feature of the input image can be obtained through an attention mechanism according to a third pre-training learning parameter; and obtaining the global text feature of the consensus of any candidate text through an attention mechanism according to the third pre-training learning parameter by taking the global text feature of any candidate text as a query and the consensus feature of any candidate text as a key value and a value item.

Illustratively, when the input image has 36 example regions, taking the global visual feature of the input image as an example, the consensus global visual feature of the input image can be calculated by formula (5); the obtaining principle of the consensus global text feature of any candidate text is the same as the obtaining principle of the consensus global visual feature of the input image. Equation (5) is as follows:

wherein, the first and the second end of the pipe are connected with each other,

for a consensus global text feature of the input image, V ^g ∈R ^1×d For global visual features of the input image, V ^l ∈R ^36×d For a matrix of 36 consensus local visual features of the input image, based on the comparison result>

Represents V ^l Transpose of (W) ^l The third pre-training learning parameters (matrix parameters) are represented.

By introducing self-attention, better global visual characteristics and global text characteristics can be expressed, and accurate matching between subsequent images and texts is facilitated.

And S1036, performing feature alignment on the consensus local visual features of the input image and the consensus word features of any candidate text to obtain the attention visual features corresponding to the consensus word features.

In the embodiment of the invention, the obtained consensus local visual feature of the input image and the consensus word feature of any candidate text can be subjected to feature alignment between modalities based on cross attention, so that better expression of the local visual feature is obtained.

For example, text-visual attention may be used to determine each region corresponding to each word in the candidate text, resulting in an attention visual characteristic corresponding to each consensus word characteristic, e.g., for the consensus word characteristic t _j Corresponding attention visual characteristics

Is formula (6):

wherein K represents the total number of common local visual features of the input image,

representing the ith consensus local visual feature of the input image, a _ij Denotes v _i And t _j With λ representing a preset parameter of the softmax function, C _ij Denotes v _i And t _j Cosine similarity between them, based on the comparison result, is greater than or equal to>

For normalizing the cosine similarity matrix, L represents the total number of the consensus word features of any candidate text, [ x ]] ₊ = max (x, 0), x being C _ii 。

S1037, obtaining a plurality of local similarities according to the attention visual features and the corresponding consensus word features, and obtaining global similarities according to the consensus global visual features of the input image and the consensus global text features of any candidate text.

In the embodiment of the invention, a local similarity can be calculated according to the fourth pre-training learning parameter, the attention visual characteristics and the consensus word characteristics corresponding to the attention visual characteristics, so that the local similarity with the same number as the attention visual characteristics is obtained; and calculating a global similarity according to the fifth pre-training learning parameter, the consensus global visual feature and the consensus global text feature.

For example, for an attention visual feature and a corresponding consensus feature, the obtained local similarity can be expressed as formula (7); for the consensus global visual feature and the consensus global textual feature, the resulting global similarity is expressed as formula (8):

wherein the content of the first and second substances,

indicating the visual feature of attention, t _j Representing a corresponding cognate word characteristic, |. Non-woven ² Represents the squared value of the element, | |. | non-woven phosphor ₂ Is represented by ₂ Canonical, w _l Representing a fourth pre-training learning parameter;

represents a consensus global visual feature, and>

representing a consensus global text feature, w _g A fifth pre-training learning parameter is represented. />

Here, in the related art, the similarity between two features is mostly measured by cosine or euclidean distance, and the similarity measurement based on scalar quantity can capture the correlation between two feature vectors to some extent, but lacks detailed correspondence. By the method for calculating the similarity, more detailed association among the features from different modalities can be captured, so that the matching precision of the subsequent images and the text can be improved.

And S1038, obtaining global output through a self-attention mechanism according to the local similarity and the global similarity, and obtaining the similarity between the input image and any candidate text according to the global output.

According to the embodiment of the invention, the local similarity and the global similarity can be spliced to obtain a similar graph comprising a plurality of similarities, then, the similar graph is subjected to self-attention calculation for a preset time according to the sixth pre-training learning parameter to obtain the global output of the global similarity, and the similarity is obtained according to the seventh pre-training learning parameter and the global output.

Exemplarily, the similarity graph can be expressed as

Wherein L represents the total number of consensus word features in any of the candidate texts, based on the total number of consensus word features in the candidate text>

Denotes L local similarities, s ^g Representing a global similarity, the principle of formula (9) may be used to perform self-attention calculation on the similarity graph for a preset number of times (e.g., 3 times), so as to obtain a local output corresponding to each local similarity and a global output corresponding to the global similarity. Equation (9) is as follows:

where n represents the number of self-attentional calculations, and when n =0,

s _p for any one of the similarities in similarity graph N, s _q Is a similar diagram N except s _p Any similarity except that q represents a similarity graph N except s _p The total number of all similarities other than exp (. Lamation.) represents an exponential function with a natural constant e as the base, w _in 、w _out And w _γ For the sixth pre-training learning parameter, w ⁿ _in 、w ⁿ _out And &>

For the sixth pre-trained learning parameter at the nth calculation (the sixth pre-trained learning parameter used at each calculation is the same as the previous time), ->

Is s is _p The corresponding calculation result (output of n + 1) at time n +1, relu (), is the activation function.

For example, when a global output corresponding to the global similarity is obtained, the global output may be input into the fully-connected layer having the seventh pre-training learning parameter, so as to obtain a value, and the value is used as the similarity between the input image and a corresponding one of the candidate texts.

Fig. 3 is a schematic diagram of a process of determining similarity between an image and a text by using a pre-trained matching model, as shown in fig. 3, the pre-trained matching model first determines each example region of the image and each word in the text, and then extracts global visual features of the image and local visual features of each example region, and global text features of the text and word features of each word by using the pre-trained CLIP model; determining the common recognition target word characteristics through common recognition knowledge (represented by a common recognition knowledge base in fig. 3), then determining common recognition local visual characteristics based on the common recognition target word characteristics and the local visual characteristics, and determining the common recognition word characteristics based on the common recognition target word characteristics and the word characteristics; then, determining a consensus global visual feature based on the consensus local visual feature, and determining a consensus global text feature based on the consensus word feature; the method comprises the steps of aligning the consensus target word features and the consensus local visual features to obtain the attention visual features corresponding to the consensus target word features (local alignment in fig. 3), constructing a similar graph based on the consensus global visual features and the consensus global text features and the attention visual features corresponding to the consensus target word features, determining global output based on the similar graph, and finally inputting the global output into a full-connection layer to obtain a similarity value between an image and a text (graph inference in fig. 3).

According to the embodiment of the invention, a top-down and bottom-up attention method can be introduced into feature extraction, so that global and local features can be considered at the same time, local information, spatial information and structural information can be concerned more, and abundant feature representation is provided for a model; in addition, consensus knowledge can be introduced into the matching of the global image and the global text, so that the characteristics of the global image and the characteristics of the global text can be optimized, the global semantic information can be expressed better, and the performance of image-text cross-mode matching is improved. And the learnable similarity representation and the similarity inference based on the self-attention mechanism can be used for fully characterizing the association between the local features of the example areas in the image and the features describing the words in the text, and the information interaction between the local similarity and the global similarity can be introduced, so that the accuracy of image-text matching is improved.

In some embodiments, the initial matching model may include: the device comprises a pre-training feature extraction model, a consensus determination module, an initial graph convolution network with a first initial learning parameter, and a matching module containing a second initial learning parameter, a third initial learning parameter, a fourth initial learning parameter, a fifth initial learning parameter, a sixth initial learning parameter and a seventh initial learning parameter. Based on this, before S103, steps S201 to S209 may be further included:

s201, obtaining a plurality of sample images and a plurality of sample texts which are in one-to-one correspondence with the sample images.

S202, during current training, determining an example area of each sample image by adopting a preprocessing module, and determining each sample word in each sample text.

S203, extracting the global visual feature of each sample image, the local visual feature of each example area in the sample image, the global text feature of each sample text and the word feature of each sample word of the sample text by adopting a pre-training feature extraction model.

Here, the principle of S202 to S203 is the same as that of the corresponding content in the above-described part S1031.

S204, selecting a plurality of words with the occurrence frequency meeting a frequency threshold value from the sample words by using a consensus determining module as preset sample words, and determining the word characteristics of each preset sample word and the co-occurrence times of different preset sample words in the same sample text; the frequency of occurrence of each sample word is the total number of occurrences of the sample word in the plurality of sample texts.

S205, obtaining the consensus target word characteristics corresponding to each preset sample word by using an initial graph convolution network according to the word characteristics, the occurrence frequency, the number of co-occurrences and the first initial learning parameters of the preset sample word.

Here, the principle of S204 to S205 is the same as that of the corresponding contents in the above-described sections S1032 to S1033.

S206, obtaining the consensus local visual feature of the example region of each sample image through an attention mechanism according to the local visual feature, the consensus target word feature and the second initial learning parameter of the example region in each sample image by adopting a matching module, and obtaining the consensus word feature of each sample word through the attention mechanism according to the word feature, the consensus target word feature and the second initial learning parameter of each sample word in each sample text; obtaining a consensus global visual feature of the sample image according to the consensus local visual feature and the third initial learning parameter corresponding to the sample image through an attention mechanism, and obtaining a consensus global text feature of the sample text according to the consensus word feature and the third initial learning parameter corresponding to the sample text; obtaining local similarity according to the consensus word feature, the consensus local visual feature and the fourth initial learning parameter, obtaining global similarity according to the consensus global visual feature, the consensus global text feature and the fifth initial learning parameter, and obtaining global output corresponding to the global similarity through a self-attention mechanism according to the local similarity, the global similarity and the sixth initial learning parameter; and obtaining the similarity between the sample image and the sample text according to the global output and the seventh initial learning parameter.

Here, the principle of S206 is the same as that of the above-described portions S1034 to S1038.

S207, determining a loss value at the current time according to the similarity between each sample image and each sample text and the real label between each sample image and each sample text; the true label is used to indicate whether there is a match between the sample image and the sample text.

For example, the true label between each sample image and each sample text may be represented by 0 or 1, for example, for sample image a and sample text B, when the true label corresponding to a and B is 0, sample image a and sample text B do not match, and when the true label corresponding to a and B is 1, sample image a and sample text B match with each other. In this manner, the model parameters may be updated by maximizing the similarity score between the positive sample pairs, and minimizing the similarity score of the negative sample pairs.

Illustratively, the loss value may be calculated using a cross-entropy loss function.

And S208, adjusting the initial learning parameter according to the current loss value to correspondingly obtain a first updated learning parameter, a second updated learning parameter, a third updated learning parameter, a fourth updated learning parameter, a fifth updated learning parameter, a sixth updated learning parameter and a seventh updated learning parameter.

S209, performing next training based on the updated learning parameters until the obtained loss value meets the preset condition, obtaining a pre-training graph convolution network with a first pre-training learning parameter, and a matching module comprising a second pre-training learning parameter, a third pre-training learning parameter, a fourth pre-training learning parameter, a fifth pre-training learning parameter, a sixth pre-training learning parameter and a seventh pre-training learning parameter, thereby obtaining a pre-training matching model.

In the embodiment of the present invention, when the current loss value is obtained, a reverse gradient propagation method may be adopted to adjust the first initial learning parameter, the second initial learning parameter, the third initial learning parameter, the fourth initial learning parameter, the fifth initial learning parameter, the sixth initial learning parameter, and the seventh initial learning parameter, so as to obtain a first updated learning parameter, a second updated learning parameter, a third updated learning parameter, a fourth updated learning parameter, a fifth updated learning parameter, a sixth updated learning parameter, and a seventh updated learning parameter, and based on these updated learning parameters, the next training is performed by using the same principle as in S202 to S208, until the loss value obtained by continuous training for a plurality of times (for example, two or three times, etc.) after the learning rate is attenuated does not decrease any more, the last updated learning parameter is used as the pre-training parameter, so as to obtain the pre-training graph network having the first pre-training parameter, and the pre-training module including the second pre-convolution learning parameter, the third pre-training parameter, the fourth pre-training parameter, the fifth pre-training parameter, the sixth pre-training parameter, and the seventh pre-training parameter. Finally, a model composed of the preprocessing module, the pre-training feature extraction model, the consensus determination module, the pre-training graph convolution network and the matching modules including the first pre-training learning parameter to the seventh pre-training learning parameter may be used as the pre-training matching model.

The matching accuracy of the pre-training matching model in the embodiment of the invention is further explained by experimental data.

1. Description of data set

The model of the invention was evaluated on the MSCOCO and Flickr30K datasets. The images of the two data sets are common images, the text is an English sentence, and the previous research work is based on the two data sets to perform experiments, so the two data sets are selected for the parameter testing experiments. The MSCOCO dataset contains 123, 287 images each with 5 annotation headings (5 texts). The data set was divided into 113,287 images for training, 5000 images for verification, and 5000 images for testing. We illustrate the results by averaging over 5 times the 1K test image and testing on the complete 5K image. The Flickr30K data set contains 31,783 images each with 5 corresponding headings (5 texts). We used 1000 images for validation, 1000 images for testing, and the rest for training.

2. Description of evaluation index

For image-text matching, a commonly used evaluation index is Recall at K (R @ K). Specifically, given a query object (image/text), the model computes similarity values for all of the search objects in the search corpus (corpus of text/images/corpus of images) to the query object, and then ranks the search objects according to their relative magnitudes. R @ K represents the proportion of the rank of the target object corresponding to the query object in all query samples at K positions in front of the rank. The invention adopts R @1, R @5 and R @10 as evaluation indexes to evaluate the performance of the model.

TABLE 1

TABLE 2

As can be seen from Table 1, the present invention has significant improvements in MSCOCO and Flickr30K compared to the direct use of CLIP for the cross-modal search task. In order to verify the validity of the consensus knowledge used in the present invention, further ablation experiments were performed to verify the validity of the consensus knowledge. The ablation experiment is carried out, the result is shown in the table 2, and the results of Flicker30K are shown in the table 2, so that the results of image retrieval and text retrieval are improved after consensus knowledge is used, and the results of R @1 of the image retrieval are improved more. That is to say, the pre-training matching model provided by the invention can be greatly improved in text and image retrieval tasks compared with the original pre-training model.

According to the embodiment of the invention, the visual-text pre-training network with high-efficiency cross-modal capability is used as the feature extractor, so that the influence of inherent 'heterogeneous gap' of visual and text features on the model performance is relieved; by introducing a top-down and bottom-up attention method into a visual text pre-training network, global and local features can be considered at the same time, local information, space and structural information are concerned more, and abundant feature representation is provided for a model; common knowledge is introduced into the matching of the global image and the global text, common words are selected from a subtitle library of the image, and the vocabulary characteristic information with local connection relation is further mined by paying attention to common semantic connection in advance, so that the global image and the global text characteristic are optimized, the global semantic information can be more expressed, and the cross-mode matching performance of the image text is improved; the relevance of the local region and the word features is fully characterized by using learnable similarity representation and similarity reasoning based on a self-attention mechanism, so that information interaction between local similarity and global similarity is introduced, and the matching accuracy is improved.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A vision and text cross-modal matching method based on consensus embedding space and similarity is characterized by comprising the following steps:

acquiring an input object of an input retrieval platform;

determining the similarity between the input object and each candidate object through a pre-training matching model deployed on the retrieval platform, and screening matching objects of the input object according to the similarity;

the pre-training matching model is obtained by adopting a sample image and a sample description text and carrying out iterative training on an initial matching model; the pre-training matching model is used for obtaining the common recognition target word feature of the preset word according to the word feature of the preset word with the occurrence frequency meeting the frequency threshold value in the description text, the occurrence frequency of the preset word and the co-occurrence times among different preset words; obtaining the consensus local visual feature of the example region through an attention mechanism according to the local visual feature of the example region in the image and the consensus target word feature, and obtaining the consensus word feature of each word through the attention mechanism according to the word feature of each word in the description text and the consensus target word feature; obtaining the consensus global visual feature of each image according to the consensus local visual feature of each image through an attention mechanism, and obtaining the consensus global text feature of each description text according to the consensus word feature of each description text; obtaining local similarity according to the consensus word characteristics and the consensus local visual characteristics, obtaining global similarity according to the consensus global visual characteristics and the consensus global text characteristics, and obtaining global output through a self-attention mechanism according to the local similarity and the global similarity; and according to the global output, obtaining the similarity between the image and the description text.

2. The consensus-embedding-space-and-similarity-based visual and text cross-modal matching method of claim 1, wherein the pre-trained matching model comprises:

the preprocessing module is used for determining an example area of the image and determining words in the description text;

the pre-training feature extraction model is used for extracting global visual features of the image and local visual features of each example region, and extracting global text features of the description text and word features of each word;

the consensus determining module is used for determining word characteristics of preset words, the occurrence frequency of the preset words and the number of co-occurrences between different preset words, wherein the occurrence frequency of the preset words in the description text meets the frequency threshold; the frequency of occurrence of each word is the total number of occurrences of the word in the description text;

the pre-training graph convolutional network is used for obtaining the consensus target word characteristics of the preset words according to the word characteristics, the occurrence frequency, the co-occurrence times and the first pre-training learning parameters of the preset words;

the matching module comprises pre-training learning parameters and is used for obtaining the consensus local visual characteristics of the example region through an attention mechanism according to the local visual characteristics of the example region in the image, the consensus target word characteristics and the second pre-training learning parameters, and obtaining the consensus word characteristics of each word through the attention mechanism according to the word characteristics of each word in the description text, the consensus target word characteristics and the second pre-training learning parameters; obtaining a consensus global visual feature of each image according to the consensus local visual feature and the third pre-training learning parameter of each image through an attention mechanism, and obtaining a consensus global text feature of each description text according to the consensus word feature of each description text and the third pre-training learning parameter; obtaining local similarity according to the consensus word feature, the consensus local visual feature and a fourth pre-training learning parameter, obtaining global similarity according to the consensus global visual feature, the consensus global text feature and a fifth pre-training learning parameter, and obtaining global output through a self-attention mechanism according to the local similarity, the global similarity and a sixth pre-training learning parameter; and obtaining the similarity between the image and the description text according to the global output and the seventh pre-training learning parameter.

3. The visual and text cross-modal matching method based on consensus embedding space and similarity according to claim 1 or 2, wherein when the input object is an input image and the candidate object is a plurality of candidate texts, the determining the similarity between the input object and each candidate object through a pre-trained matching model deployed on the search platform comprises:

for an input image and any candidate text, determining global visual features of the input image, local visual features of each example region in the input image, global text features of any candidate text and word features of words in any candidate text through the pre-training matching model;

selecting a plurality of words with the occurrence frequency meeting the frequency threshold value from the words of the candidate texts as the preset words according to the occurrence frequency of the words in the candidate texts, and determining the word characteristics of each preset word and the co-occurrence times of different preset words in the same candidate text;

obtaining the consensus target word characteristics corresponding to each preset word according to the word characteristics, the occurrence frequency and the co-occurrence times of the preset words;

respectively obtaining the common local visual features of the example area and the common identification word features of each word by using the local visual features of the input image and the word features of any candidate text as queries and the common identification target word features as key values and value items through an attention mechanism;

respectively obtaining the common global visual feature of the input image and the common global text feature of any candidate text by taking the global visual feature of the input image and the global text feature of any candidate text as a query, the common local visual feature of the input image as a key value and a value item, and the common word feature of any candidate text as a key value and a value item through an attention mechanism;

carrying out feature alignment on the consensus local visual feature of the input image and the consensus word feature of any candidate text to obtain an attention visual feature corresponding to each consensus word feature;

obtaining a plurality of local similarities according to the attention visual features and the corresponding consensus word features, and obtaining global similarities according to the consensus global visual features of the input image and the consensus global text features of any candidate text;

and obtaining global output through a self-attention mechanism according to the local similarity and the global similarity, and obtaining the similarity between the input image and any candidate text according to the global output.

4. The cross-modal matching method for vision and text based on consensus embedding space and similarity according to claim 3, wherein the obtaining of the consensus target word feature corresponding to each preset word according to the word features, the frequency of occurrence and the number of co-occurrences of the preset word comprises:

constructing a relation graph which takes the preset words as nodes and represents the co-occurrence relation among the preset words by judging whether the nodes have connecting lines with directions or not according to the word characteristics, the occurrence frequency and the co-occurrence times of the preset words;

performing graph convolution processing on the word features of the preset words and the relation graph according to a first pre-training learning parameter to obtain a consensus target word feature corresponding to each preset word; the consensus target word feature represents a feature obtained by mapping a word feature of a preset word to a consensus space.

5. The visual and text cross-modal matching method based on consensus embedding space and similarity according to claim 3, wherein the local visual features of the input image and the word features of any candidate text are respectively used as queries, the consensus target word features are key values and value items, and the consensus local visual features of the instance region and the consensus word features of each word are respectively obtained through an attention mechanism, comprising:

obtaining the consensus local visual features of the example areas of the input images through an attention mechanism according to a second pre-training learning parameter by taking the local visual features corresponding to the input images as queries and the consensus target word features as key values and value items;

and obtaining the common recognition word characteristics of each word of any candidate text through an attention mechanism according to the second pre-training learning parameter by taking the word characteristics corresponding to any candidate text as a query and the common recognition target word characteristics as a key value and a value item.

6. The cross-modal visual and text matching method according to claim 3, wherein the obtaining of the consensus global visual feature of the input image and the consensus global text feature of the candidate text through an attention mechanism by using the global visual feature of the input image and the global text feature of the candidate text as a query, the consensus local visual feature of the input image as a key value and a value item, and the consensus word feature of the candidate text as a key value and a value item respectively comprises:

obtaining the consensus global visual feature of the input image through an attention mechanism according to a third pre-training learning parameter by taking the global visual feature of the input image as a query and the consensus local visual feature of the input image as a key value and a value item;

and obtaining the consensus global text feature of any candidate text through an attention mechanism according to the third pre-training learning parameter by taking the global text feature of any candidate text as a query and the consensus word feature of any candidate text as a key value and a value item.

7. The consensus embedding space and similarity-based visual and text cross-modal matching method of claim 3,

for an attention visual feature and a corresponding consensus feature, the resulting local similarity is expressed as:

wherein it is present>

Indicating the visual feature of attention, t _j Representing the corresponding consensus word characteristic, | · non-woven ² Expressing the square value of the element, | | · | non-calculation ₂ Is represented by ₂ Canonical, w _l Representing a fourth pre-training learning parameter;

for the consensus global visual feature and the consensus global text feature, the obtained global similarity is represented as:

wherein it is present>

Representing a consensus global visual feature>

Representing a consensus global text feature, w _g A fifth pre-training learning parameter is represented.

8. The method for cross-modal matching of visual and text based on consensus embedding space and similarity according to claim 3, wherein the obtaining a global output through a self-attention mechanism according to the local similarity and the global similarity and obtaining a similarity between the input image and any one of the candidate texts according to the global output comprises:

splicing the local similarity and the global similarity to obtain a similar graph containing a plurality of similarities;

according to a sixth pre-training learning parameter, performing self-attention calculation on the similarity graph for a preset time to obtain global output of global similarity;

and obtaining the similarity according to a seventh pre-training learning parameter and the global output.

9. The consensus embedding space and similarity-based visual and text cross-modal matching method of claim 1, wherein the initial matching model comprises: the device comprises a pre-training feature extraction model, a consensus determination module, an initial graph convolution network with a first initial learning parameter, and a matching module containing a second initial learning parameter, a third initial learning parameter, a fourth initial learning parameter, a fifth initial learning parameter, a sixth initial learning parameter and a seventh initial learning parameter;

before the determining the similarity between the input object and each candidate object through a pre-trained matching model deployed on the retrieval platform, the method further comprises:

acquiring a plurality of sample images and a plurality of sample texts which are in one-to-one correspondence with the sample images;

during the current training, determining an example area of each sample image by adopting the preprocessing module, and determining each sample word in each sample text;

extracting the global visual feature of each sample image, the local visual feature of each example area in the sample image, the global text feature of each sample text and the word feature of each sample word of the sample text by adopting the pre-training feature extraction model;

selecting a plurality of words with the occurrence frequency meeting the frequency threshold value from the sample words by adopting the consensus determining module as the preset sample words, and determining the word characteristics of each preset sample word and the co-occurrence times of different preset sample words in the same sample text; the frequency of occurrence of each sample word is the total number of occurrences of the sample word in the plurality of sample texts;

obtaining the consensus target word characteristics corresponding to each preset sample word by adopting the initial graph convolution network according to the word characteristics, the occurrence frequency, the number of co-occurrences of the preset sample words and the first initial learning parameters;

obtaining the consensus local visual feature of the example region of each sample image through an attention mechanism according to the local visual feature of the example region, the consensus target word feature and the second initial learning parameter in each sample image by adopting the matching module, and obtaining the consensus word feature of each sample word through the attention mechanism according to the word feature of each sample word in each sample text, the consensus target word feature and the second initial learning parameter; obtaining a consensus global visual feature of the sample image according to the consensus local visual feature corresponding to the sample image and the third initial learning parameter through an automatic attention mechanism, and obtaining a consensus global text feature of the sample text according to the consensus word feature corresponding to the sample text and the third initial learning parameter; obtaining local similarity according to the consensus word feature, the consensus local visual feature and the fourth initial learning parameter, obtaining global similarity according to the consensus global visual feature, the consensus global text feature and the fifth initial learning parameter, and obtaining global output corresponding to the global similarity through a self-attention mechanism according to the local similarity, the global similarity and the sixth initial learning parameter; according to the global output and the seventh initial learning parameter, obtaining the similarity between the sample image and the sample text;

determining a loss value at the current time according to the similarity between each sample image and each sample text and the real label between each sample image and each sample text; the real label is used for indicating whether the sample image is matched with the sample text;

adjusting the initial learning parameter according to the current loss value to correspondingly obtain a first updated learning parameter, a second updated learning parameter, a third updated learning parameter, a fourth updated learning parameter, a fifth updated learning parameter, a sixth updated learning parameter and a seventh updated learning parameter;

and training the next time based on the updated learning parameters until the obtained loss value meets the preset condition, obtaining a pre-training graph convolution network with a first pre-training learning parameter, and a matching module comprising a second pre-training learning parameter, a third pre-training learning parameter, a fourth pre-training learning parameter, a fifth pre-training learning parameter, a sixth pre-training learning parameter and a seventh pre-training learning parameter, thereby obtaining the pre-training matching model.

10. The consensus-embedding-space-and-similarity-based visual and text cross-modal matching method according to claim 2 or 9, wherein the pre-trained feature extraction model is a CLIP model.