CN113191357B

CN113191357B - Multilevel image-text matching method based on graph attention network

Info

Publication number: CN113191357B
Application number: CN202110550780.7A
Authority: CN
Inventors: 吴杰; 吴春雷; 王雷全; 路静; 段海龙
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2023-01-17
Anticipated expiration: 2041-05-18
Also published as: CN113191357A

Abstract

The invention discloses a multilevel image-text matching method based on a graph attention network. A key challenge of this task is to learn the correspondence between images and text. Most of the existing work only learns the local semantic relationship between objects, and the research work aiming at the local short sentence relationship between the objects and the relationship is very little. The invention constructs a multilevel image-text matching method based on a graph attention network from a graph view. The network constructs an attention diagram structure for image matching on the image area and the text word, so as to infer the corresponding relation of the fine-grained structured short sentence. Meanwhile, global matching is carried out according to the constructed graph structure reasoning global semantics as the supplement of graph matching, so that more comprehensive cross-media semantic matching is realized. A large number of experiments prove that the multi-level image-text matching method based on the graph attention network can simultaneously learn graph view matching and global view matching, and obtains competitive results on MSCOCO and Flickr30K data sets.

Description

Multilevel image-text matching method based on graph attention network

Technical Field

The invention belongs to an image-text matching method, and relates to the technical field of computer vision and natural language processing.

Background

Vision and language are two core parts of people's understanding of everyday matters. However, there are large semantic differences between these visual and textual data, which are very challenging for people to effectively measure their semantic similarity, and therefore cross-modal tasks attract much attention from many researchers. The key technology for achieving the task is common understanding of vision and text, and semantic correspondence between two information sources.

With the rapid development of deep learning technology, deep neural networks have shown great potential in various multimedia applications such as cross-media retrieval and text generation. Currently, matching methods based on deep learning can be roughly classified into two categories: one-to-one approach and many-to-many approach. The former takes the whole image and text as research objects to know the corresponding relation; the latter infers image-sentence similarity by aligning the visual region with the text words.

Based on a one-to-one method, researchers at home and abroad explore various common space representation methods, and hopefully, the methods can project images and texts into a public space and directly compare the similarity of the images and the texts. In early work, many researchers attempted to directly compare the similarity between images and text by projecting them into a common space. Kiros et al's Unifying visual-semantic constructs with multi-modal neural language models, which introduced an encoder-decoder pipeline that learned a multi-modal associative embedding space containing images and text and designed a new decoding language model for decoding distributed representations from the space. On the basis, wu et al propose an online learning method for learning image-text correspondence by maintaining bidirectional relative similarity. However, he does not consider the characteristic distribution of a single modality. Thus, zheng et al propose a two-path CNN model for visual-text embedding learning and add an instance penalty to account for data distribution within the modality. Huang et al propose learning semantic concepts and organize them in the correct semantic order to improve the representation of the image. Meanwhile, li et al reason for visual representations by capturing objects and their semantic relationships. While these efforts have made great progress in image-text alignment, they lack local fine-grained analysis of image-text pairs, and the performance of cross-media matching is limited due to ignoring fine-grained information and blending some redundant information.

Subsequently, a local cross-media matching method is proposed, which performs local similarity learning between all region-word pairs. It was first proposed by Karpathy et al to learn the relationship between all region-word pairs by calculating their similarity. Following this idea, lee et al propose a hierarchical LSTM to associate shared semantics of words and regions together. On this basis, wu et al propose to learn image-text alignment by measuring two-way relative semantic similarity. To further address the visual semantic differences, niu et al organize the text into semantic trees, one phrase for each node, and then use a hierarchical long-term memory (LSTM, a variation of RNN) to extract phrase-level features of the text to compute the similarity between image regions and words. And Ma et al propose a novel multifaceted feature matching network (MFM) that explores the matching relationship between two forms by describing a multifaceted representation of images and text. They do not take into account the difference in importance of each region-word pair in calculating global similarity. In addition, lee et al have devised a stacked cross-attention network that infers image-text matches by focusing on words or word-related regions that are related to the regions.

However, most of these efforts only learn local semantic relationships between objects, and little work has been done on local phrase semantic relationships between objects and their relationships. In contrast to the prior art methods, the method, we propose a multilevel image-text matching method based on graph attention network. The network constructs a graph structure for the picture and the text, and executes a short sentence graph matching module to perform graph matching so as to infer a fine-grained structural corresponding relation. Meanwhile, global semantics are deduced according to the constructed graph structure to carry out global matching as the supplement of graph matching, so that more comprehensive cross-media semantic matching is realized.

Disclosure of Invention

The invention aims to solve the problems that in the prior image text matching method, most of objects only learn the local semantic relationship between the objects, and the local short sentence semantic relationship between the objects and the relationship is rarely considered.

The technical scheme adopted by the invention for solving the technical problems is as follows:

s1, constructing graph structures of the images and the texts, and giving different weights to the importance of the images or the texts according to the areas or the words.

And S2, reasoning global semantic features to carry out global matching by combining the image graph and the text graph in the S1.

And S3, constructing a short sentence image matching module, and firstly performing node matching and then performing short sentence matching according to the text image and the image.

And S4, combining the network in the S2 and the network in the S3 to construct an overall framework of the multilevel image-text matching method based on the graph attention network.

And S5, training a multilevel image-text matching method based on the graph attention network.

First, for the image portion, each image is represented as a undirected fully connected graph with nodes set to the salient regions detected by the fast-rcnn, each node being associated with all other nodes. Wherein X is the regional characteristic, E is the calculated each pair of regional characteristics X _i And x _j E, E is an affinity matrix. The image map is defined as:

G _x ＝(X，E _x ) (1)

E _x (x _i ，x _j )＝(x _i ) ^T x _j (2)

in this case, the higher the relevance of the image region, the higher the affinity score of its edge. This results in a fully connected image graph G. The graph is then processed using the graph attention network GAT, which outputs features of global semantic relationship enhancement. The attention coefficient was calculated and normalized by the softmax function as follows:

e _ij ＝a(W _q x _i ,W _k x _j ) (3)

μ _ij ＝Softmax(e _ij ) (5)

W _q and W _k Is a parameter which can be learnt, and simultaneously, the attention coefficient is calculated by utilizing the multi-head dot product,this is in practice faster and more space efficient.

In the formula (5-10), | | represents a connection, and a parameter

H =8 parallel attention layers were used, D = D/8. Then, under a nonlinear activation function, the final output characteristic is calculated as:

where N is the neighborhood of node i in the graph. In order to accelerate the training speed, batch normalization is added into the graph attention module. For text graphs, the same graph attention as above is made to obtain a finer text representation.

G _y ＝(Y,E _y ) (9)

The global semantic matching of the invention is different from the previous research method, the global semantic is directly extracted from the image and the text, and the image G is constructed in the text _x And a text chart G _y After the graph attention network, the relationship between the nodes in the image graph and the text graph is strengthened, and the final image and text representation can obtain the following X _All-purpose ＝Mean(G _x ) And Y _All-purpose ＝Mean(G _y ) Where Mean is average pooling. The global matching similarity score for an image-text pair is calculated as:

S _all-purpose ＝R _All-purpose (X _All-purpose ,Y _All-purpose ) (10)

In obtaining two modal representation X _All-purpose And Y _All-purpose Later, hinge-based triplet ranking penalties are employedAnd supervising the latent space learning process. The loss function attempts to find the most difficult negatives in a small batch, forming a triple union with positive negatives and ground truth queries. The loss function is defined as follows.

L _All-purpose ＝[η+R _All-purpose (X' _All-purpose ,Y _All-purpose )-R _All-purpose (X _All-purpose ,Y _All-purpose )] ₊ +[η+R _All-purpose (X _All-purpose ,Y' _All-purpose )-R _All-purpose (X _All-purpose ,Y _All-purpose )] ₊ (11)

Where R is _All-purpose (. Cndot.) refers to a similarity function, representing cosine similarity in the model.

For matching the short sentence graph, the two processes of image-to-text and text-to-image are divided. The similarity between the image and text nodes is first calculated, denoted S, and then the softmax function is calculated along the image axis. The similarity value measures how well an image node corresponds to each text node. All image nodes are then taken as a weighted combination of feature vectors, where the weights are the calculated similarities. This process can be expressed as:

F _i→t ＝softmax _α SY _β (13)

then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] _i1 ,x _i2 ,...,x _im ]And [ y _i1 ,y _i2 ,...,y _im ]. The multi-block similarity is calculated within the pair of blocks. For example, the similarity of the h-th block is calculated as s _ih ＝cos(x _ih ,y _ih ). Wherein s is _ih Is a scalar value, cos (·) represents cosine similarity. The matching vector of the ith image node can be obtained by splicing the similarity of all the blocks, namely:

s _i ＝s _i1 ||s _i2 ||...||s _im (14)

where "|" represents a connection. Symmetrically, when a text graph is given, node-level matching is performed on each text node. The corresponding image nodes will be associated differently. Each text node, along with its associated image nodes, will then be processed by the multi-block module to produce a matching vector s. In this way, each image node is associated with its matching text node, which will propagate on the final graph match to neighboring nodes to direct them to learn fine-grained phrase correspondence.

Graph matching takes node matching vectors as input, and propagates the vectors to neighbors along the edges of the graph. Wherein the match vector for each node is updated using the GCN for the match vectors in the neighborhood. How to integrate neighborhood matching quantities by GCN learning is shown as the following formula:

wherein N is _j Calculating polar coordinates (rho and theta) for the neighborhood of the jth node based on the centers of the bounding boxes of the pairwise regions, and setting an edge weight matrix W _e As a pair of polar coordinates, W _k And b is the parameter that needs to be learned for the kth core. The output of the spatial convolution is defined as the concatenation of the outputs of the k kernels, producing a convolution vector reflecting the correspondence of the connected nodes. These nodes constitute the structured phrase.

Phrase correspondence may be inferred by propagating neighboring node correspondence, thereby inferring a graph matching score for an image-text pair. Here, the convolution vector is input to a multi-layered perceptron (MLP) to consider the learned correspondences of all phrases collectively and infer the graph match score. This represents how well an image map matches a text map. This process is expressed as:

the same short sentence matching process from text image to image is as follows:

wherein W _s 、b _s Being a parameter of the MLP, the MLP comprises two fully connected layers, the function σ (·) representing tanh activation. The graph matching similarity score for an image-text pair is calculated as two directions:

finally, the image-text similarity calculation combined with multilevel semantic alignment is as follows:

Sim＝S _all-purpose +S _{Drawing (A)} (19)

Therefore, by utilizing the global information, the local information and the relation information, the complementarity between the image and the sentence is fully mined, the correlation between the image and the sentence is comprehensively modeled, and the cross-media matching can be promoted.

The multilevel image-text matching method based on the graph attention network comprises a graph construction module, a short sentence graph matching module and a global semantic matching network.

Finally, the training method of the multilevel image-text matching method based on the graph attention network is as follows:

during our training, all experiments were performed using Python 3.6 and PyTorch frameworks, and the experiments were performed on a computer with an Nvidia Tesla P100 GPU. The word embedding size is set to 300 dimensions for each sentence. Words are encoded as 1024-dimensional vectors using bidirectional GRUs. The image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to be 1024 dimensions, and the feature dimensions are the same as those of a text. The model was trained using an Adam optimizer with a batch size of 64, on the MSCOCO dataset for 20 cycles, with a 10% decay per 15 cycles. Training on the Flickr30k dataset for 30 cycles decays by 10% for each 15 batches. The learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset. Further, the parameter sum is set to 20 and 0.2, respectively.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a novel multilevel image-text matching method based on a graph attention network, which is used for constructing an image structure graph and a text structure graph, performing feature weighting by using a graph attention module and performing matching of a short sentence semantic graph to further realize structured short sentence alignment.

2. A new method for learning global and local consistency is proposed for the first time in the invention, and the similarity of global and short sentence structurization is learned for matching, but not the similarity of global and local objects.

Drawings

Fig. 1 is a schematic structural diagram of a multilevel image-text matching method based on a graph attention network.

Fig. 2 is a schematic model diagram of a short sentence graph matching module.

Fig. 3 and 4 are graphs comparing the results of image-text matching on MSCOCO and Flickr30K datasets for a multi-level image-text matching method based on a graph attention network and other networks.

Fig. 5 and 6 are graphs of visualization results of image matching text and text matching images.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent.

The invention is further illustrated below with reference to the figures and examples.

Fig. 1 is a schematic structural diagram of a multilevel image-text matching method based on a graph attention network. As shown in fig. 1, firstly, fast-rcnn and GRU operations are applied to pictures and texts to extract regional features and text features so as to construct image pictures and text pictures. And then weighting the image graph and the text graph through the graph attention network, wherein the nodes of the graph can be objects, relations or attributes, if any two nodes are semantically dependent, edges exist, and finally, short sentence graph matching and global semantic matching are respectively carried out.

Fig. 2 is a schematic model diagram of a short sentence graph matching module. As shown in fig. 2, the phrase map matching is divided into two processes, image-to-text and text-to-image. The similarity between the image nodes and the text nodes is first calculated, denoted S, and then the softmax function is calculated along the image axis. The similarity value measures the degree of correspondence of the image node to each text node. All image nodes are then taken as a weighted combination of feature vectors, where the weights are the calculated similarities. This process can be expressed as:

F _i→t ＝softmax _α SY _β (21)

then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] _i1 ，x _i2 ，…，x _im ]And [ y _i1 ，y _i2 ，...，y _im ]. The multi-block similarity is calculated within the pair of blocks. For example, the similarity of the h-th block is calculated as s _ih ＝cos(x _ih ，y _ih ). Wherein s is _ih Is a scalar value, cos (·) represents cosine similarity. The matching vector of the ith image node can be obtained by splicing the similarity of all the blocks, namely:

s _i ＝s _i1 ||s _i2 ||...||s _im (22)

where "|" denotes a connection. Symmetrically, when a text graph is given, node-level matching is performed on each text node. The corresponding image nodes will be associated differently. Each text node, along with its associated image nodes, will then be processed by the multi-block module to produce a matching vector s. In this way, each image node is associated with its matching text node, which will propagate on the final map match to neighboring nodes to guide them to learn fine-grained phrase correspondence.

Phrase correspondence may be inferred by propagating neighboring node correspondence, inferring a graph matching score for the image-text pair. Here, the convolution vector is input to a multi-layered perceptron (MLP) to consider the learned correspondences of all phrases collectively and infer the graph match score. This represents how well an image map matches a text map. This process is expressed as:

fig. 3 and 4 are graphs comparing the results of image-text matching on MSCOCO and Flickr30K datasets for a multi-level image-text matching method based on a graph attention network and other networks. As shown in fig. 3 and 4, the multi-level image-text matching result based on the graph attention network is more accurate than other models.

Fig. 5 and 6 are graphs of visualization results of image matching text and text matching images. As shown in fig. 5, given an image, a multi-level model based on the graph attention network can match corresponding texts. Given text, a multi-level model based on a graph attention network can match corresponding pictures, as shown in fig. 6.

The invention provides a multilevel image-text matching method based on a graph attention network, which selects nodes for a picture by using a bottom-up attention model, then constructs an image structure graph and a text structure graph for the image nodes and word nodes, enhances the graph characteristics through the graph attention, and executes a short sentence graph matching module to perform graph matching so as to realize the alignment of structured short sentences. Meanwhile, global semantic matching is added, and more accurate cross-media correlation is mined by learning the similarity of global and local structuring. Extensive experiments on the MSCOCO and Fliker30K datasets showed that this model has a positive effect on image-text matching. In future work, we will continue to explore how to better learn semantic correspondence between images and text.

Finally, the details of the above-described examples of the present invention are merely examples for illustrating the present invention, and any modifications, improvements, substitutions and the like of the above-described examples would be included in the scope of the present invention as claimed by those skilled in the art.

Claims

1. The multilevel image-text matching method based on the graph attention network is characterized by comprising the following steps of:

s1, constructing graph structures of images and texts, and giving different weights to the importance of the images or the texts according to areas or words;

s2, reasoning global semantic features to perform global matching by combining the image graph and the text graph in the S1;

s3, constructing a short sentence image matching module, firstly performing node matching according to the text image and the image, and then performing short sentence matching;

for matching on an image level, firstly calculating the similarity between an image node and a text node, wherein the similarity is expressed as S, then calculating a softmax function along an image axis, and measuring the corresponding degree of the image node and each text node by using a similarity value; then, all image nodes are taken as a weighted combination of feature vectors, wherein the weights are the calculated similarities, and the process can be expressed as:

F _i→t ＝softmax _α SY _β (13)

then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] _i1 ,x _i2 ,...,x _im ]And [ y _i1 ,y _i2 ,...,y _im ](ii) a The multi-block similarity is calculated in the pair of blocks, and the similarity of the h-th block is calculated as s _ih ＝cos(x _ih ,y _ih ) Wherein s is _ih As a scalar value, cos (·) represents cosine similarity, and the matching vector of the ith image node can be obtained by stitching the similarities of all blocks, that is:

s _i ＝s _i1 ||s _i2 ||...||s _im (14)

where "|" represents a connection, symmetrically, when a text graph is given, node-level matching is performed on each text node, and the corresponding image nodes will be associated distinctively; then, each text node, together with its associated image nodes, will be processed by the multi-block module to produce a matching vector s; in this way, each image node is associated with its matching text node, which will propagate on the final map match to neighboring nodes to direct them to learn fine-grained phrase correspondence;

graph matching takes node matching vectors as input, and the vectors are propagated to neighbors along the edges of the graph; the matching vector of each node updates the matching vector in the neighborhood by using GCN, and how to integrate the neighborhood matching vectors is learned by using GCN, wherein the formula is as follows:

wherein N is _j Calculating polar coordinates (rho and theta) for the neighborhood of the jth node based on the centers of the bounding boxes of the pairwise regions, and setting an edge weight matrix W _e As a pair of polar coordinates, W _k And b is the parameter that the kth core needs to learn; the output of the spatial convolution is defined as the concatenation of the outputs of the k kernels, resulting in a convolution vector reflecting the correspondence of the connecting nodes; these nodes constitute the structured phrase;

phrase correspondence may be inferred by propagating neighboring node correspondence, inferring a graph matching score for an image-text pair; here, the convolution vector is input to a multi-layer perceptron (MLP) to consider the learning correspondence of all phrases together and infer a graph matching score, which represents how well an image graph matches a text graph, and this process is expressed as:

wherein W _s 、b _s For the parameters of the MLP, which contains two fully connected layers, the function σ (-) represents the tanh activation, and the graph matching similarity score for an image-text pair is computed as two directions:

finally, the semantic image-text similarity calculation combining multilevel semantic alignment is as follows:

Sim＝S _all-purpose +S _{Drawing (A)} (19)

Therefore, by utilizing the global information, the local information and the relation information, the complementarity between the images and the sentences is fully mined, the correlation between the images and the sentences is comprehensively modeled, and the cross-media matching can be promoted;

s4, combining the network in the S2 and the network in the S3 to construct an overall framework of a multilevel image-text matching method based on the graph attention network;

2. The multi-level image-text matching method based on graph attention network of claim 1,

the specific process of S1 is as follows:

for image portions, each image is represented as a undirected fully connected graph with nodes set as salient regions detected by the false-rcnn, each node is associated with all other nodes, where X is a region feature, and E is a calculated X for each pair of region features _i And x _j E, E is an affinity matrix, and the image map is defined as:

G _x ＝(X,E _x ) (1)

E _x (x _i ,x _j )＝(x _i ) ^T x _j (2)

in this case, the higher the relevance of the image region, the higher the affinity score of its edge, thus having a fully connected image G, and then processing the image G with the graph attention network GAT, which outputs the features of global semantic relationship enhancement, calculates the attention coefficient and normalizes it by the softmax function as follows:

e _ij ＝a(W _q x _i ,W _k x _j ) (3)

μ _ij ＝Softmax(e _ij ) (5)

W _q and W _k Is a learnable parameter, and at the same time, the attention coefficient is calculated by using a multi-head dot product, which is faster in practice and higher in space efficiency:

in equation (6), | | represents a connection, a parameter

H =8 parallel attention layers, D = D/8; then, under a nonlinear activation function, the final output characteristic is calculated as:

n is the neighborhood of a node i in the graph, and batch normalization is added in a graph attention module in order to accelerate the training speed; for text graphs, the same graph attention as above is made to obtain a finer text representation

G _y ＝(Y,E _y ) (9)。

3. The multi-level image-text matching method based on graph attention network of claim 1,

the specific process of S2 is as follows:

unlike previous research methods, global semantics are extracted directly from images and text, here by constructing an image map G _x And a text chart G _y After the graph attention network, the relationship between the nodes in the image graph and the text graph is strengthened, and the final image and text representation can obtain the following X _All-purpose ＝Mean(G _x ) And Y _All-purpose ＝Mean(G _y ) Where Mean is the average pooling and the global matching similarity score for an image-text pair is calculated as:

S _all-purpose ＝R _All-purpose (X _All-purpose ,Y _All-purpose ) (10)

In obtaining two modal representation X _All-purpose And Y _All-purpose Then, a hinge-based triplet-ranking penalty is employed to supervise the potential spatial learning process, the penalty function attempts to find the most difficult negatives in a small set, forming triplets with positive negatives and ground truth queries, and the penalty function is defined as follows:

4. The multi-level image-text matching method based on graph attention network of claim 1,

the specific process of S4 is as follows:

the image-text matching method based on the graph attention network comprises a graph construction module, a short sentence graph matching module and a global semantic matching network.

5. The multi-level image-text matching method based on graph attention network of claim 1,

the specific process of S5 is as follows:

the training method of the multilevel image-text matching method based on the graph attention network comprises the following steps:

in the training process, all experiments are realized by using Python 3.6 and PyTorch frameworks, and the experiments are carried out on a computer with an Nvidia Tesla P100 GPU; for each sentence, the word embedding size is set to 300 dimensions, and the words are encoded into 1024-dimensional vectors using bidirectional GRUs; the image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to be 1024 dimensions, and the feature dimension is the same as that of a text; the model is trained by adopting an Adam optimizer, the batch size is 64, 20 cycles are trained on the MSCOCO data set, and the attenuation of each 15 cycles is 10%; training 30 cycles on a Flickr30k dataset, with 10% decay per 15 batches; the learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset; further, the parameter sum is set to 20 and 0.2, respectively.