CN113191357B - Multilevel image-text matching method based on graph attention network - Google Patents

Multilevel image-text matching method based on graph attention network Download PDF

Info

Publication number
CN113191357B
CN113191357B CN202110550780.7A CN202110550780A CN113191357B CN 113191357 B CN113191357 B CN 113191357B CN 202110550780 A CN202110550780 A CN 202110550780A CN 113191357 B CN113191357 B CN 113191357B
Authority
CN
China
Prior art keywords
image
text
graph
matching
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110550780.7A
Other languages
Chinese (zh)
Other versions
CN113191357A (en
Inventor
吴杰
吴春雷
王雷全
路静
段海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202110550780.7A priority Critical patent/CN113191357B/en
Publication of CN113191357A publication Critical patent/CN113191357A/en
Application granted granted Critical
Publication of CN113191357B publication Critical patent/CN113191357B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multilevel image-text matching method based on a graph attention network. A key challenge of this task is to learn the correspondence between images and text. Most of the existing work only learns the local semantic relationship between objects, and the research work aiming at the local short sentence relationship between the objects and the relationship is very little. The invention constructs a multilevel image-text matching method based on a graph attention network from a graph view. The network constructs an attention diagram structure for image matching on the image area and the text word, so as to infer the corresponding relation of the fine-grained structured short sentence. Meanwhile, global matching is carried out according to the constructed graph structure reasoning global semantics as the supplement of graph matching, so that more comprehensive cross-media semantic matching is realized. A large number of experiments prove that the multi-level image-text matching method based on the graph attention network can simultaneously learn graph view matching and global view matching, and obtains competitive results on MSCOCO and Flickr30K data sets.

Description

Multilevel image-text matching method based on graph attention network
Technical Field
The invention belongs to an image-text matching method, and relates to the technical field of computer vision and natural language processing.
Background
Vision and language are two core parts of people's understanding of everyday matters. However, there are large semantic differences between these visual and textual data, which are very challenging for people to effectively measure their semantic similarity, and therefore cross-modal tasks attract much attention from many researchers. The key technology for achieving the task is common understanding of vision and text, and semantic correspondence between two information sources.
With the rapid development of deep learning technology, deep neural networks have shown great potential in various multimedia applications such as cross-media retrieval and text generation. Currently, matching methods based on deep learning can be roughly classified into two categories: one-to-one approach and many-to-many approach. The former takes the whole image and text as research objects to know the corresponding relation; the latter infers image-sentence similarity by aligning the visual region with the text words.
Based on a one-to-one method, researchers at home and abroad explore various common space representation methods, and hopefully, the methods can project images and texts into a public space and directly compare the similarity of the images and the texts. In early work, many researchers attempted to directly compare the similarity between images and text by projecting them into a common space. Kiros et al's Unifying visual-semantic constructs with multi-modal neural language models, which introduced an encoder-decoder pipeline that learned a multi-modal associative embedding space containing images and text and designed a new decoding language model for decoding distributed representations from the space. On the basis, wu et al propose an online learning method for learning image-text correspondence by maintaining bidirectional relative similarity. However, he does not consider the characteristic distribution of a single modality. Thus, zheng et al propose a two-path CNN model for visual-text embedding learning and add an instance penalty to account for data distribution within the modality. Huang et al propose learning semantic concepts and organize them in the correct semantic order to improve the representation of the image. Meanwhile, li et al reason for visual representations by capturing objects and their semantic relationships. While these efforts have made great progress in image-text alignment, they lack local fine-grained analysis of image-text pairs, and the performance of cross-media matching is limited due to ignoring fine-grained information and blending some redundant information.
Subsequently, a local cross-media matching method is proposed, which performs local similarity learning between all region-word pairs. It was first proposed by Karpathy et al to learn the relationship between all region-word pairs by calculating their similarity. Following this idea, lee et al propose a hierarchical LSTM to associate shared semantics of words and regions together. On this basis, wu et al propose to learn image-text alignment by measuring two-way relative semantic similarity. To further address the visual semantic differences, niu et al organize the text into semantic trees, one phrase for each node, and then use a hierarchical long-term memory (LSTM, a variation of RNN) to extract phrase-level features of the text to compute the similarity between image regions and words. And Ma et al propose a novel multifaceted feature matching network (MFM) that explores the matching relationship between two forms by describing a multifaceted representation of images and text. They do not take into account the difference in importance of each region-word pair in calculating global similarity. In addition, lee et al have devised a stacked cross-attention network that infers image-text matches by focusing on words or word-related regions that are related to the regions.
However, most of these efforts only learn local semantic relationships between objects, and little work has been done on local phrase semantic relationships between objects and their relationships. In contrast to the prior art methods, the method, we propose a multilevel image-text matching method based on graph attention network. The network constructs a graph structure for the picture and the text, and executes a short sentence graph matching module to perform graph matching so as to infer a fine-grained structural corresponding relation. Meanwhile, global semantics are deduced according to the constructed graph structure to carry out global matching as the supplement of graph matching, so that more comprehensive cross-media semantic matching is realized.
Disclosure of Invention
The invention aims to solve the problems that in the prior image text matching method, most of objects only learn the local semantic relationship between the objects, and the local short sentence semantic relationship between the objects and the relationship is rarely considered.
The technical scheme adopted by the invention for solving the technical problems is as follows:
s1, constructing graph structures of the images and the texts, and giving different weights to the importance of the images or the texts according to the areas or the words.
And S2, reasoning global semantic features to carry out global matching by combining the image graph and the text graph in the S1.
And S3, constructing a short sentence image matching module, and firstly performing node matching and then performing short sentence matching according to the text image and the image.
And S4, combining the network in the S2 and the network in the S3 to construct an overall framework of the multilevel image-text matching method based on the graph attention network.
And S5, training a multilevel image-text matching method based on the graph attention network.
First, for the image portion, each image is represented as a undirected fully connected graph with nodes set to the salient regions detected by the fast-rcnn, each node being associated with all other nodes. Wherein X is the regional characteristic, E is the calculated each pair of regional characteristics X i And x j E, E is an affinity matrix. The image map is defined as:
G x =(X,E x ) (1)
E x (x i ,x j )=(x i ) T x j (2)
in this case, the higher the relevance of the image region, the higher the affinity score of its edge. This results in a fully connected image graph G. The graph is then processed using the graph attention network GAT, which outputs features of global semantic relationship enhancement. The attention coefficient was calculated and normalized by the softmax function as follows:
e ij =a(W q x i ,W k x j ) (3)
Figure BDA0003070846250000031
μ ij =Softmax(e ij ) (5)
W q and W k Is a parameter which can be learnt, and simultaneously, the attention coefficient is calculated by utilizing the multi-head dot product,this is in practice faster and more space efficient.
Figure BDA0003070846250000032
Figure BDA0003070846250000033
In the formula (5-10), | | represents a connection, and a parameter
Figure BDA0003070846250000034
H =8 parallel attention layers were used, D = D/8. Then, under a nonlinear activation function, the final output characteristic is calculated as:
Figure BDA0003070846250000035
where N is the neighborhood of node i in the graph. In order to accelerate the training speed, batch normalization is added into the graph attention module. For text graphs, the same graph attention as above is made to obtain a finer text representation.
G y =(Y,E y ) (9)
The global semantic matching of the invention is different from the previous research method, the global semantic is directly extracted from the image and the text, and the image G is constructed in the text x And a text chart G y After the graph attention network, the relationship between the nodes in the image graph and the text graph is strengthened, and the final image and text representation can obtain the following X All-purpose =Mean(G x ) And Y All-purpose =Mean(G y ) Where Mean is average pooling. The global matching similarity score for an image-text pair is calculated as:
S all-purpose =R All-purpose (X All-purpose ,Y All-purpose ) (10)
In obtaining two modal representation X All-purpose And Y All-purpose Later, hinge-based triplet ranking penalties are employedAnd supervising the latent space learning process. The loss function attempts to find the most difficult negatives in a small batch, forming a triple union with positive negatives and ground truth queries. The loss function is defined as follows.
L All-purpose =[η+R All-purpose (X' All-purpose ,Y All-purpose )-R All-purpose (X All-purpose ,Y All-purpose )] + +[η+R All-purpose (X All-purpose ,Y' All-purpose )-R All-purpose (X All-purpose ,Y All-purpose )] + (11)
Where R is All-purpose (. Cndot.) refers to a similarity function, representing cosine similarity in the model.
For matching the short sentence graph, the two processes of image-to-text and text-to-image are divided. The similarity between the image and text nodes is first calculated, denoted S, and then the softmax function is calculated along the image axis. The similarity value measures how well an image node corresponds to each text node. All image nodes are then taken as a weighted combination of feature vectors, where the weights are the calculated similarities. This process can be expressed as:
Figure BDA0003070846250000041
F i→t =softmax α SY β (13)
then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] i1 ,x i2 ,...,x im ]And [ y i1 ,y i2 ,...,y im ]. The multi-block similarity is calculated within the pair of blocks. For example, the similarity of the h-th block is calculated as s ih =cos(x ih ,y ih ). Wherein s is ih Is a scalar value, cos (·) represents cosine similarity. The matching vector of the ith image node can be obtained by splicing the similarity of all the blocks, namely:
s i =s i1 ||s i2 ||...||s im (14)
where "|" represents a connection. Symmetrically, when a text graph is given, node-level matching is performed on each text node. The corresponding image nodes will be associated differently. Each text node, along with its associated image nodes, will then be processed by the multi-block module to produce a matching vector s. In this way, each image node is associated with its matching text node, which will propagate on the final graph match to neighboring nodes to direct them to learn fine-grained phrase correspondence.
Graph matching takes node matching vectors as input, and propagates the vectors to neighbors along the edges of the graph. Wherein the match vector for each node is updated using the GCN for the match vectors in the neighborhood. How to integrate neighborhood matching quantities by GCN learning is shown as the following formula:
Figure BDA0003070846250000042
wherein N is j Calculating polar coordinates (rho and theta) for the neighborhood of the jth node based on the centers of the bounding boxes of the pairwise regions, and setting an edge weight matrix W e As a pair of polar coordinates, W k And b is the parameter that needs to be learned for the kth core. The output of the spatial convolution is defined as the concatenation of the outputs of the k kernels, producing a convolution vector reflecting the correspondence of the connected nodes. These nodes constitute the structured phrase.
Phrase correspondence may be inferred by propagating neighboring node correspondence, thereby inferring a graph matching score for an image-text pair. Here, the convolution vector is input to a multi-layered perceptron (MLP) to consider the learned correspondences of all phrases collectively and infer the graph match score. This represents how well an image map matches a text map. This process is expressed as:
Figure BDA0003070846250000051
the same short sentence matching process from text image to image is as follows:
Figure BDA0003070846250000052
wherein W s 、b s Being a parameter of the MLP, the MLP comprises two fully connected layers, the function σ (·) representing tanh activation. The graph matching similarity score for an image-text pair is calculated as two directions:
Figure BDA0003070846250000053
finally, the image-text similarity calculation combined with multilevel semantic alignment is as follows:
Sim=S all-purpose +S Drawing (A) (19)
Therefore, by utilizing the global information, the local information and the relation information, the complementarity between the image and the sentence is fully mined, the correlation between the image and the sentence is comprehensively modeled, and the cross-media matching can be promoted.
The multilevel image-text matching method based on the graph attention network comprises a graph construction module, a short sentence graph matching module and a global semantic matching network.
Finally, the training method of the multilevel image-text matching method based on the graph attention network is as follows:
during our training, all experiments were performed using Python 3.6 and PyTorch frameworks, and the experiments were performed on a computer with an Nvidia Tesla P100 GPU. The word embedding size is set to 300 dimensions for each sentence. Words are encoded as 1024-dimensional vectors using bidirectional GRUs. The image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to be 1024 dimensions, and the feature dimensions are the same as those of a text. The model was trained using an Adam optimizer with a batch size of 64, on the MSCOCO dataset for 20 cycles, with a 10% decay per 15 cycles. Training on the Flickr30k dataset for 30 cycles decays by 10% for each 15 batches. The learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset. Further, the parameter sum is set to 20 and 0.2, respectively.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a novel multilevel image-text matching method based on a graph attention network, which is used for constructing an image structure graph and a text structure graph, performing feature weighting by using a graph attention module and performing matching of a short sentence semantic graph to further realize structured short sentence alignment.
2. A new method for learning global and local consistency is proposed for the first time in the invention, and the similarity of global and short sentence structurization is learned for matching, but not the similarity of global and local objects.
Drawings
Fig. 1 is a schematic structural diagram of a multilevel image-text matching method based on a graph attention network.
Fig. 2 is a schematic model diagram of a short sentence graph matching module.
Fig. 3 and 4 are graphs comparing the results of image-text matching on MSCOCO and Flickr30K datasets for a multi-level image-text matching method based on a graph attention network and other networks.
Fig. 5 and 6 are graphs of visualization results of image matching text and text matching images.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent.
The invention is further illustrated below with reference to the figures and examples.
Fig. 1 is a schematic structural diagram of a multilevel image-text matching method based on a graph attention network. As shown in fig. 1, firstly, fast-rcnn and GRU operations are applied to pictures and texts to extract regional features and text features so as to construct image pictures and text pictures. And then weighting the image graph and the text graph through the graph attention network, wherein the nodes of the graph can be objects, relations or attributes, if any two nodes are semantically dependent, edges exist, and finally, short sentence graph matching and global semantic matching are respectively carried out.
Fig. 2 is a schematic model diagram of a short sentence graph matching module. As shown in fig. 2, the phrase map matching is divided into two processes, image-to-text and text-to-image. The similarity between the image nodes and the text nodes is first calculated, denoted S, and then the softmax function is calculated along the image axis. The similarity value measures the degree of correspondence of the image node to each text node. All image nodes are then taken as a weighted combination of feature vectors, where the weights are the calculated similarities. This process can be expressed as:
Figure BDA0003070846250000061
F i→t =softmax α SY β (21)
then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] i1 ,x i2 ,…,x im ]And [ y i1 ,y i2 ,...,y im ]. The multi-block similarity is calculated within the pair of blocks. For example, the similarity of the h-th block is calculated as s ih =cos(x ih ,y ih ). Wherein s is ih Is a scalar value, cos (·) represents cosine similarity. The matching vector of the ith image node can be obtained by splicing the similarity of all the blocks, namely:
s i =s i1 ||s i2 ||...||s im (22)
where "|" denotes a connection. Symmetrically, when a text graph is given, node-level matching is performed on each text node. The corresponding image nodes will be associated differently. Each text node, along with its associated image nodes, will then be processed by the multi-block module to produce a matching vector s. In this way, each image node is associated with its matching text node, which will propagate on the final map match to neighboring nodes to guide them to learn fine-grained phrase correspondence.
Graph matching takes node matching vectors as input, and propagates the vectors to neighbors along the edges of the graph. Wherein the match vector for each node is updated using the GCN for the match vectors in the neighborhood. How to integrate neighborhood matching quantities by GCN learning is shown as the following formula:
Figure BDA0003070846250000071
wherein N is j Calculating polar coordinates (rho and theta) for the neighborhood of the jth node based on the centers of the bounding boxes of the pairwise regions, and setting an edge weight matrix W e As a pair of polar coordinates, W k And b is the parameter that needs to be learned for the kth core. The output of the spatial convolution is defined as the concatenation of the outputs of the k kernels, producing a convolution vector reflecting the correspondence of the connected nodes. These nodes constitute the structured phrase.
Phrase correspondence may be inferred by propagating neighboring node correspondence, inferring a graph matching score for the image-text pair. Here, the convolution vector is input to a multi-layered perceptron (MLP) to consider the learned correspondences of all phrases collectively and infer the graph match score. This represents how well an image map matches a text map. This process is expressed as:
Figure BDA0003070846250000072
the same short sentence matching process from text image to image is as follows:
Figure BDA0003070846250000073
wherein W s 、b s Being a parameter of the MLP, the MLP comprises two fully connected layers, the function σ (·) representing tanh activation. The graph matching similarity score for an image-text pair is calculated as two directions:
Figure BDA0003070846250000074
fig. 3 and 4 are graphs comparing the results of image-text matching on MSCOCO and Flickr30K datasets for a multi-level image-text matching method based on a graph attention network and other networks. As shown in fig. 3 and 4, the multi-level image-text matching result based on the graph attention network is more accurate than other models.
Fig. 5 and 6 are graphs of visualization results of image matching text and text matching images. As shown in fig. 5, given an image, a multi-level model based on the graph attention network can match corresponding texts. Given text, a multi-level model based on a graph attention network can match corresponding pictures, as shown in fig. 6.
The invention provides a multilevel image-text matching method based on a graph attention network, which selects nodes for a picture by using a bottom-up attention model, then constructs an image structure graph and a text structure graph for the image nodes and word nodes, enhances the graph characteristics through the graph attention, and executes a short sentence graph matching module to perform graph matching so as to realize the alignment of structured short sentences. Meanwhile, global semantic matching is added, and more accurate cross-media correlation is mined by learning the similarity of global and local structuring. Extensive experiments on the MSCOCO and Fliker30K datasets showed that this model has a positive effect on image-text matching. In future work, we will continue to explore how to better learn semantic correspondence between images and text.
Finally, the details of the above-described examples of the present invention are merely examples for illustrating the present invention, and any modifications, improvements, substitutions and the like of the above-described examples would be included in the scope of the present invention as claimed by those skilled in the art.

Claims (5)

1. The multilevel image-text matching method based on the graph attention network is characterized by comprising the following steps of:
s1, constructing graph structures of images and texts, and giving different weights to the importance of the images or the texts according to areas or words;
s2, reasoning global semantic features to perform global matching by combining the image graph and the text graph in the S1;
s3, constructing a short sentence image matching module, firstly performing node matching according to the text image and the image, and then performing short sentence matching;
for matching on an image level, firstly calculating the similarity between an image node and a text node, wherein the similarity is expressed as S, then calculating a softmax function along an image axis, and measuring the corresponding degree of the image node and each text node by using a similarity value; then, all image nodes are taken as a weighted combination of feature vectors, wherein the weights are the calculated similarities, and the process can be expressed as:
Figure FDA0003958961160000011
F i→t =softmax α SY β (13)
then, the ith feature of the image node and the corresponding aggregated text node are segmented into m blocks, respectively denoted as [ x [ ] i1 ,x i2 ,...,x im ]And [ y i1 ,y i2 ,...,y im ](ii) a The multi-block similarity is calculated in the pair of blocks, and the similarity of the h-th block is calculated as s ih =cos(x ih ,y ih ) Wherein s is ih As a scalar value, cos (·) represents cosine similarity, and the matching vector of the ith image node can be obtained by stitching the similarities of all blocks, that is:
s i =s i1 ||s i2 ||...||s im (14)
where "|" represents a connection, symmetrically, when a text graph is given, node-level matching is performed on each text node, and the corresponding image nodes will be associated distinctively; then, each text node, together with its associated image nodes, will be processed by the multi-block module to produce a matching vector s; in this way, each image node is associated with its matching text node, which will propagate on the final map match to neighboring nodes to direct them to learn fine-grained phrase correspondence;
graph matching takes node matching vectors as input, and the vectors are propagated to neighbors along the edges of the graph; the matching vector of each node updates the matching vector in the neighborhood by using GCN, and how to integrate the neighborhood matching vectors is learned by using GCN, wherein the formula is as follows:
Figure FDA0003958961160000012
wherein N is j Calculating polar coordinates (rho and theta) for the neighborhood of the jth node based on the centers of the bounding boxes of the pairwise regions, and setting an edge weight matrix W e As a pair of polar coordinates, W k And b is the parameter that the kth core needs to learn; the output of the spatial convolution is defined as the concatenation of the outputs of the k kernels, resulting in a convolution vector reflecting the correspondence of the connecting nodes; these nodes constitute the structured phrase;
phrase correspondence may be inferred by propagating neighboring node correspondence, inferring a graph matching score for an image-text pair; here, the convolution vector is input to a multi-layer perceptron (MLP) to consider the learning correspondence of all phrases together and infer a graph matching score, which represents how well an image graph matches a text graph, and this process is expressed as:
Figure FDA0003958961160000021
Figure FDA0003958961160000022
wherein W s 、b s For the parameters of the MLP, which contains two fully connected layers, the function σ (-) represents the tanh activation, and the graph matching similarity score for an image-text pair is computed as two directions:
Figure FDA0003958961160000023
finally, the semantic image-text similarity calculation combining multilevel semantic alignment is as follows:
Sim=S all-purpose +S Drawing (A) (19)
Therefore, by utilizing the global information, the local information and the relation information, the complementarity between the images and the sentences is fully mined, the correlation between the images and the sentences is comprehensively modeled, and the cross-media matching can be promoted;
s4, combining the network in the S2 and the network in the S3 to construct an overall framework of a multilevel image-text matching method based on the graph attention network;
and S5, training a multilevel image-text matching method based on the graph attention network.
2. The multi-level image-text matching method based on graph attention network of claim 1,
the specific process of S1 is as follows:
for image portions, each image is represented as a undirected fully connected graph with nodes set as salient regions detected by the false-rcnn, each node is associated with all other nodes, where X is a region feature, and E is a calculated X for each pair of region features i And x j E, E is an affinity matrix, and the image map is defined as:
G x =(X,E x ) (1)
E x (x i ,x j )=(x i ) T x j (2)
in this case, the higher the relevance of the image region, the higher the affinity score of its edge, thus having a fully connected image G, and then processing the image G with the graph attention network GAT, which outputs the features of global semantic relationship enhancement, calculates the attention coefficient and normalizes it by the softmax function as follows:
e ij =a(W q x i ,W k x j ) (3)
Figure FDA0003958961160000031
μ ij =Softmax(e ij ) (5)
W q and W k Is a learnable parameter, and at the same time, the attention coefficient is calculated by using a multi-head dot product, which is faster in practice and higher in space efficiency:
Figure FDA0003958961160000032
Figure FDA0003958961160000033
in equation (6), | | represents a connection, a parameter
Figure FDA0003958961160000034
H =8 parallel attention layers, D = D/8; then, under a nonlinear activation function, the final output characteristic is calculated as:
Figure FDA0003958961160000035
n is the neighborhood of a node i in the graph, and batch normalization is added in a graph attention module in order to accelerate the training speed; for text graphs, the same graph attention as above is made to obtain a finer text representation
G y =(Y,E y ) (9)。
3. The multi-level image-text matching method based on graph attention network of claim 1,
the specific process of S2 is as follows:
unlike previous research methods, global semantics are extracted directly from images and text, here by constructing an image map G x And a text chart G y After the graph attention network, the relationship between the nodes in the image graph and the text graph is strengthened, and the final image and text representation can obtain the following X All-purpose =Mean(G x ) And Y All-purpose =Mean(G y ) Where Mean is the average pooling and the global matching similarity score for an image-text pair is calculated as:
S all-purpose =R All-purpose (X All-purpose ,Y All-purpose ) (10)
In obtaining two modal representation X All-purpose And Y All-purpose Then, a hinge-based triplet-ranking penalty is employed to supervise the potential spatial learning process, the penalty function attempts to find the most difficult negatives in a small set, forming triplets with positive negatives and ground truth queries, and the penalty function is defined as follows:
L all-purpose =[η+R All-purpose (X' All-purpose ,Y All-purpose )-R All-purpose (X All-purpose ,Y All-purpose )] + +[η+R All-purpose (X All-purpose ,Y' All-purpose )-R All-purpose (X All-purpose ,Y All-purpose )] + (11)
Where R is All-purpose (. Cndot.) refers to a similarity function, representing cosine similarity in the model.
4. The multi-level image-text matching method based on graph attention network of claim 1,
the specific process of S4 is as follows:
the image-text matching method based on the graph attention network comprises a graph construction module, a short sentence graph matching module and a global semantic matching network.
5. The multi-level image-text matching method based on graph attention network of claim 1,
the specific process of S5 is as follows:
the training method of the multilevel image-text matching method based on the graph attention network comprises the following steps:
in the training process, all experiments are realized by using Python 3.6 and PyTorch frameworks, and the experiments are carried out on a computer with an Nvidia Tesla P100 GPU; for each sentence, the word embedding size is set to 300 dimensions, and the words are encoded into 1024-dimensional vectors using bidirectional GRUs; the image preprocessing adopts a bottom-up attention model to extract regional features, each image feature vector is set to be 1024 dimensions, and the feature dimension is the same as that of a text; the model is trained by adopting an Adam optimizer, the batch size is 64, 20 cycles are trained on the MSCOCO data set, and the attenuation of each 15 cycles is 10%; training 30 cycles on a Flickr30k dataset, with 10% decay per 15 batches; the learning rate was set to 0.0005 on the MSCOCO dataset and 0.0002 on the Flickr30k dataset; further, the parameter sum is set to 20 and 0.2, respectively.
CN202110550780.7A 2021-05-18 2021-05-18 Multilevel image-text matching method based on graph attention network Active CN113191357B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110550780.7A CN113191357B (en) 2021-05-18 2021-05-18 Multilevel image-text matching method based on graph attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110550780.7A CN113191357B (en) 2021-05-18 2021-05-18 Multilevel image-text matching method based on graph attention network

Publications (2)

Publication Number Publication Date
CN113191357A CN113191357A (en) 2021-07-30
CN113191357B true CN113191357B (en) 2023-01-17

Family

ID=76982677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110550780.7A Active CN113191357B (en) 2021-05-18 2021-05-18 Multilevel image-text matching method based on graph attention network

Country Status (1)

Country Link
CN (1) CN113191357B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021558B (en) * 2021-11-10 2022-05-10 北京航空航天大学杭州创新研究院 Intelligent evaluation method for consistency of graph and text meaning based on layering
CN114492451B (en) * 2021-12-22 2023-10-24 马上消费金融股份有限公司 Text matching method, device, electronic equipment and computer readable storage medium
CN114547235B (en) * 2022-01-19 2024-04-16 西北大学 Construction method of image text matching model based on priori knowledge graph
CN114863241A (en) * 2022-04-22 2022-08-05 厦门大学 Movie and television animation evaluation method based on spatial layout and deep learning
CN115062208B (en) * 2022-05-30 2024-01-23 苏州浪潮智能科技有限公司 Data processing method, system and computer equipment
CN115098646B (en) * 2022-07-25 2024-03-29 北方民族大学 Multistage relation analysis and mining method for graphic data
CN116484878B (en) * 2023-06-21 2023-09-08 国网智能电网研究院有限公司 Semantic association method, device, equipment and storage medium of power heterogeneous data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108132968B (en) * 2017-12-01 2020-08-04 西安交通大学 Weak supervision learning method for associated semantic elements in web texts and images
CN109710923B (en) * 2018-12-06 2020-09-01 浙江大学 Cross-language entity matching method based on cross-media information
CN110516530A (en) * 2019-07-09 2019-11-29 杭州电子科技大学 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature
CN112101043B (en) * 2020-09-22 2021-08-24 浙江理工大学 Attention-based semantic text similarity calculation method
CN112417097B (en) * 2020-11-19 2022-09-16 中国电子科技集团公司电子科学研究院 Multi-modal data feature extraction and association method for public opinion analysis

Also Published As

Publication number Publication date
CN113191357A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN113191357B (en) Multilevel image-text matching method based on graph attention network
CN111488734B (en) Emotional feature representation learning system and method based on global interaction and syntactic dependency
CN111581395B (en) Model fusion triplet representation learning system and method based on deep learning
CN110083705B (en) Multi-hop attention depth model, method, storage medium and terminal for target emotion classification
Gan et al. Sparse attention based separable dilated convolutional neural network for targeted sentiment analysis
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
US11410031B2 (en) Dynamic updating of a word embedding model
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN111897944B (en) Knowledge graph question-answering system based on semantic space sharing
Wang et al. Multi-modal knowledge graphs representation learning via multi-headed self-attention
CN113779220A (en) Mongolian multi-hop question-answering method based on three-channel cognitive map and graph attention network
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN112084358B (en) Image-text matching method based on area strengthening network with subject constraint
CN115309927B (en) Multi-label guiding and multi-view measuring ocean remote sensing image retrieval method and system
CN114925176B (en) Method, system and medium for constructing intelligent multi-modal cognitive map
CN114254093A (en) Multi-space knowledge enhanced knowledge graph question-answering method and system
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
CN110889505A (en) Cross-media comprehensive reasoning method and system for matching image-text sequences
Jia et al. Semantic association enhancement transformer with relative position for image captioning
CN114169408A (en) Emotion classification method based on multi-mode attention mechanism
Zhou et al. Relation-Aware Entity Matching Using Sentence-BERT.
CN117131933A (en) Multi-mode knowledge graph establishing method and application
CN117150069A (en) Cross-modal retrieval method and system based on global and local semantic comparison learning
CN116595222A (en) Short video multi-label classification method and device based on multi-modal knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant