CN113535949A - Multi-mode combined event detection method based on pictures and sentences - Google Patents
Multi-mode combined event detection method based on pictures and sentences Download PDFInfo
- Publication number
- CN113535949A CN113535949A CN202110660692.2A CN202110660692A CN113535949A CN 113535949 A CN113535949 A CN 113535949A CN 202110660692 A CN202110660692 A CN 202110660692A CN 113535949 A CN113535949 A CN 113535949A
- Authority
- CN
- China
- Prior art keywords
- picture
- event
- sentence
- word
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/285—Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-modal combined event detection method based on pictures and sentences, and simultaneously, events are identified from the pictures and the sentences. On one hand, the method utilizes the existing single-mode data set to respectively learn the image and text event classifiers; on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so that the characteristic representation of the picture entity and the word in the public space is obtained. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. And finally, testing the model by using a small amount of multi-modal labeled data, and respectively acquiring the events and the types thereof described by the pictures and the sentences by using a shared event classifier. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.
Description
Technical Field
The invention relates to an event detection method, in particular to a multi-mode combined event detection method based on pictures and sentences, belonging to the field of multi-mode information extraction.
Background
With the gradual introduction of modern technologies such as computers and mobile phones into common families, the participation in social platform interaction, news website browsing and other behaviors has become a main way for people to acquire network information, and the process of acquiring information by netizens is greatly simplified. It follows that the number of network users consuming information is increasing, and according to the 47 th statistical report of China Internet development status issued by the China Internet information center1It shows that the number of people in China net reaches 98900 ten thousand by 12 months in 2020, and the number of people in net is increased by 8540 ten thousand compared with 3 months in the last year. As a result, a great deal of new information is being flooded into the network every day, and the information is often spread among the masses in various forms such as text, pictures, audio, and the like. When the massive and disordered network information is faced, the information extraction technology can process the data and display the structured information to the user, so that valuable and interesting information is accurately provided for the user.
The information extraction is to extract structured information from pictures, texts or audios for storage and display, is also an important technical means for constructing a knowledge graph, and generally comprises three subtasks of named entity identification, relationship extraction and event extraction. Using text as an example, the named entity recognition task is to discover entities that describe geopolitics, facilities, and names of people. The purpose of the relationship extraction task is to determine a binary semantic relationship between two entities. And the event extraction task comprises two links of event detection (finding out trigger words in the sentence and determining the event type of the trigger words) and argument identification (allocating argument roles to each entity participating in the event). Compared with the relation extraction, the event extraction task can simultaneously extract the mutual relation among multiple entities, so that the structured information with finer granularity is obtained. Thus, the event extraction task is more challenging.
Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like.
Disclosure of Invention
The information provided by the invention mainly aiming at the single-mode data such as pictures or sentences is often insufficient for carrying out correct event classification, and the characteristic information of other modes is usually required. A multi-modal combined event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. A method for multi-modal joint event detection based on pictures and sentences is proposed.
The multi-modal combined event detection method based on pictures and sentences comprises the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature representation sequence of words in a sentenceFor the jth candidate trigger word, the corresponding feature vector is then usedInput text event classifier SoftmaxTAnd acquiring event type probability distribution triggered by the jth candidate trigger word, wherein a loss function of the text event classifier is defined as LT;
Step 2, coding the picture characteristics, and acquiring the characteristic representation sequences of the actions and the plurality of entities described in the pictureThen, the feature vector of the image entityInput picture event classifier SoftmaxIObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as LI;
Step 3, the image sentence matching module firstly uses a Cross-Modal Attention Mechanism (CM)AM) calculating the association weight between each pair of photo entities and the word. According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted averageOn the other hand, for the ith entity in the picture, relevant words are firstly searched in the sentence to be matched, weights are assigned to the words, semantic information relevant to the photo entity is captured through weighted average, and therefore the characteristic representation of the photo entity in the text mode is obtainedThen, Euclidean distance D of each sentence from the characteristic representation sequence of each sentence in the picture modalityT←IAnd Euclidean distance D of all entities in the picture and the characteristic representation sequence thereof in the text modalityI←TAnd adding is carried out as the similarity of the picture and the sentence. Wherein, the loss function of the picture sentence matching module is defined as Lm;
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal text, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtainedAnd the feature representation of the jth word in the picture modalityThen utilizing gate control attention mechanism to make picture entity feature vectorAndand distributing weights, acquiring a multi-modal feature vector corresponding to the ith picture entity through weighted average, and then acquiring the event type described by the picture by using a shared event classifier. Also, another gated attention mechanism is utilizedAnddistributing weight, obtaining multi-modal feature representation of the jth word through weighted average, and then obtaining an event type triggered by the jth word by utilizing a shared event classifier;
further, the step 1 is specifically realized as follows:
1-1, training a text event classifier on a KBP2017 English data set, firstly preprocessing labeled data to obtain entity types, event trigger words and entity relations, wherein the entity types comprise 5 entity types and 18 event types, and then performing sentence segmentation and word segmentation on an original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence. And respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to the type 'null'.
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is the length of the sentence.
1-3. changing the sentence real value vector sequence W to { x ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMsConstructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentencesFinally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtainedAt the same time, C is mixedTAs a sequence of word features in a common space.
1-4, regarding each word in the sentence as a candidate trigger, regarding the jth (j is less than or equal to n) candidate trigger, and then corresponding feature vector thereofInput text event classifier:
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),represents the jth candidate trigger word w in the sentence SjProbability distribution of event type of trigger, and typew,jDenotes wjThe type of event triggered. Meanwhile, the loss function of the text event classifier is defined as:
wherein T is the number of sentences marked in the KBP2017 English data set,as a word wjAnnotated event type, SiRepresenting the ith sentence in the data set, with a sentence length of n.
Further, step 2 is specifically implemented as follows:
2-1. a picture event classifier is trained on imSitu picture data sets, where a total of 504 verbs are defined to record the actions described by the picture, and 11538 entity types describe the entities that appear in the picture. First using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectorsAt the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequencesEach picture is then represented by a mesh, which is built according to the actions and entities it describes. The action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node. And then, coding a word vector sequence corresponding to the picture characteristics by adopting a graph convolution network, so that the vector after the convolution calculation of the action nodes stores entity characteristic information. Wherein the coded picture entity characteristic vector sequence isWherein the content of the first and second substances,convolution vector for representing picture action node (for convenient calculation, the invention uses picture as picture motion nodeAction as a pictorial entity), likewise, HIAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.
2-2, convolving the motion vector in the picture IAs an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) represents the probability distribution of event type triggered by Picture I, while typeIIndicating the type of event described in picture I. Meanwhile, the loss function of the picture event classifier is defined as:
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
Further, step 3 is specifically implemented as follows:
3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. Firstly, calculating the association weight between each pair of photo entities and words by using a cross-modal attention mechanism, and learning word-based photo entity feature representation and word feature representation based on the photo entities. More specifically, from each word, the CMAM is able to locate significantAnd assigning weights to the photo entities, and acquiring feature representation of words in the photo modality by aggregating visual features related to the words through weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture IAnd word feature vector sequence of sentence SA cross-modality attention mechanism is first utilized to obtain a characterization of word and pictorial entities in other modalities.
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity and the jth word in the picture by using a cross-modal attention mechanismij:
Wherein the content of the first and second substances,feature vector representing ith entity in pictureCharacteristic energy of j-th wordHas a cosine similarity of [0,1 ] in the value range]. Then according to ScoreijCalculating the influence weight of the ith picture entity on the jth wordHeavy AijComprises the following steps:
finally, aggregating the picture entity feature representation based on the jth word in a weighted average modeTherefore, the invention usesAnd representing the characteristic representation sequence of the whole sentence in the picture mode.
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vectorIn the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
then, word feature representation based on the ith entity of the picture is captured by weighted averaging:
similarly, the representation of all entities in the picture in the text modality is:
3-4, in order to obtain semantic similarity between the picture and the sentence, adopting a weak consistency alignment mode to define the similarity between the picture and the sentence as the Euclidean distance between all entities in the picture and the characteristic representation sequence thereof in the text mode, and the sum of the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode.
First, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
then, the euclidean distances between all entities in the picture and their feature representation sequences in the text modality are:
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T. And finally, in order to obtain the picture sentence pair with the highest semantic similarity < I, S >, optimizing a picture sentence matching module by using the triplet loss. For each pair of correctly matched pictures and sentences, the invention additionally extracts a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-S >. Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,1+<I,S>-<I,S->)+max(0,1+<I,S>-<I-,S>) (15)
further, step 4 is specifically implemented as follows:
4-1. in order to obtain event classifiers sharing weight and bias term, the invention takes the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally, the target function L is minimized to be LT+LI+LmTo the modelAnd (5) performing joint optimization. Let the text event classifier SoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
Further, step 5 is specifically implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out the picture sentence pair with the highest semantic similarity < I, S >, and simultaneously obtains a word feature representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I。
5-2. in feature fusion, for word wjThe invention considersAndfor trigger word wjContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,weight of (2)The calculation method is as follows:
wherein the content of the first and second substances,representing the jth word feature vectorWith its feature representation in the picture modalityHas a cosine similarity of [ -1,1 ] in the range]. Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Wherein the content of the first and second substances,the result of (A) is usually a value between 0 and 1, controllingFor fused multi-modal featuresThe degree of influence of (c). When in useSmaller, fused features preserve more textual information, whileWhen larger, illustrate the picture feature to the word wjMore information is contributed in the event classification process.
Finally, theCandidate trigger word wjCorresponding multimodal featuresInputting a shared event classifier to obtain a word wjEvent type of trigger
5-3. also for picture I, the influence of the word features on the picture event classification is controlled using another gated attention. Firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture actionWith its feature representation in text modalityAssigning weightsAndwherein the content of the first and second substances,the calculation method is as follows:
then, the original characteristics of the ith picture entity are fused through weighted averageAnd their feature representation in text modalityObtaining updated multi-modal feature vectorsFinally, the shared event classifier pair is utilizedClassifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.
The invention has the following beneficial effects:
aiming at the defects of the prior art, a multi-modal joint event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. However, due to the lack of sufficient multi-modal annotation data, the invention adopts a joint optimization mode, on one hand, the existing single-modal data sets (imSitu picture data set and KBP2017 English data set) are utilized to respectively learn the picture and text event classifiers, and on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so as to obtain the characteristic representation of the picture entity and the word in the public space. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. Finally, a small amount of multi-modal annotation data (M) is utilized2E2Multimodal datasets) to test the model, using a shared event classifier to obtain the events and their types described by the pictures and sentences, respectively. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.
Drawings
FIG. 1 is a flow chart of the overall implementation of the present invention.
FIG. 2 is a block diagram of the model training phase of the present invention
Detailed Description
The attached drawings disclose a flow chart of a preferred embodiment of the invention in a non-limiting way; the technical solution of the present invention will be described in detail below with reference to the accompanying drawings.
Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like. With the diversification of carriers for transmitting network information, researchers are paying attention to event detection tasks in different fields, namely how to automatically acquire interesting events from different information carriers such as unstructured pictures and texts. Also, the same event may appear in different forms in pictures and sentences. However, the existing model only aims at the single-mode event detection based on sentences or pictures, or only considers the influence of picture characteristics on the text event detection, and ignores the influence of text context on the picture event classification. In order to solve the problems, the invention provides a multi-mode combined event detection method based on pictures and sentences.
As shown in fig. 1-2, a method for multi-modal joint event detection based on pictures and sentences comprises the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature representation sequence of words in a sentenceFor the jth candidate trigger word, the corresponding feature vector is then usedInput text event classifier SoftmaxTAnd acquiring event type probability distribution triggered by the jth candidate trigger word, wherein a loss function of the text event classifier is defined as LT;
Step 2, coding the picture characteristics, and acquiring the characteristic representation sequences of the actions and the plurality of entities described in the pictureThen, the feature vector of the image entityInput devicePicture event classifier SoftmaxIObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as LI;
And step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a Cross-modal attention Mechanism (CMAM). According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted averageOn the other hand, for the ith entity in the picture, relevant words are firstly searched in the sentence to be matched, weights are assigned to the words, semantic information relevant to the photo entity is captured through weighted average, and therefore the characteristic representation of the photo entity in the text mode is obtainedThen, Euclidean distance D of each sentence from the characteristic representation sequence of each sentence in the picture modalityT←IAnd Euclidean distance D of all entities in the picture and the characteristic representation sequence thereof in the text modalityI←TAnd adding is carried out as the similarity of the picture and the sentence. Wherein, the loss function of the picture sentence matching module is defined as Lm;
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal text, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtainedAnd the feature representation of the jth word in the picture modalityThen utilizing gate control attention mechanism to make picture entity feature vectorAndand distributing weights, acquiring a multi-modal feature vector corresponding to the ith picture entity through weighted average, and then acquiring the event type described by the picture by using a shared event classifier. Also, another gated attention mechanism is utilizedAnddistributing weight, obtaining multi-modal feature representation of the jth word through weighted average, and then obtaining an event type triggered by the jth word by utilizing a shared event classifier;
further, the step 1 is specifically realized as follows:
1-1, training a text event classifier on a KBP2017 English data set, firstly preprocessing labeled data to obtain entity types, event trigger words and entity relations, wherein the entity types comprise 5 entity types and 18 event types, and then performing sentence segmentation and word segmentation on an original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence. And respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to the type 'null'.
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is a sentenceThe length of the seed.
1-3. changing the sentence real value vector sequence W to { x ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMsConstructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentencesFinally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtainedAt the same time, C is mixedTAs a sequence of word features in a common space.
1-4, regarding each word in the sentence as a candidate trigger, regarding the jth (j is less than or equal to n) candidate trigger, and then corresponding feature vector thereofInput text event classifier:
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),represents the jth candidate trigger word w in the sentence SjProbability distribution of event type of trigger, and typew,jDenotes wjThe type of event triggered. Meanwhile, the loss function of the text event classifier is defined as:
wherein T is the number of sentences marked in the KBP2017 English data set,as a word wjAnnotated event type, SiRepresenting the ith sentence in the data set, with a sentence length of n.
Further, step 2 is specifically implemented as follows:
2-1. a picture event classifier is trained on imSitu picture data sets, where a total of 504 verbs are defined to record the actions described by the picture, and 11538 entity types describe the entities that appear in the picture. First using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectorsAt the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequencesEach picture is then represented by a mesh, which is built according to the actions and entities it describes. The action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node. And then, coding a word vector sequence corresponding to the picture characteristics by adopting a graph convolution network, so that the vector after the convolution calculation of the action nodes stores entity characteristic information. Wherein the coded picture entity characteristic vector sequence isWherein the content of the first and second substances,convolution vector for representing picture action node (for the convenience of calculation, the picture action is regarded as a picture entity in the invention), and similarly, HIAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.
2-2, convolving the motion vector in the picture IAs an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
typeI=argmax(P(yI|I))
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) represents the probability distribution of event type triggered by Picture I, while typeIIndicating the type of event described in picture I. Meanwhile, the loss function of the picture event classifier is defined as:
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
Further, step 3 is specifically implemented as follows:
3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. First, each pair of graphs is calculated by using a cross-modal attention mechanismAnd (3) association weight values between the fragment entities and the words, and learning word-based picture entity characteristic representation and word characteristic representation based on the picture entities. More specifically, from each word, the CMAM can locate important photo entities and assign weights, and obtain a feature representation of the word in the photo modality by aggregating visual features associated with the word by weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture IAnd word feature vector sequence of sentence SA cross-modality attention mechanism is first utilized to obtain a characterization of word and pictorial entities in other modalities.
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity and the jth word in the picture by using a cross-modal attention mechanismij:
Wherein the content of the first and second substances,feature vector representing ith entity in pictureCharacteristic energy of j-th wordHas a cosine similarity of [0,1 ] in the value range]. Then according to ScoreijCalculating the influence weight A of the ith photo entity on the jth wordijComprises the following steps:
finally, aggregating the picture entity feature representation based on the jth word in a weighted average modeTherefore, the invention usesAnd representing the characteristic representation sequence of the whole sentence in the picture mode.
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vectorIn the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
then, word feature representation based on ith entity of picture is captured by weighted averageSimilarly, the representation of all entities in the picture in the text modality is:
3-4, in order to obtain semantic similarity between the picture and the sentence, adopting a weak consistency alignment mode to define the similarity between the picture and the sentence as the Euclidean distance between all entities in the picture and the characteristic representation sequence thereof in the text mode, and the sum of the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode.
First, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
then, the euclidean distances between all entities in the picture and their feature representation sequences in the text modality are:
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T. And finally, in order to obtain the picture sentence pair with the highest semantic similarity < I, S >, optimizing a picture sentence matching module by using the triplet loss. For each pair of correctly matched pictures and sentences, the invention additionally extracts a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-S >. Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,1+<I,S>-<I,S->)+max(0,1+<I,S>-<I-,S>)
further, step 4 is specifically implemented as follows:
4-1, in order to obtain event classifiers sharing weight and bias items, the invention takes the characteristic representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally minimizes a target functionNumber L ═ LT+LI+LmAnd jointly optimizing the models. Let the text event classifier SoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
Further, step 5 is specifically implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out the picture sentence pair with the highest semantic similarity < I, S >, and simultaneously obtains a word feature representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I。
5-2. in feature fusion, for word wjIn the present invention, it is considered that cj TAnd hj T←IFor trigger word wjContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,weight of (2)The calculation method is as follows:
wherein the content of the first and second substances,representing the jth word feature vectorWith its feature representation in the picture modalityHas a cosine similarity of [ -1,1 ] in the range]. Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Wherein the content of the first and second substances,the result of (A) is usually a value between 0 and 1, controllingFor fused multi-modal featuresThe degree of influence of (c). When in useSmaller, fused features preserve more textual information, whileWhen larger, illustrate the picture feature to the word wjMore information is contributed in the event classification process.
Finally, the candidate trigger word wjCorresponding multimodal featuresInputting a shared event classifier to obtain a word wjEvent type of trigger
5-3. also for picture I, the influence of the word features on the picture event classification is controlled using another gated attention. Firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture actionWith its feature representation in text modalityAssigning weightsAndwherein the content of the first and second substances,the calculation method is as follows:
then, the original characteristics of the ith picture entity are fused through weighted averageAnd their feature representation in text modalityObtaining updated multi-modal feature vectorsFinally, the shared event classifier pair is utilizedClassifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.
Claims (6)
1. The multimodal combined event detection method based on pictures and sentences is characterized by comprising the following steps:
step 1, a text event detection module firstly encodes text features to obtain a feature vector representation sequence of words in a sentenceFor the jth candidate trigger word, the feature vector of the corresponding candidate trigger word is then usedInput text event classifier SoftmaxTObtaining the probability distribution of event types triggered by the jth candidate trigger word, wherein the loss function of the text event classifier is defined as LT;
Step 2, the picture event detection module encodes the picture characteristics to obtain picture entity characteristic vector representation sequences of the description actions and the plurality of entities in the pictureThen, the feature vector of the image entityInput picture event classifier SoftmaxIObtaining the event type probability distribution of the current picture description, wherein the loss function of the picture event classifier is defined as LI;
Step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a cross-modal attention mechanism CMAM;
according to the jth word, the CMAM can locate and classify important picture entitiesWeighting, and acquiring the feature representation of the words in the picture mode by weighting and averaging the picture entity features related to the words
Meanwhile, for the ith entity in the picture, related words are searched in the sentence to be matched, weight is distributed to the words, semantic information related to the photo entity is captured through weighted average, and therefore characteristic representation of the photo entity in the text mode is obtained
And then, the Euclidean distance D between each sentence to be matched and the characteristic representation sequence thereof in the picture modalityT←IEuclidean distance D from all entities in the picture and their feature representation sequences in the text modalityI←TAdding the images to obtain similarity of the images and the sentences; wherein, the loss function of the picture sentence matching module is defined as Lm;
Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;
step 5, in the testing stage, for the multi-modal article, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtainedAnd the feature representation of the jth word in the picture modalityThen utilizing gate control attention mechanism to make picture entity feature vectorAnd a feature representationDistributing weight, and obtaining a multi-modal feature vector corresponding to the ith picture entity through weighted average; then, acquiring an event type described by the picture by using a shared event classifier; similarly, another gated attention mechanism is used as the feature vector of the candidate trigger wordAnd a feature representationAnd assigning weights, acquiring the multi-modal feature representation of the jth word by weighted average, and then acquiring the event type triggered by the jth word by using a shared event classifier.
2. Step 1 of the picture and sentence based multimodal combined event detection method according to claim 1 is implemented as follows:
1-1, training a text event classifier Softmax on KBP2017 English data setTFirstly, preprocessing the labeled data to obtain an entity type, an event trigger word and an event type corresponding to the event trigger word; the method comprises 5 entity types and 18 event types; then, carrying out sentence segmentation and word segmentation on the original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence; respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to a type 'null';
1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentenceemdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector wposAnd querying the entity type vector table to obtain an entity type vector wentityThe real-valued vector x of each word ═ { w ═ wemd,wpos,wentityThus the sentence real valued vector sequence is denoted W ═ x1,x2,...,xn-1,xnWhere n is the length of the sentence;
1-3. will sentenceSub-real-valued vector sequence W ═ x1,x2,...,xn-1,xnTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMsConstructing a graph convolution network based on sentence grammar dependency structure, and then combining HLInputting into GCNs to obtain the convolution vector sequence of sentencesFinally, the sequence H is calculated by using attentionTThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtainedAt the same time, C is mixedTA feature representation sequence in a common space as a word sequence;
1-4, regarding each word in the sentence as a candidate trigger, regarding the j (j is less than or equal to n) first candidate trigger, and then using the corresponding feature vector thereofInput text event classifier:
wherein, WTAnd bTSoftmax as a text event classifierTThe weight matrix and the bias term of (c),represents the jth candidate trigger word w in the sentence SjThe probability distribution of the event type of the trigger,and typew,jDenotes wjThe type of event triggered; meanwhile, the loss function of the text event classifier is defined as:
3. Step 2 of the picture and sentence based multimodal combined event detection method according to claim 2 is implemented as follows:
2-1, training a picture event classifier on an imSitu picture dataset, wherein a total of 504 verbs are defined to record actions described by pictures, and 11538 entity types describe entities appearing in the pictures; first using VGG16vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLPvConverting verb features into verb vectorsAt the same time, another VGG16 is utilizedoExtracting an entity set O ═ { O ] in a picture1,o2,...,om-1,omIs then passed through a multi-layer perceptron MLPoConverting all entities into their corresponding noun vector sequencesThen, representing each picture by using a mesh structure, and constructing the mesh structure according to the described actions and entities; the action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node; then, the word vector corresponding to the picture features is subjected to graph convolution networkCoding the sequence so that the vector after the convolution calculation of the action node stores entity characteristic information; wherein the coded picture entity characteristic vector sequence isWherein the content of the first and second substances,a convolution vector to represent a picture action node; likewise, HIA characteristic representation sequence of the action of the view slice and the entity set in a common space;
2-2, convolving the motion vector in the picture IAs an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:
typeI=argmax(P(yI|I))
wherein, WIAnd bISoftmax as a picture event classifierIWeight matrix and bias term of P (y)II) representing Picture IiProbability distribution of event type of trigger, and typeIRepresenting the event type described in the picture I; meanwhile, the loss function of the picture event classifier is defined as:
wherein N represents the number of the event samples marked on the pictures in imSitu, yIAs picture IiAnnotated event type, IiIndicating the ith picture sample in the picture data set.
4. Step 3 of the picture and sentence based multimodal combined event detection method according to claim 3 is implemented as follows:
3-1, giving the entity characteristic vector sequence corresponding to the picture IAnd word feature vector sequence of sentence SFirstly, acquiring feature representations of word and picture entities in other modes by using a cross-mode attention mechanism;
3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity in the picture and the jth word in the sentence by using a cross-mode attention mechanismij:
Wherein the content of the first and second substances,feature vector representing ith entity in pictureFeature vector of j-th word in sentenceHas a cosine similarity of [0,1 ] in the value range](ii) a Then according to ScoreijCalculating the influence weight A of the ith photo entity on the jth wordijComprises the following steps:
finally, aggregating the picture entity feature representation based on the jth word in a weighted average modeBy usingA characteristic representation sequence representing the whole sentence in the picture mode;
3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vectorIn the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:
then, word feature representation based on ith entity of picture is captured by weighted averageAlso, the representation of all entities in the picture in the text modality is:
3-4, defining the similarity of the picture and the sentences as the sum of Euclidean distances between all entities in the picture and the characteristic representation sequence thereof in the text mode and the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode by adopting a weak consistency alignment mode;
first, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:
and then calculating Euclidean distances of all entities in the picture and the characteristic representation sequence of the entities in the text modality as follows:
thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ DT←I+DI←T;
In order to obtain the picture sentence pair with the semantic similarity of less than I and S greater than the highest, a trippletloss is used for optimizing a picture sentence matching module; for each pair of correctly matched pictures and sentences, additionally extracting a picture I which is not matched with the sentence S-And a sentence S not matching with the picture I-Form two negative pairs < I, S-> and < I-,S>;
Finally, the loss function of the picture sentence matching module is defined as:
Lm=max(0,(1+<I,S>-<I,S->))+max(0,(1+<I,S>-<I-,S>))。
5. step 4 of the multi-modal combined event detection method based on pictures and sentences according to claim 4 is implemented as follows:
4-1. in order to obtain event classifiers sharing weight and bias term, respectively using the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers, and finally, minimizing an objective function L ═ LT+LI+LmPerforming combined optimization on the models; enabling text event classifiersSoftmaxTAnd picture event classifier SoftmaxIThe weight matrix and bias terms can be shared; thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.
6. Step 5 of the multi-modal combined event detection method based on pictures and sentences according to claim 5 is implemented as follows:
5-1. with M2E2The multi-modal annotation data tests the trained model and tests the model for k sentences S1,S2...,Sk-1,SkAnd l pictures I1,I2...,Il-1,IlThe article firstly utilizes a picture sentence matching module to find out a picture sentence pair with the semantic similarity of less than I and S of more than the highest, and simultaneously obtains a word characteristic representation sequence H based on a picture entityI←TAnd word-based photo entity feature representation sequence HT←I;
5-2, in feature fusion, for candidate trigger word wjIt is considered thatAndfor candidate trigger word wjThe event type prediction of (2) contributes different degrees of feature information; the different feature information is therefore assigned weights using a gated attention mechanism,weight of (2)The calculation method is as follows:
wherein the content of the first and second substances,feature vector representing jth candidate triggerWith its feature representation in the picture modalityHas a cosine similarity of [ -1,1 ] in the range](ii) a Then, the weighted average is used to fuse with wjRelated picture characteristic information, obtaining wjCorresponding multi-modal feature representation vectors
Wherein the content of the first and second substances,the result of (A) is usually a value between 0 and 1, controllingFor fused multi-modal featuresThe degree of influence of (c); when in useWhen smaller, the fused features preserve more textual informationTo do soWhen larger, illustrate the picture feature to the word wjContributing more information in the event classification process;
finally, the candidate trigger word wjCorresponding multimodal featuresInputting a shared event classifier to obtain a word wjEvent type of trigger
5-3, similarly, for the picture I, controlling the influence of the word characteristics on the picture event classification by using another gating attention; firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture actionWith its feature representation in text modalityAssign weights respectivelyAndwherein the content of the first and second substances,the calculation method is as follows:
then, the ith picture entity is fused by weighted averageOriginal characteristics ofAnd their feature representation in text modalityObtaining updated multi-modal feature vectorsFinally, the shared event classifier pair is utilizedClassifying to obtain event type argmax (P (y) to which the picture description action belongsII), wherein I ═ 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110660692.2A CN113535949B (en) | 2021-06-15 | 2021-06-15 | Multi-modal combined event detection method based on pictures and sentences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110660692.2A CN113535949B (en) | 2021-06-15 | 2021-06-15 | Multi-modal combined event detection method based on pictures and sentences |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113535949A true CN113535949A (en) | 2021-10-22 |
CN113535949B CN113535949B (en) | 2022-09-13 |
Family
ID=78124947
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110660692.2A Active CN113535949B (en) | 2021-06-15 | 2021-06-15 | Multi-modal combined event detection method based on pictures and sentences |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113535949B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114418038A (en) * | 2022-03-29 | 2022-04-29 | 北京道达天际科技有限公司 | Space-based information classification method and device based on multi-mode fusion and electronic equipment |
WO2023093574A1 (en) * | 2021-11-25 | 2023-06-01 | 北京邮电大学 | News event search method and system based on multi-level image-text semantic alignment model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017139764A1 (en) * | 2016-02-12 | 2017-08-17 | Sri International | Zero-shot event detection using semantic embedding |
CN111259851A (en) * | 2020-01-23 | 2020-06-09 | 清华大学 | Multi-mode event detection method and device |
CN112163416A (en) * | 2020-10-09 | 2021-01-01 | 北京理工大学 | Event joint extraction method for merging syntactic and entity relation graph convolution network |
-
2021
- 2021-06-15 CN CN202110660692.2A patent/CN113535949B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017139764A1 (en) * | 2016-02-12 | 2017-08-17 | Sri International | Zero-shot event detection using semantic embedding |
CN111259851A (en) * | 2020-01-23 | 2020-06-09 | 清华大学 | Multi-mode event detection method and device |
CN112163416A (en) * | 2020-10-09 | 2021-01-01 | 北京理工大学 | Event joint extraction method for merging syntactic and entity relation graph convolution network |
Non-Patent Citations (2)
Title |
---|
JINGLI ZHANG等: "Interactive learning for joint event and relation extraction", 《SPRINGER》 * |
钱胜胜: "多媒体社会事件分析综述", 《计算机科学》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023093574A1 (en) * | 2021-11-25 | 2023-06-01 | 北京邮电大学 | News event search method and system based on multi-level image-text semantic alignment model |
CN114418038A (en) * | 2022-03-29 | 2022-04-29 | 北京道达天际科技有限公司 | Space-based information classification method and device based on multi-mode fusion and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN113535949B (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Deep learning for aspect-based sentiment analysis | |
CN108763362B (en) | Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection | |
CN108073568B (en) | Keyword extraction method and device | |
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
CN107688870B (en) | Text stream input-based hierarchical factor visualization analysis method and device for deep neural network | |
CN110162771B (en) | Event trigger word recognition method and device and electronic equipment | |
CN113591483A (en) | Document-level event argument extraction method based on sequence labeling | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
Huang et al. | Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow | |
WO2023108980A1 (en) | Information push method and device based on text adversarial sample | |
CN113535949B (en) | Multi-modal combined event detection method based on pictures and sentences | |
Ji et al. | Survey of visual sentiment prediction for social media analysis | |
CN113961666B (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
Liu et al. | Correlation identification in multimodal weibo via back propagation neural network with genetic algorithm | |
Liu et al. | A Comparative Analysis of Classic and Deep Learning Models for Inferring Gender and Age of Twitter Users [A Comparative Analysis of Classic and Deep Learning Models for Inferring Gender and Age of Twitter Users] | |
CN113239159A (en) | Cross-modal retrieval method of videos and texts based on relational inference network | |
CN112148831A (en) | Image-text mixed retrieval method and device, storage medium and computer equipment | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium | |
CN113627151B (en) | Cross-modal data matching method, device, equipment and medium | |
CN113516094B (en) | System and method for matching and evaluating expert for document | |
Gligorić et al. | Experts and authorities receive disproportionate attention on Twitter during the COVID-19 crisis | |
Sajeevan et al. | An enhanced approach for movie review analysis using deep learning techniques | |
Shaik et al. | Recurrent neural network with emperor penguin-based Salp swarm (RNN-EPS2) algorithm for emoji based sentiment analysis | |
CN112925983A (en) | Recommendation method and system for power grid information | |
CN116089644A (en) | Event detection method integrating multi-mode features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |