CN113535949A

CN113535949A - Multi-mode combined event detection method based on pictures and sentences

Info

Publication number: CN113535949A
Application number: CN202110660692.2A
Authority: CN
Inventors: 张旻; 曹祥彪; 汤景凡; 姜明
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-10-22
Anticipated expiration: 2041-06-15
Also published as: CN113535949B

Abstract

The invention discloses a multi-modal combined event detection method based on pictures and sentences, and simultaneously, events are identified from the pictures and the sentences. On one hand, the method utilizes the existing single-mode data set to respectively learn the image and text event classifiers; on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so that the characteristic representation of the picture entity and the word in the public space is obtained. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. And finally, testing the model by using a small amount of multi-modal labeled data, and respectively acquiring the events and the types thereof described by the pictures and the sentences by using a shared event classifier. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.

Description

Multi-mode combined event detection method based on pictures and sentences

Technical Field

The invention relates to an event detection method, in particular to a multi-mode combined event detection method based on pictures and sentences, belonging to the field of multi-mode information extraction.

Background

With the gradual introduction of modern technologies such as computers and mobile phones into common families, the participation in social platform interaction, news website browsing and other behaviors has become a main way for people to acquire network information, and the process of acquiring information by netizens is greatly simplified. It follows that the number of network users consuming information is increasing, and according to the 47 th statistical report of China Internet development status issued by the China Internet information center¹It shows that the number of people in China net reaches 98900 ten thousand by 12 months in 2020, and the number of people in net is increased by 8540 ten thousand compared with 3 months in the last year. As a result, a great deal of new information is being flooded into the network every day, and the information is often spread among the masses in various forms such as text, pictures, audio, and the like. When the massive and disordered network information is faced, the information extraction technology can process the data and display the structured information to the user, so that valuable and interesting information is accurately provided for the user.

The information extraction is to extract structured information from pictures, texts or audios for storage and display, is also an important technical means for constructing a knowledge graph, and generally comprises three subtasks of named entity identification, relationship extraction and event extraction. Using text as an example, the named entity recognition task is to discover entities that describe geopolitics, facilities, and names of people. The purpose of the relationship extraction task is to determine a binary semantic relationship between two entities. And the event extraction task comprises two links of event detection (finding out trigger words in the sentence and determining the event type of the trigger words) and argument identification (allocating argument roles to each entity participating in the event). Compared with the relation extraction, the event extraction task can simultaneously extract the mutual relation among multiple entities, so that the structured information with finer granularity is obtained. Thus, the event extraction task is more challenging.

Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like.

Disclosure of Invention

The information provided by the invention mainly aiming at the single-mode data such as pictures or sentences is often insufficient for carrying out correct event classification, and the characteristic information of other modes is usually required. A multi-modal combined event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. A method for multi-modal joint event detection based on pictures and sentences is proposed.

The multi-modal combined event detection method based on pictures and sentences comprises the following steps:

step 1, a text event detection module firstly encodes text features to obtain a feature representation sequence of words in a sentence

For the jth candidate trigger word, the corresponding feature vector is then used

Input text event classifier Softmax_TAnd acquiring event type probability distribution triggered by the jth candidate trigger word, wherein a loss function of the text event classifier is defined as L^T；

Step 2, coding the picture characteristics, and acquiring the characteristic representation sequences of the actions and the plurality of entities described in the picture

Then, the feature vector of the image entity

Input picture event classifier Softmax_IObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as L^I；

Step 3, the image sentence matching module firstly uses a Cross-Modal Attention Mechanism (CM)AM) calculating the association weight between each pair of photo entities and the word. According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted average

On the other hand, for the ith entity in the picture, relevant words are firstly searched in the sentence to be matched, weights are assigned to the words, semantic information relevant to the photo entity is captured through weighted average, and therefore the characteristic representation of the photo entity in the text mode is obtained

Then, Euclidean distance D of each sentence from the characteristic representation sequence of each sentence in the picture modality^T←IAnd Euclidean distance D of all entities in the picture and the characteristic representation sequence thereof in the text modality^I←TAnd adding is carried out as the similarity of the picture and the sentence. Wherein, the loss function of the picture sentence matching module is defined as L^m；

Step 4, acquiring a shared event classifier through a combined optimization text event detection module, a picture event detection module and a picture sentence matching module;

step 5, in the testing stage, for the multi-modal text, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtained

And the feature representation of the jth word in the picture modality

Then utilizing gate control attention mechanism to make picture entity feature vector

And

and distributing weights, acquiring a multi-modal feature vector corresponding to the ith picture entity through weighted average, and then acquiring the event type described by the picture by using a shared event classifier. Also, another gated attention mechanism is utilized

And

distributing weight, obtaining multi-modal feature representation of the jth word through weighted average, and then obtaining an event type triggered by the jth word by utilizing a shared event classifier;

further, the step 1 is specifically realized as follows:

1-1, training a text event classifier on a KBP2017 English data set, firstly preprocessing labeled data to obtain entity types, event trigger words and entity relations, wherein the entity types comprise 5 entity types and 18 event types, and then performing sentence segmentation and word segmentation on an original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence. And respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to the type 'null'.

1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentence_emdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector w_posAnd querying the entity type vector table to obtain an entity type vector w_entityThe real-valued vector x of each word ═ { w ═ w_emd,w_pos,w_entityThus the sentence real valued vector sequence is denoted W ═ x₁,x₂,...,x_n-1,x_nWhere n is the length of the sentence.

1-3. changing the sentence real value vector sequence W to { x ═ x₁,x₂,...,x_n-1,x_nTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMs

Constructing a graph convolution network based on sentence grammar dependency structure, and then combining H^LInputting into GCNs to obtain the convolution vector sequence of sentences

Finally, the sequence H is calculated by using attention^TThe influence weight of each element in the sentence on the candidate trigger word is obtained, and the coding sequence of the sentence is obtained

At the same time, C is mixed^TAs a sequence of word features in a common space.

1-4, regarding each word in the sentence as a candidate trigger, regarding the jth (j is less than or equal to n) candidate trigger, and then corresponding feature vector thereof

Input text event classifier:

wherein, W^TAnd b^TSoftmax as a text event classifier_TThe weight matrix and the bias term of (c),

represents the jth candidate trigger word w in the sentence S_jProbability distribution of event type of trigger, and type_w,jDenotes w_jThe type of event triggered. Meanwhile, the loss function of the text event classifier is defined as:

wherein T is the number of sentences marked in the KBP2017 English data set,

as a word w_jAnnotated event type, S_iRepresenting the ith sentence in the data set, with a sentence length of n.

Further, step 2 is specifically implemented as follows:

2-1. a picture event classifier is trained on imSitu picture data sets, where a total of 504 verbs are defined to record the actions described by the picture, and 11538 entity types describe the entities that appear in the picture. First using VGG16_vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLP_vConverting verb features into verb vectors

At the same time, another VGG16 is utilized_oExtracting an entity set O ═ { O ] in a picture₁,o₂,...,o_m-1,o_mIs then passed through a multi-layer perceptron MLP_oConverting all entities into their corresponding noun vector sequences

Each picture is then represented by a mesh, which is built according to the actions and entities it describes. The action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node. And then, coding a word vector sequence corresponding to the picture characteristics by adopting a graph convolution network, so that the vector after the convolution calculation of the action nodes stores entity characteristic information. Wherein the coded picture entity characteristic vector sequence is

Wherein the content of the first and second substances,

convolution vector for representing picture action node (for convenient calculation, the invention uses picture as picture motion nodeAction as a pictorial entity), likewise, H^IAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.

2-2, convolving the motion vector in the picture I

As an input of the picture event classifier, obtaining a probability distribution of picture description event types as follows:

wherein, W^IAnd b^ISoftmax as a picture event classifier_IWeight matrix and bias term of P (y)^II) represents the probability distribution of event type triggered by Picture I, while type_IIndicating the type of event described in picture I. Meanwhile, the loss function of the picture event classifier is defined as:

wherein N represents the number of the event samples marked on the pictures in imSitu, y^IAs picture I_iAnnotated event type, I_iIndicating the ith picture sample in the picture data set.

Further, step 3 is specifically implemented as follows:

3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. Firstly, calculating the association weight between each pair of photo entities and words by using a cross-modal attention mechanism, and learning word-based photo entity feature representation and word feature representation based on the photo entities. More specifically, from each word, the CMAM is able to locate significantAnd assigning weights to the photo entities, and acquiring feature representation of words in the photo modality by aggregating visual features related to the words through weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture I

And word feature vector sequence of sentence S

A cross-modality attention mechanism is first utilized to obtain a characterization of word and pictorial entities in other modalities.

3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity and the jth word in the picture by using a cross-modal attention mechanism_ij：

Wherein the content of the first and second substances,

feature vector representing ith entity in picture

Characteristic energy of j-th word

Has a cosine similarity of [0,1 ] in the value range]. Then according to Score_ijCalculating the influence weight of the ith picture entity on the jth wordHeavy A_ijComprises the following steps:

finally, aggregating the picture entity feature representation based on the jth word in a weighted average mode

Therefore, the invention uses

And representing the characteristic representation sequence of the whole sentence in the picture mode.

3-3, in order to obtain the word feature representation based on the picture entity, adopting and obtaining the vector

In the same calculation process, for the ith entity in the picture, according to the relevance of the jth word and the entity of the current picture, the attention weight is distributed to the jth word:

then, word feature representation based on the ith entity of the picture is captured by weighted averaging:

similarly, the representation of all entities in the picture in the text modality is:

3-4, in order to obtain semantic similarity between the picture and the sentence, adopting a weak consistency alignment mode to define the similarity between the picture and the sentence as the Euclidean distance between all entities in the picture and the characteristic representation sequence thereof in the text mode, and the sum of the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode.

First, the euclidean distance of each sentence from its sequence of feature representations in the picture modality is calculated:

then, the euclidean distances between all entities in the picture and their feature representation sequences in the text modality are:

thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ D^T←I+D^I←T. And finally, in order to obtain the picture sentence pair with the highest semantic similarity < I, S >, optimizing a picture sentence matching module by using the triplet loss. For each pair of correctly matched pictures and sentences, the invention additionally extracts a picture I which is not matched with the sentence S^-And a sentence S not matching with the picture I^-Form two negative pairs < I, S^-> and < I^-S >. Finally, the loss function of the picture sentence matching module is defined as:

L^m＝max(0,1+＜I,S＞-＜I,S-＞)+max(0,1+＜I,S＞-＜I^-,S＞) (15)

further, step 4 is specifically implemented as follows:

4-1. in order to obtain event classifiers sharing weight and bias term, the invention takes the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally, the target function L is minimized to be L^T+L^I+L^mTo the modelAnd (5) performing joint optimization. Let the text event classifier Softmax_TAnd picture event classifier Softmax_IThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.

Further, step 5 is specifically implemented as follows:

5-1. with M²E²The multi-modal annotation data tests the trained model and tests the model for k sentences S₁,S₂...,S_k-1,S_kAnd l pictures I₁,I₂...,I_l-1,I_lThe article firstly utilizes a picture sentence matching module to find out the picture sentence pair with the highest semantic similarity < I, S >, and simultaneously obtains a word feature representation sequence H based on a picture entity^I←TAnd word-based photo entity feature representation sequence H^T←I。

5-2. in feature fusion, for word w_jThe invention considers

And

for trigger word w_jContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,

weight of (2)

The calculation method is as follows:

wherein the content of the first and second substances,

representing the jth word feature vector

With its feature representation in the picture modality

Has a cosine similarity of [ -1,1 ] in the range]. Then, the weighted average is used to fuse with w_jRelated picture characteristic information, obtaining w_jCorresponding multi-modal feature representation vectors

Wherein the content of the first and second substances,

the result of (A) is usually a value between 0 and 1, controlling

For fused multi-modal features

The degree of influence of (c). When in use

Smaller, fused features preserve more textual information, while

When larger, illustrate the picture feature to the word w_jMore information is contributed in the event classification process.

Finally, theCandidate trigger word w_jCorresponding multimodal features

Inputting a shared event classifier to obtain a word w_jEvent type of trigger

5-3. also for picture I, the influence of the word features on the picture event classification is controlled using another gated attention. Firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture action

With its feature representation in text modality

Assigning weights

And

wherein the content of the first and second substances,

the calculation method is as follows:

then, the original characteristics of the ith picture entity are fused through weighted average

And their feature representation in text modality

Obtaining updated multi-modal feature vectors

Finally, the shared event classifier pair is utilized

Classifying to obtain event type argmax (P (y) to which the picture description action belongs^II), wherein I ═ 1.

The invention has the following beneficial effects:

aiming at the defects of the prior art, a multi-modal joint event detection method based on pictures and sentences is provided, and events are simultaneously recognized from the pictures and the sentences. However, due to the lack of sufficient multi-modal annotation data, the invention adopts a joint optimization mode, on one hand, the existing single-modal data sets (imSitu picture data set and KBP2017 English data set) are utilized to respectively learn the picture and text event classifiers, and on the other hand, the existing picture and title pair training picture sentence matching module is utilized to find out the picture and sentence with the highest semantic similarity in the multi-modal article, so as to obtain the characteristic representation of the picture entity and the word in the public space. These features help to share parameters between the picture and text event classifiers, resulting in a shared event classifier. Finally, a small amount of multi-modal annotation data (M) is utilized²E²Multimodal datasets) to test the model, using a shared event classifier to obtain the events and their types described by the pictures and sentences, respectively. The invention identifies the events from the pictures and the sentences, and utilizes the complementarity of the visual characteristics and the text characteristics, thereby not only improving the performance of single-mode event classification, but also finding more complete event information in the articles.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention.

FIG. 2 is a block diagram of the model training phase of the present invention

Detailed Description

The attached drawings disclose a flow chart of a preferred embodiment of the invention in a non-limiting way; the technical solution of the present invention will be described in detail below with reference to the accompanying drawings.

Event detection is an important link of an event extraction task, and the link can identify picture actions and text trigger words which mark the occurrence of events and classify the picture actions and the text trigger words into predefined event types. The method is widely applied to the fields of network public opinion analysis, information collection and the like. With the diversification of carriers for transmitting network information, researchers are paying attention to event detection tasks in different fields, namely how to automatically acquire interesting events from different information carriers such as unstructured pictures and texts. Also, the same event may appear in different forms in pictures and sentences. However, the existing model only aims at the single-mode event detection based on sentences or pictures, or only considers the influence of picture characteristics on the text event detection, and ignores the influence of text context on the picture event classification. In order to solve the problems, the invention provides a multi-mode combined event detection method based on pictures and sentences.

As shown in fig. 1-2, a method for multi-modal joint event detection based on pictures and sentences comprises the following steps:

Then, the feature vector of the image entity

Input devicePicture event classifier Softmax_IObtaining the event type probability distribution described by the current picture, wherein the loss function of the picture event classifier is defined as L^I；

And step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a Cross-modal attention Mechanism (CMAM). According to the jth word, the CMAM can locate important picture entities and assign weights, and obtains the feature representation of the word in the picture mode by aggregating visual features related to the word through weighted average

And the feature representation of the jth word in the picture modality

And

And

further, the step 1 is specifically realized as follows:

1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentence_emdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector w_posAnd querying the entity type vector table to obtain an entity type vector w_entityThe real-valued vector x of each word ═ { w ═ w_emd,w_pos,w_entityThus the sentence real valued vector sequence is denoted W ═ x₁,x₂,...,x_n-1,x_nWhere n is a sentenceThe length of the seed.

At the same time, C is mixed^TAs a sequence of word features in a common space.

Input text event classifier:

wherein T is the number of sentences marked in the KBP2017 English data set,

Further, step 2 is specifically implemented as follows:

Wherein the content of the first and second substances,

convolution vector for representing picture action node (for the convenience of calculation, the picture action is regarded as a picture entity in the invention), and similarly, H^IAnd the characteristic representation sequence of the action and the entity set of the view slice in the common space.

2-2, convolving the motion vector in the picture I

type_I＝argmax(P(y^I|I))

Further, step 3 is specifically implemented as follows:

3-1, the picture sentence matching module is used for finding out pictures and sentences with highest semantic similarity from the multi-modal document containing a plurality of pictures and sentences. First, each pair of graphs is calculated by using a cross-modal attention mechanismAnd (3) association weight values between the fragment entities and the words, and learning word-based picture entity characteristic representation and word characteristic representation based on the picture entities. More specifically, from each word, the CMAM can locate important photo entities and assign weights, and obtain a feature representation of the word in the photo modality by aggregating visual features associated with the word by weighted average. On the other hand, for each entity in the picture, related words are firstly searched in the sentences to be matched, weights are assigned to the words, and semantic information related to the photo entities is captured through weighted average, so that the feature representation of the photo entities in the text mode is obtained. Giving out entity characteristic vector sequence corresponding to picture I

And word feature vector sequence of sentence S

Wherein the content of the first and second substances,

feature vector representing ith entity in picture

Characteristic energy of j-th word

Has a cosine similarity of [0,1 ] in the value range]. Then according to Score_ijCalculating the influence weight A of the ith photo entity on the jth word_ijComprises the following steps:

Therefore, the invention uses

then, word feature representation based on ith entity of picture is captured by weighted average

L^m＝max(0,1+＜I,S＞-＜I,S^-＞)+max(0,1+＜I,S＞-＜I^-,S＞)

further, step 4 is specifically implemented as follows:

4-1, in order to obtain event classifiers sharing weight and bias items, the invention takes the characteristic representation of words and picture actions in a common space as the input of the text and picture event classifiers respectively, and finally minimizes a target functionNumber L ═ L^T+L^I+L^mAnd jointly optimizing the models. Let the text event classifier Softmax_TAnd picture event classifier Softmax_IThe weight matrix and bias terms can be shared. Thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.

Further, step 5 is specifically implemented as follows:

5-2. in feature fusion, for word w_jIn the present invention, it is considered that c_j ^TAnd h_j ^T←IFor trigger word w_jContribute different degrees of feature information. Therefore, the present invention utilizes a gated attention mechanism to assign weights to different feature information,

weight of (2)

The calculation method is as follows:

wherein the content of the first and second substances,

representing the jth word feature vector

With its feature representation in the picture modality

Wherein the content of the first and second substances,

the result of (A) is usually a value between 0 and 1, controlling

For fused multi-modal features

The degree of influence of (c). When in use

Smaller, fused features preserve more textual information, while

Finally, the candidate trigger word w_jCorresponding multimodal features

Inputting a shared event classifier to obtain a word w_jEvent type of trigger

With its feature representation in text modality

Assigning weights

And

wherein the content of the first and second substances,

the calculation method is as follows:

And their feature representation in text modality

Obtaining updated multi-modal feature vectors

Finally, the shared event classifier pair is utilized

Claims

1. The multimodal combined event detection method based on pictures and sentences is characterized by comprising the following steps:

step 1, a text event detection module firstly encodes text features to obtain a feature vector representation sequence of words in a sentence

For the jth candidate trigger word, the feature vector of the corresponding candidate trigger word is then used

Input text event classifier Softmax_TObtaining the probability distribution of event types triggered by the jth candidate trigger word, wherein the loss function of the text event classifier is defined as L^T；

Step 2, the picture event detection module encodes the picture characteristics to obtain picture entity characteristic vector representation sequences of the description actions and the plurality of entities in the picture

Then, the feature vector of the image entity

Input picture event classifier Softmax_IObtaining the event type probability distribution of the current picture description, wherein the loss function of the picture event classifier is defined as L^I；

Step 3, the picture sentence matching module firstly calculates the association weight between each pair of picture entities and words by using a cross-modal attention mechanism CMAM;

according to the jth word, the CMAM can locate and classify important picture entitiesWeighting, and acquiring the feature representation of the words in the picture mode by weighting and averaging the picture entity features related to the words

Meanwhile, for the ith entity in the picture, related words are searched in the sentence to be matched, weight is distributed to the words, semantic information related to the photo entity is captured through weighted average, and therefore characteristic representation of the photo entity in the text mode is obtained

And then, the Euclidean distance D between each sentence to be matched and the characteristic representation sequence thereof in the picture modality^T←IEuclidean distance D from all entities in the picture and their feature representation sequences in the text modality^I←TAdding the images to obtain similarity of the images and the sentences; wherein, the loss function of the picture sentence matching module is defined as L^m；

step 5, in the testing stage, for the multi-modal article, firstly, the picture and sentence with the highest similarity are found out by using the picture and sentence matching module, and the characteristic representation of the ith picture entity in the text mode is obtained

And the feature representation of the jth word in the picture modality

And a feature representation

Distributing weight, and obtaining a multi-modal feature vector corresponding to the ith picture entity through weighted average; then, acquiring an event type described by the picture by using a shared event classifier; similarly, another gated attention mechanism is used as the feature vector of the candidate trigger word

And a feature representation

And assigning weights, acquiring the multi-modal feature representation of the jth word by weighted average, and then acquiring the event type triggered by the jth word by using a shared event classifier.

2. Step 1 of the picture and sentence based multimodal combined event detection method according to claim 1 is implemented as follows:

1-1, training a text event classifier Softmax on KBP2017 English data set_TFirstly, preprocessing the labeled data to obtain an entity type, an event trigger word and an event type corresponding to the event trigger word; the method comprises 5 entity types and 18 event types; then, carrying out sentence segmentation and word segmentation on the original text by using Stanford CoreNLP to obtain the part of speech and the grammar dependency structure of the sentence; respectively creating a part-of-speech vector table and an entity type vector table, wherein each vector table has an initialization vector corresponding to a type 'null';

1-2, inquiring a pre-trained glove word vector matrix to obtain a word vector w of each word in a sentence_emdThen, the part-of-speech vector table is inquired to obtain a part-of-speech vector w_posAnd querying the entity type vector table to obtain an entity type vector w_entityThe real-valued vector x of each word ═ { w ═ w_emd,w_pos,w_entityThus the sentence real valued vector sequence is denoted W ═ x₁,x₂,...,x_n-1,x_nWhere n is the length of the sentence;

1-3. will sentenceSub-real-valued vector sequence W ═ x₁,x₂,...,x_n-1,x_nTaking the hidden state vector sequence of the sentence as the input of Bi-LSTMs

At the same time, C is mixed^TA feature representation sequence in a common space as a word sequence;

1-4, regarding each word in the sentence as a candidate trigger, regarding the j (j is less than or equal to n) first candidate trigger, and then using the corresponding feature vector thereof

Input text event classifier:

represents the jth candidate trigger word w in the sentence S_jThe probability distribution of the event type of the trigger,and type_w,jDenotes w_jThe type of event triggered; meanwhile, the loss function of the text event classifier is defined as:

wherein T is the number of sentences marked in the KBP2017 English data set,

3. Step 2 of the picture and sentence based multimodal combined event detection method according to claim 2 is implemented as follows:

2-1, training a picture event classifier on an imSitu picture dataset, wherein a total of 504 verbs are defined to record actions described by pictures, and 11538 entity types describe entities appearing in the pictures; first using VGG16_vTo extract the action characteristics in the picture and utilize the multi-layer perceptron MLP_vConverting verb features into verb vectors

Then, representing each picture by using a mesh structure, and constructing the mesh structure according to the described actions and entities; the action described by the picture is used as a central node of the mesh structure, and the entity is connected with the action node; then, the word vector corresponding to the picture features is subjected to graph convolution networkCoding the sequence so that the vector after the convolution calculation of the action node stores entity characteristic information; wherein the coded picture entity characteristic vector sequence is

Wherein the content of the first and second substances,

a convolution vector to represent a picture action node; likewise, H^IA characteristic representation sequence of the action of the view slice and the entity set in a common space;

2-2, convolving the motion vector in the picture I

type_I＝argmax(P(y^I|I))

wherein, W^IAnd b^ISoftmax as a picture event classifier_IWeight matrix and bias term of P (y)^II) representing Picture I_iProbability distribution of event type of trigger, and type_IRepresenting the event type described in the picture I; meanwhile, the loss function of the picture event classifier is defined as:

4. Step 3 of the picture and sentence based multimodal combined event detection method according to claim 3 is implemented as follows:

3-1, giving the entity characteristic vector sequence corresponding to the picture I

And word feature vector sequence of sentence S

Firstly, acquiring feature representations of word and picture entities in other modes by using a cross-mode attention mechanism;

3-2, in order to obtain the word-based picture entity feature representation, firstly, calculating the association degree Score of the ith entity in the picture and the jth word in the sentence by using a cross-mode attention mechanism_ij：

Wherein the content of the first and second substances,

feature vector representing ith entity in picture

Feature vector of j-th word in sentence

Has a cosine similarity of [0,1 ] in the value range](ii) a Then according to Score_ijCalculating the influence weight A of the ith photo entity on the jth word_ijComprises the following steps:

By using

A characteristic representation sequence representing the whole sentence in the picture mode;

Also, the representation of all entities in the picture in the text modality is:

3-4, defining the similarity of the picture and the sentences as the sum of Euclidean distances between all entities in the picture and the characteristic representation sequence thereof in the text mode and the Euclidean distance between each sentence and the characteristic representation sequence thereof in the picture mode by adopting a weak consistency alignment mode;

and then calculating Euclidean distances of all entities in the picture and the characteristic representation sequence of the entities in the text modality as follows:

thus, the semantic similarity between picture I and sentence S is defined as < I, S > ═ D^T←I+D^I←T；

In order to obtain the picture sentence pair with the semantic similarity of less than I and S greater than the highest, a trippletloss is used for optimizing a picture sentence matching module; for each pair of correctly matched pictures and sentences, additionally extracting a picture I which is not matched with the sentence S^-And a sentence S not matching with the picture I^-Form two negative pairs < I, S^-> and < I^-,S＞；

Finally, the loss function of the picture sentence matching module is defined as:

L^m＝max(0,(1+＜I,S＞-＜I,S^-＞))+max(0,(1+＜I,S＞-＜I^-,S＞))。

5. step 4 of the multi-modal combined event detection method based on pictures and sentences according to claim 4 is implemented as follows:

4-1. in order to obtain event classifiers sharing weight and bias term, respectively using the feature representation of words and picture actions in a common space as the input of the text and picture event classifiers, and finally, minimizing an objective function L ═ L^T+L^I+L^mPerforming combined optimization on the models; enabling text event classifiersSoftmax_TAnd picture event classifier Softmax_IThe weight matrix and bias terms can be shared; thus, in the testing phase, the shared event classifier is used to predict the event types described by the picture and sentence simultaneously.

6. Step 5 of the multi-modal combined event detection method based on pictures and sentences according to claim 5 is implemented as follows:

5-1. with M²E²The multi-modal annotation data tests the trained model and tests the model for k sentences S₁,S₂...,S_k-1,S_kAnd l pictures I₁,I₂...,I_l-1,I_lThe article firstly utilizes a picture sentence matching module to find out a picture sentence pair with the semantic similarity of less than I and S of more than the highest, and simultaneously obtains a word characteristic representation sequence H based on a picture entity^I←TAnd word-based photo entity feature representation sequence H^T←I；

5-2, in feature fusion, for candidate trigger word w_jIt is considered that

And

for candidate trigger word w_jThe event type prediction of (2) contributes different degrees of feature information; the different feature information is therefore assigned weights using a gated attention mechanism,

weight of (2)

The calculation method is as follows:

wherein the content of the first and second substances,

feature vector representing jth candidate trigger

With its feature representation in the picture modality

Has a cosine similarity of [ -1,1 ] in the range](ii) a Then, the weighted average is used to fuse with w_jRelated picture characteristic information, obtaining w_jCorresponding multi-modal feature representation vectors

Wherein the content of the first and second substances,

the result of (A) is usually a value between 0 and 1, controlling

For fused multi-modal features

The degree of influence of (c); when in use

When smaller, the fused features preserve more textual informationTo do so

When larger, illustrate the picture feature to the word w_jContributing more information in the event classification process;

finally, the candidate trigger word w_jCorresponding multimodal features

Inputting a shared event classifier to obtain a word w_jEvent type of trigger

5-3, similarly, for the picture I, controlling the influence of the word characteristics on the picture event classification by using another gating attention; firstly, a gating attention machine is utilized to produce original characteristics corresponding to the picture action