CN111428504B

CN111428504B - Event extraction method and device

Info

Publication number: CN111428504B
Application number: CN202010187298.7A
Authority: CN
Inventors: 徐猛; 付骁弈
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2023-04-28
Anticipated expiration: 2040-03-17
Also published as: CN111428504A

Abstract

The application discloses an event extraction method and device, wherein the method comprises the following steps: obtaining a vectorized semantic representation W1 of the sentence; performing trigger word recognition according to a token set in the vectorized semantic representation W1, and performing entity recognition according to a corresponding Span semantic representation obtained by performing Span division on the vectorized semantic representation W1; every token and span are combined in pairs, and whether the combined token and span are (trigger-argument) pairs is marked. By the scheme, more useful information can be obtained, and the method has stronger practical application value; the operation is simple, and the problem of error accumulation caused by using a natural language processing tool is avoided; the problem existing in sequence labeling is perfectly solved through the span division mode, the efficiency is higher, and the applicability is stronger.

Description

Event extraction method and device

Technical Field

The present disclosure relates to event data processing technologies, and in particular, to a method and apparatus for event extraction.

Background

A large amount of news data is generated daily on the internet describing many events that have occurred. However, due to the wide variety of events, the type of event and the information of each factor of the event, such as time, place, participants, etc., cannot be rapidly and accurately distinguished.

The public events or the events occurring in the specific industry are distinguished and subject identification is performed, so that the development trend of the events and the development direction of the whole industry can be mastered in real time, high-level decision making can be assisted, and the risk is reduced. Has important practical application value and research significance.

The existing identification method comprises the following steps: [1] a model based on a graph neural network; [2] models based on deep learning, attention mechanisms, sequence labeling, etc.

The existing method has the following defects:

1. the existing method only carries out event type detection, namely event trigger words, but does not carry out event main body (or entity) extraction, has single task and does not have stronger practical application value. Although some methods perform event trigger word recognition and argument recognition, they rely on manually pre-labeled entities, which are not present in practical applications.

2. Existing methods for event detection mostly assist in using existing natural language processing tools, but cannot be pre-processed by these tools in practical applications. The existing method mainly uses specific natural language processing tools such as Jieba, ltp, standfordNTP and the like to firstly segment sentences, build dependency trees and then input the features into a model. The defects are that: firstly, the processing is complicated, and secondly, the tools have certain errors in the processing process, so that the problem of error accumulation exists in the subsequent modeling analysis process.

3. A series of models based on sequence labeling can hardly solve the situation that event subjects cross, for example, "beijing court" is an event subject (organization), but "beijing" is also a subject/entity (place name) per se.

Disclosure of Invention

The event extraction method and the event extraction device can acquire more useful information and have stronger practical application value; the method is simple to operate in the process of data processing and modeling, and the problem of error accumulation caused by using a natural language processing tool is avoided; through the mode of dividing span, the problem existing in sequence labeling is perfectly solved, the efficiency is higher, and the applicability is stronger.

The application provides an event extraction method, which can comprise the following steps:

obtaining a vectorized semantic representation W1 of the sentence;

performing trigger word recognition according to the token set in the vectorized semantic representation W1, and performing entity recognition according to the corresponding span semantic representation obtained by performing span division on the vectorized semantic representation W1;

every token and span are combined in pairs, and whether the combined token and span are (trigger-argument) pairs is marked.

In an exemplary embodiment of the present application, the obtaining the vectorized semantic representation W1 of the sentence may include: the vectorized semantic representation W1 of the statement is obtained by a bi-directional LSTM network model or BERT model.

In an exemplary embodiment of the present application, before obtaining the vectorized semantic representation W1 of the statement through the bidirectional LSTM network, the method may further include: randomly initializing a characters in the sentence into a b-dimensional vector D with a dimension of [ a, b ], wherein for index ids from 0 to a-1, each id corresponds to a different character; for a sentence with the length of S, each character in the sentence can find a corresponding id in a vector D, so that a vector with the dimension of S, D is obtained;

obtaining the vectorized semantic representation W1 of the statement over the bi-directional LSTM network may include: inputting vectors with dimensions of S and D into a preset bidirectional LSTM neural network, and taking the output of the bidirectional LSTM neural network as the vectorization semantic representation W1 of a sentence;

wherein the dimension of the vectorized semantic representation W1 is [ S, D1]; d1 is the number of hidden layer nodes of 2 x lstm.

In an exemplary embodiment of the present application, obtaining the vectorized semantic representation W1 of the statement by the BERT model may include: directly inputting the sentence into the BERT model, and taking the output of the BERT model as the vectorization semantic representation W1 of the sentence;

wherein the dimension of the vectorized semantic representation W1 is [ S, D1]; d1 =768.

In an exemplary embodiment of the present application, the method may further include:

dividing trigger word types into x types, dividing entity types into y types, dividing event argument types into z types, and taking the trigger word types, the entity types and types except the event argument types as other types of other types; wherein x, y and z are positive integers;

before obtaining the vectorized semantic representation W1 of the statement, any one or more of the following operations are performed:

setting one or more token in the sentence; each token is used for marking whether the current word is of the trigger word type; each token represents any one of the x types;

according to the set span width, performing span division on the sentence to divide the sentence into a plurality of spans, and marking each span to determine whether the current span belongs to an entity type; each mark represents any one of y types;

each token and span of the labels are combined in pairs, and whether the combined token and span are (trigger-argument) pairs is marked.

In an exemplary embodiment of the present application, the performing trigger word recognition according to the token set in the vectorized semantic representation W1 may include:

each token is classified through a two-layer fully connected neural network and a softmax layer, a vector W2 with the dimension of [ S, x+1] is obtained, and the vector W2 represents the probability that each token belongs to each type of trigger word.

In an exemplary embodiment of the present application, the entity identification performed by the corresponding Span semantic representation obtained by performing Span division according to the vectorized semantic representation W1 may include:

performing span division on the vectorized semantic representation W1 to obtain a plurality of semantic segments; carrying out average pooling on a plurality of semantic segments to obtain a representation W3 of each span;

taking the representation W3 of each span as input, classifying each span using two layers of fully connected neural network and softmax layers, outputting a vector W4 of dimension N, y+1, the vector W4 representing the probability that each span belongs to each type of entity.

In an exemplary embodiment of the present application, the span division is performed on the vectorized semantic representation W1 to obtain a plurality of semantic segments; the average pooling of the plurality of semantic segments to obtain a representation W3 of each span may include:

acquiring a set maximum width max_span_width of the span; sequentially selecting the vectorized semantic representation W1 according to the span width from 1 to max_span_width to obtain N span semantic representations span_beaming;

the semantic representations span_casting of the N spans are averaged and pooled to obtain a representation W3 of each span.

In an exemplary embodiment of the present application, the step of combining each token and span in pairs, and marking whether the combined token and span are (trigger-argument) pairs may include:

copying and transforming the vectorization semantic representation W1 and the representation W3 of each span to realize the pairwise splicing combination of each token and span and obtain a vector W5 with the dimension of [ S, N,2 x D1];

taking the vector W5 as input, classifying the vector W5 through a two-layer fully connected neural network and a softmax layer, and outputting a vector W6 with the dimension of [ S, N, z+1 ]; vector W4 represents the probability that each combination belongs to each type of event argument.

The application also provides an event extraction device, which may include a processor and a computer readable storage medium, where instructions are stored, and when the instructions are executed by the processor, the event extraction method is implemented.

Compared with the related art, the embodiment of the application can comprise the following steps: obtaining a vectorized semantic representation W1 of the sentence; performing trigger word recognition according to the token set in the vectorized semantic representation W1, and performing entity recognition according to the corresponding span semantic representation obtained by performing span division on the vectorized semantic representation W1; every token and span are combined in pairs, and whether the combined token and span are (trigger-argument) pairs is marked. According to the embodiment, the event trigger words, the arguments and the entities of the event can be extracted at the same time, more useful information can be obtained, and the method has a strong practical application value; the existing natural language processing tool is not used in the data processing and modeling process, so that the operation is simple, the problem of error accumulation caused by using the natural language processing tool is avoided, and meanwhile, the method is more suitable for a real application scene; through the mode of dividing span, the problem existing in sequence labeling is perfectly solved, the efficiency is higher, and the applicability is stronger.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. Other advantages of the present application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The accompanying drawings are included to provide an understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

FIG. 1 is a flow chart of an event extraction method according to an embodiment of the present application;

fig. 2 is a block diagram of an event extraction device according to an embodiment of the present application.

Detailed Description

The present application describes a number of embodiments, but the description is illustrative and not limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or in place of any other feature or element of any other embodiment unless specifically limited.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements of the present disclosure may also be combined with any conventional features or elements to form a unique inventive arrangement as defined in the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement as defined in the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Further, various modifications and changes may be made within the scope of the appended claims.

Furthermore, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other sequences of steps are possible as will be appreciated by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

In the exemplary embodiments of the present application, before describing the embodiments of the present application, terms related to the embodiments of the present application may be first described:

1. event type and definition:

the event types refer to the categories to which different events belong, such as event types including 'actual person stakeholder change', 'credit violation', 'financial falsification', and the like in the financial field. The definition of event types is typically determined by an expert or experienced person in the field.

2. Trigger words:

trigger words refer to words that can clearly indicate the event type, such as "kill", "injuried", "fire", and the like.

3. The theory element:

an argument refers to a factor included in an event, such as a time, place, participant, etc. when a certain event occurs.

4. Event body and definition:

event subjects refer to the primary party to which an event occurs, and are also the most closely related parties to the event, defined as entities. Such as: the XX technology actually controls people to change the YY group and the ZZ group to break and reform, wherein the XX technology is the main body of the event, the type is an organization, and the main body of the event can be defined as various entity types such as a person name, a place name, an organization name, time and the like.

5. Event extraction:

event extraction involves three tasks, namely subject (entity) recognition, trigger word recognition and argument recognition.

6、span：

The span can be considered as a section of area, each span has a certain width, namely, a section of speech is selected with a fixed length, such as a sentence of ' i eat bread today and drink milk ', and if the span has a width of 2, the sections of ' i ' today ', ' Tianchi ', and the like can be obtained.

7. separation of span:

the division of the span refers to dividing from small to large according to the set maximum width of the span. For example, if the maximum width of the span is 8, the width of the span is 1-8, and a plurality of spans can be obtained by dividing the spans respectively.

8. Classification of span:

the span classification refers to judging the type of a piece of data, namely a label, through a model or a specific method, and generally speaking, each piece of data in a classification task only belongs to one category.

The application provides an event extraction method, as shown in fig. 1, the method may include S101-S103:

s101, obtaining the vectorization semantic representation W1 of the sentence.

In an exemplary embodiment of the present application, the data to be event extracted may be first preprocessed before the vectorized semantic representation W1 of the statement is obtained. The pre-treatment may include, but is not limited to: trigger word tags, entity tags, and argument tags.

In the exemplary embodiment of the present application, it is assumed that the number of types of trigger words (may be referred to as event types) is n_event=10, that is, x=10, the number of types of event bodies (may be referred to as entity types) is n_entity=20, that is, y=20, and the number of types of event arguments is n_event=15, that is, z=15.

In an exemplary embodiment of the present application, the trigger word flags may include: first, each token in the sentence is marked as to whether it is a trigger word (if so, the trigger word type is marked, otherwise "other"). Then, the sentence (or term sentence) is divided into the span, and taking a single sentence as an example, if the maximum width max_span_width=8 of the span is set, a plurality of spans can be obtained.

In an exemplary embodiment of the present application, the entity tag may include: each span needs to be marked, i.e. it is determined whether each span is an entity; if it is an entity, it is marked as an entity class, otherwise it is marked as "other".

The argument mark may include: every token is combined with span in pairs and is marked as to whether it is a (trigger word-argument) pair, if it is a correct pairing, it is marked as an argument type, otherwise it is marked as "other".

In the exemplary embodiment of the present application, since the computer cannot directly process chinese, each word in a sentence (sentence) can be converted into a map of numbers. That is, a vectorized semantic representation W1 of the statement is obtained.

In the exemplary embodiment of the present application, assuming that there are 20000 different characters (Chinese characters and/or words, and other common symbols may be included) in the corpus, each character may be randomly initialized to a 300-dimensional vector, a vector D with a dimension of [20000,300] may be obtained, where for index ids from 0 to 19999, each id corresponds to a different Chinese character. Then for each character in a sentence (of length S) the corresponding id can be found in D to obtain the corresponding vector and thus a vector of dimension S,300 can be obtained. The semantic representation vector W1 of the sentence can then be derived using a bi-directional LSTM neural network.

In an exemplary embodiment of the present application, when using the BERT model, the sentence may be directly input to the BERT model, and the output of the BERT model may be used as the vectorized semantic representation W1 of the sentence.

In the exemplary embodiment of the present application, let the semantics obtained by the above two methods be denoted as W1, then the dimension of 1 is [ S, D1], where S is the sentence length; if the vectorized semantic representation W1 of the statement is obtained using a bi-directional LSTM network, D1 is the number of 2 x LSTM hidden nodes, if the vectorized semantic representation W1 of the statement is obtained using the BERT model, d1=768.

S102, performing trigger word recognition according to the token set in the vectorized semantic representation W1, and performing entity recognition according to corresponding span semantic representations obtained by performing span division on the vectorized semantic representation W1.

In the exemplary embodiment of the present application, on the basis of the semantic representation W1 obtained in step S101, each token is classified by two layers of fully-connected neural networks and softmax layers, where the input of the network is W1 and the output is a vector W2 with a dimension of [ S, n_event+1], and W2 means the probability that each token belongs to each type of event (trigger word).

In an exemplary embodiment of the present application, the semantic representation W1 obtained in step S101 may be divided according to a set maximum width max_span_width=8 of span. The partitioning method may include: the width of span is sequentially selected from 1 to max_span_width on the vector W1, and N semantic representations of span are obtained, namely span_casting.

In the exemplary embodiment of the present application, since the width of each span is different (span_emmbedding may have dimensions of [ sw, D1], where sw takes a value of 1-max_span_width), the semantic representations of the N spans may be subjected to an average pooling process, so as to obtain N spans of representations W3, W3 may have dimensions of [ N, D1].

In an exemplary embodiment of the present application, after the representation W3 of the span is obtained, the span may be classified using two layers of fully connected neural networks and softmax layers to determine whether each span belongs to a certain class of entities. The input of the network is W3, the output is a vector W4 with the dimension of [ N, n_entity+1], and W4 means the probability that each span belongs to a certain type of entity.

S103, combining each token and span in pairs, and marking whether the combined token and span are (trigger word-argument) pairs or not.

In the exemplary embodiment of the present application, an argument must belong to an event because an argument is associated with an event. Thus, each token in a sentence can be paired with a span in pairs, i.e., a combination (token-span), to determine if the combination is a correct (trigger-argument) pair, which can include: according to the obtained sentence representation W1 (the dimension is [ S, D1 ])andall span representations W3 (the dimension is [ N, D1 ]), carrying out pairwise splicing combination on the obtained sentence representations, namely copying, converting and the like to obtain a vector W5 with the dimension of [ S, N,2 x D1], and then classifying the vector W5 through two layers of fully connected neural networks and one layer of softmax layers, so as to determine the argument of each combination belonging to a certain event; the input of the network is W5, the output is a vector W6 with a dimension of [ S, N, n_area+1 ], W6 means the probability that each combination belongs to a certain event argument.

In the exemplary embodiment of the present application, in the training stage in this embodiment, the above classification result and the previously performed tag data (trigger word tag, entity tag, and argument tag) obtained in step S101 may be subjected to error calculation and back propagation, and the parameter update operation completes the training process.

In the exemplary embodiment of the present application, in the prediction stage in this embodiment, according to the type corresponding to the classification result, the probability that the output of softmax belongs to each class may be taken as the type corresponding to the index of the maximum probability. Therefore, for trigger word recognition, whether a certain token is a trigger word can be judged according to the output of softmax; for entity identification, whether a certain span belongs to a certain type of entity can be judged according to the output of softmax; for argument identification, a precondition needs to be satisfied that the token in the combination must have been determined to be a trigger word, and then it can be determined whether the (token-span) combination is an argument of the event based on the output of softmax.

The embodiment of the application obtains the vectorized semantic representation W1 of the sentence through a bidirectional LSTM network or BERT, and then divides the vectorized semantic representation into three branches: 1. the first branch carries out trigger word recognition; 2. the second branch carries out entity identification; before the method, the method comprises the steps of performing span division on the vectorized semantic representation W1 of the obtained sentence to obtain a plurality of semantic segments, carrying out average pooling on each semantic segment to obtain a representation of each span, and classifying the span by a second branch to determine whether the span is an entity; 3. the third branch carries out argument identification; each token is combined with a span to determine if the combination is a (trigger-argument) pair. Based on the above process, the entity, the trigger word and the corresponding argument of all the events in the sentence (document) can be identified, the efficiency of event extraction is improved, and the method and the device have strong practicability. The model only needs the input of the original sentence, so that the problem that the existing method depends on a natural language processing tool is avoided.

In summary, the present application has at least the following advantages:

1. meanwhile, the event trigger words, the arguments and the entities of the events can be extracted, more useful information can be obtained, and the method has stronger practical application value.

2. The existing natural language processing tool is not used in the data processing and modeling process, so that the operation is simple, the problem of error accumulation caused by using the natural language processing tool is avoided, and meanwhile, the method and the device are more suitable for the actual application scene.

3. Through the mode of dividing span, the problem existing in sequence labeling is perfectly solved, the efficiency is higher, and the applicability is stronger.

In the exemplary embodiment of the present application, the trigger word may be considered as a single token, and thus the method is more applicable to languages in which a word such as english represents a meaning. While there is no limitation on the language for entities and arguments in an event because they are span-based.

The application also proposes an event extraction device 1, as shown in fig. 2, may comprise a processor 11 and a computer readable storage medium 12, the computer readable storage medium 12 storing instructions which, when executed by the processor 11, implement an event extraction method as described in any one of the above.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Claims

1. A method of event extraction, the method comprising:

each token and span of the labels are combined in pairs, and whether the combined token and span are (trigger word-argument) pairs or not is marked;

obtaining a vectorized semantic representation W1 of the sentence;

performing trigger word recognition according to the token set in the vectorization semantic representation W1;

performing entity recognition according to a corresponding span semantic representation obtained by performing span division on the vectorized semantic representation W1, wherein the entity recognition comprises the following steps: performing span division on the vectorized semantic representation W1 to obtain a plurality of semantic segments; carrying out average pooling on a plurality of semantic segments to obtain a representation W3 of each span; taking the representation W3 of each span as input, classifying each span by using a two-layer fully connected neural network and a softmax layer, and outputting a vector W4 with the dimension of [ N, y+1], wherein the vector W4 represents the probability that each span belongs to each type of entity;

the vectorization semantic representation W1 is subjected to span division to obtain a plurality of semantic segments; the multiple semantic segments are averaged and pooled to obtain a representation W3 of each span, including: acquiring a set maximum width max_span_width of the span; sequentially selecting the vectorized semantic representation W1 according to the span width from 1 to max_span_width to obtain N span semantic representations span_beaming; carrying out average pooling on semantic representation span_mapping of N spans to obtain a representation W3 of each span;

2. The event extraction method according to claim 1, wherein obtaining a vectorized semantic representation W1 of a statement comprises: the vectorized semantic representation W1 of the statement is obtained by a bi-directional LSTM network model or BERT model.

3. The event extraction method according to claim 2, wherein before obtaining the vectorized semantic representation W1 of the statement via the bi-directional LSTM network, the method further comprises: randomly initializing a characters in the sentence into a b-dimensional vector D with a dimension of [ a, b ], wherein for index ids from 0 to a-1, each id corresponds to a different character; for a sentence with the length of S, each character in the sentence can find a corresponding id in a vector D, so that a vector with the dimension of S, D is obtained;

obtaining the vectorized semantic representation W1 of the statement over the bidirectional LSTM network includes: inputting vectors with dimensions of S and D into a preset bidirectional LSTM neural network, and taking the output of the bidirectional LSTM neural network as the vectorization semantic representation W1 of a sentence;

4. The event extraction method according to claim 2, wherein obtaining the vectorized semantic representation W1 of the statement by the BERT model comprises: directly inputting the sentence into the BERT model, and taking the output of the BERT model as the vectorization semantic representation W1 of the sentence;

5. The event extraction method according to claim 1, wherein the performing trigger word recognition according to the token set in the vectorized semantic representation W1 includes:

6. The event extraction method according to claim 4, wherein the step of combining each token and span in pairs and marking whether the combined token and span are (trigger-argument) pairs comprises:

copying and transforming the vectorization semantic representation W1 and the representation W3 of each span to realize the pairwise splicing combination of each token and span and obtain a vector W5 with the dimension of [ S, N, 2D 1];

7. An event extraction apparatus comprising a processor and a computer readable storage medium having instructions stored therein, which when executed by the processor, implement an event extraction method as claimed in any one of claims 1 to 6.