CN112231447B

CN112231447B - Method and system for extracting Chinese document events

Info

Publication number: CN112231447B
Application number: CN202011315453.5A
Authority: CN
Inventors: 王雷
Original assignee: Hangzhou Touzhi Information Technology Co ltd
Current assignee: Hangzhou Touzhi Information Technology Co ltd
Priority date: 2020-11-21
Filing date: 2020-11-21
Publication date: 2023-04-07
Anticipated expiration: 2040-11-21
Also published as: CN112231447A

Abstract

The invention discloses a method and a system for extracting Chinese document events, wherein the method comprises the following steps: detecting an entity and an entity type of the document; detecting an event type of a sentence in the document; obtaining argument roles and importance degrees thereof required by the event types according to the event types; obtaining the importance of each sentence according to the importance of the argument role; detecting a central sentence of the document based on the importance of the sentence; and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument. By defining the importance of the argument roles to the event types, detecting the central sentence and acquiring the argument and the argument roles according to the sentence entities and the relationship between the event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced.

Description

Method and system for extracting Chinese document events

Technical Field

The invention relates to the technical field of text information extraction, in particular to a method and a system for extracting Chinese document events.

Background

The event extraction is an important basis for understanding natural language, can provide a convenient way for people to quickly acquire knowledge, is a necessary condition for a computer to understand the natural language, and has positive promoting effects on an automatic abstract, a machine translation system, a question-answering system and the like.

Event extraction, namely, detecting the type of an event from unstructured text, and extracting arguments forming the core elements of the event of the type from the text, so as to express the event information contained in the text in the form of a structured tuple (event type, event argument 1, event argument 2, …, event argument n). With the explosive growth of internet information, the event extraction technology provides a solution for people to efficiently acquire effective information from massive texts, and becomes a research hotspot in academia and industry.

In the prior art, the identification of event types explicitly depends on event trigger words, and in reality, the event trigger words often need to manually induce specific phrases, so that the workload required by the marking of a training data set is increased, and sometimes the trigger words do not necessarily appear in a text describing the events, so that the recall rate during the event type detection is low.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a method and a system for extracting Chinese document events, which improve the recall rate of event type detection.

The invention discloses a method for extracting Chinese document events, which comprises the following steps: detecting an entity and an entity type of the document; detecting an event type of a sentence in the document; obtaining argument roles and importance degrees thereof required by the event types according to the event types; obtaining the importance of each sentence according to the importance of the argument role; detecting a central sentence of the document based on the importance of the sentence; and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument.

Preferably, the method for detecting the entity and the entity type comprises the following steps: creating a first training set, and marking entities and entity types for samples of the first training set; training the first training set based on a bidirectional LSTM network, an attention mechanism and a conditional random field to obtain an entity recognition model; and detecting the entity and the entity type of the document through an entity recognition model.

Preferably, the method of the present invention further comprises a Chinese character vectorization method: establishing a search matrix from Chinese characters to character vectors; inputting the Chinese characters into the search matrix to obtain character vectors.

Preferably, the method for detecting the entity and the entity type of the document through the entity recognition model comprises the following steps: labeling, by a BIO tag, the location and type of entities in the sample; converting Chinese characters of sentences in the sample into character vectors based on the search matrix; inputting the sentences converted into character vectors into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a reverse direction to be used as hidden vectors of current characters of the sentences; inputting the implicit vector into attention mechanisms of 8 self-attention heads to obtain an output vector; and inputting the output vector into the conditional random field, and calculating the predicted BIO label of the sample.

Preferably, the method for detecting the event type of the sentence in the document comprises: performing maximum pooling operation on the hidden vector to obtain a first sentence vector; inputting the first sentence vector into a first full-connection matrix of the two categories to obtain the normalized probability of the sentence event type; and obtaining the event type of the sentence according to the normalized probability.

Preferably, the method for obtaining the argument roles and the importance thereof required by the event types comprises the following steps: acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set; and obtaining the importance of the argument role to the event type according to the times of the argument role under the event type.

Preferably, the relative importance of the argument role to the event type is:

where IR (r, v) is defined as the relative importance of the argument role to the event type, r is the argument role, v is the event type, j is used to traverse each possible event type,

representing the times of occurrence of argument role r in event type v in the training set;

the inverse importance of the argument role to the event type is:

wherein, IC (r) is defined as the inverse importance degree, | V | represents the event type number of the event type set, | { r belongs to V ∈ V } | represents the event type number containing argument role r;

the normalized argument role r has an importance for event type v of:

where I (r, v) represents the importance of the normalized argument role to the event type.

Preferably, the method for extracting the entity as argument comprises:

converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining the entity vector through the maximum pooling operation:

e _l ＝Maxpooling{m _i，j ，...，m _i，k }

wherein e is _l Is an entity vector, m _i，j ，...，m _i，k The span from the entity m of the jth Chinese character to the kth Chinese character exists in the sentence with the serial number i;

after the entity vector, splicing the entity type code and the distance code from the entity to the central sentence to obtain a third entity vector;

inputting the Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on the output value to obtain a second sentence vector;

after the second sentence vector, sequentially splicing the event type code of the sentence and the distance code from the sentence to the central sentence to obtain a third sentence vector;

inputting the third entity vector and the third sentence vector into a 4-layer Transformer network to obtain a fourth entity vector and a fourth sentence vector which fully exchange text semantic information;

and inputting the fourth entity vector into a second full-connection matrix of the two categories to obtain a result of whether the entity is used as a sentence argument.

Preferably, the method of the present invention further comprises a method of calculating the predicted loss:

obtaining entity detection loss L according to the detected entity and entity type thereof and the real entity and type thereof of the document _see ；

Obtaining the event type detection loss L according to the detected event type and the central sentence _tri ；

Obtaining argument extraction loss L according to the detected argument and argument role thereof and the true argument and argument role thereof of the document _dee ；

Obtaining the loss of Chinese document event extraction according to the entity detection loss, the event type detection loss and the argument extraction loss:

L _total ＝λ ₁ L _see +λ ₂ L _tri +λ ₃ L _dee

wherein λ is ₁ 、λ ₂ 、λ ₃ Respectively, are weighting factors.

In another aspect, the present invention further provides a system for implementing the method, including: an entity detection module, an event type detection module, a central sentence detection module and an argument extraction module,

the entity detection module is used for detecting the entity and the entity type of the document;

the event type detection module is used for detecting the event type of sentences in the document;

the central sentence detection module is used for acquiring argument roles and importance thereof required by the event type according to the event type, acquiring the importance of each sentence according to the importance of the argument roles, and detecting the central sentence of the document based on the importance of the sentence;

the argument extraction module is used for extracting the entity as an argument and obtaining the argument role of the argument based on the detected entity, the event type and the central sentence.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of defining the importance of argument roles to event types, detecting a central sentence, and obtaining argument and argument roles according to the relationship between sentence entities and event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced; the granularity of event extraction is promoted from sentence level to document level.

Drawings

FIG. 1 is a flow chart of a method of Chinese document event extraction of the present invention;

FIG. 2 is a method flow diagram of a method of Chinese character vectorization;

FIG. 3 is a diagram of Chinese character conversion into character vectors;

FIG. 4 is a flow chart of a method of detecting entities and entity types;

FIG. 5 is a flow diagram of a method for detecting entities and entity types of the document through an entity recognition model;

FIG. 6 is a flow diagram of a method of detecting event types for sentences in the document;

FIG. 7 is a flow diagram of a method of obtaining argument roles and their importance as required by event types;

FIG. 8 is a flow diagram of a method of extracting entities as arguments;

FIG. 9 is a flow chart of a method of calculating predicted losses;

FIG. 10 is a logical block diagram of the system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

a method for extracting chinese document events, as shown in fig. 1, the method comprising:

step 101: an entity and an entity type of the document are detected.

An Entity (Entity) refers to an object with certain semantic categories, such as time, place, name of a person, name of a place, number, etc., and is a candidate for an argument. The entity type is a category to which the entity belongs, such as name, location, time, and the like. Argument (Argument) refers to an element involved in the occurrence of an event, and is composed of a group of entities; the Argument Role (Argument Role) is used to declare the Role an Argument plays in an event, e.g., "Chen Liequan" is the "pledge". In the field of event extraction of natural language processing, a plurality of argument roles are defined under each event type to describe a complete event information.

The structured event is composed of arguments and their argument roles, as shown in the following table:

step 102: detecting an event type of a sentence in the document. Based on the entity recognition result and the text expression, it is detected what type of event the text describes.

Step 103: and obtaining the argument roles and the importance thereof required by the event type according to the event type. Through the training set, argument roles and the occurrence times thereof under the event types can be counted, and the importance of the event types is judged through the occurrence frequency of the argument roles.

In a specific embodiment, the argument roles required by the event type are acquired through a dictionary, which is constructed based on a training set and used for representing the argument role corresponding relation played by an argument under the event type, as shown in the following table:

chen Liequan: pledge person
	12780000 strands: share of pledge
Changjiang securities (Shanghai) asset management Co., ltd: person of quality right
	12 month and 30 days 2015: date of start
Year 2018, month 1 and month 4Day: end date

Wherein Chen Liequan is argument, and pledge is the role of argument.

Step 104: and obtaining the importance of each sentence according to the importance of the argument role.

Step 105: and detecting a central sentence of the document based on the importance of the sentence. In a Chinese document, the central sentence serves as a summary and overview, and is more important than the other sentences in the document.

Step 106: and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument. Entities can be extracted as arguments by the relationship of the entity and event type to the central sentence, such as semantic distance.

The method comprises the steps of defining the importance of argument roles to event types, detecting a central sentence, and obtaining argument and argument roles according to the relationship between sentence entities and event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced; promoting the granularity of event extraction from sentence level to document level; an end-to-end event extraction method is adopted, and error transmission of a pipeline method is avoided.

As shown in fig. 2, the present invention also includes a method for vectorizing chinese characters:

step 201: and establishing a search matrix from the Chinese characters to the character vectors.

In one embodiment, a lookup matrix of Chinese Kanji to character vector representations is trained on a large number of Chinese texts in the field of event extraction based on an unsupervised ski-gram method. As shown in fig. 3, "medium" and "country" are respectively input into the search matrix to obtain corresponding character vectors. The closer the Chinese characters are semantically, the closer the Euclidean distance of the character vectors in the multidimensional space is, and the more semantically different the Chinese characters are, the farther the Euclidean distance of the character vectors in the multidimensional space is. Converting the Chinese characters into character vectors facilitates computer computation on one hand and acquisition of semantic distance between Chinese characters on the other hand.

Step 202: inputting the Chinese characters into the search matrix to obtain character vectors.

Example 1

As shown in fig. 4, the present embodiment provides a method for detecting an entity and an entity type:

step 301: a first training set is created and entities and entity types are labeled for samples of the first training set.

In one embodiment, the BIO tag system is used to tag entities and entity types as shown in the following table:

wherein, B represents the initial Chinese character of the entity, I represents the internal Chinese character of the entity, O represents the Chinese character of the non-entity part, and the word behind the- "is the entity type.

The location of the argument may also be marked for the first training set, for example, in the form of tuple (send _ idx, char _ s, char _ e), where send _ idx represents the sentence number of the argument appearing in the document, char _ s represents the starting character number of the argument, and char _ e represents the ending character number of the argument, as shown in the following table:

chen Liequan: (0,0,3)
	12780000 strands: (1,3, 12)
Changjiang securities (Shanghai) asset management Co., ltd: (0, 52, 68)
	12 month and 30 days 2015: (1, 21,32)
1 month and 4 days 2018: (1, 39, 48)

Step 302: and training the first training set based on the bidirectional LSTM network, the attention mechanism and the conditional random field to obtain an entity recognition model.

Step 303: and detecting the entity and the entity type of the document through an entity recognition model.

In a specific embodiment, as shown in fig. 5, the method for detecting the entity and the entity type of the document through the entity recognition model includes:

step 401: the location and type of entities in the sample are labeled by BIO tags.

Step 402: converting Chinese characters of sentences in the sample into character vectors x based on the search matrix _i,j 。

Step 403: sentence s to be converted into character vector _i Inputting the result into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a backward direction to obtain a hidden vector h of the current character in the sentence _i,j 。

The bidirectional LSTM network respectively performs forward calculation and backward calculation on sentences so as to improve the recognition of semantics.

Step 404: will conceal the vector h _i,j Obtaining output vector m from attention mechanism of 8 self-attention heads _i,j 。

Among them, attention mechanism (attention mechanism) is a resource allocation scheme of the main means for solving the information overload problem, and allocates the computing resources to more important tasks, so that the neural network has the ability to focus on its input (or feature) subset.

Step 405: output vector m _i,j Inputting the sample into a conditional random field, and obtaining a predicted BIO label of the sample.

Among them, conditional Random Field (CRF) is a basic model of natural language processing, a discriminant probability model, and is a kind of Random Field, and is commonly used for labeling or analyzing sequence data, such as natural language characters.

The bidirectional LSTM network, the attention mechanism, and the conditional random field are respectively the prior art, and are not described in detail in the present invention. Steps 401-405 describe the identification method of the BIO tag, and can also be applied to the construction of the entity identification model.

Example 2

As shown in fig. 6, the present embodiment provides a method of detecting an event type of a sentence in the document:

step 501: the hidden vector h of the characters in the sentence _i,j Performing maximal pooling operation to obtain a first sentence vector g _i 。

The maximum Pooling operation (maxpoling) is to obtain the maximum value as the preserved value of the Pooling layer, all other characteristic values are discarded, the maximum value represents that only the strongest characteristic of the characteristics is preserved, and other weak characteristics are discarded, so that the number of model parameters can be reduced, and the overfitting problem of the model can be reduced.

Step 502: inputting a first sentence vector into a first fully-connected matrix W of two classes _v In (3), obtaining a normalized probability of a sentence event type.

The first full-connection matrix is used for calculating the normalized probability of the event type, the output 1 represents that the sentence belongs to the event type, and the output 0 represents that the sentence does not belong to the event type. In one embodiment, in the first fully-connected matrix, the Softmax function is used to perform the second class calculation, and the output values of the multi-class are converted into probability distributions in the range of [0,1] and 1 by the Softmax function.

Step 503: and obtaining the event type of the sentence according to the normalized probability.

As shown in fig. 7, the method for obtaining the argument roles and their importance levels required by the event types includes:

step 601: acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set;

step 602: and obtaining the importance of the argument role to the event type according to the times of the argument role under the event type.

Wherein, the relative importance of the argument role to the event type is as follows:

the inverse importance of the argument role to the event type is:

the normalized argument role r has an importance for event type v of:

where I (r, v) represents the importance of the normalized argument role r to the event type v.

And acquiring argument roles and the importance thereof required by the event type based on the first training set, acquiring the importance of each sentence under the event type in the Chinese document, and selecting the sentence with the highest importance as a central sentence.

Example 3

As shown in fig. 8, the present embodiment provides a method for extracting entities as arguments:

step 701: converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining an entity vector through maximum pooling operation:

e _l ＝Maxpooling{m _i，j ，...，m _i，k }

wherein e is _l Is an entity vector, m _i，j ，...，m _i，k The span of the entity m from the jth Chinese character to the kth Chinese character in the sentence numbered i.

Step 702: and after the entity vector, splicing the entity type code and the distance code from the entity to the central sentence to obtain a third entity vector E'. The distance code is used for representing the number difference between the sentence where the entity is located and the central sentence, if S1 is the central sentence, the number difference between S2 and the central sentence is 1.

Step 703: and inputting the Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on the output value to obtain a second sentence vector G'.

Step 704: and after the second sentence vector G', splicing the event type code of the sentences and the distance code from the sentences to the central sentence to obtain a third sentence vector. Wherein the event type code is used to mark the event type.

Step 705: inputting the third entity vector and the third sentence vector into a 4-layer Transformer network to obtain a fourth entity vector and a fourth sentence vector which fully exchange text semantic information;

step 706: inputting the fourth entity vector into a second fully-connected matrix w of two classes _v，r And obtaining the result of whether the entity is used as the sentence argument and the argument role.

Wherein the second fully-connected matrix w _v，r And the argument is used for judging whether the entity is expanded to the current event type, wherein the output 1 represents expansion, and the output 0 represents no expansion. Argument roles may be obtained based on the dictionary.

All arguments to a sentence are extracted by computing each entity in the sentence in turn.

Example 4

As shown in fig. 9, the present embodiment provides a method of calculating a predicted loss:

step 801: obtaining entity detection loss L according to the detected entity and entity type thereof and the real entity and type thereof of the document _see 。

Step 802: obtaining the event type detection loss L according to the detected event type and the central sentence _tri ；

Step 803: obtaining argument extraction loss L according to the detected argument and argument role thereof and the true argument and argument role thereof of the document _dee 。

Step 804: obtaining the loss of Chinese document event extraction according to the entity detection loss, the event type detection loss and the argument extraction loss:

L _total ＝λ ₁ L _see +λ ₂ L _tri +λ ₃ L _dee

wherein λ is ₁ 、λ ₂ 、λ ₃ Respectively, are weighting factors.

The present invention also provides a system for implementing the above method, as shown in fig. 10, the system includes: an entity detection module 1, an event type detection module 2, a central sentence detection module 3 and an argument extraction module 4,

the entity detection module 1 is used for detecting the entity and the entity type of the document;

the event type detection module 2 is used for detecting the event type of the sentence in the document;

the central sentence detection module 3 is configured to obtain argument roles and importance thereof required by the event type according to the event type, obtain importance of each sentence according to the importance of the argument roles, and detect a central sentence of the document based on the importance of the sentence;

and the argument extraction module 4 is used for extracting the entity as an argument and obtaining the argument role of the argument based on the detected entity, the event type and the central sentence.

The system of the present invention may further comprise a loss calculation module 5 for calculating the loss of the chinese document event extraction.

In the invention, the processing logic is also divided into different modules, but in the model training stage, the model parameter optimization is integrally carried out from the perspective of optimal overall model loss, and the model can directly obtain the final event extraction result only by inputting the original text when in use, and the model does not need to manually call the different modules in sequence, belongs to an end-to-end model, and avoids error transmission.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting Chinese document events is characterized by comprising the following steps:

step 101: detecting an entity and an entity type of the document;

step 102: detecting an event type of a sentence in the document;

step 103: obtaining argument roles and importance degrees thereof required by the event types according to the event types;

step 104: obtaining the importance of each sentence according to the importance of the argument role;

step 105: detecting a central sentence of the document based on the importance of the sentence;

step 106: extracting an entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument;

in step 103, the method for obtaining the argument roles and the importance thereof required by the event types includes:

acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set;

obtaining the importance of the argument role to the event type according to the times of the argument role under the event type;

the relative importance of the argument role to the event type is:

the inverse importance of the argument role to the event type is:

the normalized argument role r has an importance for event type v of:

2. The method of chinese document event extraction as claimed in claim 1, wherein the method of detecting entities and entity types comprises:

creating a first training set, and marking entities and entity types for samples of the first training set;

training the first training set based on a bidirectional LSTM network, an attention mechanism and a conditional random field to obtain an entity recognition model;

and detecting the entity and the entity type of the document through an entity recognition model.

3. The method for extracting events from a chinese document as recited in claim 2, further comprising a method of vectorizing chinese characters:

establishing a search matrix from Chinese characters to character vectors;

inputting the Chinese characters into the search matrix to obtain character vectors.

4. The method of Chinese document event extraction as claimed in claim 3, wherein the method of detecting the entity and entity type of the document by entity recognition model comprises:

marking the position and the type of the entity in the sample of the first training set through the BIO label;

converting Chinese characters of sentences in the sample into character vectors based on the search matrix;

inputting the sentences converted into character vectors into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a reverse direction to be used as hidden vectors of current characters of the sentences;

inputting the implicit vector into attention mechanisms of 8 self-attention heads to obtain an output vector;

and inputting the output vector into the conditional random field, and calculating the predicted BIO label of the sample.

5. The method of Chinese document event extraction as claimed in claim 4, wherein the method of detecting event type of sentences in the document comprises:

performing maximum pooling operation on the hidden vector to obtain a first sentence vector;

inputting the first sentence vector into a first full-connection matrix of two categories to obtain the normalized probability of the sentence event type;

and obtaining the event type of the sentence according to the normalized probability.

6. The method for extracting events of Chinese documents as claimed in claim 3, wherein the method for extracting entities as arguments comprises:

converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining an entity vector through maximum pooling operation:

e _l ＝Maxpooling{m _i，j ，...，m _i，k }

wherein e is _l Is an entity vector, m _i,j ,…,m _i,k The span from the entity m of the jth Chinese character to the kth Chinese character exists in the sentence with the serial number i;

inputting Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on an output value to obtain a second sentence vector;

7. The method for Chinese document event extraction according to claim 1, further comprising a method for calculating prediction loss:

According to detected argumentAnd argument role thereof, and the true argument and argument role thereof of the document, obtaining argument extraction loss L _dee ；

L _total ＝λ ₁ L _see +λ ₂ L _tri +λ ₃ L _dee

wherein λ is ₁ 、λ ₂ 、λ ₃ Respectively, are weighting factors.

8. A system for implementing the method of any one of claims 1-7, comprising: an entity detection module, an event type detection module, a central sentence detection module and an argument extraction module,