CN112231447B - Method and system for extracting Chinese document events - Google Patents

Method and system for extracting Chinese document events Download PDF

Info

Publication number
CN112231447B
CN112231447B CN202011315453.5A CN202011315453A CN112231447B CN 112231447 B CN112231447 B CN 112231447B CN 202011315453 A CN202011315453 A CN 202011315453A CN 112231447 B CN112231447 B CN 112231447B
Authority
CN
China
Prior art keywords
entity
argument
sentence
event type
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011315453.5A
Other languages
Chinese (zh)
Other versions
CN112231447A (en
Inventor
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Touzhi Information Technology Co ltd
Original Assignee
Hangzhou Touzhi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Touzhi Information Technology Co ltd filed Critical Hangzhou Touzhi Information Technology Co ltd
Priority to CN202011315453.5A priority Critical patent/CN112231447B/en
Publication of CN112231447A publication Critical patent/CN112231447A/en
Application granted granted Critical
Publication of CN112231447B publication Critical patent/CN112231447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a system for extracting Chinese document events, wherein the method comprises the following steps: detecting an entity and an entity type of the document; detecting an event type of a sentence in the document; obtaining argument roles and importance degrees thereof required by the event types according to the event types; obtaining the importance of each sentence according to the importance of the argument role; detecting a central sentence of the document based on the importance of the sentence; and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument. By defining the importance of the argument roles to the event types, detecting the central sentence and acquiring the argument and the argument roles according to the sentence entities and the relationship between the event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced.

Description

Method and system for extracting Chinese document events
Technical Field
The invention relates to the technical field of text information extraction, in particular to a method and a system for extracting Chinese document events.
Background
The event extraction is an important basis for understanding natural language, can provide a convenient way for people to quickly acquire knowledge, is a necessary condition for a computer to understand the natural language, and has positive promoting effects on an automatic abstract, a machine translation system, a question-answering system and the like.
Event extraction, namely, detecting the type of an event from unstructured text, and extracting arguments forming the core elements of the event of the type from the text, so as to express the event information contained in the text in the form of a structured tuple (event type, event argument 1, event argument 2, …, event argument n). With the explosive growth of internet information, the event extraction technology provides a solution for people to efficiently acquire effective information from massive texts, and becomes a research hotspot in academia and industry.
In the prior art, the identification of event types explicitly depends on event trigger words, and in reality, the event trigger words often need to manually induce specific phrases, so that the workload required by the marking of a training data set is increased, and sometimes the trigger words do not necessarily appear in a text describing the events, so that the recall rate during the event type detection is low.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides a method and a system for extracting Chinese document events, which improve the recall rate of event type detection.
The invention discloses a method for extracting Chinese document events, which comprises the following steps: detecting an entity and an entity type of the document; detecting an event type of a sentence in the document; obtaining argument roles and importance degrees thereof required by the event types according to the event types; obtaining the importance of each sentence according to the importance of the argument role; detecting a central sentence of the document based on the importance of the sentence; and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument.
Preferably, the method for detecting the entity and the entity type comprises the following steps: creating a first training set, and marking entities and entity types for samples of the first training set; training the first training set based on a bidirectional LSTM network, an attention mechanism and a conditional random field to obtain an entity recognition model; and detecting the entity and the entity type of the document through an entity recognition model.
Preferably, the method of the present invention further comprises a Chinese character vectorization method: establishing a search matrix from Chinese characters to character vectors; inputting the Chinese characters into the search matrix to obtain character vectors.
Preferably, the method for detecting the entity and the entity type of the document through the entity recognition model comprises the following steps: labeling, by a BIO tag, the location and type of entities in the sample; converting Chinese characters of sentences in the sample into character vectors based on the search matrix; inputting the sentences converted into character vectors into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a reverse direction to be used as hidden vectors of current characters of the sentences; inputting the implicit vector into attention mechanisms of 8 self-attention heads to obtain an output vector; and inputting the output vector into the conditional random field, and calculating the predicted BIO label of the sample.
Preferably, the method for detecting the event type of the sentence in the document comprises: performing maximum pooling operation on the hidden vector to obtain a first sentence vector; inputting the first sentence vector into a first full-connection matrix of the two categories to obtain the normalized probability of the sentence event type; and obtaining the event type of the sentence according to the normalized probability.
Preferably, the method for obtaining the argument roles and the importance thereof required by the event types comprises the following steps: acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set; and obtaining the importance of the argument role to the event type according to the times of the argument role under the event type.
Preferably, the relative importance of the argument role to the event type is:
Figure BDA0002791215450000021
where IR (r, v) is defined as the relative importance of the argument role to the event type, r is the argument role, v is the event type, j is used to traverse each possible event type,
Figure BDA0002791215450000022
representing the times of occurrence of argument role r in event type v in the training set;
the inverse importance of the argument role to the event type is:
Figure BDA0002791215450000023
wherein, IC (r) is defined as the inverse importance degree, | V | represents the event type number of the event type set, | { r belongs to V ∈ V } | represents the event type number containing argument role r;
the normalized argument role r has an importance for event type v of:
Figure BDA0002791215450000031
where I (r, v) represents the importance of the normalized argument role to the event type.
Preferably, the method for extracting the entity as argument comprises:
converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining the entity vector through the maximum pooling operation:
e l =Maxpooling{m i,j ,...,m i,k }
wherein e is l Is an entity vector, m i,j ,...,m i,k The span from the entity m of the jth Chinese character to the kth Chinese character exists in the sentence with the serial number i;
after the entity vector, splicing the entity type code and the distance code from the entity to the central sentence to obtain a third entity vector;
inputting the Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on the output value to obtain a second sentence vector;
after the second sentence vector, sequentially splicing the event type code of the sentence and the distance code from the sentence to the central sentence to obtain a third sentence vector;
inputting the third entity vector and the third sentence vector into a 4-layer Transformer network to obtain a fourth entity vector and a fourth sentence vector which fully exchange text semantic information;
and inputting the fourth entity vector into a second full-connection matrix of the two categories to obtain a result of whether the entity is used as a sentence argument.
Preferably, the method of the present invention further comprises a method of calculating the predicted loss:
obtaining entity detection loss L according to the detected entity and entity type thereof and the real entity and type thereof of the document see
Obtaining the event type detection loss L according to the detected event type and the central sentence tri
Obtaining argument extraction loss L according to the detected argument and argument role thereof and the true argument and argument role thereof of the document dee
Obtaining the loss of Chinese document event extraction according to the entity detection loss, the event type detection loss and the argument extraction loss:
L total =λ 1 L see2 L tri3 L dee
wherein λ is 1 、λ 2 、λ 3 Respectively, are weighting factors.
In another aspect, the present invention further provides a system for implementing the method, including: an entity detection module, an event type detection module, a central sentence detection module and an argument extraction module,
the entity detection module is used for detecting the entity and the entity type of the document;
the event type detection module is used for detecting the event type of sentences in the document;
the central sentence detection module is used for acquiring argument roles and importance thereof required by the event type according to the event type, acquiring the importance of each sentence according to the importance of the argument roles, and detecting the central sentence of the document based on the importance of the sentence;
the argument extraction module is used for extracting the entity as an argument and obtaining the argument role of the argument based on the detected entity, the event type and the central sentence.
Compared with the prior art, the invention has the beneficial effects that:
the method comprises the steps of defining the importance of argument roles to event types, detecting a central sentence, and obtaining argument and argument roles according to the relationship between sentence entities and event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced; the granularity of event extraction is promoted from sentence level to document level.
Drawings
FIG. 1 is a flow chart of a method of Chinese document event extraction of the present invention;
FIG. 2 is a method flow diagram of a method of Chinese character vectorization;
FIG. 3 is a diagram of Chinese character conversion into character vectors;
FIG. 4 is a flow chart of a method of detecting entities and entity types;
FIG. 5 is a flow diagram of a method for detecting entities and entity types of the document through an entity recognition model;
FIG. 6 is a flow diagram of a method of detecting event types for sentences in the document;
FIG. 7 is a flow diagram of a method of obtaining argument roles and their importance as required by event types;
FIG. 8 is a flow diagram of a method of extracting entities as arguments;
FIG. 9 is a flow chart of a method of calculating predicted losses;
FIG. 10 is a logical block diagram of the system of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention is described in further detail below with reference to the attached drawing figures:
a method for extracting chinese document events, as shown in fig. 1, the method comprising:
step 101: an entity and an entity type of the document are detected.
An Entity (Entity) refers to an object with certain semantic categories, such as time, place, name of a person, name of a place, number, etc., and is a candidate for an argument. The entity type is a category to which the entity belongs, such as name, location, time, and the like. Argument (Argument) refers to an element involved in the occurrence of an event, and is composed of a group of entities; the Argument Role (Argument Role) is used to declare the Role an Argument plays in an event, e.g., "Chen Liequan" is the "pledge". In the field of event extraction of natural language processing, a plurality of argument roles are defined under each event type to describe a complete event information.
The structured event is composed of arguments and their argument roles, as shown in the following table:
Figure BDA0002791215450000051
step 102: detecting an event type of a sentence in the document. Based on the entity recognition result and the text expression, it is detected what type of event the text describes.
Step 103: and obtaining the argument roles and the importance thereof required by the event type according to the event type. Through the training set, argument roles and the occurrence times thereof under the event types can be counted, and the importance of the event types is judged through the occurrence frequency of the argument roles.
In a specific embodiment, the argument roles required by the event type are acquired through a dictionary, which is constructed based on a training set and used for representing the argument role corresponding relation played by an argument under the event type, as shown in the following table:
chen Liequan: pledge person
12780000 strands: share of pledge
Changjiang securities (Shanghai) asset management Co., ltd: person of quality right
12 month and 30 days 2015: date of start
Year 2018, month 1 and month 4Day: end date
Wherein Chen Liequan is argument, and pledge is the role of argument.
Step 104: and obtaining the importance of each sentence according to the importance of the argument role.
Step 105: and detecting a central sentence of the document based on the importance of the sentence. In a Chinese document, the central sentence serves as a summary and overview, and is more important than the other sentences in the document.
Step 106: and extracting the entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument. Entities can be extracted as arguments by the relationship of the entity and event type to the central sentence, such as semantic distance.
The method comprises the steps of defining the importance of argument roles to event types, detecting a central sentence, and obtaining argument and argument roles according to the relationship between sentence entities and event types and the central sentence; the event type, the argument and the argument role can be correctly known without depending on the trigger words when the trigger words are lacked, so that the recall rate is improved; meanwhile, the workload of marking the trigger words in the training set is reduced; promoting the granularity of event extraction from sentence level to document level; an end-to-end event extraction method is adopted, and error transmission of a pipeline method is avoided.
As shown in fig. 2, the present invention also includes a method for vectorizing chinese characters:
step 201: and establishing a search matrix from the Chinese characters to the character vectors.
In one embodiment, a lookup matrix of Chinese Kanji to character vector representations is trained on a large number of Chinese texts in the field of event extraction based on an unsupervised ski-gram method. As shown in fig. 3, "medium" and "country" are respectively input into the search matrix to obtain corresponding character vectors. The closer the Chinese characters are semantically, the closer the Euclidean distance of the character vectors in the multidimensional space is, and the more semantically different the Chinese characters are, the farther the Euclidean distance of the character vectors in the multidimensional space is. Converting the Chinese characters into character vectors facilitates computer computation on one hand and acquisition of semantic distance between Chinese characters on the other hand.
Step 202: inputting the Chinese characters into the search matrix to obtain character vectors.
Example 1
As shown in fig. 4, the present embodiment provides a method for detecting an entity and an entity type:
step 301: a first training set is created and entities and entity types are labeled for samples of the first training set.
In one embodiment, the BIO tag system is used to tag entities and entity types as shown in the following table:
Figure BDA0002791215450000071
wherein, B represents the initial Chinese character of the entity, I represents the internal Chinese character of the entity, O represents the Chinese character of the non-entity part, and the word behind the- "is the entity type.
The location of the argument may also be marked for the first training set, for example, in the form of tuple (send _ idx, char _ s, char _ e), where send _ idx represents the sentence number of the argument appearing in the document, char _ s represents the starting character number of the argument, and char _ e represents the ending character number of the argument, as shown in the following table:
chen Liequan: (0,0,3)
12780000 strands: (1,3, 12)
Changjiang securities (Shanghai) asset management Co., ltd: (0, 52, 68)
12 month and 30 days 2015: (1, 21,32)
1 month and 4 days 2018: (1, 39, 48)
Step 302: and training the first training set based on the bidirectional LSTM network, the attention mechanism and the conditional random field to obtain an entity recognition model.
Step 303: and detecting the entity and the entity type of the document through an entity recognition model.
In a specific embodiment, as shown in fig. 5, the method for detecting the entity and the entity type of the document through the entity recognition model includes:
step 401: the location and type of entities in the sample are labeled by BIO tags.
Step 402: converting Chinese characters of sentences in the sample into character vectors x based on the search matrix i,j
Step 403: sentence s to be converted into character vector i Inputting the result into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a backward direction to obtain a hidden vector h of the current character in the sentence i,j
The bidirectional LSTM network respectively performs forward calculation and backward calculation on sentences so as to improve the recognition of semantics.
Step 404: will conceal the vector h i,j Obtaining output vector m from attention mechanism of 8 self-attention heads i,j
Among them, attention mechanism (attention mechanism) is a resource allocation scheme of the main means for solving the information overload problem, and allocates the computing resources to more important tasks, so that the neural network has the ability to focus on its input (or feature) subset.
Step 405: output vector m i,j Inputting the sample into a conditional random field, and obtaining a predicted BIO label of the sample.
Among them, conditional Random Field (CRF) is a basic model of natural language processing, a discriminant probability model, and is a kind of Random Field, and is commonly used for labeling or analyzing sequence data, such as natural language characters.
The bidirectional LSTM network, the attention mechanism, and the conditional random field are respectively the prior art, and are not described in detail in the present invention. Steps 401-405 describe the identification method of the BIO tag, and can also be applied to the construction of the entity identification model.
Example 2
As shown in fig. 6, the present embodiment provides a method of detecting an event type of a sentence in the document:
step 501: the hidden vector h of the characters in the sentence i,j Performing maximal pooling operation to obtain a first sentence vector g i
The maximum Pooling operation (maxpoling) is to obtain the maximum value as the preserved value of the Pooling layer, all other characteristic values are discarded, the maximum value represents that only the strongest characteristic of the characteristics is preserved, and other weak characteristics are discarded, so that the number of model parameters can be reduced, and the overfitting problem of the model can be reduced.
Step 502: inputting a first sentence vector into a first fully-connected matrix W of two classes v In (3), obtaining a normalized probability of a sentence event type.
The first full-connection matrix is used for calculating the normalized probability of the event type, the output 1 represents that the sentence belongs to the event type, and the output 0 represents that the sentence does not belong to the event type. In one embodiment, in the first fully-connected matrix, the Softmax function is used to perform the second class calculation, and the output values of the multi-class are converted into probability distributions in the range of [0,1] and 1 by the Softmax function.
Step 503: and obtaining the event type of the sentence according to the normalized probability.
As shown in fig. 7, the method for obtaining the argument roles and their importance levels required by the event types includes:
step 601: acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set;
step 602: and obtaining the importance of the argument role to the event type according to the times of the argument role under the event type.
Wherein, the relative importance of the argument role to the event type is as follows:
Figure BDA0002791215450000091
where IR (r, v) is defined as the relative importance of the argument role to the event type, r is the argument role, v is the event type, j is used to traverse each possible event type,
Figure BDA0002791215450000092
representing the times of occurrence of argument role r in event type v in the training set;
the inverse importance of the argument role to the event type is:
Figure BDA0002791215450000093
wherein, IC (r) is defined as the inverse importance degree, | V | represents the event type number of the event type set, | { r belongs to V ∈ V } | represents the event type number containing argument role r;
the normalized argument role r has an importance for event type v of:
Figure BDA0002791215450000094
where I (r, v) represents the importance of the normalized argument role r to the event type v.
And acquiring argument roles and the importance thereof required by the event type based on the first training set, acquiring the importance of each sentence under the event type in the Chinese document, and selecting the sentence with the highest importance as a central sentence.
Example 3
As shown in fig. 8, the present embodiment provides a method for extracting entities as arguments:
step 701: converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining an entity vector through maximum pooling operation:
e l =Maxpooling{m i,j ,...,m i,k }
wherein e is l Is an entity vector, m i,j ,...,m i,k The span of the entity m from the jth Chinese character to the kth Chinese character in the sentence numbered i.
Step 702: and after the entity vector, splicing the entity type code and the distance code from the entity to the central sentence to obtain a third entity vector E'. The distance code is used for representing the number difference between the sentence where the entity is located and the central sentence, if S1 is the central sentence, the number difference between S2 and the central sentence is 1.
Step 703: and inputting the Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on the output value to obtain a second sentence vector G'.
Step 704: and after the second sentence vector G', splicing the event type code of the sentences and the distance code from the sentences to the central sentence to obtain a third sentence vector. Wherein the event type code is used to mark the event type.
Step 705: inputting the third entity vector and the third sentence vector into a 4-layer Transformer network to obtain a fourth entity vector and a fourth sentence vector which fully exchange text semantic information;
step 706: inputting the fourth entity vector into a second fully-connected matrix w of two classes v,r And obtaining the result of whether the entity is used as the sentence argument and the argument role.
Wherein the second fully-connected matrix w v,r And the argument is used for judging whether the entity is expanded to the current event type, wherein the output 1 represents expansion, and the output 0 represents no expansion. Argument roles may be obtained based on the dictionary.
All arguments to a sentence are extracted by computing each entity in the sentence in turn.
Example 4
As shown in fig. 9, the present embodiment provides a method of calculating a predicted loss:
step 801: obtaining entity detection loss L according to the detected entity and entity type thereof and the real entity and type thereof of the document see
Step 802: obtaining the event type detection loss L according to the detected event type and the central sentence tri
Step 803: obtaining argument extraction loss L according to the detected argument and argument role thereof and the true argument and argument role thereof of the document dee
Step 804: obtaining the loss of Chinese document event extraction according to the entity detection loss, the event type detection loss and the argument extraction loss:
L total =λ 1 L see2 L tri3 L dee
wherein λ is 1 、λ 2 、λ 3 Respectively, are weighting factors.
The present invention also provides a system for implementing the above method, as shown in fig. 10, the system includes: an entity detection module 1, an event type detection module 2, a central sentence detection module 3 and an argument extraction module 4,
the entity detection module 1 is used for detecting the entity and the entity type of the document;
the event type detection module 2 is used for detecting the event type of the sentence in the document;
the central sentence detection module 3 is configured to obtain argument roles and importance thereof required by the event type according to the event type, obtain importance of each sentence according to the importance of the argument roles, and detect a central sentence of the document based on the importance of the sentence;
and the argument extraction module 4 is used for extracting the entity as an argument and obtaining the argument role of the argument based on the detected entity, the event type and the central sentence.
The system of the present invention may further comprise a loss calculation module 5 for calculating the loss of the chinese document event extraction.
In the invention, the processing logic is also divided into different modules, but in the model training stage, the model parameter optimization is integrally carried out from the perspective of optimal overall model loss, and the model can directly obtain the final event extraction result only by inputting the original text when in use, and the model does not need to manually call the different modules in sequence, belongs to an end-to-end model, and avoids error transmission.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for extracting Chinese document events is characterized by comprising the following steps:
step 101: detecting an entity and an entity type of the document;
step 102: detecting an event type of a sentence in the document;
step 103: obtaining argument roles and importance degrees thereof required by the event types according to the event types;
step 104: obtaining the importance of each sentence according to the importance of the argument role;
step 105: detecting a central sentence of the document based on the importance of the sentence;
step 106: extracting an entity as an argument based on the detected entity, the event type and the central sentence, and obtaining the argument role of the argument;
in step 103, the method for obtaining the argument roles and the importance thereof required by the event types includes:
acquiring the occurrence frequency of an event type and the frequency of argument roles under the event type within the range of a first training set;
obtaining the importance of the argument role to the event type according to the times of the argument role under the event type;
the relative importance of the argument role to the event type is:
Figure FDA0003901074580000011
where IR (r, v) is defined as the relative importance of the argument role to the event type, r is the argument role, v is the event type, j is used to traverse each possible event type,
Figure FDA0003901074580000012
representing the times of occurrence of argument role r in event type v in the training set;
the inverse importance of the argument role to the event type is:
Figure FDA0003901074580000013
wherein, IC (r) is defined as the inverse importance degree, | V | represents the event type number of the event type set, | { r belongs to V ∈ V } | represents the event type number containing argument role r;
the normalized argument role r has an importance for event type v of:
Figure FDA0003901074580000014
where I (r, v) represents the importance of the normalized argument role to the event type.
2. The method of chinese document event extraction as claimed in claim 1, wherein the method of detecting entities and entity types comprises:
creating a first training set, and marking entities and entity types for samples of the first training set;
training the first training set based on a bidirectional LSTM network, an attention mechanism and a conditional random field to obtain an entity recognition model;
and detecting the entity and the entity type of the document through an entity recognition model.
3. The method for extracting events from a chinese document as recited in claim 2, further comprising a method of vectorizing chinese characters:
establishing a search matrix from Chinese characters to character vectors;
inputting the Chinese characters into the search matrix to obtain character vectors.
4. The method of Chinese document event extraction as claimed in claim 3, wherein the method of detecting the entity and entity type of the document by entity recognition model comprises:
marking the position and the type of the entity in the sample of the first training set through the BIO label;
converting Chinese characters of sentences in the sample into character vectors based on the search matrix;
inputting the sentences converted into character vectors into a bidirectional LSTM network of an entity recognition model, and performing hidden layer splicing in a forward direction and a reverse direction to be used as hidden vectors of current characters of the sentences;
inputting the implicit vector into attention mechanisms of 8 self-attention heads to obtain an output vector;
and inputting the output vector into the conditional random field, and calculating the predicted BIO label of the sample.
5. The method of Chinese document event extraction as claimed in claim 4, wherein the method of detecting event type of sentences in the document comprises:
performing maximum pooling operation on the hidden vector to obtain a first sentence vector;
inputting the first sentence vector into a first full-connection matrix of two categories to obtain the normalized probability of the sentence event type;
and obtaining the event type of the sentence according to the normalized probability.
6. The method for extracting events of Chinese documents as claimed in claim 3, wherein the method for extracting entities as arguments comprises:
converting each Chinese character of the detected entity into a character vector by searching the matrix, and obtaining an entity vector through maximum pooling operation:
e l =Maxpooling{m i,j ,...,m i,k }
wherein e is l Is an entity vector, m i,j ,…,m i,k The span from the entity m of the jth Chinese character to the kth Chinese character exists in the sentence with the serial number i;
after the entity vector, splicing the entity type code and the distance code from the entity to the central sentence to obtain a third entity vector;
inputting Chinese characters of the sentence where the entity is located into the search matrix, and performing maximum pooling operation on an output value to obtain a second sentence vector;
after the second sentence vector, sequentially splicing the event type code of the sentence and the distance code from the sentence to the central sentence to obtain a third sentence vector;
inputting the third entity vector and the third sentence vector into a 4-layer Transformer network to obtain a fourth entity vector and a fourth sentence vector which fully exchange text semantic information;
and inputting the fourth entity vector into a second full-connection matrix of the two categories to obtain a result of whether the entity is used as a sentence argument.
7. The method for Chinese document event extraction according to claim 1, further comprising a method for calculating prediction loss:
obtaining entity detection loss L according to the detected entity and entity type thereof and the real entity and type thereof of the document see
Obtaining the event type detection loss L according to the detected event type and the central sentence tri
According to detected argumentAnd argument role thereof, and the true argument and argument role thereof of the document, obtaining argument extraction loss L dee
Obtaining the loss of Chinese document event extraction according to the entity detection loss, the event type detection loss and the argument extraction loss:
L total =λ 1 L see2 L tri3 L dee
wherein λ is 1 、λ 2 、λ 3 Respectively, are weighting factors.
8. A system for implementing the method of any one of claims 1-7, comprising: an entity detection module, an event type detection module, a central sentence detection module and an argument extraction module,
the entity detection module is used for detecting the entity and the entity type of the document;
the event type detection module is used for detecting the event type of sentences in the document;
the central sentence detection module is used for acquiring argument roles and importance thereof required by the event type according to the event type, acquiring the importance of each sentence according to the importance of the argument roles, and detecting the central sentence of the document based on the importance of the sentence;
the argument extraction module is used for extracting the entity as an argument and obtaining the argument role of the argument based on the detected entity, the event type and the central sentence.
CN202011315453.5A 2020-11-21 2020-11-21 Method and system for extracting Chinese document events Active CN112231447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011315453.5A CN112231447B (en) 2020-11-21 2020-11-21 Method and system for extracting Chinese document events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011315453.5A CN112231447B (en) 2020-11-21 2020-11-21 Method and system for extracting Chinese document events

Publications (2)

Publication Number Publication Date
CN112231447A CN112231447A (en) 2021-01-15
CN112231447B true CN112231447B (en) 2023-04-07

Family

ID=74124326

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011315453.5A Active CN112231447B (en) 2020-11-21 2020-11-21 Method and system for extracting Chinese document events

Country Status (1)

Country Link
CN (1) CN112231447B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591483A (en) * 2021-04-27 2021-11-02 重庆邮电大学 Document-level event argument extraction method based on sequence labeling
CN113220768A (en) * 2021-06-04 2021-08-06 杭州投知信息技术有限公司 Resume information structuring method and system based on deep learning
CN113468433B (en) * 2021-09-02 2021-12-07 中科雨辰科技有限公司 Target event extraction data processing system
CN113722462B (en) * 2021-09-02 2022-03-04 中科雨辰科技有限公司 Target argument information extraction data processing system
CN113762381B (en) * 2021-09-07 2023-12-19 上海明略人工智能(集团)有限公司 Emotion classification method, system, electronic equipment and medium
CN113722491A (en) * 2021-09-08 2021-11-30 北京有竹居网络技术有限公司 Method and device for determining text plot type, readable medium and electronic equipment
CN114297394B (en) * 2022-03-11 2022-07-01 中国科学院自动化研究所 Method and electronic equipment for extracting event arguments in text
CN114936563B (en) * 2022-04-27 2023-07-25 苏州大学 Event extraction method, device and storage medium
CN116049345B (en) * 2023-03-31 2023-10-10 江西财经大学 Document-level event joint extraction method and system based on bidirectional event complete graph
CN116579338A (en) * 2023-07-13 2023-08-11 江西财经大学 Document level event extraction method and system based on integrated joint learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210040A1 (en) * 2004-03-18 2005-09-22 Zenodata Corporation Document organization and formatting for display
CN111143578B (en) * 2019-12-30 2023-12-22 北京因特睿软件有限公司 Method, device and processor for extracting event relationship based on neural network
CN111414482B (en) * 2020-03-20 2024-02-20 北京百度网讯科技有限公司 Event argument extraction method and device and electronic equipment
CN111581345A (en) * 2020-04-26 2020-08-25 上海明略人工智能(集团)有限公司 Document level event extraction method and device

Also Published As

Publication number Publication date
CN112231447A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN112231447B (en) Method and system for extracting Chinese document events
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN111382565B (en) Emotion-reason pair extraction method and system based on multiple labels
CN110134946B (en) Machine reading understanding method for complex data
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN114936277A (en) Similarity problem matching method and user similarity problem matching system
CN114881043B (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN115759092A (en) Network threat information named entity identification method based on ALBERT
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
CN112269874A (en) Text classification method and system
CN116127099A (en) Combined text enhanced table entity and type annotation method based on graph rolling network
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN115905187B (en) Intelligent proposition system oriented to cloud computing engineering technician authentication
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN114881038B (en) Chinese entity and relation extraction method and device based on span and attention mechanism
CN115577080A (en) Question reply matching method, system, server and storage medium
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant