CN113191118B

CN113191118B - Text relation extraction method based on sequence annotation

Info

Publication number: CN113191118B
Application number: CN202110501103.6A
Authority: CN
Inventors: 展一鸣; 李钊; 吴士伟; 李慧娟; 辛国茂; 陈通; 胡传会; 张超; 赵秀浩
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2023-07-18
Anticipated expiration: 2041-05-08
Also published as: CN113191118A

Abstract

The invention relates to the technical field of data processing, in particular to a text relation extraction method based on sequence annotation, which comprises the steps of constructing a training data set similar to predicted data, and presetting all possible bidirectional entity relations and three fixed dependency relations; dividing an input sentence into word sequences, and inputting the word sequences into a pre-training model to obtain a representation vector of a word in each sentence; forming a unique word pair sequence by using a similar handshake mode on the word vector sequence; inputting the obtained vector pair sequence into a neural network classification layer; calculating loss and back-propagating; judging the category of each word pair, and judging whether the word pair has a corresponding relation with the position; and decoding the final result by using the pseudo codes shown in the drawings according to the corresponding relation, and finally obtaining all the extracted triples. The invention can simultaneously complete two tasks: entity identification and relationship classification. The extraction accuracy and recall rate are obviously improved, and the extraction accuracy and recall rate are greatly improved.

Description

Text relation extraction method based on sequence annotation

Technical Field

The invention relates to the technical field of data processing, in particular to a text relation extraction method based on sequence labeling.

Background

The triplet relationships have been widely used in the field of natural language understanding as a knowledge representation that can be stored in a structured manner and play a great role. In the field of natural language text, a piece of text knowledge can always be represented by one or more triples. The structured triad relationship is extracted from one or more pieces of text, i.e., the knowledge represented by the discrete text is converted into a form that can be understood or stored by the machine, a task referred to as relationship extraction.

Relationship extraction methods, from early statistical methods to recent neural network methods, have evolved over many years, with relationship extraction tasks ranging from simple single relationship extraction to overlapping relationship extraction. A single relationship extraction refers to the existence of a unique one of the triples in a piece of text, while an overlapping relationship extraction refers to the existence of multiple triples and overlapping in a piece of text. The overlapping relationships are divided into physical overlaps and single physical overlaps, and for better explanation of the overlapping relationships, three relationship types are shown in FIG. 1.

In the conventional method, there are three disadvantages: 1) Some relation extraction methods consider the relation extraction task as a relation classification task, namely, the relation between entities is classified by the entities already marked in a given text, and the method is not suitable for practical application. 2) Still other methods simplify the relationship extraction problem, only solve the single relationship extraction problem, and do not consider the actual situation; in fact, multiple relationship extraction is the most common problem in our practical application. 3) The trace flow method divides a relation extraction task into two independent tasks, namely an entity identification task and a relation classification task, and the method does not consider the correlation between the two tasks, so that the problem of exposure deviation is caused, namely when a final result is generated in a certain sequence, the result of the latter step is influenced by the result of the former step.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a text relation extraction method based on sequence labeling, which can simultaneously complete two tasks: entity identification and relationship classification.

The technical scheme adopted for solving the technical problems is as follows:

a text relation extraction method based on sequence labeling comprises the following steps:

step 1, presetting all possible entity relation categories and establishing a relation set R;

step 2, constructing a training data set suitable for the service field, wherein entity relations in the training data must comprise all preset relation types;

step 3, expanding a relation set R, and presetting all possible bidirectional entity relations to obtain a data set R';

step 4, constructing a dependency relationship set R for the triplet _s ；

Step 5, dividing the input sentence into word sequences, and inputting the word sequences into a pre-training model to obtain a representation vector set of the words in each sentenceWherein d is the dimension of the hidden layer vector represented by the preset super parameter, h _i Is a word vector;

step 6, forming a unique word pair sequence for the word vector sequence H obtained in the step 5, and double traversing the word sequence to form a word vector pair;

step 7, constructing training targets for the word pairs generated in step 6, for each word pair (W _i ,W _j ) J.gtoreq.i constructs a 0-value vector of length 2r+3(target vector), where R is the number of relationships in the relationship set R;

step 8, inputting the word vector pair sequences obtained in the step 6 into a neural network classification layer, and finally classifying and outputting 2r+3 categories; the classification layer function is as follows:

wherein h is _i ∈H,h _j ∈H，[h _i ；h _j ]For the vector concatenation operation,intermediate vectors obtained by Post-Norm function for spliced vectors, +.>Is->Vector after linear transformation, < >>And->For trainable parameters, ++>And->Is a deviation parameter;

step 9, training stage r _i,j Inputting a Circle Loss function, calculating Loss and carrying out back propagation;

step 10, prediction stage vs. r in step 8 _i,j Judging, namely judging the category of each word pair; r is (r) _i,j Each position in the list represents different labels, if the numerical value in a certain position is larger than 0, the corresponding relation of the word pair in the position is judged, and if the numerical value is smaller than 0, the corresponding relation in the position is not judged;

and 11, decoding a final result by using a pseudo code according to the corresponding relation of each word pair obtained in the step 10, and finally obtaining all the extracted triples.

Further, in step 3, the bidirectional entity relationship is a relationship of two different orientations, namely a forward direction, a pointing direction and a reverse direction.

Further, in step 4, the dependency relationship set R _s There are three dependencies: head entity head to tail entity head, head entity tail to tail entity tail, entity head to entity tail.

Further, in step 5, the pre-training model is one of BERT, ALBERT, roBERTa, ERNIE, XLNet and the like.

Further, in step 6, the traversal process is: each word vector and the word vector (including itself) following it form a word vector pair, and the word sequence is traversed to obtainFor a word vector pair, where n is the number of word vectors.

Further, in step 9, the loss function is as follows:

wherein,,and->The predictive scores of the positive examples and the negative examples are respectively, L and K are respectively a positive example set and a negative example set, L is a Circle Loss value, and e is the base of a natural logarithmic function.

Further, in step 11, the pseudo code is as shown in fig. 3.

The invention has the technical effects that:

compared with the prior art, the text relation extraction method based on the sequence labeling can simultaneously complete two tasks: entity identification and relationship classification. Aiming at the entity overlapping problem in the text, the invention designs a novel labeling method which can solve the problem of overlapping of a plurality of entities. In order to solve the exposure deviation, the labeling method of the invention uses a joint extraction mode, which can label the relationship between the entities at the same time of labeling the entities. The method has the advantages of remarkably improving the extraction accuracy and recall rate and greatly improving the extraction accuracy and recall rate.

Drawings

FIG. 1 is an exemplary diagram of an overlapping relationship in accordance with the present invention;

FIG. 2 is a diagram of a model of the present invention;

FIG. 3 is a pseudo code diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings of the specification.

Example 1:

the text relation extraction method based on sequence labeling, which is related to the embodiment, comprises the following steps:

step 1, presetting all possible entity relation categories and establishing a relation set R: for the business scenario, key knowledge information in the field is determined by expert demonstration, or important attributes of the business object in the business scenario are determined. For example, in the medical field, attribute analysis for drugs, important attributes include: the applicability, administration method, adverse reflection, notice, etc., whereby four or more preset relationships of the medicine can be obtained and added to the relationship set R.

Step 2, constructing a training data set suitable for the service field: first, we define a complete piece of training data, which must contain all the ternary relationships contained in the original sentences S and S that exist in the relationship set R. Secondly, a section or a sentence of text is given, and the preset relation R is combined through manual analysis, so that the required triples are extracted to form training data. For example, the applicable symptom for the drug of statement S "a" is B, C, D. By the way, three ternary relations of S can be determined, including (A medicine, application, B), (A medicine, application, C), (A medicine, application, D). Accordingly, an effective training sentence can be constructed.

And 3, expanding a relation set R (containing R relations) to expand each relation into two relations (forward and reverse) with different directions, and obtaining a data set R' containing 2R relations. For example, the relationship "applicability" is to be extended to both "applicability-forward" and "applicability-reverse". For both entities (A, B), the relationship pointed to by A to B is determined to be a forward relationship, and the relationship pointed to by B to A is a reverse relationship.

Step 4, constructing a dependency relationship set R for the triplet _s There are three dependencies: head-to-head (start entity head points to target entity head), tail-to-tail (start entity tail points to target entity tail), head-to-tail (entity head points to entity tail). Here, the initial entity refers to the former entity in the triplet, i.e. the initiator of the relationship, and the target entity refers to the latter entity in the triplet, i.e. the recipient of the relationship.

Step 5, dividing the input sentence into word sequences and inputting the word sequences into BEIn the RT pre-training model, a set of expression vectors of words in each sentence is obtainedWherein d is the dimension of the hidden layer vector represented by the preset super parameter, h _i Is a word vector.

Step 6, forming a unique word pair sequence by using a similar handshake mode on the word vector sequence H obtained in the step 5，I.e. each word vector "handshakes" with its following word vector (including itself) to form a word vector pair, thus double traversing the word sequence ultimately resultsFor word vector pairs, where n is the number of word vectors;

step 7, constructing training targets for the word pairs generated in step 6, for each word pair (W _i ,W _j ) J.gtoreq.i constructs a 0-value vector of length 2r+3(target vector), where R is the number of relationships in the relationship set R. For the relation triplet (E ₁ ,R _i' ,E ₂ ),R _i' E R', we define E _j' [1]J' E (1, 2) is entity E _j' First character of (E) _j' [-1]J' E (1, 2) is entity E _j' V of the last character of (v) _i,j [idx]A value representing the position of the target vector idx of the word pair. If W is _i ∈E ₁ ,W _j ∈E ₂ Then put v _i,j [i']=1; if W is _i ＝E ₁ [1],W _j ＝E ₁ [-1]Or W _i ＝E ₂ [1],W _j ＝E ₂ [-1]Then put v _i,j [2r+3]=1; if W is _i ＝E ₁ [1],W _j ＝E ₂ [1]Then put v _i,j [2r+2]=1; if W is _i ＝E ₁ [-1],W _j ＝E ₂ [-1]Then put v _i,j [2r+1]＝1。

And 8, inputting the word vector pair sequence obtained in the step 6 into a neural network classification layer, and finally classifying and outputting 2r+3 classification quantity. The classification layer function is as follows:

wherein h is _i ∈H,h _j ∈H，[h _i ；h _j ]For the vector concatenation operation,intermediate vectors obtained by Post-Norm function for spliced vectors, +.>Is->Vector after linear transformation, < >>Andfor trainable parameters, ++>And->Is a deviation parameter. In particular, post-Norm functions are:

h′＝Post-Norm(h)

＝LayerNorm(DropOut(GELU(Wh+b))+h)

step 9, training stage r _i,j And a target matrix v _i,j Comparing according to v _i,j Will r _i,j Dividing into positive examples or negative examples, and dividing r into _i,j Inputting a Circle Loss function, calculating Loss and carrying out back propagation; the loss function is as follows:

and step 11, decoding a final result by using a pseudo code as shown in fig. 3 according to the corresponding relation of each word pair obtained in the step 10, and finally obtaining all the extracted triples.

The pseudocode described in fig. 3: receiving a marked sequence S, mapping the index of the S to a dictionary M of original sentence index pairs, and outputting a triplet set finally by an original input sentence. Line 1 of code initializes a set E of word index pairs containing "head-to-tail" dependencies, line 2 of code initializes a dictionary H mapping bi-directional entity relationships to "head-to-head" word index pairs, line 3 of code initializes a dictionary T mapping bi-directional entity relationships to "tail-to-tail" word index pairs, line 4 of codeInitializing a result set R, establishing a dictionary mapping indexes to relation types, initializing the quantity R of bidirectional entity relations by a code line 5, initializing indexes of three dependency relations in a target vector by a code line 7, traversing the whole labeling sequence S by a code line 8-20, converting the indexes of the labeling sequence belonging to a head-to-tail relation into head-to-tail word index pairs of original sentences through the dictionary M by a code line 9-11, adding the head-to-tail word index pairs into a set E, traversing all the bidirectional relations by a code line 12-19, mapping the word index pairs belonging to the head-to-head relation onto the bidirectional relation set contained in the word index pairs, mapping the word index pairs belonging to the tail-to-tail relation onto the bidirectional relation set contained in the bidirectional relation set, performing double traversing by a code line 21-38, and defining E [ i ] by a code line 23-24],E[j]For an entity's head-to-tail word index pair, code 25 lines define P _h For entity E [ i ]]And E [ j ]]The head word index pair, representing the "head-to-head" relationship, defines P on line 26 of the code _t For entity E [ i ]]And E [ j ]]The tail word index pair, representing a "tail-to-tail" relationship, defines H [ P ] at line 27 of code _h ]To query P in dictionary H through head word index pairs _h Corresponding bi-directional relationship, code 28 rows define T [ P ] _t ]To query P in dictionary T through tail word index pairs _t Corresponding bi-directional relation, code 29 lines fetch set H [ P ] _h ]And set T [ P ] _t ]Is Set of intersection-derived relationships _r Code 30-36 lines judge Set _r Non-null and adding triples to Set R, code 31 line extracts entity fragments from original sentence through head and tail indexes of initial entity, code 32 line extracts entity fragments from original sentence through head and tail indexes of target entity, code 33-35 line traverses relation Set _r And adding each predicted relationship to the final output result combination R, and returning the final result to the code 39 line and ending the flow.

The entity is a unified name of objectively existing objects or concepts; the triplet is a knowledge representation in the form of (initial entity, relationship, target entity); the entity head refers to the first character of the entity character segment, refers to the first Chinese character of the entity word for Chinese and refers to the first word of the entity word group for English; the entity tail refers to the last character of the entity character segment, refers to the last Chinese character of the entity word for Chinese and refers to the last word of the entity word group for English;

test example:

the model of the invention reaches the leading level in the public data set NYT and WEBNLG, which are both English data sets, wherein the NYT data set comes from paper Modeling Relations and TheirMentions without LabeledText, and the paper uses a remote supervision learning method to report the triples extracted from the corpus in New York; the WEBNLG dataset is from paper Creating training corpora for nlg micro-planners, which uses a text generation approach to generate a piece of text and matches a pre-set ternary relationship. The experimental results are shown in table 1.

Table 1 comparison between the method of this patent and other methods

The NoveTaggering method is from paper Joint extraction of entities and relations based on a novel tagging scheme, the GraphRel method is from paper Extracting relational facts by an end-to-end neural model with copy mechanism, the OrderCopyrE method is from paper Learning the extraction order of multiple relational facts in a sentence withreinforcement learning, and the CasRel method is from paper ANovel Cascade Binary Tagging Framework for Relational Triple Extraction.

The calculation method of the F1 evaluation index is as follows:

the F1 index represents the overall level of the predicted result and is a harmonic mean of the model accuracy and recall. As can be seen from Table 1, the method of the present invention achieves the best effect on both data sets, and can improve the key index F1 by 1-2% compared with the latest method, and the extraction accuracy and recall rate are significantly improved compared with other methods. The following three problems are better solved: 1) Meanwhile, the problems of entity extraction and relationship classification among entities are solved. 2) Overlapping triples in the text are extracted correctly. 3) The problem of exposure deviation is solved, and the extraction accuracy is improved.

The foregoing embodiments are merely examples of the present invention, and the scope of the present invention includes, but is not limited to, the forms and styles of the foregoing embodiments, and any suitable changes or modifications made by those skilled in the art, which are consistent with the claims of the present invention, shall fall within the scope of the present invention.

Claims

1. A text relation extraction method based on sequence labeling is characterized by comprising the following steps: the method comprises the following steps:

step 4, constructing a dependency relationship set R for the triplet _s ；

step 6, forming a unique word pair sequence for the word vector sequence H obtained in the step 5, and double traversing the word sequence to form a word vector pair; by a means ofThe traversal process is as follows: each word vector forms a word vector pair with itself and the word vector following it, and the word sequence is traversed to obtainFor word vector pairs, where n is the number of word vectors;

step 7, constructing training targets for the word pairs generated in step 6, for each word pair (W _i ,W _j ) J.gtoreq.i constructs a 0-value vector of length 2r+3Wherein R is the number of relationships in the relationship set R;

wherein h is _i ∈H,h _j ∈H，[h _i ；h _j ]For the vector concatenation operation,intermediate vectors obtained by Post-Norm function for spliced vectors, +.>Is->Vector after linear transformation, < >>Andfor trainable parameters, ++>And->Is a deviation parameter;

step 9, training stage r _i,j Inputting a Circle Loss function, calculating Loss and carrying out back propagation; the loss function is as follows:

wherein,,and->Predictive scores of positive examples and negative examples are respectively provided, L and K are positive example sets and negative example sets respectively, L is a Circle Loss value, and e is a base of a natural logarithmic function;

2. The text relationship extraction method based on sequence labeling as claimed in claim 1, wherein: in step 3, the bidirectional entity relationship is a relationship of two different directives, namely a forward direction, a direct direction and a reverse direction relationship.

3. The text relationship extraction method based on sequence labeling as claimed in claim 1, wherein: in step 4, the dependency relationship set R _s There are three dependencies: head entity head to tail entity head, head entity tail to tail entity tail, entity head to entity tail.

4. The text relationship extraction method based on sequence labeling as claimed in claim 1, wherein: in step 5, the pre-training model is one of BERT, ALBERT, roBERTa, ERNIE, XLNet.