CN113191118A

CN113191118A - Text relation extraction method based on sequence labeling

Info

Publication number: CN113191118A
Application number: CN202110501103.6A
Authority: CN
Inventors: 展一鸣; 李钊; 吴士伟; 李慧娟; 辛国茂; 陈通; 胡传会; 张超; 赵秀浩
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-07-30
Anticipated expiration: 2041-05-08
Also published as: CN113191118B

Abstract

The invention relates to the technical field of data processing, in particular to a text relation extraction method based on sequence labeling, which comprises the steps of constructing a training data set similar to prediction data, and presetting all possible bidirectional entity relations and three fixed dependency relations; segmenting an input sentence into a word sequence, and inputting the word sequence into a pre-training model to obtain an expression vector of a word in each sentence; forming a unique word pair sequence by using a handshake-like mode on the word vector sequence; inputting the obtained vector pair sequence into a neural network classification layer; calculating loss and performing back propagation; judging the category of each word pair, and judging whether the word pair has a corresponding relation of the position; and decoding the final result by using the pseudo code shown in the figure according to the corresponding relation, and finally obtaining all the extracted triples. The invention can simultaneously complete two tasks: entity identification and relationship classification. The extraction accuracy and the recall rate are obviously improved, and the extraction method has great improvement.

Description

Text relation extraction method based on sequence labeling

Technical Field

The invention relates to the technical field of data processing, in particular to a text relation extraction method based on sequence labeling.

Background

The triple relation is a knowledge representation mode capable of being stored in a structured mode, is widely applied to the field of natural language understanding, and plays a great role. In the field of natural language text, a piece of text knowledge can always be represented by one or more triples. Then the structured triple relationship is extracted from one or more pieces of text, i.e., the knowledge represented by the discrete text is converted into a form that can be understood or stored by a machine, a task referred to as relationship extraction.

Relationship extraction methods, ranging from early statistical methods to recent neural network methods, have evolved over the years, with relationship extraction tasks ranging from simple single relationship extraction to overlapping relationship extraction. The single relation extraction means that only one ternary relation exists in one text, and the overlapping relation extraction means that a plurality of ternary relations exist in one text and overlap exists. Overlapping relationships are divided into entity overlapping and single entity overlapping, and in order to better explain the overlapping relationships, fig. 1 shows three relationship types.

In the conventional method, there are three disadvantages: 1) some relation extraction methods regard the relation extraction task as a relation classification task, that is, entities already labeled in a given text classify the relation between the entities, and this method is not suitable for practical applications. 2) Some methods simplify the problem of relation extraction, only solve the problem of single relation extraction, and do not consider the actual situation; in fact, multiple relationship extraction is the most common problem in our practical application. 3) The method of the stream-through method divides the relationship extraction task into two separate tasks, namely an entity identification task and a relationship classification task, does not consider the correlation between the two tasks, and causes the problem of exposure deviation, namely when a final result is generated in a certain sequence, the result of the next step is influenced by the result of the previous step.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a text relation extraction method based on sequence labeling, which can simultaneously complete two tasks: entity identification and relationship classification.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a text relation extraction method based on sequence labeling comprises the following steps:

step 1, presetting all possible entity relation types and establishing a relation set R;

step 2, constructing a training data set suitable for the business field, wherein the entity relationship in the training data must include all preset relationship types;

step 3, expanding a relation set R, and presetting all possible bidirectional entity relations to obtain a data set R';

step 4, constructing a dependency relationship set R for the triples_s；

Step 5, segmenting the input sentence into word sequences, and inputting the word sequences into a pre-training model to obtain a set of expression vectors of words in each sentence

Wherein d is the dimension of the hidden layer vector represented by the preset hyper-parameter, h_iIs a word vector;

step 6, forming a unique word pair sequence for the word vector sequence H obtained in the step 5, and doubly traversing the word sequence to form a word vector pair;

step 7, constructing a training target for the word pairs generated in step 6, for each word pair (W)_i,W_j) J ≧ i construct a 0-valued vector of length 2r +3

(target vector), where R is the number of relationships in the relationship set R;

step 8, inputting the word vector pair sequence obtained in the step 6 into a neural network classification layer, and finally classifying and outputting the number of classes to be 2r + 3; the classification layer function is as follows:

wherein h is_i∈H,h_j∈H，[h_i；h_j]In order to perform the vector splicing operation,

the intermediate vector obtained by the Post-Norm function is the spliced vector,

is composed of

The vector after the linear transformation is carried out,

and

in order to be a parameter that can be trained,

and

is a deviation parameter;

step 9, training stage r_i,jInputting a Circle Loss function, calculating Loss and performing back propagation;

step 10, predicting r in step 8_i,jJudging, namely judging the category of each word pair; r is_i,jEach position in the label list represents different labels, if the value of a certain position is greater than 0, the corresponding relation of the word pair at the position is judged, and if the value of the certain position is less than 0, the corresponding relation at the position is not obtained;

and step 11, decoding the final result by using a pseudo code according to the corresponding relation of each word pair obtained in the step 10, and finally obtaining all extracted triples.

Further, in step 3, the two-way entity relationship is a relationship with two different orientations, i.e. a forward orientation relationship, a directional orientation relationship and a reverse orientation relationship.

Further, in step 4, the dependency relationship set R_sThere are three dependencies: head entity head to tail entity head, head entity tail to tail entity tail, entity head to entity tail.

Further, in step 5, the pre-training model is one of BERT, ALBERT, RoBERTa, ERNIE, XLNet, etc.

Further, in step 6, the traversal process is as follows: each word vector and the word vectors (including itself) behind it form a word vector pair, and the word sequence is traversed to obtain the final word vector pair

And (4) pairing word vectors, wherein n is the number of the word vectors.

Further, in step 9, the loss function is as follows:

wherein the content of the first and second substances,

and

the prediction scores are respectively positive and negative examples, L and K are respectively a positive example set and a negative example set, L is a Circle Loss value, and e is a base number of a natural logarithm function.

Further, in step 11, the pseudo code is as shown in fig. 3.

The invention has the technical effects that:

compared with the prior art, the text relation extraction method based on the sequence labeling can simultaneously complete two tasks: entity identification and relationship classification. Aiming at the problem of entity overlapping in a text, the invention designs a new labeling method which can solve the problem of overlapping of a plurality of entities. Also, in order to solve the exposure deviation, the labeling method of the present invention uses a joint extraction method, which can label the relationship between entities at the same time as labeling the entities. The method has the advantages that the extraction accuracy and the recall rate are remarkably improved, and the method is greatly improved.

Drawings

FIG. 1 is an exemplary diagram of an overlapping relationship according to the present invention;

FIG. 2 is a diagram of a model of the present invention;

FIG. 3 is a pseudo code diagram according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings of the specification.

Example 1:

the text relation extraction method based on sequence labeling comprises the following steps:

step 1, presetting all possible entity relationship types and establishing a relationship set R: and aiming at the business scene, determining key knowledge information in the field in an expert demonstration mode, or determining important attributes of business objects in the business scene. For example, in the medical field, the important attributes of attribute analysis for drugs include: the indications, the administration methods, the adverse reactions, the cautionary matters, etc., according to which four or more preset relationships of the medicine can be obtained and added to the relationship set R.

Step 2, constructing a training data set suitable for the business field: first, we define a complete piece of training data that must contain all the ternary relationships contained in the original sentence S and S that exist in the set of relationships R. Secondly, a section or a sentence of text is given, and the triples meeting the requirements are extracted to form training data through manual analysis and combination of the preset relation R. For example, the statement S is "A drug is indicated with B, C, D. ", three ternary relationships for S can be determined, including (A drug, indication of indication, B), (A drug, indication of indication, C), (A drug, indication of indication, D). Accordingly, an effective training sentence can be constructed.

And 3, expanding a relation set R (containing R relations in total), namely expanding each relation into two relations (forward and reverse) with different directions to obtain a data set R' containing 2R relations in total. For example, the relationship "indication" is to be expanded to two relationships "indication-positive" and "indication-negative". For both entities (A, B), the relationship pointed to by A to B is determined to be a forward relationship, and the relationship pointed to by B to A is a reverse relationship.

Step 4, constructing a dependency relationship set R for the triples_sThere are three dependencies in common: head to head (starting entity head points to target entity head), tail to tail (starting entity tail points to target entity tail), head to tail (entity head points to target entity tail)Pointing to the end of the entity). Here, the starting entity refers to the previous entity in the triplet, i.e., the initiator of the relationship, and the target entity refers to the next entity in the triplet, i.e., the recipient of the relationship.

Step 5, segmenting the input sentence into word sequences, and inputting the word sequences into a BERT pre-training model to obtain a set of expression vectors of words in each sentence

Wherein d is the dimension of the hidden layer vector represented by the preset hyper-parameter, h_iIs a word vector.

Step 6, forming a unique word pair sequence by using a handshake-like mode on the word vector sequence H obtained in the step 5，That is, each word vector needs to "handshake" with the word vectors (including itself) following it to form a word vector pair, so that double-traversing the word sequence results in

For word vector pairs, wherein n is the number of word vectors;

(target vector), where R is the number of relationships in the relationship set R. For relational triplets (E)₁,R_i',E₂),R_i'E.g. R', we define, E_j'[1]J' E (1,2) is an entity E_j'The first character of (E)_j'[-1]J' E (1,2) is an entity E_j'V is the last character of_i,j[idx]Representing the value of the target vector idx position of the word pair. If W is_i∈E₁,W_j∈E₂Then set v_i,j[i']1 is ═ 1; if W is_i＝E₁[1],W_j＝E₁[-1]Or W_i＝E₂[1],W_j＝E₂[-1]Then set v_i,j[2r+3]1 is ═ 1; if it is notW_i＝E₁[1],W_j＝E₂[1]Then set v_i,j[2r+2]1 is ═ 1; if W is_i＝E₁[-1],W_j＝E₂[-1]Then set v_i,j[2r+1]＝1。

And 8, inputting the word vector pair sequence obtained in the step 6 into a neural network classification layer, and finally classifying and outputting the number of classes to be 2r + 3. The classification layer function is as follows:

is composed of

The vector after the linear transformation is carried out,

and

in order to be a parameter that can be trained,

and

is a deviation parameter. In particular, the Post-Norm function is:

h′＝Post-Norm(h)

＝LayerNorm(DropOut(GELU(Wh+b))+h)

step 9, training stage r_i,jAnd an object matrix v_i,jMake a comparison according to v_i,jWill r is_i,jDivide into positive or negative cases and divide r_i,jInputting a Circle Loss function, calculating Loss and performing back propagation; the loss function is as follows:

wherein the content of the first and second substances,

and

and 11, decoding a final result by using the pseudo code shown in fig. 3 according to the corresponding relation of each word pair obtained in the step 10, and finally obtaining all extracted triples.

The pseudo code in FIG. 3: receiving a marked sequence S, a dictionary M for mapping the index of S to the index pair of the original sentence, and an original input sentence, and finally outputting a triple set. Line 1 of the code initializes a dependency containing a head-to-tailThe method comprises the steps of collecting a word index pair E of a relation, initializing a dictionary H which maps a bidirectional entity relation to a head-to-head word index pair on line 2 of a code, initializing a dictionary T which maps the bidirectional entity relation to a tail-to-tail word index pair on line 3 of the code, initializing a result collection R on line 4 of the code, establishing a dictionary which maps indexes to relation types on line 5 of the code, initializing the number R of the bidirectional entity relation on line 6 of the code, initializing indexes of three dependency relations in a target vector on line 7 of the code, traversing the whole annotation sequence S on lines 8-20 of the code, converting the indexes of the annotation sequence belonging to the head-to-tail relation into the head-to-tail word index pair of an original sentence through the dictionary M and adding the head-to-tail word index pair of the original sentence on line 9-11 of the code, traversing all bidirectional relations on lines 12-19 of the code, and mapping the word index pair belonging to the head-to-head word index pair contained in line 13-to-tail word index pair contained in line On the set of relationships, lines 16-18 of the code map the word index pairs belonging to the "tail-to-tail" relationship to the set of bi-directional relationships it contains, lines 21-38 of the code doubly traverse set E, lines 23-24 of the code define E [ i [ ]],E[j]For the head and tail word index pairs of an entity, line 25 of the code defines P_hIs an entity E [ i ]]And E [ j ]]Represents a "head to head" relationship, and the code 26 line defines P_tIs an entity E [ i ]]And E [ j ]]The tail word index pair of (1) represents a "tail-to-tail" relationship, with code line 27 defining H [ P ]_h]For querying P in dictionary H by head word index pair_hCorresponding two-way relationship, code line 28 defines T [ P ]_t]For querying P in dictionary T by tail word index pair_tCorresponding two-way relationship, code 29 lines set H [ P ]_h]And set T [ P ]_t]The intersection of the two sets of relationships results in a Set of relationships_rLine 30-36 of code judges Set_rAdding a triple into the Set R in a non-empty state, extracting an entity fragment from an original statement through a head-tail index of a starting entity in a code 31 line, extracting an entity fragment from an original statement through a head-tail index of a target entity in a code 32 line, and traversing the Set of relationship Set through the code 33-35 lines_rAnd each predicted relationship is added to the final output result combination R, and the code 39 row returns the final result and ends the flow.

The entities are unified titles of objectively existing objects or concepts; the triple is a knowledge representation form existing in a form of (initial entity, relation, target entity); the entity head refers to the first character of the entity character fragment, for Chinese, refers to the first Chinese character of the entity word, and for English, refers to the first word of the entity word group; the entity tail refers to the last character of the entity character fragment, for Chinese, refers to the last Chinese character of the entity word, and for English, refers to the last word of the entity word group;

test example:

the model of the invention achieves the leading level in public data sets NYT and WEBNLG, both of which are English data sets, wherein the NYT data set is from a paper Modeling relationships and theoretical relationships with out LabeledText, which extracts triples on the new york time corpus by using a remote supervised learning method; the WEBNLG dataset is from a paper Creating training corppora for ng micro-planes, which uses text generation to generate a piece of text and matches a preset ternary relationship. The results of the experiment are shown in table 1.

TABLE 1 comparison of the method of this patent with other methods

Among them, the NoveTagging method is from the article "Joint Extraction of experiments and relations based on a novel Tagging scheme", the GraphRel method is from the article "Extraction of Relational efficiencies by an end-to-end neural model with copy mechanism", the OrderCopyRE method is from the article "Learning the Extraction order of multiple Relational efficiencies in a sensor with translation mechanism leaving", and the CasRel method is from the article "Heart Binary Tagging for translation Triple Extraction".

The calculation method of the F1 evaluation index is as follows:

the F1 index represents the overall level of prediction results and is a harmonic average of model accuracy and recall. As can be seen from Table 1, the method of the present invention achieves the best effect on both data sets, and can improve the key index F1 by 1-2% compared with the latest method, and the extraction accuracy and the recall rate are both significantly improved compared with other methods. The following three problems are solved well: 1) and simultaneously, the problems of entity extraction and relationship classification among entities are solved. 2) And correctly extracting the overlapped triples in the text. 3) The problem of exposure deviation is solved, and the accuracy of extraction is improved.

The above embodiments are only specific examples of the present invention, and the protection scope of the present invention includes but is not limited to the product forms and styles of the above embodiments, and any suitable changes or modifications made by those skilled in the art according to the claims of the present invention shall fall within the protection scope of the present invention.

Claims

1. A text relation extraction method based on sequence labeling is characterized in that: the method comprises the following steps:

step 4, constructing a dependency relationship set R for the triples_s；

is composed of

The vector after the linear transformation is carried out,

and

in order to be a parameter that can be trained,

and

is a deviation parameter;

2. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 3, the two-way entity relationship is a relationship of two different directions, namely a forward direction, a direction and a reverse direction.

3. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 4, the dependency relationship set R_sThere are three dependencies: head entity head to tail entity head, head entity tail to tail entity tail, entity head to entity tail.

4. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 5, the pre-training model is one of BERT, ALBERT, RoBERTA, ERNIE, and XLNET.

5. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 6, the traversal process is as follows: each word vector and the word vectors (including itself) behind it form a word vector pair, and the word sequence is traversed to obtain the final word vector pair

And (4) pairing word vectors, wherein n is the number of the word vectors.

6. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 9, the loss function is as follows:

wherein the content of the first and second substances,

and

7. The method of extracting text relations based on sequence labeling according to claim 1, wherein: in step 11, the pseudo code is as shown in fig. 3.