CN113887211A - Entity relation joint extraction method and system based on relation guidance - Google Patents

Entity relation joint extraction method and system based on relation guidance Download PDF

Info

Publication number
CN113887211A
CN113887211A CN202111232526.9A CN202111232526A CN113887211A CN 113887211 A CN113887211 A CN 113887211A CN 202111232526 A CN202111232526 A CN 202111232526A CN 113887211 A CN113887211 A CN 113887211A
Authority
CN
China
Prior art keywords
entity
relationship
sentence
relation
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111232526.9A
Other languages
Chinese (zh)
Inventor
尹美娟
胡红卫
刘晓楠
伍润民
刘威
罗向阳
颜志豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN202111232526.9A priority Critical patent/CN113887211A/en
Publication of CN113887211A publication Critical patent/CN113887211A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of natural language processing, and particularly relates to a method and a system for entity relationship joint extraction based on relationship guidance, wherein sentences in a target text are encoded to obtain sentence vector representation in the target text; aiming at sentence vector representation, a relation type contained in a target text is extracted by using a relation extraction module; and fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and identifying an entity corresponding to the extracted relation type in the target text by using an entity identification module. The method and the device can reduce the attention to irrelevant entities, avoid extracting redundant entities, further respectively identify the corresponding entity pairs of the identified multiple relationship types, solve the problem of entity overlapping, finally extract all entity relationship triples contained in sentences, improve the entity relationship identification accuracy and facilitate the application of actual scenes.

Description

Entity relation joint extraction method and system based on relation guidance
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method and a system for entity relationship joint extraction based on relationship guidance.
Background
With the advent of the big data era, a large amount of data is generated on the internet all the time, most of the data exists in an unstructured form, and the problem that how to convert massive unstructured data into structured information is currently concerned is. Information extraction techniques have been developed in this context. The information extraction technology refers to a technology for automatically extracting information such as events, entities, relationships and the like from a natural language text and outputting the information in a structured form. Entity relationship extraction is a subtask of information extraction, which aims to identify entities in text and relationships between entities and represents the relationships in the form of triples (head entity, relationship, tail entity). The entity relationship triples are basic constitutional units of the knowledge graph, and entity relationship extraction not only has important significance in the knowledge graph, but also has important research value in applications such as intelligent search, automatic question answering and figure portrayal.
Early methods based on rules and traditional machine learning required manual definition of many rules and features, required massive expert knowledge, and had poor domain migration capability, which could not cope with large-scale entity relationship extraction. With the development of the deep learning technology, the neural network can automatically extract features, a large amount of labor cost is saved, the field migration capability is good, large-scale entity relationship extraction can be effectively dealt with, and more researchers adopt the deep learning technology to extract entity relationships. In the early stage of entity relationship extraction based on deep learning, a pipeline extraction framework is adopted, namely, the task is decomposed into two subtasks of entity identification and relationship classification, all entities in a text are firstly identified, and then the relationship types between entity pairs are classified. Although the entity identification and the relation classification can independently and freely select the model and have strong flexibility, the method has some defects: error propagation, which can affect the following relation classification task under the condition of inaccurate entity identification; the internal relation between two tasks of entity identification and relation classification is ignored; redundant entity pairs are generated, and since the entities of the entities are paired pairwise and then their relationship types are classified, the N entities generate N2 entity pairs, a large number of which have no relationship, and these redundant entity pairs affect the relationship classifier.
Disclosure of Invention
Therefore, the invention provides an entity relationship joint extraction method and system based on relationship guidance, which firstly extract the relationship type contained in the text, and then identify the entity corresponding to the relationship type, thereby reducing the attention to irrelevant entities, avoiding extracting redundant entities, solving the problem of entity overlapping, ensuring that the entity relationship extraction is more accurate and facilitating the application of actual scenes.
According to the design scheme provided by the invention, the entity relationship joint extraction method based on relationship guidance comprises the following contents:
coding sentences in the target text to obtain sentence vector representation in the target text;
aiming at sentence vector representation, a relation type contained in a target text is extracted by using a relation extraction module;
and fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and identifying an entity corresponding to the extracted relation type in the target text by using an entity identification module.
As the entity relationship joint extraction method based on relationship guidance, the invention further adopts a pre-trained BERT model to code sentences in the target text, obtains word embedded word vectors of each word in the target text, and generates vector representation of sentence context by capturing sentence characteristics.
As the entity relationship joint extraction method based on relationship guidance, classification identifiers are further added to sentence headers in target texts, sentences added with the classification identifiers are input as BERT models, and the output sentence vector representation is obtained through coding by the BERT models.
As the entity relationship joint extraction method based on relationship guidance, further, in a relationship extraction model, a sigmoid function is adopted to model relationship extraction into a multi-label binary classification task, so that multiple relationship types in sentence vector representation are identified and output.
As the entity relationship joint extraction method based on relationship guidance, further, the multi-label binary classification task is expressed as follows: p is a radical ofr=σ(Wr·hcls+br) Wherein p isrFor output relationship type labels, Wr∈RN×dN is the total relation type number, d is the sentence vector representation dimension, brFor the offset vector, σ denotes sigmoid function, hclsIs a sentence vector.
As the entity relationship joint extraction method based on relationship guidance, further, a relationship extraction model loss function adopts a two-class cross entropy function, and the two-class cross entropy function is expressed as follows:
Figure BDA0003316430180000021
wherein, yiIs a true relationship type tag.
As the entity relation joint extraction method based on the relation guidance, the invention further aims at the extracted multiple relation types (r)1,r2,...,rm) And m is the extracted relation quantity, coding is carried out by utilizing a table look-up method according to the index to obtain a relation type coding vector, and word vector representation and the relation type coding vector in the sentence are superposed and fused so as to output an entity corresponding to the extracted relation type by utilizing an entity identification module.
The extracted relationship type and the word vector in the target text sentence are represented, superposed and fused to serve as an input vector of an entity identification module, the entity positions corresponding to the relationship type are marked by adopting binary pointers respectively based on the extracted relationship type, and the entity triples (head entity, relationship and tail entity) corresponding to the relationship type are obtained according to the entity positions.
As the entity relationship joint extraction method based on relationship guidance, the method further comprises the steps of adopting two same binary classifiers to decode a head entity and a tail entity in the marked entity positions, wherein one classifier marks the start position of the entity, the other classifier marks the end position of the entity, allocating a binary label to each word of an input vector, detecting the probability that the word is used as the start position and the end position of the entity, and selecting the end position closest to the start position to generate an entity corresponding to the relationship type according to the principle of closeness.
Further, the present invention also provides a system for extracting entity relationship based on relationship guidance, which comprises: a sentence coding module, a relation extracting module and an entity identifying module, wherein,
the sentence coding module is used for coding sentences in the target text to obtain sentence vector representation in the target text;
the relation extraction module is used for extracting the relation types contained in the target text by utilizing the relation extraction module aiming at the sentence vector representation;
and the entity identification module is used for fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and identifying the entity corresponding to the extracted relation type in the target text by using the entity identification module.
The invention has the beneficial effects that:
according to the method, firstly, the relationship types contained in the text are extracted through the relationship extraction module, then the relationship types extracted in advance are merged into the entity identification module, the concern on irrelevant entities is reduced, the extraction of redundant entities is avoided, the corresponding entity pairs of the identified relationship types are respectively identified, the entity overlapping problem is solved, all entity relationship triples contained in the sentence are finally extracted, and the entity relationship identification accuracy is improved. And further combined with an experimental result on a DuIE data set, the value of an entity relationship combined extraction model F1 containing a relationship extraction module and an entity identification module can reach 78.4%, and is improved by 1-28% compared with the value of F1 of a baseline model, so that the method has a better application prospect.
Description of the drawings:
FIG. 1 is a schematic flow chart of a relational-oriented-based entity-relationship joint extraction method in an embodiment;
FIG. 2 is a schematic diagram of an embodiment of entity overlap;
FIG. 3 is a schematic diagram of an embodiment of a entity-relationship joint extraction model;
FIG. 4 is an example of a data set in an embodiment;
FIG. 5 is a schematic representation of the results of the entity overlap experiments in the examples;
FIG. 6 is a schematic diagram showing the results of the physical redundancy test in the example.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
Entity relationship extraction is the basic task of information extraction, which aims at extracting triples from unstructured data. Most of the existing combined extraction methods adopt the idea of firstly identifying entities and then extracting relationships, and the problems of entity overlapping and entity redundancy exist. The embodiment of the invention, as shown in fig. 1, provides a relationship-oriented entity relationship joint extraction method, which includes the following steps:
s101, coding sentences in the target text to obtain sentence vector representation in the target text;
s102, aiming at sentence vector representation, extracting a relation type contained in a target text by using a relation extraction module;
s103, fusing the extracted relation type as prior knowledge with word vector representation in a target text sentence, and identifying an entity corresponding to the extracted relation type in the target text by using an entity identification module.
In order to solve the defects of the pipeline method, two subtasks of entity recognition and relation classification share a bottom text coding layer through an entity relation combined extraction frame, so that the two tasks are mutually promoted in a training process to achieve the purpose of task interaction, but the problems of error propagation and redundant entities exist because the entity is recognized firstly and the relation is classified secondly during decoding. Furthermore, a combined labeling framework which utilizes uniform entities and relationships is used, the entity relationship extraction is regarded as a sequence labeling problem, the entity labels and the relationship labels are uniformly coded, and the entities and the relationships are decoded at the same time. Entity overlap refers to the situation that in the case of multiple triples in the text, duplication may occur between entities in the triples, specifically including overlapping of a single entity and overlapping of entity pairs. Single entity overlap means that an entity has a relationship with other entities, and entity pair overlap means that the same entity pair has multiple relationships. As shown in fig. 2, a single entity "smile" in singleentityoverlap (seo) has a relationship with both an entity "liu dao" and an entity "xiao", and an entity pair (away, xiao) in entitypair overlap (epo) has both a relationship of "composition" and "singer". And further, a sequence-to-sequence with a copying mechanism is used to solve the problem of entity overlapping, but the method cannot identify a multi-word entity because only a single word can be copied from a sentence in the process of identifying the entity. The extraction scheme well solves the problem of entity overlapping, but the extraction of tail entities needs to be carried out on each relation corresponding to all extracted head entities, and a plurality of relations corresponding to the head entities have no tail entities, so that the problem of extracting redundant entities exists. As is known from the nature of natural language, relationships can often be obtained through sentence contexts, without relying on entities. For example, the word "practice" is used directly from "a" which is a suspicion spyware practice guided by a tutor, and the sentence containing the director relationship can be directly obtained without depending on a specific entity. In the scheme, the relation types contained in the text are extracted firstly, then the combined extraction model of the entities corresponding to the relations is identified, and the extracted relations are integrated into the entity identification module, so that the concern on irrelevant entities is reduced, the extraction of redundant entities is avoided, the problem of entity overlapping is solved, the entity relation identification accuracy is improved, and the practical scene application is facilitated.
As the entity relationship joint extraction method based on the relationship guidance in the embodiment of the invention, further, sentences in a target text are coded by adopting a pre-trained BERT model, word embedded vectors of each word in the target text are obtained, and vector representation of sentence context is generated by capturing sentence characteristics. Furthermore, a classification identifier is added to the sentence beginning in the target text, the sentence added with the classification identifier is used as a BERT model to be input, the BERT model is used for coding, and output sentence vector representation is obtained. Further, in the relation extraction module, a sigmoid function is adopted to model relation extraction into a multi-label binary classification task, so that multiple relation types in sentence vector representation are identified and output.
For a given sentence S ═ w1,w2,...,wn},wiRepresenting the ith word in the sentence S, and a predefined set of relationships R, the task of entity relationship extraction is to extract all possible triples (h, R, t) in the sentence S, where (h, t) represents the entity pair in the sentence, h represents the head entity, t represents the tail entity, R represents the type of relationship between them, R is R.
The purpose of entity relationship extraction is to extract the triples existing in the sentence as much as possible, and there may be overlap between entities in different triples. Unlike the entity-oriented cascaded extraction method, the extraction scheme can be modeled as follows:
p((h,r,t)|S)=p(r|S)p((h,t)|r,S)
the joint extraction model constructed according to the scheme decomposes the entity relation extraction into two parts of p (r | S) and p ((h, t) | r, S). p (r | S) firstly extracts the relation type r in the sentence S, p ((h, t) | r, S) identifies the entity pair (h, t) corresponding to the relation type by combining the sentence S on the basis of the extracted specific relation r.
With the introduction of pre-trained language models, many natural language processing tasks have made breakthrough progress, and especially with the advent of BERT models, the SOTA effect is constantly achieved in their respective natural language processing tasks. The BERT model trained on large-scale corpora based on the self-attention mechanism contains rich semantic information, and provides a good foundation for downstream tasks. To better capture sentence features, a pre-trained BERT model can be used to encode sentences as a basis for subsequent relationship classification and entity recognition. For a sentence S, a [ CLS ] identifier is added at the beginning of the sentence, and the inputs for the BERT model are as follows:
S'={[CLS],w1,w2,...,wn}
[CLS]to classify an identifier, wiRepresenting the words in the sentence, wherein n is the length of the original sentence, and the length of the sentence S' added with the special identifier is n + 1. And coding the sentence S' through a BERT model to obtain the final layer of output vectors of the BERT model:
Hw=[hcls,h1,h2,...,hn]
hclsrepresents [ CLS]Identifier encoded vector representation, hiRepresenting words w in sentencesiThe encoded vector representation. h iscls,hiThe dimensions being equal, i.e. hcls∈Rd,hi∈RdAnd d is the dimension size of the BERT code.
Different from an entity-oriented cascading extraction method, the ROJER model detects the relation type contained in the sentence firstly, and the relation type is used as priori knowledge to help entity identification, so that the problem of extracting redundant entities is avoided. Different sentences contain different numbers of relations, one sentence can also contain a plurality of relations, and the detection of the relations in the sentence can be regarded as a multi-label classification problem. Coded [ CLS]Identifier vector hclsThe feature information of the sentence is aggregated, and some classification tasks are usually performed on the sentence. Can utilize hclsThe vector extracts the relations contained in the sentence. Because sentences may contain a plurality of relations, the sentences are modeled into a multi-label binary classification task by adopting a sigmoid function, when a certain relation is contained, the label of the corresponding position is assigned as 1, otherwise, the label is assigned as 1The value is tag 0.
pr=σ(Wr·hcls+br)
Wr∈RN×d,brFor the offset vector, N is the number of total relationship types, and σ represents the sigmoid function.
For multiple relationship types (r) identified in a sentence1,r2,...,rm) M is the number of identified relationships, and the code is obtained by a table look-up method according to the index
Figure BDA0003316430180000051
J is more than or equal to 1 and less than or equal to m so as to fuse the sentence codes to carry out corresponding entity identification. The lookup table is initialized by parameters in a random mode and is continuously updated along with the training process.
The classification task is binary classification, a sigmoid function is adopted, and therefore the loss function is binary classification cross entropy.
Figure BDA0003316430180000061
N is the number of relationship types, yiAre the true relationship labels in the training set.
As an embodiment of the method for extracting entity relationship based on relationship guidance, the extracted relationship types and sentence vectors in the target text are further represented, superimposed and fused to serve as input vectors of an entity identification module, the entity positions corresponding to the relationship types are respectively marked by binary pointers based on the extracted relationship types, and entity triples (head entities, relationship entities and tail entities) corresponding to the relationship types are obtained according to the entity positions. Further, in the step of marking the position of the entity, two identical binary classifiers are adopted to decode a head entity and a tail entity, wherein one classifier marks the starting position of the entity, the other classifier marks the ending position of the entity, each word of the input vector is allocated with a binary label, the probability that the word is used as the starting position and the ending position of the entity is detected, and the ending position closest to the starting position is selected according to the principle of closeness to generate the entity corresponding to the relationship type.
The entity recognition module under the established relationship aims to extract an entity pair corresponding to one established relationship in the sentence. The input vector of the entity recognition is formed by overlapping the relation type code recognized by the relation classification and the word code in the sentence, so that the acquired prior knowledge of the relation type is merged into the entity recognition, the attention to the entity irrelevant to the relation type is reduced, the extraction of redundant entities is avoided, and the effect of the entity recognition is improved.
Figure BDA0003316430180000062
hiRepresenting the word vector after the BERT encoding,
Figure BDA00033164301800000610
representing the relationship type encoding vector obtained by table lookup. h isi∈Rd
Figure BDA0003316430180000063
And l is the dimension size of the relation type code.
The decoding methods of the head entity and the tail entity respectively adopt two same binary classifiers. One classifier is used to identify the beginning location of the entity, denoted start, and one classifier is used to mark the end location of the entity, denoted end. The beginning and ending positions of the entity are detected separately by assigning a binary label to each word. If a word is the beginning or ending position of an entity, it is marked as 1, otherwise it is 0.
Figure BDA0003316430180000064
Figure BDA0003316430180000065
Figure BDA0003316430180000066
Figure BDA0003316430180000067
Figure BDA0003316430180000068
Representing the probability of the ith word in the sentence as the beginning and ending positions of the head entity,
Figure BDA0003316430180000069
representing the probability of the ith word in the sentence as the beginning and ending positions of the tail entity. If the probability is greater than the set threshold, the corresponding position is marked as 1, otherwise, the corresponding position is marked as 0.
Figure BDA0003316430180000071
All have the same dimension of Rn ×(d+l)
Figure BDA0003316430180000072
Is a bias vector. For the case that the sentence contains a plurality of same relations, when the entity is extracted, a method of a principle of proximity is adopted, namely for the starting position, the end position on the right side closest to the starting position is matched to generate the entity. Because the processes of extracting the entities by different relationship types are independent, the same entity or entity pair can be identified for different relationship types, and the problem of entity overlapping is avoided.
The decoding method of the entity also adopts a sigmoid function, and the loss function is also two-class cross entropy.
Figure BDA0003316430180000073
n is the length of the sentence,
Figure BDA0003316430180000074
representing the ith word in a sentence as a head or tail entityA real tag of a start or end position.
In order to realize the relationship-oriented entity relationship joint extraction, the scheme can construct an entity-oriented joint extraction model ROJER framework based on the method, and the method is shown in FIG. 3 and comprises three modules of sentence coding, relationship extraction and entity identification. The sentence coding module adopts a BERT model to code a sentence, and obtains a sentence vector and vector representation of each word; the relation extraction module performs multi-label classification by using the sentence vector and extracts the relation types contained in the sentences; the entity identification module identifies corresponding head entities and tail entities based on the extracted relationship types. By utilizing the combined extraction model architecture, the relationship can be extracted firstly, and then the corresponding entity is identified, compared with the entity-oriented combined extraction method, the relationship information extracted in advance can be merged into the entity identification, so that the entity identification module reduces the attention to the entity with irrelevant relationship types, avoids extracting redundant entities, and improves the effect of the entity identification. In addition, in the process of entity identification, the same entity or entity pair can be identified for multiple times by different relationship types, so that the problem of entity overlapping is avoided. During model joint training, entity recognition and relationship extraction can generate interaction through joint training of loss functions, and internal relation between modules is enhanced. The total loss function comprises two parts of relation extraction and entity identification, and the loss functions of all modules are added to obtain a combined loss function of the model. Through joint training, the minimization of loss function learns various parameters in the model:
L=Lrel+Lent
further, based on the foregoing method, an embodiment of the present invention further provides a system for extracting entity relationship based on relationship guidance, including: a sentence coding module, a relation extracting module and an entity identifying module, wherein,
the sentence coding module is used for coding sentences in the target text to obtain sentence vector representation in the target text;
the relation extraction module is used for extracting the relation types contained in the target text by utilizing the relation extraction module aiming at the sentence vector representation;
and the entity recognition module is used for fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and recognizing the entity corresponding to the extracted relation type in the target text by using the trained entity recognition model.
To verify the validity of the scheme, the following further explanation is made by combining experimental data:
the DuIE dataset is used to train and test the in this case ROJER model architecture. All sentences in the DuIE data set are derived from encyclopedia and encyclopedia, and have 214739 instances and 49 relationship types, wherein 173108 instances in the training set, 21639 instances in the verification set, and 19992 instances in the test set, and each data contains sentence text and one or more triples labeled from the sentences. The data set annotates 458184 triplets, and covers three types of triplet data of Normal, EPO and SEO, so that the data set can be used for verifying the expression of the model architecture on the entity overlapping problem. As the test set is not published, 10% of the test set is randomly selected as a verification set, the rest 90% of the test set is selected as a training set, and the original verification set is used as the test set. The split data set and entity overlap statistics are shown in table 1.
TABLE 1 data set statistics
Figure BDA0003316430180000081
It can be seen from the statistical situation that there are a lot of entity overlapping in the data set, and there are cases where there is both single entity overlapping and entity-pair overlapping in one sentence sample. For example, in the test set, 347 sentence instances exist in two overlapping cases at the same time, and the entity relationship extraction case is more complicated. In the data set example shown in FIG. 4, the entity "Movie" is an overlapping instance of an entity pair with the entity "Cowitter" and a single entity overlap instance with "11.07.1988".
And the accuracy, recall rate and F1 value commonly used in relation extraction are used as evaluation indexes. The extracted triple is judged to be correct only if the head entity, the relationship type and the tail entity in the triple are completely consistent with the label. The calculation formula is as follows.
Figure BDA0003316430180000082
Figure BDA0003316430180000083
Figure BDA0003316430180000084
TP represents the number of positive samples, FP represents the number of positive samples, FN represents the number of negative samples, where a positive sample is a triplet.
The model architecture can be realized based on a HuggingFace's transformations library, and the adopted BERT Chinese pre-training language model is ' BERT-base-Chinese '. The experimental environment is shown in table 2.
TABLE 2 Experimental Environment
Figure BDA0003316430180000091
The hyper-parameter settings of the model are shown in table 3.
TABLE 3 Superparameter settings
Figure BDA0003316430180000092
And selecting a plurality of models as comparison, wherein the models comprise a pipeline model and a joint extraction model, and the joint extraction model comprises extraction models of primary decoding, independent decoding and cascade decoding. The basic principles of the five comparative models are as follows.
(1) Pipeline: the model is a pipeline model, the entity identification model identifies all entities in the sentence by adopting a sequence labeling method, and the relation classification model classifies the relation between the entities by utilizing the sentence vector and the entity vector which are coded by BERT.
(2) NovelTagging: a joint decoding model uniformly encodes an entity label and a relation label, can simultaneously decode an entity and a relation, and has the problem of entity overlapping.
(3) MHS (Multi-Head Selection): the Multi-Head Selection combined extraction model adopts CRF to predict the labels of the entities, and extracts the relation and the entities through Multi-Head Selection so as to solve the problem of entity overlapping.
(4) SpERT (Span-based Entity and relationship Transformer): entities are identified first, and then the entities and the contexts between the entities are fused for relationship classification. The two tasks share the underlying coding.
(5) CasRel (cassette binding tagging frame): an entity-oriented cascade extraction model adopts an extraction method of extracting a head entity first and then identifying a tail entity of each relation.
The behavior of each model on the data set is shown in table 4.
Table 4DuIE data set experimental results
Figure BDA0003316430180000101
The results show that the ROJER model is better than the other comparative models.
1) The NovelTagging model adopts entity relationship unified coding, does not need to decode for many times, and therefore has high accuracy, but has low recall rate because the problem of entity overlapping cannot be solved. The ROJER model architecture can solve the entity overlapping problem, and the recall rate is improved by 38.5% compared with NovelTagging, and the F1 value is improved by 28.1%.
2) Although the Pipeline extraction method is adopted in the Pipeline model, BERT is adopted in bottom layer coding, and F1 values are greatly improved compared with a NovelTagging model and an MHS model of static word vectors. The ROJER model also adopts BERT as an encoder, and the F1 value is greatly improved, which shows that the semantic expression capability of the pre-trained BERT model is better than that of a word vector.
3) The SpERT model, the CasRel model and the ROJER model framework belong to a BERT-based combined extraction model, the effect is better than that of a Pipeline model, and the problems of error propagation, task interaction and the like existing in a Pipeline model can be solved by the combined extraction model.
4) The CasRel model is superior to the SpERT model which firstly identifies entity reclassification relations in each index by firstly extracting a head entity and then predicting a tail entity for each relation. Compared with the current advanced CasRel model, the accuracy, the recall rate and the F1 value of the ROJER model are respectively improved by 1.9%, 0.7% and 1.2%, which shows that the model adopts a method of firstly extracting the relation and then predicting the corresponding entity for the identified relation, so that the effect of extracting the triples can be improved.
Entity overlap experiment results: to examine the ability of the model to cope with the entity overlap problem, the test data were divided into three classes, entity non-overlapping (Normal), Single Entity Overlapping (SEO) and entity-pair overlapping (EPO), and experiments were performed on each class of data separately. The experimental result is shown in fig. 5, and it can be seen from the graph that the NovelTagging model can predict only one entity relationship label for each word, so that the F1 value is low in both single entity overlap and entity-pair overlap. The Pipeline model and the SpERT model belong to a framework of firstly identifying all entities and then extracting relationships among the entities, and only one relationship can be predicted for the same entity pair, so that the performance is poor in the case of entity pair overlapping. The MHS model, the CasRel model and the ROJER model framework can solve the problem of overlapping of two entities, and the performances are balanced in three cases. In contrast, the model architecture in the present application is more advantageous for a method of identifying corresponding entities of a plurality of relationship types extracted in advance.
In order to verify the effect of the models on solving the entity redundancy problem, the detailed situation of the extraction result of each model is shown in fig. 6, and it can be seen from the figure that the NovelTagging model cannot solve the entity overlapping problem due to the fact that a large number of entity overlapping situations exist in the data set, so that the number of extracted triples and the number of correct triples are small. The SpERT model and the Pipeline model both belong to a method for recognizing entities in sentences and extracting relationships between the entities, wherein the SpERT model is a shared parameter type combined extraction model, compared with the Pipeline model, the extracted triples are fewer, but the correct triples in the extracted triples are more, which indicates that the combined extraction model SpERT can relieve the entity redundancy problem of the Pipeline model, namely the Pipeline model, Pipeline. Compared with an advanced CasRel model, the ROJER model has fewer extracted triples and higher accuracy, which shows that the ROJER model framework firstly extracts the relationship types in the sentences and then integrates the prior knowledge of the relationship types into the entity identification, reduces the attention of the entity identification module to irrelevant entities and avoids extracting redundant entities.
The entity relation extraction technology can automatically extract triples from unstructured texts, and is the basis of natural language application such as knowledge maps. In the experiment of the scheme, aiming at the problems of entity redundancy and entity overlap in the triple extraction process, all relation types contained in a sentence are firstly identified by using an ROJER model, then the identified relation types are integrated into the entity identification process, the corresponding entity identification is carried out under the specified relation, the interference of irrelevant entities is reduced, the problem of entity redundancy is avoided, the starting position and the ending position of a head entity and a tail entity are respectively marked in a pointer marking mode, the entity pair corresponding to the corresponding relation type is identified, and the entity overlap problem is solved. The F1 value on the DuIE data set reaches 78.4%, which is better than that of the relevant comparison model, and the extraction result shows that the model architecture of the scheme effectively solves the problems of entity redundancy and entity overlap.
Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.
Based on the foregoing method and/or system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method described above.
Based on the above method and/or system, the embodiment of the invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above method.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for extracting entity relationship based on relationship guidance is characterized by comprising the following steps:
coding sentences in the target text to obtain sentence vector representation in the target text;
aiming at sentence vector representation, a relation type contained in a target text is extracted by using a relation extraction module;
and fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and identifying an entity corresponding to the extracted relation type in the target text by using an entity identification module.
2. The method of claim 1, wherein a pre-trained BERT model is used to encode sentences in the target text, to obtain word-embedded vectors for each word in the target text, and to generate a vector representation of sentence context by capturing sentence features.
3. The entity relationship joint extraction method based on relationship guidance as claimed in claim 2, wherein a classification identifier is added to the sentence beginning in the target text, the sentence with the classification identifier added is input as a BERT model, and the output sentence vector representation is obtained by coding through the BERT model.
4. The method for relational-oriented-based entity-relationship joint extraction according to claim 1, wherein in the relational extraction module, multiple relation types in sentence vector representation are identified and output by modeling relational extraction as a multi-label binary classification task by adopting a sigmoid function.
5. The method for extracting entity relationship combination based on relationship guidance as claimed in claim 4, wherein the multi-label binary classification task is expressed as: p is a radical ofr=σ(Wr·hcls+br) Wherein p isrFor output relationship type labels, Wr∈RN×dN is the total relation type number, d is the sentence vector representation dimension, brFor the offset vector, σ denotes sigmoid function, hclsIs a sentence vector.
6. The relationship-oriented-based entity relationship joint extraction method according to claim 5, wherein the relationship extraction module loss function adopts a two-class cross-entropy function, which is expressed as:
Figure FDA0003316430170000011
wherein, yiIs a true relationship type tag.
7. The relational-oriented-based entity-relationship joint extraction method according to claim 1, 4, 5 or 6, wherein the extracted multiple relationship types (r) are selected1,r2,...,rm) And m is the extracted relation quantity, coding is carried out by utilizing a table look-up method according to the index to obtain a relation type coding vector, and word vector representation and the relation type coding vector in the sentence are superposed and fused so as to output an entity corresponding to the extracted relation type by utilizing an entity identification module.
8. The method of claim 1, wherein the extracted relationship types and word vectors in the target text sentences are overlaid and fused to serve as input vectors of the entity recognition module, binary pointers are respectively adopted to mark entity positions corresponding to the relationship types based on the extracted relationship types, and entity triples (head entities, relationship entities and tail entities) corresponding to the relationship types are obtained according to the entity positions.
9. The method of claim 8, wherein the labeled entity positions are obtained by decoding a head entity and a tail entity with two identical binary classifiers, one of the two classifiers is used for labeling the entity start position, the other classifier is used for labeling the entity end position, each word of the input vector is assigned with a binary label and the probability of the word as the entity start and end positions is detected, and the end position closest to the start position is selected to generate the entity corresponding to the relationship type according to the rule of proximity.
10. A system for extracting entity relationships jointly based on relationship guidance, comprising: a sentence coding module, a relation extracting module and an entity identifying module, wherein,
the sentence coding module is used for coding sentences in the target text to obtain sentence vector representation in the target text;
the relation extraction module is used for extracting the relation types contained in the target text by utilizing the relation extraction module aiming at the sentence vector representation;
and the entity identification module is used for fusing the extracted relation type as prior knowledge with word vector representation in the target text sentence, and identifying the entity corresponding to the extracted relation type in the target text by using the entity identification module.
CN202111232526.9A 2021-10-22 2021-10-22 Entity relation joint extraction method and system based on relation guidance Pending CN113887211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111232526.9A CN113887211A (en) 2021-10-22 2021-10-22 Entity relation joint extraction method and system based on relation guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111232526.9A CN113887211A (en) 2021-10-22 2021-10-22 Entity relation joint extraction method and system based on relation guidance

Publications (1)

Publication Number Publication Date
CN113887211A true CN113887211A (en) 2022-01-04

Family

ID=79004316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111232526.9A Pending CN113887211A (en) 2021-10-22 2021-10-22 Entity relation joint extraction method and system based on relation guidance

Country Status (1)

Country Link
CN (1) CN113887211A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528418A (en) * 2022-04-24 2022-05-24 杭州同花顺数据开发有限公司 Text processing method, system and storage medium
CN114841151A (en) * 2022-07-04 2022-08-02 武汉纺织大学 Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN114841122A (en) * 2022-01-25 2022-08-02 电子科技大学 Text extraction method combining entity identification and relationship extraction, storage medium and terminal
CN115130466A (en) * 2022-09-02 2022-09-30 杭州火石数智科技有限公司 Classification and entity recognition combined extraction method, computer equipment and storage medium
CN115186649A (en) * 2022-08-30 2022-10-14 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN116167368A (en) * 2023-04-23 2023-05-26 昆明理工大学 Domain text entity relation extraction method based on type association feature enhancement

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841122A (en) * 2022-01-25 2022-08-02 电子科技大学 Text extraction method combining entity identification and relationship extraction, storage medium and terminal
CN114528418A (en) * 2022-04-24 2022-05-24 杭州同花顺数据开发有限公司 Text processing method, system and storage medium
CN114841151A (en) * 2022-07-04 2022-08-02 武汉纺织大学 Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN115186649A (en) * 2022-08-30 2022-10-14 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN115186649B (en) * 2022-08-30 2023-01-06 北京睿企信息科技有限公司 Relational model-based segmentation method and system for ultra-long text
CN115130466A (en) * 2022-09-02 2022-09-30 杭州火石数智科技有限公司 Classification and entity recognition combined extraction method, computer equipment and storage medium
CN115130466B (en) * 2022-09-02 2022-12-02 杭州火石数智科技有限公司 Classification and entity recognition combined extraction method, computer equipment and storage medium
CN116167368A (en) * 2023-04-23 2023-05-26 昆明理工大学 Domain text entity relation extraction method based on type association feature enhancement
CN116167368B (en) * 2023-04-23 2023-06-27 昆明理工大学 Domain text entity relation extraction method based on type association feature enhancement

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
WO2022037256A1 (en) Text sentence processing method and device, computer device and storage medium
CN113887211A (en) Entity relation joint extraction method and system based on relation guidance
WO2021147726A1 (en) Information extraction method and apparatus, electronic device and storage medium
Xu et al. MAF: a general matching and alignment framework for multimodal named entity recognition
Tang et al. Aspect level sentiment classification with deep memory network
Mavroudi et al. Representation learning on visual-symbolic graphs for video understanding
US11954893B2 (en) Negative sampling algorithm for enhanced image classification
CN112163099A (en) Text recognition method and device based on knowledge graph, storage medium and server
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN115238690A (en) Military field composite named entity identification method based on BERT
Xiao et al. A new attention-based LSTM for image captioning
Lu et al. Flat multi-modal interaction transformer for named entity recognition
CN115048511A (en) Bert-based passport layout analysis method
Kalo et al. Knowlybert-hybrid query answering over language models and knowledge graphs
CN114676346A (en) News event processing method and device, computer equipment and storage medium
Ma et al. Few-shot event detection: An empirical study and a unified view
Li et al. Effective piecewise CNN with attention mechanism for distant supervision on relation extraction task
Zhou et al. Attending via both fine-tuning and compressing
Zhu et al. Autoshot: A short video dataset and state-of-the-art shot boundary detection
CN117009516A (en) Converter station fault strategy model training method, pushing method and device
CN115934883A (en) Entity relation joint extraction method based on semantic enhancement and multi-feature fusion
CN115544212A (en) Document-level event element extraction method, apparatus and medium
CN117648504A (en) Method, device, computer equipment and storage medium for generating media resource sequence
CN111723301A (en) Attention relation identification and labeling method based on hierarchical theme preference semantic matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination