CN114610819B - Entity relation extraction method - Google Patents

Entity relation extraction method Download PDF

Info

Publication number
CN114610819B
CN114610819B CN202210264276.5A CN202210264276A CN114610819B CN 114610819 B CN114610819 B CN 114610819B CN 202210264276 A CN202210264276 A CN 202210264276A CN 114610819 B CN114610819 B CN 114610819B
Authority
CN
China
Prior art keywords
entity
character
relationship
model
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210264276.5A
Other languages
Chinese (zh)
Other versions
CN114610819A (en
Inventor
喻野
黄宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shitong Hengqi Beijing Technology Co ltd
Original Assignee
Zhongke Shitong Hengqi Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Shitong Hengqi Beijing Technology Co ltd filed Critical Zhongke Shitong Hengqi Beijing Technology Co ltd
Priority to CN202210264276.5A priority Critical patent/CN114610819B/en
Publication of CN114610819A publication Critical patent/CN114610819A/en
Application granted granted Critical
Publication of CN114610819B publication Critical patent/CN114610819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/042Backward inferencing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an entity relationship extraction method, which comprises the following steps: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction, wherein the person attribute definition comprises: character basic attributes, social attributes and character social relationship attributes; utilizing character attribute definition to pre-label the entity in the text; and constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database.

Description

Entity relation extraction method
Technical Field
The invention relates to the technical field of information mining, in particular to an entity relationship extraction method.
Background
In the era of rapid development of internet technology, entity relationship extraction is always a research hotspot as a core research direction of text mining and information extraction in a large amount of irregular unstructured data in an open domain. The entity relation extraction is to extract or convert massive unstructured data into structured data and provide data samples for constructing a knowledge map, automatically asking and answering, machine translation, obtaining text summaries in a large scale and the like.
At present, an entity relation extraction method based on deep learning gradually surpasses a classical method based on characteristics and a kernel function, and entity relation extraction based on deep learning is mainly divided into two types of supervised method and remote supervised method, wherein the supervised entity relation extraction method mainly comprises a pipeline method and a joint learning method. Although, the method based on deep learning can avoid the problem of error accumulation in the artificial feature selection in the classical aspect; however, the pipeline-based method is to perform relation classification prediction after the entity identification module, on one hand, an error of entity identification can be continuously propagated to the relation classification to cause error propagation, and on the other hand, the pipeline-based method neglects the effect that information lost by the relation between two subtasks influences a model. Therefore, the entity relationship extraction accuracy rate in the prior art is low.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an entity relationship extraction method to improve the accuracy of entity relationship extraction.
The invention solves the technical problems through the following technical scheme:
the invention provides a method for establishing a character attribute relation extraction database in a long text, which comprises the following steps:
acquiring a person attribute definition based on a preset method, wherein the person attribute definition comprises the following steps: character basic attributes, social attributes and character social relationship attributes;
pre-labeling entities in the text by using character attribute definition;
and constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database.
Optionally, the obtaining of the person attribute definition based on the preset method includes:
determining whether the category is a target category according to the category of the relationship between the first entity and the second entity;
if yes, semantic reverse reasoning is carried out according to the target category to obtain a reasoning result, and the reasoning result is used as an accurate relation between the first entity and the second entity.
Optionally, the obtaining of the person attribute definition based on the preset method includes:
and generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels.
Optionally, the obtaining of the person attribute definition based on the preset method includes:
for person attribute definitions that cannot be specifically classified, the person attribute definitions that cannot be specifically classified may be classified into a set having a semantic scope greater than the person attribute definitions that cannot be specifically classified based on the semantic scope size relationship between the entities.
Optionally, obtaining the person attribute definition based on a preset method includes:
and when the character attribute is defined as the character social attribute, acquiring a lower concept corresponding to the character social attribute, and acquiring the character social attribute based on the relevance between the lower concept and the character.
The invention also provides a database for extracting the character attribute relationship in the long text, wherein the database is established by using any one of the methods.
The invention also provides an entity relationship extraction method, which comprises the following steps:
randomly initializing an embedding layer matrix in an initial model, wherein the initial model is a DGCNN model and an Attention model which are connected in series;
performing Word segmentation on a text corresponding to the character attribute relation extraction database in the long text to obtain a Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector by using a transformation matrix, wherein the character attribute relation extraction database in the long text is the database;
carrying out alignment addition on the words by adopting a mode of repeatedly aligning the words with the positions of the words to obtain an addition result;
inputting the addition result into a DGCNN model to obtain a coded vector sequence;
inputting the coded vector sequence into a first Attention model, wherein the first Attention model outputs a result through two classifiers, and each classifier comprises two convolution layers and a full-connection layer;
when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint mode, and the result is input into an Embedding layer at a relative position to obtain an Embedding result;
inputting the coded vector sequence into a second orientation model, superposing the output of the second orientation model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first orientation model until the model converges to obtain a trained target model;
and carrying out entity relation extraction by using the target model.
Compared with the prior art, the invention has the following advantages:
by applying the embodiment of the invention, a new person attribute relation schema system is designed, so that on one hand, the workload of labeling can be greatly reduced, and meanwhile, the interference of label information redundancy on model training is avoided, and further, the method has an important positive effect in precise labeling and algorithm optimization.
Moreover, a semi-automatic visual labeling platform can be developed to reduce the labeling cost and realize accurate labeling; and the method can also realize many-to-many requirements in triple extraction, solve the problem of extraction of overlapping entity relations and realize the extraction of long text friendly character attribute relations.
Drawings
Fig. 1 is a schematic flowchart of a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating definition and classification of task attributes in a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an entity labeling process in the method for establishing a character attribute relationship extraction database in a long text according to the embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating an entity extraction method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a model architecture used in the entity extraction method according to an embodiment of the present invention;
fig. 6 is a structural diagram of a convolution gate mechanism in the entity extraction method according to the embodiment of the present invention;
fig. 7 is a schematic diagram illustrating a principle of a standard convolution method and a dilation convolution method used in the entity extraction method according to the embodiment of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
Example 1
Fig. 1 is a schematic flowchart of a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention, where as shown in fig. 1, the method includes:
s101: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction, wherein the person attribute definition comprises: person base attributes and social attributes and person social relationship attributes.
In order to more comprehensively summarize the description about attributes and relationships in a person portrait, the person attribute relationship is divided into three parts, namely a person basic attribute, a social attribute and a person social relationship attribute,
fig. 2 is a schematic diagram illustrating definition and classification of task attributes in a method for establishing a relationship extraction database for attributes of persons in long texts according to an embodiment of the present invention, as shown in fig. 2, in which a basic attribute of a person, a social attribute, and a social relationship attribute of a person in fig. 2 may be used as a schema, respectively. Further, the basic attributes of the character comprise 11 basic attributes of alias, gender, nationality, birth date, birth place, contact way, certificate number, writing, award certificate and religious belief. In practical applications, the schema is a collection of a class of data in a database.
The person social attributes comprise 4 social attributes of an arbitrary organization, a school of seeking study, a participation activity and a social nickname;
the character social relationship attributes comprise 4 social relationship attributes of relatives, classmates, friends and other relationships.
S102: and pre-labeling the entity in the text by using the character attribute definition.
The labeling platform for extracting the relation in the market at present cannot meet the use of the patent, and the labeling tool directly determines the formation of the labeled data, so that the embodiment of the invention provides a special labeling platform for realizing the functions of self-defined schema, entity labeling, relation labeling, visual inspection and special functions: automatic retrieval and adaptive labeling.
(1) Fig. 3 is a schematic diagram of an entity labeling process in the method for establishing a character attribute relationship extraction database in a long text according to the embodiment of the present invention, as shown in fig. 3, an automatic retrieval function is performed in a labeling process, for example, if "china" is already labeled in a process of sequentially labeling the long text, then "china" in other parts of the whole text is also automatically labeled as a country by using a full text retrieval method, so that repeated labeling of the same entity can be avoided, and the computation amount of a model is reduced.
In the step of decoding the schemas, the entities and the relations, a self-adaptive labeling mode is used, entity information and corresponding entity relations are identified by calling an entity identification algorithm or an existing relation extraction algorithm in the prior art, and the entity information and the entity relations are used as labels to pre-label the text, so that on one hand, the labeling workload can be greatly reduced, and meanwhile, the method has an important role in precise labeling and algorithm optimization.
In this step, in the process of entity pre-labeling in the text, the following contents are added in embodiment 2 of the present invention:
(2) Determining whether the category is a target category according to the category of the relationship between the first entity and the second entity; if yes, performing semantic reverse reasoning according to the target category to obtain a reasoning result, and using the reasoning result as an accurate relation between the first entity and the second entity. In the embodiment of the invention, reverse reasoning information is added, such as parent and child relationships, when the algorithm identifies the parent relationship, the child relationship can be obtained through reasoning, and labeling and algorithm fitting are not needed any more, so that training errors caused by the redundancy of labels in the schema are reduced, and the design difficulty of the schema is reduced.
(3) Generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels, for example, reducing the design of the schema by predicting the gender in the basic attributes of the human beings, for example, the algorithm of "zhang san is zhang zi" can identify that zhang san parents are zhang mian group of entity pairs, and simultaneously identify that zhang san and zhang gender are both males. In the embodiment of the invention, a label 'son' for an entity 'Zhang III' can be generated through the sex of Zhang III; similarly, when the entity "zhang san" is labeled, since the relationship between the entity and "zhang da" is bidirectional, not only the son whose zhang san is zhang can be obtained, but also the father whose zhang san is zhang san can be obtained. By analogy, a great number of labels of mothers, girls, brother, sisters and the like in most data sets are reduced, the labels can be realized by using the relationship between the gender and the entity in a cross way, and the complexity of the labeling work and the algorithm can be reduced by only needing one more gender prediction function.
Gender prediction is not in the sentence algorithm flow, because words representing gender do not exist, and at the same time, all triples do not need to output gender, therefore, subsequences in the model coding vector sequence can be used as classification models when s and o are characters, and it is worth explaining that gender prediction is a three-classification problem.
(4) For the person attribute definitions which cannot be specifically classified, the embodiment of the invention can divide the person attribute definitions which cannot be specifically classified into a set of person attribute definitions of which the semantic range is larger than that of the person attribute definitions which cannot be specifically classified based on the semantic range size relationship between the entities.
For example, "wang di and zhao wu are very good friends in school", if there is no context, the algorithm cannot identify college classmate relationship, or high school classmate relationship, or primary school classmate relationship; only the classmatic relationship between the entity "wang di" and the entity "zhao wu" can be identified. However, the relationship between students is also a relatively important information, so the definition of the attribute of the character that cannot be classified specifically means that although the general relationship between two entities such as a couple and a classmate can be obtained, the precise relationship between the two entities such as a husband, a wife, a college classmate, a high school classmate and a primary school classmate cannot be obtained. Therefore, the relationship that cannot be specifically identified, that is, the relationship cannot be specifically classified, can be included in the designed classmate relationship, that is, the label of the classmate relationship is designed in the embodiment of the present invention to include college classmate relationship, high school classmate relationship, and elementary school classmate relationship. For another example, when gender prediction is not accurate, parents and couples are also a kind of inclusion for parents and husband wife, and when the gender prediction results of the first entity and the second entity are not accurate, but the first entity and the second entity have a marital relationship, the label "couple" may be used instead of the specific labels "husband" and "wife"; the specific tags "father", "mother", "dad", "mom" are replaced with the tag "parent". As shown in FIG. 2, the four major classes in the social relationship attribute are the inclusion of each subclass. By applying the embodiment of the invention, some added uncertain relations are also extracted, so that the information contained in the text can be retrieved as much as possible, and the influence of uncertain categories in algorithm fitting on the algorithm is reduced.
(5) And when the character attribute is defined as the character social attribute, acquiring a lower concept corresponding to the character social attribute, and acquiring the character social attribute based on the relevance between the lower concept and the character.
For example, to more accurately trace to specific social attributes, we add large-to-small class relationships, i.e., organizations, schools, activities, and social platforms, as subject participating entity tags in triples. For example, in "obtaining the degree of economic academic of pennsylvania university at waton business school in 1968, mingming is required to mark that the school of mingming is the university of pennsylvania, the school of mingming is the school of waton business, and the subordinate concept of the university of pennsylvania is also required to be marked, namely the school is the school of waton business, if other academic experiences exist in the context, the complete social attribute can be obtained by combining the academic time, namely the mingming is that the academic experience exists in the school of waton business of pennsylvania. By using the method, detailed traceability can be realized for positions, activity identities, statements and the like.
S103: and constructing a corresponding triple based on the marked entity.
The extraction of the character attribute relationship is to extract all the triples contained in the sentence. The triplet (s, p, o) is in the form of s being a subject, i.e. a host entity, o being an object, i.e. an object, and p being a prefix, i.e. a relationship between two entities. (s, p, o) is understood to mean "p of s is o".
The embodiment of the invention outputs a training set comprising 53 schemas, which basically comprises most schemas in the current character attribute relationship extraction task, so that the training set constructed by the embodiment of the invention has stronger generalization in the field of non-noon attribute relationship extraction;
the embodiment of the invention reduces the dimensionality of the data set from a schema design system, and realizes the high efficiency of annotation on the other hand from an annotation method.
In addition, at present, the standard data sets related to the Chinese character attribute relationship extraction task in China are few, and most researchers collect the data sets respectively, so that the problem that the establishment and design of the standard data set related to the Chinese character attribute relationship extraction is urgently needed to be solved is caused. Moreover, in the prior art, the remote supervision adopts the aligned remote knowledge base, so that the manpower loss is greatly reduced, the field mobility is stronger, but the data quality obtained by automatic labeling of the remote supervision is lower, and the model training effect is influenced finally. The embodiment of the invention improves the prediction effect of the model by optimizing the data set.
Example 2
In the prior art, a method based on sequence labeling solves the problem of entity redundancy influence caused by invalid entities by simultaneously labeling 'position information-relation type-entity role' for directly predicting and obtaining an entity-relation triple. However, the existing method based on sequence labeling does not consider the diversity of words after the input sentence is encoded, so that the problem that the overlapping entity relationship cannot be identified is caused, for example, in "someone self-transmission", the model cannot identify the relationship between "someone" and "someone self-transmission". Moreover, some articles belong to long texts, for example, the number of characters in the text exceeds 1000. When long text recognition is carried out, neurons in the neural network are increased, the layer number of the network is deepened, and the operation amount during model operation is further increased.
In order to solve the above problems, embodiments of the present invention provide a concept of sequence-based labeling, which can not only accommodate information between entities, but also avoid the problem of entity redundancy, and also solve the problems of entity overlapping and excessive computation.
Fig. 4 is a schematic diagram of a principle of an entity extraction method according to an embodiment of the present invention, as shown in fig. 4, the method according to embodiment 2 includes:
the method comprises the steps of firstly converting triples into sequence data, then removing a relation type in a position information-relation type-entity role in sequence labeling by a labeling part, firstly predicting an entity in a parameter sharing part, and then predicting another entity of each relation under the prior condition of the entity, wherein the problem of entity overlapping can be solved after the text is coded twice each time.
S201 (not shown in the figure): randomly initializing one embedding layer matrix in an initial model, wherein the initial model is a DGCNN model and an Attention model which are connected in series.
Crawling is carried out on a character introduction page in encyclopedia, continuous text texts are obtained by regular mode for labeling, and meanwhile, a character attribute relation extraction data set of 11603 samples with 427 characters in average length and 512 characters in maximum length is constructed by combining match data and public data sets. In the process of labeling, the text length needs to be controlled manually, at least 500 samples are guaranteed for each class of p, data fine labeling is carried out on a certain class of p with low accuracy or recall rate, and the effect of a visual labeling platform is fully exerted in the supplementing process.
Fig. 5 is a schematic diagram of a model architecture used in the entity extraction method provided in the embodiment of the present invention, and as shown in fig. 5, the entity extraction model in embodiment 2 of the present invention uses a structure of DGCNN + Attention, which is the Self Attention of Google; DGCNNs may also use existing neural network models. The text vector taking the word as the unit has no meaning, and the single word does not have semantic information, so the method for adding the word Embedding layer fusion word to the deep learning-based joint learning method is most frequently used for finding that the number of o is multiple of s after statistical analysis is carried out on a labeled data set, and sampling s is easier and more sufficient, so that the method for predicting s first and then predicting corresponding o and p by introducing s is adopted.
The DGCNN model uses 12 layers in total, the expansion rate is repeated for three times in sequence of [1,2 and 5], the granularity is repeatedly learned from fine granularity to coarse granularity, and then three layers of [1] are added to finely adjust the fine granularity; since the prediction of the start-stop positions is actually two 2-class problems, the loss function uses two-class cross entropy; the optimizer selects Adam to train, the learning rate is 0.0001, the epoch is 150, the maximum length of the text is 512, and the batch size is 8; to provide stability for training, EMA (expenential Moving Average) was used for the model, with an attenuation rate of 0.9999; the "start" threshold and "end" threshold of the decoding part at the start-end position are 0.5 and 0.4 respectively.
The sentence "plum four in 2021 in the figure held the wedding ceremony with the husband Zhang three. For example, the process includes the following steps:
in order to preserve the flexibility of the words, randomly initializing a word embedding layer matrix (n x 32) and updating the word embedding layer matrix in a training process to realize word vectorization, wherein n represents the number of words in all dictionaries, and 32 represents the vectorization dimension of each word;
s202 (not shown in the figure): the method comprises the steps of segmenting words of texts corresponding to a character attribute relation extraction database in a long text through Word2Vec to obtain Word vectors, and converting the Word vectors into vectors with the same dimensionality as the Word vectors by using a transformation matrix, wherein the character attribute relation extraction database in the long text is the database in embodiment 1.
Performing Word segmentation on the text to obtain a 300-dimensional Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector through a (300 x 32) transformation matrix;
s203 (not shown): the words are aligned and added in a mode of repeatedly aligning the words, for example, "reading makes people progress". Thus, a word and word mixed vector is obtained, and not only semantic information brought by the word vector is attached, but also the semantic information of the word is contained.
S204 (not shown): the addition result is input into the DGCNN model to the encoded vector sequence.
In the embodiment of the invention, in order to enhance the position sense of CNN and add the position Embedding layer, the method is the same as that of the word Embedding layer, because the subject is often at the beginning of a sentence, the object is often near the subject, and the maximum word number is 512 during training, the matrix dimension of one position Embedding layer is set to be 512 x 32;
a coded vector sequence is obtained through a DGCNN model, and the DGCNN (expanded Gated Convolutional Neural Network) is a convolution Neural Network added with two new convolution methods, namely expanded convolution and gate convolution. The gate convolution adds a gate mechanism to the normal convolution:
fig. 6 is a structural diagram of a convolution gate mechanism in the entity extraction method according to the embodiment of the present invention, as shown in fig. 6, where the numbers of convolution kernels and window sizes of Conv1D1 and Conv1D2 are the same, but weights are not shared, one of the convolution kernels is activated by a sigmoid function, the other convolution kernel is not activated by a function, and finally the two convolution kernels multiply bit by bit to output Y, where the output Y corresponds to the input X. The model structure in this step prevents the gradient from vanishing, because there is a convolution without an activation function. When the output dimension and the input dimension are consistent, its output can be calculated using the following formula:
Figure GDA0003803855940000111
wherein,
y is the output of the convolution gate; x is the convolution gate input; conv1D1 (X) is a convolutional layer without an activation function;
Figure GDA0003803855940000121
is a bit-by-bit multiplication symbol; σ () is a sigmoid function.
Since the threshold of the sigmoid function is (0, 1), the effect of transmission of multiple channels is increased similarly to that for Conv 1D. To further enhance the convolution effect, a dilation convolution may be used in the convolution gate mechanism. The effect of the dilation convolution is that the convolution network can notice information at a greater distance without increasing the parameters of the model. Fig. 7 is a schematic diagram of a standard convolution method and a dilation convolution method used in an entity extraction method according to an embodiment of the present invention, as shown in fig. 7, a left diagram in fig. 7 is the standard convolution method, and a diagram in fig. 7 is the schematic diagram of the dilation convolution method. The standard convolution in the graph can capture the front and back 3 inputs at the last layer, while the dilated convolution can capture the front and back 7 inputs.
S205 (not shown): and inputting the coded vector sequence into a first Attention model, wherein the first Attention model outputs a result through two classifiers, and each classifier comprises two convolution layers and a full-connection layer.
The vector is then passed into the first Attention, which uses the Self Attention model of Google, to help integrate the input text to make the model focus more on important words.
(6) And finally, constructing two classifiers by sending into two convolution and full-connection layers to respectively predict the start and stop positions of the s, wherein the size of a convolution kernel is 3. In natural language processing, a word is a one-dimensional vector, and a sentence is a two-dimensional matrix vector. Vector features are mapped into a hidden layer feature space by the convolutional layer and the activation function, and learned high-dimensional features are mapped into a sample mark space layer by layer, which are the most basic parts of CNN;
s206 (not shown): when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint way, and the result is input into an Embedding layer at a relative position to obtain an Embedding result.
And randomly sampling from the label in the training process, and taking the sampling result as the output of the model. When s is sampled, in order to predict o and p by information with s, the start-stop vector sequences of the s segments in the DGCNN-coded process are added in a counterpoint mode to obtain a vector with the same dimension as a word vector, and an Embedding layer with a relative position is added, wherein the structure of the Embedding layer with the relative position still adopts a trainable mode, and the absolute position of the word is different from the absolute position of the word in the aspect that the index of the word is the relative s head position, so that the value of the s is added when the o and p are predicted;
s207 (not shown): and inputting the coded vector sequence into a second Attention model, superposing the output of the second Attention model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first Attention model until the model converges to obtain the trained target model.
And (3) continuously adding the superposed vector and the result of the vector sequence coded by the DGCNN after the vector sequence is transmitted into a second Self orientation: and (5) inputting the vector sequence into the second orientation model, superposing the output of the second orientation model and the embedding result, taking the superposed result as the vector sequence, and returning to the step (5) until the model converges.
In the embodiment of the invention, different from other strategies, the start-stop position of o is predicted for each p through the CNN and the full-connection layer, so that both o and p are predicted at the same time, and thus, 106 two classifiers of 53 (p types) × 2 (start-stop positions of o) are constructed in the o and p prediction parts.
S208 (not shown): and then, extracting entity relations by using the trained target model.
Furthermore, when the entity relation is extracted, the maximum length of the text which is trained by the model each time is set to 512, so that the text is firstly divided into a plurality of segments according to the sentences during prediction, namely a sliding window from beginning to end is generated, the starting position of the sentence where the last triplet is located slides downwards after the last extraction, and the number of words is not more than 512 each time, the prediction method which predicts and slides at the same time greatly improves the recall rate of the model in the long text, and avoids information loss caused by violent separation; meanwhile, the friendliness of the model to long texts is improved.
In the embodiment of the invention, the marking data set is optimized, and the task of extracting the triples in the text by the joint learning is realized by combining a deep learning mode of a word fusion mode, so that the accuracy of the model is greatly improved. In order to verify the technical advantages of the embodiments of the present invention, the inventor also trains the most common methods extracted based on the deep learning relationship, namely CNN (Convolutional Neural Network), PCNN (Pulse Coupled Neural Network), RNN (Recurrent Neural Network, back propagation Neural Network) and LSTM (Long Short-Term Memory Network), to perform comparative analysis, and the experimental results are shown below:
table 1 extracts correlation indices for attributes and relationships of human beings, as shown in Table 1, table 1
Figure GDA0003803855940000141
As shown in table 1, the accuracy, recall ratio, and F value of the method provided by the embodiment of the present invention are better than those of the existing models.
Example 3
Corresponding to embodiment 1 or 2, embodiment 3 of the present invention provides an apparatus for creating a person attribute relationship extraction database in a long text, where the apparatus includes:
the acquisition module is used for acquiring the character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction; the person attribute definition includes: character basic attributes, social attributes and character social relationship attributes;
the labeling module is used for pre-labeling the entity in the text by utilizing the character attribute definition, wherein the entity comprises: one or a combination of a person, an article, an animal;
and the construction module is used for constructing the corresponding triple based on the marked entity and taking the set of the triples as a database.
In a specific implementation manner of the embodiment of the present invention, the obtaining module is configured to:
determining whether the category is a target category according to the category of the relationship between the first entity and the second entity;
if yes, semantic reverse reasoning is carried out according to the target category to obtain a reasoning result, and the reasoning result is used as an accurate relation between the first entity and the second entity.
In a specific implementation manner of the embodiment of the present invention, the obtaining module is configured to:
and generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels.
The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (5)

1. An entity relationship extraction method, the method comprising:
randomly initializing an embedding layer matrix in an initial model, wherein the initial model is a DGCNN model and an Attention model which are connected in series;
performing Word segmentation on a text corresponding to a character attribute relation extraction database in a long text to obtain a Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector by using a transformation matrix, wherein the establishment process of the character attribute relation extraction database in the long text comprises the following steps: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction; the person attribute definition includes: character basic attributes, social attributes and character social relationship attributes; and utilizing the character attribute definition to perform entity pre-labeling in the text, wherein the entity comprises: one or a combination of a person, an article, an animal; constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database;
carrying out alignment addition on the words by adopting a mode of repeatedly aligning the words with the positions of the words to obtain an addition result;
inputting the addition result into a DGCNN model to obtain a coded vector sequence;
inputting the coded vector sequence into a first Attention model, wherein the first Attention model outputs a result through two classifiers, and each classifier comprises two convolution layers and a full-connection layer;
when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint way, and the result is input into an Embedding layer at a relative position to obtain an Embedding result;
inputting the coded vector sequence into a second Attention model, superposing the output of the second Attention model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first Attention model until the model converges to obtain a trained target model;
and carrying out entity relation extraction by using the target model.
2. The entity relationship extraction method as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method comprises:
determining whether the category is a target category according to the category of the relationship between the first entity and the second entity;
if yes, semantic reverse reasoning is carried out according to the target category to obtain a reasoning result, and the reasoning result is used as an accurate relation between the first entity and the second entity.
3. The entity relationship extraction method as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method comprises:
and generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels.
4. The method for extracting entity relationship as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method includes:
the relationship based on semantic scope between entities divides the person attribute definitions that cannot be specifically classified into a set having a semantic scope greater than the person attribute definitions that cannot be specifically classified.
5. The method of claim 1, wherein obtaining the person attribute definition based on a preset method comprises:
and when the character attribute is defined as the character social attribute, acquiring a lower concept corresponding to the character social attribute, and acquiring the character social attribute based on the relevance between the lower concept and the character.
CN202210264276.5A 2022-03-17 2022-03-17 Entity relation extraction method Active CN114610819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210264276.5A CN114610819B (en) 2022-03-17 2022-03-17 Entity relation extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210264276.5A CN114610819B (en) 2022-03-17 2022-03-17 Entity relation extraction method

Publications (2)

Publication Number Publication Date
CN114610819A CN114610819A (en) 2022-06-10
CN114610819B true CN114610819B (en) 2022-10-11

Family

ID=81864911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210264276.5A Active CN114610819B (en) 2022-03-17 2022-03-17 Entity relation extraction method

Country Status (1)

Country Link
CN (1) CN114610819B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183465A (en) * 2020-10-26 2021-01-05 天津大学 Social relationship identification method based on character attributes and context
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network
CN113553440A (en) * 2021-06-25 2021-10-26 武汉理工大学 Medical entity relationship extraction method based on hierarchical reasoning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280378A1 (en) * 2013-03-14 2014-09-18 Adminovate, Inc. Database generator
CN104657750B (en) * 2015-03-23 2018-04-27 苏州大学张家港工业技术研究院 A kind of method and apparatus extracted for character relation
CN110222199A (en) * 2019-06-20 2019-09-10 青岛大学 A kind of character relation map construction method based on ontology and a variety of Artificial neural network ensembles
CN110555083B (en) * 2019-08-26 2021-06-25 北京工业大学 Non-supervision entity relationship extraction method based on zero-shot
CN111538849B (en) * 2020-04-29 2023-04-07 华中科技大学 Character relation graph construction method and system based on deep learning
CN112101009B (en) * 2020-09-23 2024-03-26 中国农业大学 Method for judging similarity of red-building dream character relationship frames based on knowledge graph

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183465A (en) * 2020-10-26 2021-01-05 天津大学 Social relationship identification method based on character attributes and context
CN112487807A (en) * 2020-12-09 2021-03-12 重庆邮电大学 Text relation extraction method based on expansion gate convolution neural network
CN113553440A (en) * 2021-06-25 2021-10-26 武汉理工大学 Medical entity relationship extraction method based on hierarchical reasoning

Also Published As

Publication number Publication date
CN114610819A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN111291185B (en) Information extraction method, device, electronic equipment and storage medium
Alwehaibi et al. Comparison of pre-trained word vectors for arabic text classification using deep learning approach
US11321363B2 (en) Method and system for extracting information from graphs
CN112131393A (en) Construction method of medical knowledge map question-answering system based on BERT and similarity algorithm
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
Li et al. UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning
CN109299211A (en) A kind of text automatic generation method based on Char-RNN model
CN115204156A (en) Keyword extraction method and device
CN114416995A (en) Information recommendation method, device and equipment
CN111967267A (en) XLNET-based news text region extraction method and system
CN114328919A (en) Text content classification method and device, electronic equipment and storage medium
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN113282729A (en) Question-answering method and device based on knowledge graph
CN114356990A (en) Base named entity recognition system and method based on transfer learning
Goyal et al. Recurrent neural network-based model for named entity recognition with improved word embeddings
CN114610819B (en) Entity relation extraction method
CN112800186B (en) Reading understanding model training method and device and reading understanding method and device
Abdulwahab et al. Deep Learning Models for Paraphrases Identification
CN114580423A (en) Bert and Scat-based shale gas field named entity identification method
Shahade et al. Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining
CN110909547A (en) Judicial entity identification method based on improved deep learning
Wang et al. Realization of Chinese word segmentation based on deep learning method
Hamplová et al. An improved classifier and transliterator of hand-written Palmyrene letters to Latin
Moholkar et al. Hybrid CNN-LSTM model for answer identification
Al-Salman et al. Fly-LeNet: A deep learning-based framework for converting multilingual braille images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant