CN114610819B

CN114610819B - Entity relation extraction method

Info

Publication number: CN114610819B
Application number: CN202210264276.5A
Authority: CN
Inventors: 喻野; 黄宇
Original assignee: Zhongke Shitong Hengqi Beijing Technology Co ltd
Current assignee: Zhongke Shitong Hengqi Beijing Technology Co ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-10-11
Anticipated expiration: 2042-03-17
Also published as: CN114610819A

Abstract

The invention discloses an entity relationship extraction method, which comprises the following steps: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction, wherein the person attribute definition comprises: character basic attributes, social attributes and character social relationship attributes; utilizing character attribute definition to pre-label the entity in the text; and constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database.

Description

Entity relation extraction method

Technical Field

The invention relates to the technical field of information mining, in particular to an entity relationship extraction method.

Background

In the era of rapid development of internet technology, entity relationship extraction is always a research hotspot as a core research direction of text mining and information extraction in a large amount of irregular unstructured data in an open domain. The entity relation extraction is to extract or convert massive unstructured data into structured data and provide data samples for constructing a knowledge map, automatically asking and answering, machine translation, obtaining text summaries in a large scale and the like.

At present, an entity relation extraction method based on deep learning gradually surpasses a classical method based on characteristics and a kernel function, and entity relation extraction based on deep learning is mainly divided into two types of supervised method and remote supervised method, wherein the supervised entity relation extraction method mainly comprises a pipeline method and a joint learning method. Although, the method based on deep learning can avoid the problem of error accumulation in the artificial feature selection in the classical aspect; however, the pipeline-based method is to perform relation classification prediction after the entity identification module, on one hand, an error of entity identification can be continuously propagated to the relation classification to cause error propagation, and on the other hand, the pipeline-based method neglects the effect that information lost by the relation between two subtasks influences a model. Therefore, the entity relationship extraction accuracy rate in the prior art is low.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an entity relationship extraction method to improve the accuracy of entity relationship extraction.

The invention solves the technical problems through the following technical scheme:

the invention provides a method for establishing a character attribute relation extraction database in a long text, which comprises the following steps:

acquiring a person attribute definition based on a preset method, wherein the person attribute definition comprises the following steps: character basic attributes, social attributes and character social relationship attributes;

pre-labeling entities in the text by using character attribute definition;

and constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database.

Optionally, the obtaining of the person attribute definition based on the preset method includes:

determining whether the category is a target category according to the category of the relationship between the first entity and the second entity;

if yes, semantic reverse reasoning is carried out according to the target category to obtain a reasoning result, and the reasoning result is used as an accurate relation between the first entity and the second entity.

and generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels.

for person attribute definitions that cannot be specifically classified, the person attribute definitions that cannot be specifically classified may be classified into a set having a semantic scope greater than the person attribute definitions that cannot be specifically classified based on the semantic scope size relationship between the entities.

Optionally, obtaining the person attribute definition based on a preset method includes:

and when the character attribute is defined as the character social attribute, acquiring a lower concept corresponding to the character social attribute, and acquiring the character social attribute based on the relevance between the lower concept and the character.

The invention also provides a database for extracting the character attribute relationship in the long text, wherein the database is established by using any one of the methods.

The invention also provides an entity relationship extraction method, which comprises the following steps:

randomly initializing an embedding layer matrix in an initial model, wherein the initial model is a DGCNN model and an Attention model which are connected in series;

performing Word segmentation on a text corresponding to the character attribute relation extraction database in the long text to obtain a Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector by using a transformation matrix, wherein the character attribute relation extraction database in the long text is the database;

carrying out alignment addition on the words by adopting a mode of repeatedly aligning the words with the positions of the words to obtain an addition result;

inputting the addition result into a DGCNN model to obtain a coded vector sequence;

inputting the coded vector sequence into a first Attention model, wherein the first Attention model outputs a result through two classifiers, and each classifier comprises two convolution layers and a full-connection layer;

when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint mode, and the result is input into an Embedding layer at a relative position to obtain an Embedding result;

inputting the coded vector sequence into a second orientation model, superposing the output of the second orientation model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first orientation model until the model converges to obtain a trained target model;

and carrying out entity relation extraction by using the target model.

Compared with the prior art, the invention has the following advantages:

by applying the embodiment of the invention, a new person attribute relation schema system is designed, so that on one hand, the workload of labeling can be greatly reduced, and meanwhile, the interference of label information redundancy on model training is avoided, and further, the method has an important positive effect in precise labeling and algorithm optimization.

Moreover, a semi-automatic visual labeling platform can be developed to reduce the labeling cost and realize accurate labeling; and the method can also realize many-to-many requirements in triple extraction, solve the problem of extraction of overlapping entity relations and realize the extraction of long text friendly character attribute relations.

Drawings

Fig. 1 is a schematic flowchart of a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating definition and classification of task attributes in a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an entity labeling process in the method for establishing a character attribute relationship extraction database in a long text according to the embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an entity extraction method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a model architecture used in the entity extraction method according to an embodiment of the present invention;

fig. 6 is a structural diagram of a convolution gate mechanism in the entity extraction method according to the embodiment of the present invention;

fig. 7 is a schematic diagram illustrating a principle of a standard convolution method and a dilation convolution method used in the entity extraction method according to the embodiment of the present invention.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

Example 1

Fig. 1 is a schematic flowchart of a method for establishing a character attribute relationship extraction database in a long text according to an embodiment of the present invention, where as shown in fig. 1, the method includes:

s101: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction, wherein the person attribute definition comprises: person base attributes and social attributes and person social relationship attributes.

In order to more comprehensively summarize the description about attributes and relationships in a person portrait, the person attribute relationship is divided into three parts, namely a person basic attribute, a social attribute and a person social relationship attribute,

fig. 2 is a schematic diagram illustrating definition and classification of task attributes in a method for establishing a relationship extraction database for attributes of persons in long texts according to an embodiment of the present invention, as shown in fig. 2, in which a basic attribute of a person, a social attribute, and a social relationship attribute of a person in fig. 2 may be used as a schema, respectively. Further, the basic attributes of the character comprise 11 basic attributes of alias, gender, nationality, birth date, birth place, contact way, certificate number, writing, award certificate and religious belief. In practical applications, the schema is a collection of a class of data in a database.

The person social attributes comprise 4 social attributes of an arbitrary organization, a school of seeking study, a participation activity and a social nickname;

the character social relationship attributes comprise 4 social relationship attributes of relatives, classmates, friends and other relationships.

S102: and pre-labeling the entity in the text by using the character attribute definition.

The labeling platform for extracting the relation in the market at present cannot meet the use of the patent, and the labeling tool directly determines the formation of the labeled data, so that the embodiment of the invention provides a special labeling platform for realizing the functions of self-defined schema, entity labeling, relation labeling, visual inspection and special functions: automatic retrieval and adaptive labeling.

(1) Fig. 3 is a schematic diagram of an entity labeling process in the method for establishing a character attribute relationship extraction database in a long text according to the embodiment of the present invention, as shown in fig. 3, an automatic retrieval function is performed in a labeling process, for example, if "china" is already labeled in a process of sequentially labeling the long text, then "china" in other parts of the whole text is also automatically labeled as a country by using a full text retrieval method, so that repeated labeling of the same entity can be avoided, and the computation amount of a model is reduced.

In the step of decoding the schemas, the entities and the relations, a self-adaptive labeling mode is used, entity information and corresponding entity relations are identified by calling an entity identification algorithm or an existing relation extraction algorithm in the prior art, and the entity information and the entity relations are used as labels to pre-label the text, so that on one hand, the labeling workload can be greatly reduced, and meanwhile, the method has an important role in precise labeling and algorithm optimization.

In this step, in the process of entity pre-labeling in the text, the following contents are added in embodiment 2 of the present invention:

(2) Determining whether the category is a target category according to the category of the relationship between the first entity and the second entity; if yes, performing semantic reverse reasoning according to the target category to obtain a reasoning result, and using the reasoning result as an accurate relation between the first entity and the second entity. In the embodiment of the invention, reverse reasoning information is added, such as parent and child relationships, when the algorithm identifies the parent relationship, the child relationship can be obtained through reasoning, and labeling and algorithm fitting are not needed any more, so that training errors caused by the redundancy of labels in the schema are reduced, and the design difficulty of the schema is reduced.

(3) Generating entity labels based on the relevance of the relationship between the gender and the entities, and labeling the entities in the text by using the generated entity labels, for example, reducing the design of the schema by predicting the gender in the basic attributes of the human beings, for example, the algorithm of "zhang san is zhang zi" can identify that zhang san parents are zhang mian group of entity pairs, and simultaneously identify that zhang san and zhang gender are both males. In the embodiment of the invention, a label 'son' for an entity 'Zhang III' can be generated through the sex of Zhang III; similarly, when the entity "zhang san" is labeled, since the relationship between the entity and "zhang da" is bidirectional, not only the son whose zhang san is zhang can be obtained, but also the father whose zhang san is zhang san can be obtained. By analogy, a great number of labels of mothers, girls, brother, sisters and the like in most data sets are reduced, the labels can be realized by using the relationship between the gender and the entity in a cross way, and the complexity of the labeling work and the algorithm can be reduced by only needing one more gender prediction function.

Gender prediction is not in the sentence algorithm flow, because words representing gender do not exist, and at the same time, all triples do not need to output gender, therefore, subsequences in the model coding vector sequence can be used as classification models when s and o are characters, and it is worth explaining that gender prediction is a three-classification problem.

(4) For the person attribute definitions which cannot be specifically classified, the embodiment of the invention can divide the person attribute definitions which cannot be specifically classified into a set of person attribute definitions of which the semantic range is larger than that of the person attribute definitions which cannot be specifically classified based on the semantic range size relationship between the entities.

For example, "wang di and zhao wu are very good friends in school", if there is no context, the algorithm cannot identify college classmate relationship, or high school classmate relationship, or primary school classmate relationship; only the classmatic relationship between the entity "wang di" and the entity "zhao wu" can be identified. However, the relationship between students is also a relatively important information, so the definition of the attribute of the character that cannot be classified specifically means that although the general relationship between two entities such as a couple and a classmate can be obtained, the precise relationship between the two entities such as a husband, a wife, a college classmate, a high school classmate and a primary school classmate cannot be obtained. Therefore, the relationship that cannot be specifically identified, that is, the relationship cannot be specifically classified, can be included in the designed classmate relationship, that is, the label of the classmate relationship is designed in the embodiment of the present invention to include college classmate relationship, high school classmate relationship, and elementary school classmate relationship. For another example, when gender prediction is not accurate, parents and couples are also a kind of inclusion for parents and husband wife, and when the gender prediction results of the first entity and the second entity are not accurate, but the first entity and the second entity have a marital relationship, the label "couple" may be used instead of the specific labels "husband" and "wife"; the specific tags "father", "mother", "dad", "mom" are replaced with the tag "parent". As shown in FIG. 2, the four major classes in the social relationship attribute are the inclusion of each subclass. By applying the embodiment of the invention, some added uncertain relations are also extracted, so that the information contained in the text can be retrieved as much as possible, and the influence of uncertain categories in algorithm fitting on the algorithm is reduced.

(5) And when the character attribute is defined as the character social attribute, acquiring a lower concept corresponding to the character social attribute, and acquiring the character social attribute based on the relevance between the lower concept and the character.

For example, to more accurately trace to specific social attributes, we add large-to-small class relationships, i.e., organizations, schools, activities, and social platforms, as subject participating entity tags in triples. For example, in "obtaining the degree of economic academic of pennsylvania university at waton business school in 1968, mingming is required to mark that the school of mingming is the university of pennsylvania, the school of mingming is the school of waton business, and the subordinate concept of the university of pennsylvania is also required to be marked, namely the school is the school of waton business, if other academic experiences exist in the context, the complete social attribute can be obtained by combining the academic time, namely the mingming is that the academic experience exists in the school of waton business of pennsylvania. By using the method, detailed traceability can be realized for positions, activity identities, statements and the like.

S103: and constructing a corresponding triple based on the marked entity.

The extraction of the character attribute relationship is to extract all the triples contained in the sentence. The triplet (s, p, o) is in the form of s being a subject, i.e. a host entity, o being an object, i.e. an object, and p being a prefix, i.e. a relationship between two entities. (s, p, o) is understood to mean "p of s is o".

The embodiment of the invention outputs a training set comprising 53 schemas, which basically comprises most schemas in the current character attribute relationship extraction task, so that the training set constructed by the embodiment of the invention has stronger generalization in the field of non-noon attribute relationship extraction;

the embodiment of the invention reduces the dimensionality of the data set from a schema design system, and realizes the high efficiency of annotation on the other hand from an annotation method.

In addition, at present, the standard data sets related to the Chinese character attribute relationship extraction task in China are few, and most researchers collect the data sets respectively, so that the problem that the establishment and design of the standard data set related to the Chinese character attribute relationship extraction is urgently needed to be solved is caused. Moreover, in the prior art, the remote supervision adopts the aligned remote knowledge base, so that the manpower loss is greatly reduced, the field mobility is stronger, but the data quality obtained by automatic labeling of the remote supervision is lower, and the model training effect is influenced finally. The embodiment of the invention improves the prediction effect of the model by optimizing the data set.

Example 2

In the prior art, a method based on sequence labeling solves the problem of entity redundancy influence caused by invalid entities by simultaneously labeling 'position information-relation type-entity role' for directly predicting and obtaining an entity-relation triple. However, the existing method based on sequence labeling does not consider the diversity of words after the input sentence is encoded, so that the problem that the overlapping entity relationship cannot be identified is caused, for example, in "someone self-transmission", the model cannot identify the relationship between "someone" and "someone self-transmission". Moreover, some articles belong to long texts, for example, the number of characters in the text exceeds 1000. When long text recognition is carried out, neurons in the neural network are increased, the layer number of the network is deepened, and the operation amount during model operation is further increased.

In order to solve the above problems, embodiments of the present invention provide a concept of sequence-based labeling, which can not only accommodate information between entities, but also avoid the problem of entity redundancy, and also solve the problems of entity overlapping and excessive computation.

Fig. 4 is a schematic diagram of a principle of an entity extraction method according to an embodiment of the present invention, as shown in fig. 4, the method according to embodiment 2 includes:

the method comprises the steps of firstly converting triples into sequence data, then removing a relation type in a position information-relation type-entity role in sequence labeling by a labeling part, firstly predicting an entity in a parameter sharing part, and then predicting another entity of each relation under the prior condition of the entity, wherein the problem of entity overlapping can be solved after the text is coded twice each time.

S201 (not shown in the figure): randomly initializing one embedding layer matrix in an initial model, wherein the initial model is a DGCNN model and an Attention model which are connected in series.

Crawling is carried out on a character introduction page in encyclopedia, continuous text texts are obtained by regular mode for labeling, and meanwhile, a character attribute relation extraction data set of 11603 samples with 427 characters in average length and 512 characters in maximum length is constructed by combining match data and public data sets. In the process of labeling, the text length needs to be controlled manually, at least 500 samples are guaranteed for each class of p, data fine labeling is carried out on a certain class of p with low accuracy or recall rate, and the effect of a visual labeling platform is fully exerted in the supplementing process.

Fig. 5 is a schematic diagram of a model architecture used in the entity extraction method provided in the embodiment of the present invention, and as shown in fig. 5, the entity extraction model in embodiment 2 of the present invention uses a structure of DGCNN + Attention, which is the Self Attention of Google; DGCNNs may also use existing neural network models. The text vector taking the word as the unit has no meaning, and the single word does not have semantic information, so the method for adding the word Embedding layer fusion word to the deep learning-based joint learning method is most frequently used for finding that the number of o is multiple of s after statistical analysis is carried out on a labeled data set, and sampling s is easier and more sufficient, so that the method for predicting s first and then predicting corresponding o and p by introducing s is adopted.

The DGCNN model uses 12 layers in total, the expansion rate is repeated for three times in sequence of [1,2 and 5], the granularity is repeatedly learned from fine granularity to coarse granularity, and then three layers of [1] are added to finely adjust the fine granularity; since the prediction of the start-stop positions is actually two 2-class problems, the loss function uses two-class cross entropy; the optimizer selects Adam to train, the learning rate is 0.0001, the epoch is 150, the maximum length of the text is 512, and the batch size is 8; to provide stability for training, EMA (expenential Moving Average) was used for the model, with an attenuation rate of 0.9999; the "start" threshold and "end" threshold of the decoding part at the start-end position are 0.5 and 0.4 respectively.

The sentence "plum four in 2021 in the figure held the wedding ceremony with the husband Zhang three. For example, the process includes the following steps:

in order to preserve the flexibility of the words, randomly initializing a word embedding layer matrix (n x 32) and updating the word embedding layer matrix in a training process to realize word vectorization, wherein n represents the number of words in all dictionaries, and 32 represents the vectorization dimension of each word;

s202 (not shown in the figure): the method comprises the steps of segmenting words of texts corresponding to a character attribute relation extraction database in a long text through Word2Vec to obtain Word vectors, and converting the Word vectors into vectors with the same dimensionality as the Word vectors by using a transformation matrix, wherein the character attribute relation extraction database in the long text is the database in embodiment 1.

Performing Word segmentation on the text to obtain a 300-dimensional Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector through a (300 x 32) transformation matrix;

s203 (not shown): the words are aligned and added in a mode of repeatedly aligning the words, for example, "reading makes people progress". Thus, a word and word mixed vector is obtained, and not only semantic information brought by the word vector is attached, but also the semantic information of the word is contained.

S204 (not shown): the addition result is input into the DGCNN model to the encoded vector sequence.

In the embodiment of the invention, in order to enhance the position sense of CNN and add the position Embedding layer, the method is the same as that of the word Embedding layer, because the subject is often at the beginning of a sentence, the object is often near the subject, and the maximum word number is 512 during training, the matrix dimension of one position Embedding layer is set to be 512 x 32;

a coded vector sequence is obtained through a DGCNN model, and the DGCNN (expanded Gated Convolutional Neural Network) is a convolution Neural Network added with two new convolution methods, namely expanded convolution and gate convolution. The gate convolution adds a gate mechanism to the normal convolution:

fig. 6 is a structural diagram of a convolution gate mechanism in the entity extraction method according to the embodiment of the present invention, as shown in fig. 6, where the numbers of convolution kernels and window sizes of Conv1D1 and Conv1D2 are the same, but weights are not shared, one of the convolution kernels is activated by a sigmoid function, the other convolution kernel is not activated by a function, and finally the two convolution kernels multiply bit by bit to output Y, where the output Y corresponds to the input X. The model structure in this step prevents the gradient from vanishing, because there is a convolution without an activation function. When the output dimension and the input dimension are consistent, its output can be calculated using the following formula:

wherein,

y is the output of the convolution gate; x is the convolution gate input; conv1D1 (X) is a convolutional layer without an activation function;

is a bit-by-bit multiplication symbol; σ () is a sigmoid function.

Since the threshold of the sigmoid function is (0, 1), the effect of transmission of multiple channels is increased similarly to that for Conv 1D. To further enhance the convolution effect, a dilation convolution may be used in the convolution gate mechanism. The effect of the dilation convolution is that the convolution network can notice information at a greater distance without increasing the parameters of the model. Fig. 7 is a schematic diagram of a standard convolution method and a dilation convolution method used in an entity extraction method according to an embodiment of the present invention, as shown in fig. 7, a left diagram in fig. 7 is the standard convolution method, and a diagram in fig. 7 is the schematic diagram of the dilation convolution method. The standard convolution in the graph can capture the front and back 3 inputs at the last layer, while the dilated convolution can capture the front and back 7 inputs.

S205 (not shown): and inputting the coded vector sequence into a first Attention model, wherein the first Attention model outputs a result through two classifiers, and each classifier comprises two convolution layers and a full-connection layer.

The vector is then passed into the first Attention, which uses the Self Attention model of Google, to help integrate the input text to make the model focus more on important words.

(6) And finally, constructing two classifiers by sending into two convolution and full-connection layers to respectively predict the start and stop positions of the s, wherein the size of a convolution kernel is 3. In natural language processing, a word is a one-dimensional vector, and a sentence is a two-dimensional matrix vector. Vector features are mapped into a hidden layer feature space by the convolutional layer and the activation function, and learned high-dimensional features are mapped into a sample mark space layer by layer, which are the most basic parts of CNN;

s206 (not shown): when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint way, and the result is input into an Embedding layer at a relative position to obtain an Embedding result.

And randomly sampling from the label in the training process, and taking the sampling result as the output of the model. When s is sampled, in order to predict o and p by information with s, the start-stop vector sequences of the s segments in the DGCNN-coded process are added in a counterpoint mode to obtain a vector with the same dimension as a word vector, and an Embedding layer with a relative position is added, wherein the structure of the Embedding layer with the relative position still adopts a trainable mode, and the absolute position of the word is different from the absolute position of the word in the aspect that the index of the word is the relative s head position, so that the value of the s is added when the o and p are predicted;

s207 (not shown): and inputting the coded vector sequence into a second Attention model, superposing the output of the second Attention model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first Attention model until the model converges to obtain the trained target model.

And (3) continuously adding the superposed vector and the result of the vector sequence coded by the DGCNN after the vector sequence is transmitted into a second Self orientation: and (5) inputting the vector sequence into the second orientation model, superposing the output of the second orientation model and the embedding result, taking the superposed result as the vector sequence, and returning to the step (5) until the model converges.

In the embodiment of the invention, different from other strategies, the start-stop position of o is predicted for each p through the CNN and the full-connection layer, so that both o and p are predicted at the same time, and thus, 106 two classifiers of 53 (p types) × 2 (start-stop positions of o) are constructed in the o and p prediction parts.

S208 (not shown): and then, extracting entity relations by using the trained target model.

Furthermore, when the entity relation is extracted, the maximum length of the text which is trained by the model each time is set to 512, so that the text is firstly divided into a plurality of segments according to the sentences during prediction, namely a sliding window from beginning to end is generated, the starting position of the sentence where the last triplet is located slides downwards after the last extraction, and the number of words is not more than 512 each time, the prediction method which predicts and slides at the same time greatly improves the recall rate of the model in the long text, and avoids information loss caused by violent separation; meanwhile, the friendliness of the model to long texts is improved.

In the embodiment of the invention, the marking data set is optimized, and the task of extracting the triples in the text by the joint learning is realized by combining a deep learning mode of a word fusion mode, so that the accuracy of the model is greatly improved. In order to verify the technical advantages of the embodiments of the present invention, the inventor also trains the most common methods extracted based on the deep learning relationship, namely CNN (Convolutional Neural Network), PCNN (Pulse Coupled Neural Network), RNN (Recurrent Neural Network, back propagation Neural Network) and LSTM (Long Short-Term Memory Network), to perform comparative analysis, and the experimental results are shown below:

table 1 extracts correlation indices for attributes and relationships of human beings, as shown in Table 1, table 1

As shown in table 1, the accuracy, recall ratio, and F value of the method provided by the embodiment of the present invention are better than those of the existing models.

Example 3

Corresponding to

embodiment

1 or 2, embodiment 3 of the present invention provides an apparatus for creating a person attribute relationship extraction database in a long text, where the apparatus includes:

the acquisition module is used for acquiring the character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction; the person attribute definition includes: character basic attributes, social attributes and character social relationship attributes;

the labeling module is used for pre-labeling the entity in the text by utilizing the character attribute definition, wherein the entity comprises: one or a combination of a person, an article, an animal;

and the construction module is used for constructing the corresponding triple based on the marked entity and taking the set of the triples as a database.

In a specific implementation manner of the embodiment of the present invention, the obtaining module is configured to:

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. An entity relationship extraction method, the method comprising:

performing Word segmentation on a text corresponding to a character attribute relation extraction database in a long text to obtain a Word vector through Word2Vec, and converting the Word vector into a vector with the same dimension as the Word vector by using a transformation matrix, wherein the establishment process of the character attribute relation extraction database in the long text comprises the following steps: acquiring character attribute definition based on a preset method, wherein the preset method comprises the following steps: one or a combination of full text retrieval and gender prediction; the person attribute definition includes: character basic attributes, social attributes and character social relationship attributes; and utilizing the character attribute definition to perform entity pre-labeling in the text, wherein the entity comprises: one or a combination of a person, an article, an animal; constructing a corresponding triple based on the marked entity, and taking the set of the triples as a database;

when the classifier identifies the entity label, the start-stop vector sequences corresponding to the sequence fragment of the entity in the coded vector sequence are added in a counterpoint way, and the result is input into an Embedding layer at a relative position to obtain an Embedding result;

inputting the coded vector sequence into a second Attention model, superposing the output of the second Attention model and the embedding result, taking the superposed result as the coded vector sequence, and returning to execute the step of inputting the coded vector sequence into the first Attention model until the model converges to obtain a trained target model;

and carrying out entity relation extraction by using the target model.

2. The entity relationship extraction method as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method comprises:

3. The entity relationship extraction method as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method comprises:

4. The method for extracting entity relationship as claimed in claim 1, wherein the obtaining of the person attribute definition based on the preset method includes:

the relationship based on semantic scope between entities divides the person attribute definitions that cannot be specifically classified into a set having a semantic scope greater than the person attribute definitions that cannot be specifically classified.

5. The method of claim 1, wherein obtaining the person attribute definition based on a preset method comprises: