CN111125367B

CN111125367B - Multi-character relation extraction method based on multi-level attention mechanism

Info

Publication number: CN111125367B
Application number: CN201911362557.9A
Authority: CN
Inventors: 蔡毅; 刘宸铄
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-05-23
Anticipated expiration: 2039-12-26
Also published as: CN111125367A

Abstract

The invention discloses a method for extracting various character relations based on a multi-level attention mechanism, which comprises the following steps: preprocessing the collected text; the method comprises the steps of adopting a remote supervision technology to carry out alignment labeling on an original personage named entity to obtain a text containing the entity and entity description information; training the Chinese word vector of the obtained text containing the entity; constructing a bidirectional long-short-time memory network containing two levels of attention mechanisms, and training the constructed model to obtain a multi-classification model for extracting various character relations; and inputting the preprocessed text to obtain a text relation extraction result. The invention solves the defects of the relation extraction of the prior various character relation texts, and improves the relation extraction experimental result of the various character relation texts.

Description

Multi-character relation extraction method based on multi-level attention mechanism

Technical Field

The invention relates to the field of natural language processing, in particular to a method for extracting various character relations based on a multi-level attention mechanism.

Background

With the rapid development of internet technology, text information data in a network grows exponentially, but text information data is often unstructured information. Information extraction is a task of natural language processing, and aims to extract structured information from unstructured text. Information extraction includes two aspects: named entity recognition task for discovering entities present in text and relationship extraction task for judging relationships between discovered entities, i.e., obtaining entity pair e for specified text ₁ And e ₂ And a relation r between the two (e ₁ ，r，e ₂ ). The task of relation extraction has been widely used in the fields of knowledge graph, information retrieval, etc.

Conventional non-deep learning methods for relational extraction are typically supervised learning and can be classified into feature-based methods and kernel-based methods, both of which use existing NLP tools, resulting in downstream error accumulation. Entering the deep learning era, the manual feature acquisition mode is avoided, but the supervised deep learning requires a large amount of training data to learn features. The labeling of training data can take a lot of time and effort and is biased to a fixed domain. Mintz et al in 2009 proposed a remote supervision method, which strongly hypothesizes that entity relationships in a knowledge base are entity relationships in text, and generates a large amount of data by aligning the knowledge base with the text.

However, the strong assumption of remote supervision does not necessarily hold, and the entity relationships present in the text do not necessarily have to be the same as the entity relationships in the knowledge base. To alleviate this disadvantage, riedel uses multiple instance learning. Lin uses a segmented convolutional neural network and a sentence-level attention mechanism for the first time in 2016, and the deep learning and the use of the attention mechanism are introduced to achieve better effect of relation extraction.

Most of relation extraction tasks are related to English texts, and related to Chinese texts, especially Chinese texts containing various character relations, how to use a deep learning fusion attention mechanism to realize a better method for extracting various character relations of Chinese texts is needed to be researched.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multi-character relation extraction method based on a multi-level attention mechanism. The invention obtains the global characteristic representation of the text by adopting a bidirectional long-short-time memory network and a word-level attention mechanism, wherein the word-level attention mechanism is used for strengthening the weight of words which are more important for relation extraction, then a multi-instance learning mode is adopted, the sentence-level attention mechanism obtains the package representation consisting of a plurality of sentence representations, and the description information of a named entity is added to strengthen the result of the package representation. The invention obtains better experimental results on the remote supervision relation extraction data set.

The invention can be realized by the following technical scheme:

a multi-character relation extraction method based on a multi-level attention mechanism comprises the following steps:

preprocessing the collected text;

the method comprises the steps of adopting a remote supervision technology to carry out alignment labeling on an original personage named entity to obtain a text containing the entity and entity description information;

training the Chinese word vector of the obtained text containing the entity;

constructing a bidirectional long-short-time memory network containing two levels of attention mechanisms, and training the constructed model to obtain a multi-classification model for extracting various character relations;

and inputting the preprocessed text to obtain a text relation extraction result.

Specifically, the pretreatment includes:

removing English data in the text;

removing emoticons and hyperlinks in the text;

removing stop words in the text according to the Chinese stop word list;

and performing Chinese word segmentation on the text subjected to the processing.

Specifically, in the step of aligning and labeling the original character named entities by adopting the remote supervision technology, the character name entry is acquired by using the Chinese online hundred degrees encyclopedia, the two characters with the relation and the relation of the characters form a triplet, and finally, a character relation knowledge base is constructed. Pairs of entities, i.e., relationships of two entities, that appear in the text with the knowledge base are labeled as relationships in triples. The final marked data set in the invention has 35 relation types.

Specifically, in the step of training the Chinese Word vector of the text, a distributed Word vector representation method Word2Vec is adopted, and the dimension of the output Word vector is set to 300.

Specifically, in the step of constructing the bidirectional long and short time memory network containing the two levels of attention mechanisms, a pytorch is used to build a BiLSTM (bidirectional long and short time memory network) and two levels of attention mechanism network structures, wherein the first layer of the network is an embedded layer, the second layer is a bidirectional LSTM layer, the third layer is a word level attention layer, the fourth layer is a sentence level attention layer, and the fifth layer is a classifier soffmax layer.

Further, the embedded layerThe input of the Word is a trained Word vector sequence, the length of a text sequence (the number of Word vector sequences) is set to be m, the Word vector sequences are filled with 0 and are truncated beyond m, the relative position length of words in each text relative to two entities is also m, an embedding layer adopts hundred degrees encyclopedia as a corpus, and Word2Vec is obtained by using a Gensim tool. The word vector dimension is dw, and the randomly initialized position vector dimension is dp, so that a vector sequence w= { w is obtained ₁ ，w ₂ ，…，w _m }，w _i ∈R ^d Where d=dw+dp×2.

Still further, in the bi-directional LSTM layer, the unidirectional LSTM is expressed as follows:

i _t is an input door, f _t C is a forgetful door _t Is in the state of metacells, o _t For outputting the door h _t To conceal the vector, W _x ，W _h ，W _c As the weight of the material to be weighed,

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i )

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f )

c _t ＝f _t c _t-1 +i _t tanh(W _xc x _t +W _hc h _t-1 +b _c )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o )

h _t ＝o _t tanh(c _t )

vector h of bidirectional LSTM _t Output by a forward network

And reverse output->

And (5) obtaining the product together.

Still further, the word-level attention layer is used to strengthen the weight of words that are more important for relation extraction.

In the word-level attention layer, u _i Score representing the relevance of each word in a sentence, r represents a random query vector, h _i Is h _t I.e. each word hidden vector, the specific relationship is:

u _i ＝h _i ·r

α _i the weight obtained by the word level attention mechanism is calculated as follows:

s is a vector representation of a sentence, and the calculation formula is as follows:

still further, the sentence-level attention layer is configured to add entity description information.

In the sentence-level attention layer, e _i Representing an input sentence s _i Relation r relative to predictions _k The matching degree of (2) is calculated by the following formula:

α _i the weight obtained by the sentence level attention mechanism is calculated as follows:

b is a vector representation of a packet, equal to the weighted sum of all sentences, calculated as follows:

the obtained package represents the description information of the entity on the splice, namely the category information vector of the entity, and is expressed as follows:

the fifth layer is the classifier softmax layer for generating results of the relational extraction by the softmax multi-classifier.

The method obtains global feature representation of text by using BiLSTM (bidirectional long short time memory network) and Word level attention (word level attention mechanism), the word level attention mechanism is used for strengthening the weight of words which are more important for relation extraction, then a multi-instance learning mode is adopted, sentence level attention (sentence level attention mechanism) obtains a package representation composed of a plurality of sentence representations, and descriptive information of named entities is added to strengthen the result of the package representation.

Compared with the prior art, the invention has the following beneficial effects:

for Chinese texts with various task relations, the invention better avoids noise caused by remote supervision by adopting two levels of attention mechanisms, and adds entity description information into the Chinese texts, so that the semantic characteristics of the texts are enhanced, and better relation extraction results are obtained.

Drawings

FIG. 1 is a flow chart of a method for extracting relationships between multiple people based on a multi-level attention mechanism according to the present invention.

FIG. 2 is a diagram of a multiple persona relationship extraction network model based on a multi-level attention mechanism in accordance with the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples

Fig. 1 is a flowchart of a method for extracting multiple character relationships based on a multi-level attention mechanism, the method comprising the steps of:

(1) Preprocessing the collected text;

the disclosed remote supervision multiple persona relationship extraction data sets (e.g., CCKS 2019 IPER data sets) are used in this embodiment. The following operations are performed: firstly, removing English data in a text;

removing special symbols in the text, such as: emoji and hyperlinks, representing emoji as "expression", removing hyperlinks, etc.; and removing stop words in the text according to the Chinese stop word list.

(2) The method comprises the steps of adopting a remote supervision technology to carry out alignment labeling on an original personage named entity to obtain a text containing the entity and entity description information;

in the entity labeling stage, the character relationship knowledge base is finally constructed by utilizing the acquired name entry of the Chinese online hundred degrees encyclopedia to form a triplet of two characters with relationship and the relationship thereof. Pairs of entities, i.e., relationships of two entities, that appear in the text with the knowledge base are labeled as relationships in triples. The final annotated dataset had 35 relationship types.

(3) Training the Chinese word vector of the obtained text containing the entity;

in the text vectorization step of this embodiment, the word2vec method is used, chinese word segmentation is performed on the text processed by the above method using a barker word segmentation tool, word2vec training is performed using a genesim package, and the vector dimension of each word is 300.

(4) Constructing a bidirectional long-short-time memory network containing two levels of attention mechanisms, and training the constructed model to obtain a multi-classification model for extracting various character relations;

as shown in fig. 2, the network model constructed in this embodiment includes: an embedded layer, a bi-directional LSTM layer, a word level attention layer, a sentence level attention layer, and a softmax classification layer.

The neural network model constructed by the embodiment is trained through the downloaded Chinese hundred-degree encyclopedic data set, the loss function is Crossentiopy, the optimization method is Adam, through adjustment of other parameters of the model, training is completed when 15 epochs or loss are unchanged in 1000 batches, the test set is tested, a result is obtained, a result of relation extraction is measured through a P-R curve, the P-R curve is characterized by a precision and a recovery of the result, and the upper curve in a two-dimensional coordinate system represents the relation extraction effect is good.

(5) And preprocessing the text required to be extracted, and inputting the preprocessed text into a trained model to obtain a text relation extraction result.

The method for extracting the relationships of various characters based on the multi-level attention mechanism is established in the method, the input text is converted into a vector form through an embedding layer, hidden vectors with more characteristics are obtained through a BiLSTM layer, the attention mechanism of word level is used for extracting more important words from the text, higher weight is obtained, the attention mechanism of sentence level is better package representation, noise brought by remote supervision can be eliminated, and therefore better experimental results are obtained.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A multi-character relation extraction method based on a multi-level attention mechanism is characterized by comprising the following steps:

preprocessing the collected text;

training the Chinese word vector of the obtained text containing the entity;

inputting the preprocessed text to obtain a text relation extraction result;

in the step of aligning and labeling the original character named entities by adopting a remote supervision technology, acquiring name entries by utilizing Chinese online hundred degrees encyclopedia, forming a triplet by two characters with relation and the relation of the characters, and finally constructing a character relation knowledge base; entity pairs which exist with the knowledge base appear in the text, namely the relation between the two entities is marked as the relation in the triplet;

in the step of constructing a bidirectional long-short-time memory network containing two levels of attention mechanisms, a pyrach is used for constructing a BiLSTM and two levels of attention mechanism network structures, wherein the first layer of the network is an embedded layer, the second layer is a bidirectional LSTM layer, the third layer is a word level attention layer, the fourth layer is a sentence level attention layer, and the fifth layer is a classifier softmax layer;

the input of the embedding layer is a trained word vector sequence, the length of a text sequence, namely the number of word vector sequences is set to be m, the number of word vector sequences is less than m and filled with 0, the relative position length of words in each text relative to two entities is also m, the embedding layer adopts a pre-trained word vector dimension dw and a randomly initialized position vector dimension dp, and therefore a vector sequence w= { w is obtained ₁ ，w ₂ ，...，w _m }，w _i ∈R ^d Wherein d=dw+dp×2;

in the bi-directional LSTM layer, its unidirectional LSTM is expressed as follows:

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i )

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f )

c _t ＝f _t c _t-1 +i _t tanh(W _xc x _t +W _hc h _t-1 +b _c )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o )

h _t ＝o _t tanh(c _t )

vector h of bidirectional LSTM _t Output by a forward network

And reverse output->

Obtaining the materials together;

in the attention layer of word level, u _i Score representing the relevance of each word in a sentence, r represents a random query vector, h _i Is h _t I.e. each word hidden vector, the specific relationship is:

u _i ＝h _i ·r

the sentence-level attention layer is used for adding entity description information;

β _i the weight obtained by the sentence level attention mechanism is calculated as follows:

2. the method of claim 1, wherein the preprocessing comprises:

removing English data in the text;

removing emoticons and hyperlinks in the text;

removing stop words in the text according to the Chinese stop word list;

3. The method of claim 1, wherein the step of training the text for the chinese Word vector uses a distributed Word vector representation method Word2Vec, and the dimension of the output Word vector is set to 300.