CN110134954B

CN110134954B - Named entity recognition method based on Attention mechanism

Info

Publication number: CN110134954B
Application number: CN201910371706.1A
Authority: CN
Inventors: 王丹; 徐书世; 赵青; 杜金莲; 付利华
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2023-12-22
Anticipated expiration: 2039-05-06
Also published as: CN110134954A

Abstract

A named entity recognition method based on an Attention mechanism belongs to the field of computers, and the accuracy of named entity recognition is improved by introducing Chinese literal and character position weight information through the Attention mechanism. The method comprises the following steps: the device comprises a similar word extraction module, a feature construction module and a classifier module, wherein the feature construction module comprises four sub-modules of word similarity fusion, word feature extraction, character feature extraction and feature fusion. The method processes the context information in named entity recognition by a two-way LSTM (long-short term memory) and predicts the entity tag class by crf (conditional random field).

Description

Named entity recognition method based on Attention mechanism

Technical Field

The invention relates to a named entity recognition method based on an Attention mechanism, belonging to the field of computer software.

Background

The value of knowledge for artificial intelligence is that a machine can have cognitive ability and understanding ability, and the process of constructing a knowledge graph is the process of forming cognitive ability by the machine, so that the machine can understand the world. Knowledge graph is a semantic network that reveals relationships between entities, and can formally describe real world things and their relationships.

In the medical direction, a medical knowledge graph is a basic stone for realizing intelligent medical treatment, and the premise of constructing the medical knowledge graph is the extraction of knowledge, and the acquisition of knowledge from case texts is an important source. But manual extraction would be a very burdensome task. Named entity recognition provides an alternative to human labor.

The concept of named entity is proposed in the sixth information understanding conference (MUC-6) in 1996, named entity identification refers to identifying entities with specific meaning in texts, such as names of persons, places, institutions and the like, and when identifying case texts, the entities which we need to identify include patient complaints, examination means, examination results, disease names, treatment means and the like. The named entity identification is not only of great significance to the construction of the atlas, but also an important basic tool in the application fields of information extraction, question-answering systems, machine translation and the like.

The history of development of named entity recognition can be described simply as: ranging from rule-based methods to statistics-based methods to deep learning-based combinations with statistics.

The rule-based method is the earliest used method for identifying named entities, and depends on rule templates manually constructed by linguists, and the selected features generally comprise statistical information, keywords, punctuation marks, indicator words, direction words, position words, center words and the like, and the matching is mainly performed by modes and character strings. The representative systems of the rule include an ANNIE system in GATRE project and a FACILE system participating in MUC evaluation. Generally, if the extracted rule can accurately reflect the language phenomenon, the method can obtain a result superior to the statistical method, but in reality, it is often difficult to obtain a rule meeting the requirement, and rule formulation depends on a lot of manpower and takes a long time. Moreover, this approach is language dependent and cannot be reused. Due to the above drawbacks, rule-based methods are rarely used today.

The method based on statistics uses manually marked corpus for training, the marked corpus does not need language expert help, and the time consumption is relatively less. And portability is superior to a rule-based method, and training is only needed by using a new corpus. Bikel et al originally put forward an English naming entity recognition method based on a hidden Markov model, and excellent results are obtained by extracting English place names, organization names and person names in a MUC-6 test text set. McCallum et al, 2003, used conditional random fields for named entity recognition at the earliest, and are popular because of their simplicity and ease of implementation and their good performance. In addition, the statistical machine learning-based method also comprises a hidden Markov model, a maximum entropy model, a support vector machine and the like.

As machine learning evolves, machine learning based methods become an increasingly important part. Some use a bi-directional long and short term network BiLSTM to identify and classify entities of text, biLSTM consisting of two LSTMs, one forward and one reverse. The reverse is the reverse input of the original sequence (a series of words). The results obtained by processing the same sequence through two LSTM are combined together, so that the context information of the sequence can be effectively utilized.

Combining machine learning and statistical learning together, a BiLSTM-CRF model based on a bidirectional long-short term network and a conditional random field is proposed, and finally, the result is input into a CRF layer for processing. The method achieves good effect and is popular.

Based on the original BiLSTM-CRF, the simple et al learn the information of the characters in the words through the other BiLSTM, and then splice the word information learned by the former and the character information obtained by the latter together to be used as the input of the final CRF layer, so that a better result is obtained.

Rei and the like add attention mechanisms on the basis of Lamle and the like to improve the splicing mode of word information and character information obtained by the original two BiLSTM. The original method is simple splicing, in Rei et al, they proposed that a two-layer traditional neural network is used to learn the weight of the intent, and the weight is used to teach the weight summation of word vector and character vector, so as to dynamically utilize the word vector information and the character vector information, and finally the obtained experimental result is due to the original splicing method.

Bharadwaj et al add phonological features and attention mechanisms on the basis of Lamle et al, distinguish from the weight summation proposed by Rei et al, the method mentioned in this article focuses on using attention mechanisms on character vectors to learn focusing on more efficient characters.

Named entity recognition is not a major topic of research today, as this is considered by the academia part to be a problem that has been solved. However, the named entity recognition only has good effect of recognizing names, places and the like in limited text types, mainly news corpus, but is not studied much in other fields like medical text recognition. Moreover, the portability of named entity recognition is poor, most of named entity recognition is closely combined with the fields, entities in different fields have different internal characteristics, and it is difficult to describe the entities in all fields by using a unified model. For this purpose. A named entity recognition model based on Chinese characters for Chinese electronic medical record is disclosed.

Disclosure of Invention

The invention comprises the following steps:

a named entity recognition model based on an Attention mechanism is provided, and the model is suitable for recognizing entities in Chinese documents.

The invention introduces Chinese character element and character position weight information by utilizing an Attention mechanism. In performing text-naming entity recognition, many similar words, such as "chronic obstructive pulmonary disease" and "chronic obstructive pulmonary disease", are encountered, which represent the same entity, although not the same word. By introducing Chinese characters, similar words can be processed, and the method can also process the OOV problem, and an unoccupied entity can appear during training, so that the method can be solved by using the similar words. By introducing character position weight information, more important character information in the word can be emphasized.

The invention comprises three parts: the system comprises a similar word extraction module, a feature construction module and a classifier module. The feature extraction module comprises four sub-modules of word similarity fusion, word feature extraction, character feature extraction and feature fusion.

And the similar word extraction module is used for processing all words in the word stock before various information is input into the network. The most similar words of each word are found by a word similarity algorithm.

And the word similarity fusion module is used for combining the input word with the word vector corresponding to the word similar to the word by using an attention mechanism before processing word information in order to process the similar word, and taking the output obtained by the word similarity fusion module as the input of the word feature extraction module. The most relevant part between the two can be found by the attention mechanism.

The input of the feature fusion module is the output of the word feature extraction module and the character feature extraction module. In the module, the attention mechanism and the character position weight information are combined, and the obtained output is used as the input of the classifier module. The most valuable characters in the word can be found by the attentive mechanism.

Drawings

FIG. 1 is a diagram of an overall architecture

FIG. 2 is a flow chart of a method

FIG. 3 is a word information processing diagram

FIG. 4 is a character information processing diagram

FIG. 5 feature fusion map

Detailed Description

Aiming at the Chinese document, when the entity is marked, a marking method of 'BIESO' is adopted, wherein 'B' represents the beginning of the entity, 'I' represents the intermediate content of the entity, 'E' represents the end of the entity, 'S' represents the word as an entity, and 'O' represents the non-entity content.

The whole framework of the scheme is shown in fig. 1 and is divided into (1) a similar word extraction module, (2) a characteristic construction module and (3) a classifier module. A specific flow chart of the scheme is shown in fig. 2.

The scheme is based on the BiLSTM-CRF framework popular in the current named entity recognition, and three parts are expanded on the basis: preprocessing data; combining the Chinese literal information with the word vector by using an attention mechanism to process synonyms; obtaining word information and character information through BiLSTM; classification is performed by CRF. These four sections will be described separately.

(1) Similar word extraction module

Let W (W) _i I is more than or equal to 1 and less than or equal to n, n is the aggregate size, W _i For a term in the set) is a set of all terms, then for each of the terms W _i The terms that have the greatest similarity to them are:

similarity (W) in the above formula _i ，W _j ) To calculate W _i And W is _j Similarity between them. This formula will be described below.

Let W be _i And W is _j Is N and M, respectively, the common length is S, and the positions of the common portions in the two are ctrls and key, respectively. The similarity between the two is commonSimilarity (W) _i ，W _j ) The following are provided:

wherein the method comprises the steps ofMeaning the ratio of the lengths of two words, and is always less than or equal to 1,

the weight of each grapheme in the word is represented, and x is the position of the grapheme in the word.

(21) Word similarity fusion

The chinese literal information is introduced by an attention mechanism before the input information is entered into the network. The set of word vectors of the input sentence is V (V _k K is more than or equal to 1 and less than or equal to L, L is the sentence length, wherein the size of each word vector is H), and the maximum similarity word vector set corresponding to each word vector is SV (SV) _k K is more than or equal to 1 and less than or equal to L, L is the sentence length, and the size of each word vector is H).

The attention mechanism formula is:

wherein Query is SV, key and Value are V. The similarity between each vector in V and each vector in V can be obtained by multiplying SV and the transpose of V, and then a weight, namely the Attention, is allocated to the vector in SV through softmax.

After obtaining the Attention, the following formula is substituted:

O(Attention，Value)＝AttentionValue

the input O of the lower network can be obtained, the shape of the input O is (L, H), wherein L is the length of a sentence, H is the size of a word vector, and Value in the formula represents a word vector set V.

(22) Word feature extraction

The module adopts BiLSTM (bidirectional long-short term neural network) to process sentences to extract word characteristics. The bi-directional BiLSTM structure of the processed sentence is shown in fig. 3. The actual input to the processing word information network is the output vector O obtained in section (21).

In the figure, a sentence is input in the forward direction to obtain forward information, then the sentence is input in the reverse direction to another LSTM to obtain reverse information, and then the two information are spliced to obtain the context information of the sentence.

(23) Character feature extraction

The bi-directional BiLSTM structure of the processed character is shown in fig. 4. The method comprises the steps of inputting each character of an actual word corresponding to a vector O obtained in the step (21), inputting the character vector of each word into two LSTM in the forward direction and the reverse direction respectively to obtain forward information and reverse information, and then splicing the forward information and the reverse information to obtain context information.

(24) Feature fusion

This section is shown in fig. 5. In the model, the word information obtained in step (22) and the character information obtained in step (23) are used as inputs of the module, and are output as fusion information of the word information and the character information.

In the processing process, if word information and character information are simply spliced together as input of the CRF, the character information of the word cannot be well utilized, so that a attention mechanism is adopted to process the word information and the character information, and meanwhile, character position weights are introduced to better screen more valuable character information.

By (22), a word information set WI (WI) representing one sentence can be obtained _p P is more than or equal to 1 and less than or equal to L, L is the sentence length), and the corresponding character information set CI (CI) can be obtained through (23) _pq P is more than or equal to 1 and less than or equal to L, L is the sentence length, CI _pq Representing the q-th character information vector in the p-th word).

The attention mechanism formula used in this section is as follows:

Attention(WI _p ，CI ^p )＝((WI _p CI _p ^T )·Weight _p )CI _p 。

in the formula, a certain word information and the word in a sentence are inputAnd corresponding character information sets. WI (WI) _p The structure is (1, H), CI _p The structure is (length) _p H), wherein length _p The number of characters of the original word corresponding to the p-th word information. Here, the vectors in the sentence are regarded as Query, and the character vectors are regarded as Key and Value. The subscript p represents the p-th vector in the sentence, and the content in the softmax function represents multiplying the similarity between the Query and each Key obtained through matrix multiplication by the position weight information. After softmax processing, the weight information of each character in a word can be obtained, and the final attention information is obtained after weighted summation, wherein matrix multiplication is equivalent to weighted summation.

Weight in the formula _p The character position weight information set has the structure of (1, length) _p ). The composition of the material is as follows:

(word(1，length _p )，(word(2，length _p )...(word(length _p ，length _p ))，

for each character in a word, setting its position in the word as q, the character position weight of the character

Wherein c is 1 to length _p Is a certain item in the list.

After attention is given, the formula is: combine _p ＝Attention(WI _p ，CI _p )·WI _p The output result Combine of this section can be obtained, wherein Combine _p The p-th item in Combine has a structure of (1, H), and the Combine has a structure of (L, H).

(3) Classifier module

The classifier uses a Conditional Random Field (CRF), takes the output information Combine obtained in the section (24) as the input of the CRF and generates a classification label result.

Claims

1. A named entity recognition method based on an Attention mechanism is characterized by comprising three modules: the device comprises a similar word extraction module, a feature construction module and a classifier module;

(1) Similar word extraction module

Before various information is input into a network, a process of processing all words in a word stock is carried out, and the word most similar to each word is found through a word similarity algorithm;

(2) Feature construction module

The feature construction module comprises four sub-modules of word similarity fusion, word feature extraction, character feature extraction and feature fusion;

(3) Classifier module

The classifier module adopts a Conditional Random Field (CRF), takes output information Combine obtained in the feature fusion sub-module (24) as input of the CRF and generates a classification label result;

(1) The similar word extraction module and the characteristic construction module (2) are specifically as follows:

(1) Similar word extraction module

The similar word extraction module is a processing procedure before the information is input into the network, and aims to search the word with the highest similarity with each word through a word similarity algorithm, and W (W _i I is more than or equal to 1 and less than or equal to n, n is the aggregate size, W _i For a term in the set) is a set of all terms, then for each term W therein _i Target W _j The method comprises the following steps:

similarity (W) in the above formula _i ,W _j ) To calculate W _i And W is _j Similarity between them, let W _i And W is _j Is N and M, respectively, the common length is S, and the positions of the common part in the two are ctrls and keys, respectively, the similarity formula (W) _i ,W _j ) The following are provided:

wherein the method comprises the steps ofMeaning the ratio of the lengths of two words, and is always less than or equal to 1,/or less>Represents the smallest value of N and M, < >>Represents the maximum value of N and M, < ->Representing the weight of each word element in the word, wherein x is the position of the word element in the word, word (i) in the above formula, N represents the weight of the ith element in ctrls, word (key (i), M) represents the weight of the ith element in key;

(2) Feature construction module

(21) Word similarity fusion submodule

After obtaining similar words of each word, introducing Chinese literal information through an attention mechanism, and taking the obtained result as the input of a word characteristic extraction submodule (22);

let the word vector set of the input sentence be V (V _k K is more than or equal to 1 and less than or equal to L, L is the sentence length, wherein the size of each word vector is H), and the maximum similarity word vector set corresponding to each word vector is SV (SV) _k K is more than or equal to 1 and less than or equal to L, L is the sentence length, and the size of each word vector is H);

the attention mechanism formula is as follows:

wherein Query is SV, key and Value are V; multiplying SV and V transpose to obtain similarity of each vector in V and each vector in V, and then distributing a weight for the vector in SV through softmax, namely attribute;

after the attention weight is obtained, the following formula is substituted:

O(Attention,Value)＝AttentionValue

the input O of the lower network can be obtained, the shape of the input O is (L, H), wherein L is the length of a sentence, H is the size of a word vector, and Value in the formula represents a word vector set V;

(22) Word feature extraction submodule

The word feature extraction submodule adopts BiLSTM (bidirectional long-short term neural network) to process sentences to extract word features, and the actual input of the processed word information network is an output vector O obtained in the (21) word similarity fusion submodule; inputting a sentence in the forward direction into one LSTM to obtain forward information, inputting the sentence in the reverse direction into the other LSTM to obtain information, and then splicing the two to obtain the context information of the sentence;

(23) Character feature extraction submodule

The character feature extraction submodule adopts BiLSTM (bidirectional long-short term neural network) to process the characters of each word, the input of the BiLSTM is (21) each character of the actual word corresponding to the output vector O obtained by the word similarity fusion submodule, the character vector of each word is respectively input into two LSTMs in the forward direction and the reverse direction to obtain forward direction and reverse direction information, and then the two LSTMs are spliced to obtain context information;

(24) Feature fusion submodule

The feature fusion submodule adopts an attention mechanism to process word information and character information, and simultaneously introduces character position weight to better screen more valuable character information;

the word information set WI (WI) representing a sentence can be obtained by the word feature extraction sub-module (22) _p P is more than or equal to 1 and less than or equal to L, L is the sentence length), and the corresponding character information set CI (CI) can be obtained through the character feature extraction submodule (23) _p P is more than or equal to 1 and less than or equal to L, wherein L is the sentence length);

(24) The attention mechanism formula used by the feature fusion sub-module is as follows:

Attention(WI _p ,CI _p )＝softmax((WI _p CI _p ^T )·Weight _p )CI _p

in the formula, the input is a set of word information and character information corresponding to the word in a sentence, WI _p The structure is (1, H), CI _p The structure is (length) _p H), wherein length _p The number of characters of the original word corresponding to the p-th word information is the number of characters of the original word corresponding to the p-th word information; attention (WI) _p ,CI _p ) To WI _p And CI (CI) _p The corresponding attention obtained after the treatment; here, the vectors in the sentence are regarded as Query, and the character vectors are regarded as Key and Value; the subscript p represents a p-th vector in a sentence, wherein the content in a softmax function represents that the similarity between a Query and each Key is obtained through matrix multiplication and then multiplied by position weight information, the weight information of each character in a word is obtained after softmax processing, and final attention information is obtained after weighted summation, and the matrix multiplication in the formula is equivalent to the weighted summation;

weight in the formula _p The character position weight information set has the structure of (1, length) _p ) The composition of the composite material is as follows:

(word(1,length _p ),(word(2,length _p )…(word(length _p ,length _p )，

for each character in a word, assuming that its position in the word is q, the character position weight of the character is:

wherein c is 1 to length _p One of them;

after attention is given, the formula is: combine _p ＝Attention(WI _p ,CI _p )·WI _p Obtaining (24) an output result Combine of the feature fusion submodule, wherein Combine is obtained _p The p-th item in Combine has a structure of (1, H), and the Combine has a structure of (L, H).