CN107908614A

CN107908614A - A kind of name entity recognition method based on Bi LSTM

Info

Publication number: CN107908614A
Application number: CN201710946713.0A
Authority: CN
Inventors: 岳永鹏; 唐华阳
Original assignee: Beijing Future Information Technology Co Ltd
Current assignee: Beijing Future Information Technology Co Ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-04-13

Abstract

The present invention relates to a kind of name entity recognition method based on Bi LSTM.This method includes：1) training corpus for naming Entity recognition is labeled, forms mark language material；2) word marked in language material and character are converted into vector；3) Named Entity Extraction Model based on Bi LSTM is established using the vector of word and character, and trains the parameter of the Named Entity Extraction Model；4) trained Named Entity Extraction Model is utilized, data to be predicted are named with Entity recognition prediction.The present invention uses the word-based and vector of character, the problem of can obtaining the feature of character and word at the same time, while can also evade unregistered word；In addition using CRF model algorithms pure compared to tradition two-way shot and long term Memory Neural Networks Bi LSTM, it can absorb more characters and word feature, so as to further lift the precision of Entity recognition.

Description

A kind of name entity recognition method based on Bi-LSTM

Technical field

The invention belongs to information technology field, and in particular to a kind of name entity recognition method based on Bi-LSTM.

Background technology

Name Entity recognition (Named Entity Recognition, abbreviation NER) refers to having in identification text The entity of certain sense, mainly including name, place name, mechanism name, proper noun etc..

Naming the scene of putting into practice of the recognition methods of entity includes：

Scene 1：Event detection.Place, time, personage are several basic composition parts of time, in plucking for structure event When wanting, related person, place, unit etc. can be protruded.In event search system, relevant personage, time, place can be made For indexing key words.Relation between several composition parts of event, event has been described in more detail from semantic level.

Scene 2：Information retrieval.Name entity can be used for improving and improving the effect of searching system, when user's input " weight When greatly ", it can be found that what user more thought retrieval is " University Of Chongqing ", rather than its corresponding adjective implication.In addition, establishing When inverted index, if name entity is cut into multiple words, it will cause search efficiency to reduce.In addition, search engine Develop to the direction of semantic understanding, calculating answer.

Scene 3：Semantic network.Concept and example and its corresponding relation, such as " country " are generally comprised in semantic network It is a concept, China is an example, and " China " is the relation between one " country " expression entity and concept.Semantic network In example have be greatly name entity.

Scene 4：Machine translation.The translation of name entity often has some special translation rules, such as Chinese people's translation To be represented into during English using the phonetic of name, it is famous in the posterior rule of preceding surname, and common word will translate into correspondence English word.The name entity in text is recognized accurately, has important meaning to the effect for improving machine translation.

Scene 5：Question answering system.Accurately identify that each part to go wrong is especially important, the association area of problem, Related notion.At present, most of question answering system can only all search for answer, and cannot calculate answer.Search for answer and carry out keyword Matching, user manually extracts answer according to search result, and more friendly mode is that answer is calculated to be presented to user. Some problem needs to consider the relation between entity, such as " the 45th, U.S. president " in question answering system, at present Search engine can return to answer " Donald Trump " in a particular format.

Traditional name entity recognition method can be divided into the name entity recognition method based on dictionary, based on word frequency statistics Method and method based on artificial nerve network model.Name entity recognition method based on dictionary, its principle are：Will be to the greatest extent In the more different classes of entity vocabulary income dictionary of amount, when identification, is matched text message with the word in dictionary, That mixes is then labeled as corresponding entity class.Method based on word frequency statistics, such as CRF (condition random field), its principle be Learn the semantic information to preceding the latter word, then make classification and judge.

Name Entity recognition based on dictionary depends critically upon dictionary, it is impossible to identifies unregistered word.United based on word frequency It can only associate the semanteme of the previous word of current word by the HMM (hidden Markov) of meter and CRF (condition random field) method, identification Precision is not high enough, and the discrimination of especially unregistered word is relatively low.Method based on artificial nerve network model, exists in training Gradient disappearance problem, and the network number of plies is few in actual application, it is final to name Entity recognition result advantage unobvious.

The content of the invention

The present invention is based on Bi-LSTM (Bi-directional Long Short-Term in view of the above-mentioned problems, providing one kind Memory, two-way shot and long term Memory Neural Networks) name entity recognition method, can effectively improve name Entity recognition essence Degree.

In the present invention, posting term refers to being already present in the word in vocabulary, and unregistered word refers to not appearing in word Word in table.

The technical solution adopted by the present invention is as follows：

A kind of name entity recognition method based on Bi-LSTM, it is characterised in that comprise the following steps：

1) training corpus for naming Entity recognition is labeled, forms mark language material；

2) word marked in language material and character are converted into vector；

3) Named Entity Extraction Model based on Bi-LSTM is established using the vector of word and character, and the training name is real The parameter of body identification model；

4) trained Named Entity Extraction Model is utilized, data to be predicted are named with Entity recognition prediction.

Further, step 1) is labeled training corpus in the way of IOBES.

Further, the word of input is converted into vector by step 2) first, then carries out each character in word Disassemble, all characters for being included word with Bi-LSTM models are converted into vector, and the vector converted to word and character is spelled Connect.

Further, step 3) trains the parameter of Named Entity Extraction Model using Adam gradient descent algorithms.

Further, training corpus is carried out subordinate sentence by step 3) during training parameter according to Chinese syntactic rule Processing, and the sentence data 0 for being less than neuron number to character length after subordinate sentence are filled.

Further, it is random without in the slave training corpus data set put back to every time in the iteration of Adam gradient descent algorithms Random one sentence packet of selection, iterative data of some sentences as model single is extracted from sentence packet.

Further, step 4) pre-processes data to be predicted first, then into the vectorization of line character and word Processing, is then named Entity recognition prediction.

Further, the pretreatment includes subordinate sentence processing and word segmentation processing；The vectorization processing includes term vector Processing, character vectorization processing, and term vector, character vector are spliced.

Name entity recognition method of the invention based on Bi-LSTM, using word-based and character vector, can obtain at the same time The problem of obtaining the feature of character and word, while unregistered word can also be evaded；In addition two-way shot and long term Memory Neural Networks are used CRF model algorithms pure compared to tradition Bi-LSTM, it can absorb more characters and word feature, so as to more into one The precision of the lifting Entity recognition of step.

Brief description of the drawings

The step flow chart of Fig. 1 Bi-LSTM entity recognition methods of the present invention.

Fig. 2 Bi-LSTM entity recognition models schematic diagrames of the present invention.

Fig. 3 .LSTM cell schematics.

Fig. 4 .Bi-LSTM character vector structure charts.

Embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific implementation case And with reference to attached drawing, the present invention is described in further details.

The invention discloses a kind of name entity recognition method based on Bi-LSTM, for example know from non-structured text Do not go out name, place name, mechanism name, trade (brand) name, exabyte etc..The invention solves key problem include two：1. make The precision of name Entity recognition is improved with LSTM-CRF models；2. adding the feature of the character vector of word, solve to unregistered word Name the identification (Out of Vocabulary, OV) of entity.

In order to improve the precision of name Entity recognition, we add Bi-LSTM character features on traditional CRF models Add word characteristic layer with Bi-LSTM characters, its detailed structure is as shown in Figure 2, Figure 3 and Figure 4.

Because entity to be identified is all to be not logged in vocabulary under many situations, in order to improve the knowledge to unregistered word Not, present invention adds the feature extraction of the character vector shown in Fig. 4, so as to illustrate that an entity is not only deposited with word segmentation result In much relations, while also name " Zhao, money, grandson, Lee .. ... " work of the character with itself there are much relations, such as China When appearance for first character, there is a strong possibility is exactly a name for its word combination result closely followed.

The name entity recognition method flow of the present invention is as shown in Figure 1.This method is divided into two stages：Training stage, in advance The survey stage.

(1) training stage：(" training " in Fig. 1)

Step 1：Language material is marked to prepare.

Step 2：Character and term vector.

Step 3：Bi-LSTM entity recognition models are built.

Step 4：Model parameter is trained.

Step 5：Model result preserves.

(2) forecast period：(" prediction " in Fig. 1)

Step 1：Data prediction.

Step 2：Character and term vector.

Step 3：The model preserved using the step 4 of training stage (one) does prediction data Entity recognition prediction.

The specific implementation process in two stages is specifically described below.

(1) training stage：

Step 1：Language material is marked to prepare.

Training language of the language material in the way of IOBES (Inside, Other, Begin, End, Single) to Entity recognition Material is labeled and (can also be labeled using other manner, such as replaced with 0,1,2,3,4).If it is to a participle unit One single entity, then be labeled as (tag S- ...)；Start if a participle unit is an entity, be labeled as (tag B-…)；If a participle unit is vocabulary among an entity, it is labeled as (tag I- ...)；An if participle unit It is the end of an entity, then is labeled as (tag E- ...)；If a participle unit is not an entity, (tag is labeled as O).For example " Xiao Ming is born in Yunnan, knows wound space work in Sichuan Province China province Chengdu now.", with most common in entity Exemplified by name (PER), place name (LOC) and mechanism name (ORG), it is segmented and the result of corpus labeling is：

Xiao Ming S-PER

Be born O

In O

Yunnan S-LOC

, O

Present O

In O

Chinese B-LOC

Sichuan Province I-LOC

Chengdu E-LOC

Know B-ORG

Create space E-ORG

Work O

。O

Step 2：The vectorization of character and word.

Step 2-1：Term vector.

Because computer is only capable of calculating the type of numeric type, and the word x inputted is character type, and computer cannot be direct Calculate, it is therefore desirable to which word is converted into numerical value vector.Word is converted into by a vector, this hair using known word2vec herein The bright vector for word being converted into 300 dimensions.

Step 2-2：Character vector.

Carry out character vector conversion according to Bi-LSTM models shown in Fig. 4, word be converted into vector first, for example, by word " in State " be split as two characters " in " and " state ".A numerical value ID is then translated into, to LSTM neurons before being input to Structure, in forward direction LSTM neurons the output of i-th of neuron be finally aggregated into as the input of i+1 neuron The numerical value vector that one dimension is 64.On the other hand, character vector is input to consequent LSTM neural units, and it is rear to In the transmittance process of LSTM neuron neurons, the output of i+1 neuron as i-th of neuron input, finally Also the numerical value vector for being 64 for a dimension is collected.Then, the numerical value vector that forward and backward collects is done and spliced, be spliced into one A length is the vector of 128 dimensions.Such as the term vector " U.S. " obtained in step 2-1, " U.S. " and " state " two characters can be led to Cross the vector that two-way Bi-LSTM models are translated into one 128 dimension.

Step 2-3：Splicing.

Two vectors that step 2-1 and 2-2 are obtained are spliced, for example can obtain a 300+ to participle " U.S. " The vector of 128=428 dimensions.

Step 3：Establish the model of Bi-LSTM Entity recognitions.

Framework according to the Bi-LSTM entity recognition models of Fig. 2 builds the model of Entity recognition, the word that step 2 is spliced Symbol and term vector are input in first layer LSTM neuron elements (with " I am Chinese." exemplified by, as shown in Figure 2), while the The output of one layer of LSTM, i-th of the LSTM unit input as first layer LSTM i+1 LSTM units at the same time.Meanwhile it will walk The character and term vector of rapid 2 splicing are input in second layer LSTM neuron elements, while second layer LSTM i+1s LSTM is mono- The output of the member input as i-th of LSTM unit of first layer LSTM at the same time.Then by each neural unit of two-way LSTM Input of the output as sequence labelling MODEL C RF, every input character x so as to calculate_iThe y calculated by above-mentioned model_i。 And set the result of real marking in language material asConstruct a loss function L based on entropy：

Wherein, n represents training samples number.Then, this loss function L is converted into an optimization problem by the present invention, Solve：

" O " in the CRF Layer of Fig. 2 represents non-physical type, and Loc represents place name entity.

Fig. 3 is shown in detailed LSTM units description, wherein the implication of each symbol is described as follows：

w：Parameter list to be solved.

C_i-1, C_i：The semanteme that the semantic information and preceding i character that i-1 character is accumulated before representing respectively are accumulated Information.

h_i-1, h_i：The characteristic information of the characteristic information of the i-th -1 character of expression and i-th of character respectively.

f：Forget door, the accumulation semantic information (C for i-1 character before controlling_i-1) retain how much.

i：Input gate, for control input data (w and h_i-1) retain how much.

o：Out gate, how many characteristic information exported in the feature of i-th of character of output for controlling.

tanh：Hyperbolic tangent function.

u:tanh：With controlling i-th of character how many characteristic information to be retained in C together with input gate i_i-1In.

* ,+：Represent that step-by-step carries out multiplication and step-by-step carries out addition respectively.

Step 4：The training of model parameter.

For the parameter w in solving-optimizing function L, adopted in the present invention in known Adam gradient descent algorithms training L Parameter.During training parameter, include following several key issues：

Step 4-1：Subordinate sentence.

Training corpus is subjected to subordinate sentence processing according to Chinese syntactic rule.If l_iRepresent the sentence length of the i-th word, then will |l_i-l_j| the sentence of ＜ δ is included into one group, and wherein δ represents sentence length interval, if the data after packet are GroupData, one M groups are set to altogether.

Step 4-2：Input data is filled.

Because the neuron elements of its input data of the Bi-LSTM Entity recognition structural models of Fig. 2 are regular lengths, right Character length needs to be filled with data 0 less than the sentence of Bi-LSTM entity recognition model neuron numbers after subordinate sentence.

Step 4-3：The selection of iteration batch data.

In the iteration of Adam gradient descent algorithms the present invention every time it is random without in the slave training corpus data set put back to One sentence packet of selection of machine, extracts iterative data of the BatchSize sentence as model single from sentence packet (numerical value of BatchSize can be selected arbitrarily).

Step 4-4：The end condition of iteration.

In the selection of the model end condition of the parameter during Adam gradient descent algorithms train L, the present invention is provided with two A end condition：1) maximum iterations Max_Iteration；2) penalty values iteration changes | L_i-L_i+1| ＜ ε.Wherein ε tables Show an acceptable error range.

Step 5：Model result preserves.

Trained model parameter preserves during finally step 1-4 is walked, so that forecast period uses these parameters.

(2) forecast period：

Step 1：Data prediction.

The present invention mainly includes two steps to the data prediction of the forecast period of Entity recognition：

Step 1-1:Subordinate sentence.

One section of word of peer entities identification, do subordinate sentence processing first.Such as " Xiao Ming is born in China, he is Chinese, he Love China.Xiao Li is born in the U.S., he is American, he also likes China." according to the subordinate sentence result of Chinese grammar be：

First：Xiao Ming is born in China, and I am Chinese, I likes China.

Second：Xiao Li is born in the U.S., I is American, I also likes China.

Step 1-2：Participle.

The result of the subordinate sentence of step 1-1 is segmented, the present invention is based on dictionary+HMM (hidden horses in participle known to Er Kefu) jieba (stammerer) participles of identification of the model to unregistered word segment it.The word segmentation result of step 1-1 is：

First：Xiao Ming is born in China, he is Chinese, he likes China.

Second：Xiao Li is born in the U.S., he is American, he also likes China.

Step 2：Character and term vector.

Character and term vector can be split as three following steps：

Step 2-1：Term vector.

Word is converted into a vector, such as 1-2 points of step by the word segmentation result of step 1-2 using known word2vec " U.S. " in word result, first word2vec can be converted to " U.S. " vector of one 300 dimension.

Step 2-2：Character vector.

" U.S. " carried out according to Bi-LSTM models shown in Fig. 4 in character vector conversion, such as step 2-1 can be " U.S. " " state " two characters are translated into the vector of one 128 dimension by LSTM models.

Step 2-3：Splicing.

Step 3：Entity recognition is predicted.

The step 2-3 vector datas spliced are input in the model of (one) training stage step 5 preservation.Obtain every The prediction result of one input data.Sentence subordinate sentence and the input data filling to input are also needed during prediction Operation, the prediction process of Entity recognition is just completed to this.Prediction result to word segmentation result in step 1-2 is

First：Xiao Ming/S-PER births/O /O China/S-LOC ,/O he/O is /O China/S-ORG people/O ,/O he/O Love/O China/S-ORG./O

Second：Xiao Li/S-PER births/O /the O U.S./S-LOC ,/O he/O is /the O U.S./S-ORG people/O ,/O he/O Also/O love/O China/S-ORG./O

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area Personnel can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, sheet The protection domain of invention should be subject to described in claims.

Claims

1. a kind of name entity recognition method based on Bi-LSTM, it is characterised in that comprise the following steps：

2) word marked in language material and character are converted into vector；

3) Named Entity Extraction Model based on Bi-LSTM, and training name entity knowledge are established using the vector of word and character The parameter of other model；

2. the method as described in claim 1, it is characterised in that step 1) is in the way of IOBES to training corpus into rower Note.

3. the method as described in claim 1, it is characterised in that the word of input is converted into vector by step 2) first, then will Each character in word is disassembled, and all characters for being included word with Bi-LSTM models are converted into vector, and to word Spliced with the vector of character conversion.

4. method as claimed in claim 3, it is characterised in that the name Entity recognition mould based on Bi-LSTM described in step 3) Type includes LSTM layers and CRF layers, and the character and term vector of step 2) splicing are input in first layer LSTM neuron elements, and first Input of the output of layer i-th of LSTM unit of LSTM as first layer LSTM i+1 LSTM units；Step 2) is spliced at the same time Character and term vector be input in second layer LSTM neuron elements, the output of second layer LSTM i+1 LSTM units is same The input of Shi Zuowei first layers i-th of LSTM unit of LSTM；Then using the output of two-way each neural unit of LSTM as The input of CRF models, so as to calculate corresponding each input character x_iY_i, and set the result of real marking in language material as Construct a loss function L based on entropy：

<mrow> <mi>L</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <munder> <mo>&Sigma;</mo> <mi>i</mi> </munder> <msub> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow>

Wherein, n represents training samples number；This loss function L is then converted into an optimization problem, is solved：

<mrow> <mi>M</mi> <mi>i</mi> <mi>n</mi> <mi> </mi> <mi>L</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <msub> <mi>&Sigma;</mi> <mi>i</mi> </msub> <msub> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>.</mo> </mrow>

5. method as claimed in claim 4, it is characterised in that step 3) is using the ginseng in Adam gradient descent algorithms training L Number.

6. method as claimed in claim 5, it is characterised in that step 3) presses training corpus during training parameter Subordinate sentence processing is carried out according to Chinese syntactic rule, and the sentence data 0 for being less than neuron number to character length after subordinate sentence are filled.

7. method as claimed in claim 6, it is characterised in that random nothing is put every time in the iteration of Adam gradient descent algorithms Random one sentence packet of selection, some sentences are extracted as mould from sentence packet in the slave training corpus data set returned The iterative data of type single.

8. method as claimed in claim 5, it is characterised in that the end condition of iteration is in Adam gradient descent algorithms：1) Maximum iterations；2) penalty values iteration changes | L_i-L_i+1| ＜ ε, wherein ε represent acceptable error range.

9. the method as described in claim 1, it is characterised in that step 4) pre-processes data to be predicted first, so The vectorization processing of laggard line character and word, is then named Entity recognition prediction.

10. method as claimed in claim 9, it is characterised in that the pretreatment includes subordinate sentence processing and word segmentation processing；It is described Vectorization processing includes term vectorization processing, character vectorization processing, and term vector, character vector are spliced.