CN109858041A

CN109858041A - A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries

Info

Publication number: CN109858041A
Application number: CN201910172675.7A
Authority: CN
Inventors: 苏海波; 高体伟; 孙伟; 王然; 于帮付; 黄伟
Original assignee: Beijing Baifendian Information Science & Technology Co Ltd
Current assignee: Beijing Baifendian Information Science & Technology Co Ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2019-06-07
Anticipated expiration: 2039-03-07
Also published as: CN109858041B

Abstract

The invention discloses a kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries, include the following steps: S1, with unlabeled data pre-training Bi-LSTM language model；S2, use word vector model by each character vector at Embedding layers；S3, using two layers of two-way LSTM as sequence labelling model, the sequence labelling model is using labeled data training；S4, user's Custom Dictionaries are added；S5, maximum probability path in sequence is found out using Veterbi decoding.The present invention splices the output of the language model of pre-training and the output of the two-way LSTM of first layer work, and the input as the two-way LSTM of the second layer can reduce the use of mark corpus, while the mark corpus of frontier can be only replaced when switching field.In addition, when prediction the emission matrix into Veterbi decoding can be changed, to achieve the effect that Custom Dictionaries by the setting of Custom Dictionaries.

Description

A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries

Technical field

The present invention relates to data processing fields, towards name entity recognition techniques application, and in particular to a kind of semi-supervised Practise the name entity recognition method for combining Custom Dictionaries.

Background technique

It names Entity recognition (Named Entity Recognition, hereinafter NER) to refer to and identifies tool from text There are the entity (usually noun), such as name, place name, mechanism name, proper noun etc. of particular category.NER is information retrieval, looks into The background task for the problems such as asking classification, automatic question answering, effect directly affect the effect of subsequent processing, therefore are at natural language Manage a underlying issue of research.

Semi-supervised learning (Semi-Supervised Learning, SSL) is pattern-recognition and machine learning area research Important Problems, be a kind of learning method that supervised learning is combined with unsupervised learning.Semi-supervised learning uses largely not Flag data, and flag data is used simultaneously, Lai Jinhang pattern-recognition work.The basic thought of semi-supervised learning is to utilize number Learner is established according to the model hypothesis in distribution, and label is carried out to non-label sample.Its formalized description is given one and comes from The sample set S=LU of certain unknown distribution, wherein L be label sample set L=(x1, y1), (x2, y2) ..., (x | L |, y | L |), U be a non-label sample set U=xc1, xc2 ..., xc | U | }, it is desirable to obtaining function f:XyY can be accurately to sample Example x predicts its label y.Wherein xi, xc1 are d dimensional vector, and ytIY is the label of sample xi, | L | and | U | it is respectively L's and U Size, that is, the sample number for being included, semi-supervised learning are exactly that optimal learner is found on sample set S.If S=L, Problem translates into traditional supervised learning；, whereas if S=U, then problem is to be converted into traditional unsupervised learning. Label sample and non-label sample are how comprehensively utilized, is semi-supervised learning problem to be solved.

Custom Dictionaries are the products based on user demand, and the user of different field and industry has different definition to entity And understanding, therefore the word having appears to be entity in certain user, and for other users, it may not be entity.Therefore user Custom Dictionaries be it is necessary, by dictionary can be improved name Entity recognition accuracy rate, so that it is more met user's Demand.

Summary of the invention

In view of the deficiencies of the prior art, the present invention is intended to provide a kind of name of semi-supervised learning combination Custom Dictionaries is real Body recognition methods.

In order to achieve the above technical purposes, the present invention adopts the following technical scheme:

A kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries, includes the following steps:

S1, with unlabeled data pre-training Bi-LSTM language model；

S2, use word vector model by each character vector at Embedding layers；

S3, using two layers of two-way LSTM as sequence labelling model, the sequence labelling model is instructed using labeled data Practice；

In the training process of sequence labelling model, by the output vector and step of the two-way LSTM of the first layer of sequence labelling model The output work for the Bi-LSTM language model that pre-training obtains in rapid S1 splices, then by the vector of splicing by a full connection Input after layer as the two-way LSTM of the second layer of sequence labelling model；

S4, user's Custom Dictionaries are added:

Emission matrix X can be obtained after two layers of two-way LSTM of sequence labelling model, by CRF layers, passes through maximum Likelihood probability obtains transfer matrix Y, and the probability of emission matrix, the hair after being adjusted then are adjusted according to user's Custom Dictionaries Penetrate matrix X；

S5, maximum probability path in sequence is found out using Veterbi decoding:

It will be input to obtained in step S4 according to user's Custom Dictionaries emission matrix X adjusted and transfer matrix Y CRF layers of Veterbi decoding obtains sequence labelling, i.e., correctly names Entity recognition result.

Further, in step S2, the word vector model is word2vec model.

Further, in step S2, the training of word vector model is specifically carried out using Skip-gram method.

Still further, carrying out the specific steps of word vector model training using Skip-gram method are as follows:

(1) the relevant balanced corpus of application field is collected first；

(2) corpus data collected for step (1) pre-processes, including filters out junk data, stops low-frequency word and nothing Meaning symbol, and it is organized into the format of training data, obtain training data；

(3) training data is given to Skip-gram model, training obtains word vector model.

The beneficial effects of the present invention are: it is based on pre-training language model (Pretrained Language Model), word Symbol insertion (char embeddings) technology, Custom Dictionaries technology, semi-supervised training (Semi-Supervised Learning), two-way LSTM (Long Short Term Memory, long memory models in short-term) network, CRF (Conditional Random Field, condition random field) model etc., the NER training of Lai Shixian semi-supervised learning.By above method and special Network structure splices the output of the language model of pre-training and the output of the two-way LSTM of first layer work, and two-way as the second layer The input of LSTM.Pass through the method, it is possible to reduce mark the use of corpus, while can only replace frontier when switching field Mark corpus.In addition, the transmitting into Veterbi decoding can be changed by the setting of Custom Dictionaries when prediction Matrix, to achieve the effect that Custom Dictionaries.

Detailed description of the invention

Fig. 1 is the method flow schematic diagram of the embodiment of the present invention；

Fig. 2 is the network diagram of Bi-LSTM language model in the embodiment of the present invention；

Fig. 3 is common word2vec training pattern CBOW schematic diagram in the embodiment of the present invention；

Fig. 4 is common word2vec training pattern skip-gram model schematic in the embodiment of the present invention；

Fig. 5 is the flow diagram of word vector model training in the embodiment of the present invention；

Fig. 6 is sequence labelling model schematic in the embodiment of the present invention.

Specific embodiment

Below with reference to attached drawing, the invention will be further described, it should be noted that the present embodiment is with this technology side Premised on case, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to this reality Apply example.

Simplicity of explanation is made to the technical term that the present embodiment is related to below:

Name Entity recognition: specific proper noun is identified from given text data, as name, place name, mechanism name, when Between word, ProductName etc..

Word2vec: being the algorithm of *** company exploitation, by unsupervised training, by word become a several hundred dimensions to Amount, this vector can capture the semantic dependency between word.Also term vector or word is made to be embedded in.

Tensorflow:Tensorflow is the deep learning platform of *** open source, provides interface abundant, mostly flat Platform (CPU, GPU, HADOOP) and distributed support, visual control.

Skip-gram:Google is used to the training Wordvec in big data and uses method, it predicts week by current word The word enclosed obtains training objective function.

LSTM:LSTM (Long Short-Term Memory) shot and long term memory network, is a kind of time recurrent neural net Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.It is by " Memory-Gate " and " forgets Note door " controls the going or staying of historical information, efficiently solves the problems, such as the long Route Dependence of conventional recycle neural network.

CRF:CRF (Conditional Random Field) condition random field is that natural language processing field was normal in recent years One of algorithm is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..CRF is using Markov Chain as implicit The probability metastasis model of variable implies variable by Observable condition discrimination, belongs to discrimination model.

Semi-supervised learning: semi-supervised learning (Semi-Supervised Learning, SSL) is pattern-recognition and engineering The Important Problems for practising area research, are a kind of learning methods that supervised learning is combined with unsupervised learning.Semi-supervised learning makes With a large amount of Unlabeled data, and flag data is used simultaneously, Lai Jinhang pattern-recognition work.

Custom Dictionaries: user's Custom Dictionaries are when carrying out NER extraction, and user wishes the special reality extracted Body, in such a way that dictionary is set, it is ensured that it can be extracted.

The present embodiment provides a kind of name entity recognition methods of semi-supervised learning combination Custom Dictionaries, including walk as follows It is rapid:

S1, with unlabeled data pre-training Bi-LSTM language model；

It is had the advantage that using the Bi-LSTM language model of pre-training

1) demand of mark corpus is reduced, the main function of language model is exactly the automatic extraction of feature, using not marking It infuses data and carries out pre-training, obtain the semantic information of each character in advance.

2) training time for reducing model is reduced and is trained with labeled data due to the pre-training carried out in advance Time.

The present invention carries out the training of language model using Bi-LSTM model, is the method for unsupervised learning, is not required to very important person Work marks corpus can training pattern.The network structure of model is as shown in Figure 2.

Bi-LSTM (two-way LSTM) can provide the semantic information of character in conjunction with the information of context, by it is preceding to A LSTM semantic vector available for each character, then again by backward LSTM to another it is semantic to Amount.Two semantic vectors are spliced to obtain final output in output layer.Since the training of this language model is unsupervised Form, therefore the data volume of demand is the bigger the better.

From figure 2 it can be seen that the Bi-LSTM used in the present embodiment, forward and backward and without parameter sharing.Two Layer LSTM is trained using different parameters, that is to say, that two layers of LSTM is independent.

S2, use word2vec model by each character vector at Embedding layers.

In the present embodiment, word vector model is specifically obtained using the training of Skip-gram method.

Each word can be become the vector in a lower dimensional space by word2vec model, usual several hundred dimensions.Such character Between semantic dependency can be with the distance of vector come approximate description.Relative to common term vector, the vector based on character Change technology can bring following advantage:

1) more fine-grained character feature can be characterized；

2) since character quantity is much smaller than word quantity, obtained model occupied space is minimum, greatly improves model and adds Carry speed；

3) over time, neologisms can continue to bring out, and the term vector model trained before will appear increasingly tighter The feature hit rate downslide problem of weight, and the vector based on character then effectively prevents this problem, comes because being continuously created every year Fresh character it is relatively seldom.

Therefore the present embodiment selects the vectorization technology based on character.

The word2vec model that the present embodiment uses is unsupervised learning method, that is, does not need manually to mark corpus Training pattern, common are two kinds of training methods is CBOW and Skip-gram, as shown in Figure 3-4.

CBOW is the word of based on context pre- measured center, according to the character w (t-2), w (t-1), w (t+ around character w (t) 1), w (t+2) is predicted, the vector of these words is connected, can be sufficiently reserved contextual information in this way, as shown in Figure 3.Skip- Gram method is exactly the opposite, removes word w (t-2), w (t-1), w (t+1), the w (t+2) around prediction using w (t), as shown in Figure 4. Under the conditions of big data quantity, it is suitble to use Skip-gram method.

As shown in figure 5, in the present embodiment, using the specific steps of Skip-gram method training pattern are as follows:

(1) collecting relevant balanced corpus first, (because to do unsupervised learning, data volume is the bigger the better, without mark Note), these corpus cover most of data type of the scene mainly for corresponding application field as far as possible；

(2) corpus data collected for step (1) pre-processes, including filters out junk data, stops low-frequency word and nothing Meaning symbol, and be organized into the format of training data, that is, it indicates to output and input, obtains training data；

S3, the sequence labelling is trained as sequence labelling model, and using labeled data using two layers of two-way LSTM Model；

In the present embodiment, training data is labeled using BIO mark method.Such as:

Label B-PER then represents the beginning of name, and I-ORG represents the centre of institution term, and O then represents other.

The sequence labelling model of the present embodiment uses two layers of two-way LSTM, due to being carried out using a small amount of labeled data Training, it is contemplated that by the complexity of increase model come better fitting data.While in order to reduce to labeled data magnitude Demand, the present embodiment introduces the language model vector of pre-training between two layers of two-way LSTM of sequence labelling model, has The model of body is as shown in Figure 6.

Specifically, in the training process of sequence labelling model, by the defeated of the two-way LSTM of the first layer of sequence labelling model The output work of outgoing vector and Bi-LSTM language model splices, then the vector of splicing is used as to sequence after a full articulamentum The input of the two-way LSTM of the second layer of column marking model.

From the point of view of concrete implementation, the calculating process of the two-way LSTM of first layer of sequence labelling model, forward direction are initially entered The output of LSTM and backward LSTM, forward direction LSTM are h_ft, the output of backward LSTM is h_bt, after the two is spliced, obtain h_t1= [h_ft,h_bt], wherein forward direction exports h_ftHistory context information is characterized, and exports h backward_btThen characterize following context Information.Then by the output h of Bi-LSTM language model_lmAfter output with the two-way LSTM of first layer is spliced, h is obtained_t= [h_lm,h_t1].Later after a full articulamentum, in the two-way LSTM of the second layer that result is input to sequence labelling model.

Recognition with Recurrent Neural Network (RNN) is widely applied in natural language processing field at present.For arbitrarily inputting text This sequence (x₁,x₂,…,x_n), RNN returns to the output value set (h for being directed to this sequence₁,h₂,…,h_n).Since traditional RNN exists It carries out that gradient disappearance can be led to the problem of during optimization, so that can not remember long text when prediction Record remote semantic information.And LSTM model controls outputting and inputting for historical information using different gate, meanwhile, it is double To LSTM can not only refer to past historical information, can also be with reference to following semantic information.

S4, user's Custom Dictionaries are added；

Emission matrix X can be obtained after two layers of two-way LSTM of sequence labelling model, by CRF layers, passes through maximum Likelihood probability obtains transfer matrix Y, and the probability of emission matrix, the hair after being adjusted then are adjusted according to user's Custom Dictionaries Penetrate matrix X.

S5, maximum probability path in sequence is found out using Veterbi decoding.

Specifically, CRF layers will be input to according to user's Custom Dictionaries emission matrix X adjusted and transfer matrix Y Veterbi decoding obtains sequence labelling, i.e., correctly names Entity recognition result.

CRF floor is exactly to carry out Veterbi decoding to find optimal road as the main effect of the last layer in the present invention Diameter.Condition random field (conditional random fields, abbreviation CRF or CRFs), is a kind of discriminate probabilistic model, It is one kind of random field, is usually used in mark or analytical sequence data, such as natural language text or biological sequence.Condition random field Model both has the advantages that discriminative model, and with production model in view of the transition probability between contextual tagging, with sequence Column form carries out the characteristics of global parameter optimization and decoding, solves other discriminative model (such as maximum entropy Markov moulds Type) the marking bias problem that is difficult to avoid that.

And condition random field then uses a kind of probability graph model, has the energy of expression long-distance dependence and overlapping property feature Power can preferably solve the advantages of marking the problems such as (classification) biases, and all features can carry out global normalization, energy Enough acquire global optimal solution.Here what is mainly used is exactly the prediction algorithm of condition random field: viterbi algorithm (Viterbi Algorithm, a kind of dynamic programming algorithm).

For those skilled in the art, it can be provided various corresponding according to above technical solution and design Change and modification, and all these change and modification, should be construed as being included within the scope of protection of the claims of the present invention.

Claims

1. a kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries, which comprises the steps of:

S1, with unlabeled data pre-training Bi-LSTM language model；

S2, use word vector model by each character vector at Embedding layers；

S3, using two layers of two-way LSTM as sequence labelling model, the sequence labelling model is using labeled data training；

In the training process of sequence labelling model, by the output vector of the two-way LSTM of the first layer of sequence labelling model and step S1 The output work for the Bi-LSTM language model that middle pre-training obtains splices, then by the vector of splicing after a full articulamentum The input of the two-way LSTM of the second layer as sequence labelling model；

S4, user's Custom Dictionaries are added:

Emission matrix X can be obtained after two layers of two-way LSTM of sequence labelling model, by CRF layers, passes through maximum likelihood Probability obtains transfer matrix Y, and the probability of emission matrix, the transmitting square after being adjusted then are adjusted according to user's Custom Dictionaries Battle array X；

S5, maximum probability path in sequence is found out using Veterbi decoding:

CRF layers will be input to according to user's Custom Dictionaries emission matrix X adjusted and transfer matrix Y obtained in step S4 Veterbi decoding, obtain sequence labelling, i.e., correctly name Entity recognition result.

2. a kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries according to claim 1, special Sign is, in step S2, the word vector model is word2vec model.

3. a kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries according to claim 2, special Sign is, in step S2, specifically carries out the training of word vector model using Skip-gram method.

4. a kind of name entity recognition method of semi-supervised learning combination Custom Dictionaries according to claim 3, special Sign is, the specific steps of word vector model training are carried out using Skip-gram method are as follows:

(1) the relevant balanced corpus of application field is collected first；

(2) corpus data collected for step (1) pre-processes, including filters out junk data, stops low-frequency word and meaningless Symbol, and it is organized into the format of training data, obtain training data；