CN110222343A

CN110222343A - A kind of Chinese medicine plant resource name entity recognition method

Info

Publication number: CN110222343A
Application number: CN201910512743.XA
Authority: CN
Inventors: 李巧勤; 蔡茁; 何家欢; 李杨; 刘勇国; 杨尚明
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2019-09-10

Abstract

The present invention provides a kind of Chinese medicine plant resource name entity recognition method, comprising the following steps: S1: obtains Chinese medicine germplasm resource document；S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into text sentence；S3: searching term vector corresponding to each text sentence and word vector one by one respectively, and GRU-CRF model is trained using the term vector and word vector；S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.Chinese medicine plant resource provided by the invention names entity recognition method, pass through framework GRU-CRF model, to realize the name entity for capableing of automatic identification Chinese medicine plant resource document, not only substantially increase recognition accuracy and recognition efficiency, reduce manual identified time overhead, and other names entity class is extended to, such as soil mineral content, moisture content etc..

Description

A kind of Chinese medicine plant resource name entity recognition method

Technical field

The present invention relates to Chinese medicine name entity recognition techniques fields more particularly to a kind of Chinese medicine plant resource name entity to know Other method.

Background technique

Chinese medicine planting environment has a significant impact to quality of medicinal material, for example, different sources Chinese medicine appearance character, effectively There is larger difference in ingredient and medicinal efficacy etc..In order to study the influence of planting environment centering quality of medicinal material, need from a large amount of Name entity relevant to planting environment is identified in non-structured Chinese medicine germplasm resource document, is provided for further research Basic data.For example, the corresponding geographical location of Chinese medicine plant, Chinese medicine title, soil, weather etc., referred to as name entity, and want Identify these different name entities, the method for the entity of identification name at present is passed through first by the way of based on artificial+rule Manual type arranges document, is then matched using regular expression, so that the name entity of needs is extracted, this Kind recognition methods efficiency is too low, does not meet the research instantly to Chinese medicine planting environment.

Summary of the invention

It is an object of the invention to solve the problems of the above-mentioned prior art, a kind of pharmacopoeia that recognition efficiency is high plant is provided Resource names entity recognition method.

A kind of Chinese medicine plant resource name entity recognition method, comprising the following steps:

S1: Chinese medicine germplasm resource document is obtained；

S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open It is divided into text sentence；S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector GRU-CRF model is trained with word vector；

S4: entity is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model Identification.

Further, Chinese medicine plant resource as described above names entity recognition method, and the step S3 includes: difference one One searches term vector and word vector corresponding to each text sentence, and the word for the word for being included to each text sentence to Amount obtains hiding vector using two-way LSTM model, is spliced term vector and hiding vector to obtain intermediate vector, will be intermediate Vector is trained it as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.

Further, Chinese medicine plant resource as described above names entity recognition method, certain rule are as follows: uses { B-s, I-s, O-s, S-s } label is labeled Chinese medicine germplasm resource document, wherein B indicates opening for a name entity Begin, I indicates the name entity in addition to the other parts of beginning, and O indicates other parts, and S indicates that the name being made of single word is real Body；S indicates the name entity attributes.

Further, Chinese medicine plant resource as described above name entity recognition method, the term vector and word vector by The corpus that wikipedia, Baidupedia construct carries out word2vec insertion training and obtains.

Further, Chinese medicine plant resource as described above names entity recognition method, after step S1, before S2, It further include being pre-processed to Chinese medicine germplasm resource document；

The pretreated step includes: to carry out unification to the format of all Chinese medicine germplasm resource documents, and to unified Document afterwards carries out delete operation, to delete interference information.

Further, Chinese medicine plant resource as described above names entity recognition method, and step S3 includes:

When searching term vector corresponding to text sentence, carried out in the term vector library that the insertion training obtains first It searches, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again Vector, the domain lexicon by wikipedia, Baidupedia construct corpus generate.

The utility model has the advantages that

Chinese medicine plant resource provided by the invention names entity recognition method, by framework GRU-CRF model, to realize It is capable of the name entity of automatic identification Chinese medicine plant resource document, not only substantially increases recognition accuracy and recognition efficiency, Manual identified time overhead is reduced, and extends to other names entity class, such as soil mineral content, moisture content Deng.

Detailed description of the invention

Fig. 1 is that Chinese medicine plant resource of the present invention names entity recognition method flow chart；

Fig. 2 is to utilize { B-s, I-s, O-s, S-s } rule mark schematic diagram；

Fig. 3 is the structure chart of GRU-CRF model.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the technical solution in the present invention is carried out below It clearly and completely describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

The present invention provides a kind of Chinese medicine plant resource name entity recognition method, as shown in Figure 1, this method includes following step It is rapid:

S1: Chinese medicine germplasm resource document is obtained；

S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open It is divided into text sentence；

S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word Vector trains GRU-CRF model；

Wherein, rule certain described in step S2 are as follows: use { B-s, I-s, O-s, S-s } label to plant Chinese medicine and provide Source document, which is offered, to be labeled, wherein B indicates the beginning of a name entity, and I indicates the name entity in addition to other portions of beginning Point, O indicates other parts, and S indicates the name entity being made of single word；S indicates the name entity attributes.Such as: such as Fig. 2 It is shown, " Shandong plantation violet flower danshen ", sequence label be B-Location, I-Location, O, O, B-Herb, I-Herb, I-Herb, I-Herb }, wherein " B-Location, I-Location " indicate location category entity, i.e., " Shandong ", " B-Herb, I- Herb, I-Herb, I-Herb " indicate herbs Class entity, i.e., " violet flower danshen ", O indicates irrelevant portions.The GRU-CRF model Structure it is as described in Figure 3.

The term vector and word vector carry out word2vec insertion by the corpus that wikipedia, Baidupedia construct and train It obtains.Equally, Chinese medicine plant field dictionary is obtained also by the corpus that wikipedia, Baidupedia construct.

Further, in order to which the efficiency for improving GRU-CRF model training further includes centering after step S1, before S2 Medicinal material germplasm resource document is pre-processed；

It is specifically as follows: it is unified that the PDF format of Chinese medicine germplasm resource document is converted into text formatting, only retain document Effective content of text, delete the interference informations such as author and unit information, bibliography.

The step S3 include: search term vector corresponding to each text sentence and word vector one by one respectively, and The word vector for the word for being included to each text sentence obtains hiding vector using two-way LSTM model, by term vector and hide to Amount is spliced to obtain intermediate vector, is trained, is instructed to it using intermediate vector as the input of two-way GRU-CRF model The GRU-CRF model perfected.

In addition, passing through the corpus by wikipedia, Baidupedia building first when searching term vector corresponding to phrase Library carry out word2vec insertion training obtain term vector library in is searched, if can not find, first pass through domain lexicon into Row synonym is converted to find corresponding word, then looks into the corresponding vector of the word again.

Step S3 is described in detail below, should the following steps are included:

S31. word2vec insertion training is carried out to corpus such as wikipedia, Baidupedias, respectively obtains word vector table CT and term vector table WT.

S32. for given text sentence, word vector table CT is searched, a word sequence vector (c is generated₁,c₂,…,c_m), Wherein m indicates the character quantity of sentence.Using the vector as the input of two-way LSTM model, model training extracts word Morphological feature.LSTM's specific formula is as follows:

i_t=sigmoid (W_icc_t+W_ihh_t-1+b_i) (1)

f_t=sigmoid (W_fcc_t+W_fhh_t-1+b_f) (2)

g_t=tanh (W_gcc_t+W_ghh_t-1+b_g) (3)

o_t=sigmoid (W_occ_t+W_ohh_t-1+b_o) (4)

s_t=f_t⊙s_t-1+i_t⊙g_t(dark)

h_t=o_t⊙tanh(s_t) (6)

Wherein, W indicates connection weight matrix；F, i, o are to forget door, input gate and out gate respectively；G indicates currently thin Born of the same parents' state；S is cell state, includes long-term Dependency Specification.i_t、f_t、g_tThe common cell state s for updating the t-1 moment_t-1Obtain t The cell state s at moment_t；h_tIt is the hidden state of moment t, by o_t、s_tIt codetermines.⊙ is indicated by element multiplication.By it is preceding to LSTM layers and backward LSTM layers calculating, obtain the forward direction information of sentence SentenceWith backward informationThe two collectively constitutes Hiding layer state, here shown as

S33. first looking for term vector table WT whether there is the vector of current word, and if it exists, then be made using current term vector For input；If it does not exist, then it first passes through domain lexicon and carries out synonym conversion to find corresponding word, then search term vector table WT forms a term vector sequence (w1, w2 ..., wn) for given text sentence, and wherein n indicates the word number of sentence Amount.It willWith term vector sequence (w₁,w₂,…,w_n) splicing joint, generation words combine sequence vector (x respectively₁, x₂,…,x_n), as two-way GRU-CRF network input and training the model, GRU's specific formula is as follows:

r_t=sigmoid (W_rxx_t+W_ruu_t-1+b_r) (7)

z_t=sigmoid (W_zxx_t+W_zuu_t-1+b_z) (8)

n_t=tanh (W_nxx_t+W_nu(r_t⊙u_t-1)+b_n) (9)

u_t=(1-z_t)⊙n_t+z_t⊙u_t-1 (10)

Wherein, z_tIt is the update door of GRU, determines t-1 moment hidden state u_t-1Information in how many be when being transmitted to t Carve hidden state u_tIn；r_tIt is resetting door, determines t-1 moment hidden state u_t-1Information in how many need pass into silence； n_tTo input x_tWith t-1 moment hidden state u_t-1Information summarize, i.e., current cell state.By preceding to GRU layers and backward GRU layers of calculating obtain before sentence to informationWith backward informationThe two has collectively constituted hiding layer state, is expressed as

S34. sequence label prediction is carried out using CRF, it is raw finally by optimal sequence label mapping to corresponding name entity At the annotated sequence for being based on { B, I, O, S } label.Wherein, step S2 is manually to be marked using { B, I, O, S } label to text Note is the training set for exporting result；This step is the model obtained according to training, is predicted, obtained output result is also With { B, I, O, S } tag representation.

By the prediction of a sequence is defined as:

Y=(y₁,y₂,…,y_n) (11)

The score of the forecasting sequence, formula are as follows:

Wherein A is state-transition matrix, A_i,_jIndicate that the probability that label j is transferred to from label i, P are expressed as two-way GRU's Output matrix.The probability of all possible sequence label is finally obtained, formula is as follows:

Wherein Y_sIndicate all possible sequence label of sentence s.During model training, true tag sequence is maximized Log probability, formula is as follows:

During model prediction, the top score of output sequence is solved, it may be assumed that

y^*The automatic identification result of the optimum label sequence exactly predicted, i.e. Chinese medicine plant resource name entity.

Below by for example bright recognition methods of the invention:

TXT text is converted by Chinese medicine germplasm resource PDF document first, with " Shandong plantation violet flower danshen " in the text For one, using the sentence as input, word vector table and term vector table are obtained by word2vec model, then being directed to " mountain East plantation violet flower danshen " is segmented, and is searched from term vector table, obtains the term vector of each word；Then word vector is searched Table, obtains the word vector of each word, and the word vector for the word for being included to each word using two-way LSTM model obtain hiding to Amount, term vector and hiding vector is spliced, referred to as intermediate vector, using intermediate vector as the defeated of two-way GRU-CRF model Enter, moving model, obtains the optimum label sequence of the sentence, the i.e. sentence a annotated sequence, i.e. " B-Location I- Location O O B-Herb I-Herb I-Herb I-Herb ", then the result that can be identified from this sentence Are as follows: name entity has location entity, medicinal material entity.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of Chinese medicine plant resource names entity recognition method, which comprises the following steps:

S1: Chinese medicine germplasm resource document is obtained；

S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into Text sentence；

S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word vector To train GRU-CRF model；

S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.

2. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the step S3 packet It includes: searching term vector corresponding to each text sentence and word vector one by one respectively, and included to each text sentence The word vector of word obtain hiding vector using two-way LSTM model, by term vector and hiding vector spliced to obtain it is intermediate to Amount, is trained it for intermediate vector as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.

3. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that certain rule Then are as follows: { B-s, I-s, O-s, S-s } label is used to be labeled Chinese medicine germplasm resource document, wherein B indicates a name The beginning of entity, I indicate the name entity in addition to the other parts of beginning, and O indicates other parts, and S expression is made of single word Name entity；S indicates the name entity attributes.

4. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the term vector and Word vector carries out word2vec insertion training by the corpus that wikipedia, Baidupedia construct and obtains.

5. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that step S1 it It afterwards, further include being pre-processed to Chinese medicine germplasm resource document before S2；

The pretreated step includes: to carry out unification to the formats of all Chinese medicine germplasm resource documents, and to after reunification Document carries out delete operation, to delete interference information.

6. Chinese medicine plant resource according to claim 4 names entity recognition method, which is characterized in that step S3 includes:

When searching term vector corresponding to text sentence, looked into the term vector library that the insertion training obtains first It looks for, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again Vector, the domain lexicon are generated by the corpus that wikipedia, Baidupedia construct.