CN110223737A

CN110223737A - A kind of chemical composition of Chinese materia medica name entity recognition method and device

Info

Publication number: CN110223737A
Application number: CN201910512263.3A
Authority: CN
Inventors: 刘勇国; 蒋羽; 李杨; 何家欢; 蔡茁; 杨尚明; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2019-09-10

Abstract

The present invention provides a kind of chemical composition of Chinese materia medica name entity recognition method and device, comprising the following steps: S1: obtaining chemical composition of Chinese materia medica and names entity pertinent literature；S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization；S3: the corpus is encoded and is marked, the corpus marked is obtained；S4: it is trained using the corpus marked as the training sample of BiLSTM, obtains trained BiLSTM；S5: inputting the trained BiLSTM for the pertinent literature for needing to identify chemical composition of Chinese materia medica name entity and identify, to identify that chemical composition of Chinese materia medica names entity.Name entity recognition techniques based on deep neural network are applied to chemical composition of Chinese materia medica identification by the method for the present invention, have higher efficiency than manual identified mode；Be conducive to name solid foundation database to provide data source for building chemical composition of Chinese materia medica.

Description

A kind of chemical composition of Chinese materia medica name entity recognition method and device

Technical field

The present invention relates to technical field of information processing more particularly to a kind of chemical composition of Chinese materia medica based on deep neural network Name entity recognition method and device.

Background technique

Chemical composition of Chinese materia medica names entity, refers to the title of chemical component contained by Chinese medicine, such as glycycoumarin, Glucoperiplocymarin Member, Dihydrocaffeic acid, Physcion -8-O- β-D-Glucose glycosides etc., are to be distinguished to heterogeneity using different names With the naming method for calling and following certain rule.

The entity research of existing Chinese medicine chemical component name at present is more scattered, and the chemical composition of Chinese materia medica of standard is not named real Volume data library is melted into point progress to help traditional Chinese medicine research personnel fast and efficiently to obtain newest Chinese medicine, and there is an urgent need to right Chemical composition of Chinese materia medica name entity is arranged, and effectively identifies that chemical composition of Chinese materia medica name is real from a large amount of scattered Research Literatures Body.

With the development of natural language processing technique, the name entity recognition techniques based on natural language processing are had begun For the Western medicine chemical component identification in Biomedical literature, common method has pattern match, machine learning and depth mind Method through network, or a variety of methods are merged.The chemicals title of Western medicine follows strictly compound in name Naming rule, have normalization.Compared with the chemical component title of Western medicine, the natural medicinal ingredients title of Chinese medicine is in name side There are difference in formula.For example, natural medicinal ingredients title has much comprising special prefix, suffix, or according to its source The name that botanical name develops, some natural medicinal ingredients are named even with popular name.Therefore, entity recognition techniques are named It is still blank out in chemical composition of Chinese materia medica identification field.Artificial side is still passed through to the housekeeping of chemical composition of Chinese materia medica at present Formula, low efficiency are unfavorable for the foundation of propulsion standard chemical composition of Chinese materia medica name entity data bak.

Summary of the invention

It is an object of the invention to solve the problems of the above-mentioned prior art, provide a kind of based on deep neural network Chemical composition of Chinese materia medica names entity recognition method and system, solves the problems, such as the nonstandard chemical composition of Chinese materia medica name identification of name.

A kind of chemical composition of Chinese materia medica name entity recognition method, comprising the following steps:

S1: it obtains chemical composition of Chinese materia medica and names entity pertinent literature；

S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization；

S3: the corpus is encoded and is marked, the corpus marked is obtained；

S4: it is trained, obtains trained using the corpus marked as the training sample of BiLSTM BiLSTM；

S5: the pertinent literature input trained BiLSTM for needing to identify chemical composition of Chinese materia medica name entity is carried out Identification, to identify that chemical composition of Chinese materia medica names entity.

Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the S1 includes:

S11: literature search is carried out from Chinese periodical literature database using keyword, and document is downloaded with PDF format；With And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, to obtain Chinese medicine Study a point name entity documents collection；

S12: document content is extracted according to the document sets.

Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the S3 includes:

The corpus is encoded according to certain rules, obtains coding corpus；

The coding corpus is labeled according to certain rules, with distinguish chemical composition of Chinese materia medica name entity and Non- chemical composition of Chinese materia medica names entity, to obtain the corpus marked.

Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the rule is in accordance with the following methods It is formed:

Feature extraction is carried out to a large amount of chemical component title samples, is obtained comprising chemical element name, chemical special term, chemistry Preposition, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, indicate the letter of serial number, Chinese medicine title, Indicate area, group, symbol characteristic attribute；

Each characteristic attribute is utilized respectively character and distinguishes definition, it is mutually right to form a character and characteristic attribute The regular table of comparisons answered.

Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the step S4 includes:

S41.: using the corpus marked as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, The contextual information of current term vector can be obtained simultaneously；VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4:

i_t=σ (W_xix_t+W_hih_t-1+W_cic_t-1+b_i) (1)

o_t=σ (W_xox_t+W_hoh_t-1+W_coc_t+b_o) (3)

Wherein, σ is nonlinear function, { W_xi,W_hi,W_ci,W_xc,W_hc,W_xo,W_ho,W_coBe LSTM parameter matrix, { b_i, b_c,b_oIt is bias term；i_tAnd o_tThe respectively input gate of BiLSTM and out gate；⊙ is dot product；C is each memory in BiLSTM The state of unit；h_tIt is last output；

S42. the expression of word in the text is obtained using Attention mechanism, i-th of word is calculated in full text model by formula Enclose the interior attention α that should be distributed_i；

energy_i=f (attended, state_i,W) (5)

α_i=softmax (energy_i) (6)

Wherein, attended is the combination of term vector；State is i-th of word corresponding one in the combination；W is Weight coefficient；F function is used to calculate the correlation between state and attended, is surveyed using manhatton distance as similitude Degree；

Wherein, a, b indicate two term vectors, and ai, bi respectively correspond i-th of element of vector a and b；

In addition, indicating the output for handling entire article by BiLSTM with source；

Later, obtaining context of the current word under full text range indicates, is defined as glimpse:

S43. the word is indicated to combine in the context of full text range and the context of adjacent word, it is non-linear by tanh Function is mapped, and output is denoted as；

context_i=tanh (glimpse_i,source_i,U). (9)

Wherein, context_iIndicate the content of Attention layers of unit i, U is the weighting parameter with model training；

S44: use condition random field obtains the sequence label of entire article, calculates entire articleIn given label sequence ColumnUnder total score:

θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer Matrix A, A are label transfer matrixes, are indicated from [m]_t-1Label is transferred to [m]_tThe score of label is calculated using Softmax function The word is determined as to the Probability p of true tag；By maximizing log-likelihood probability come training pattern parameter, and using under gradient Drop method carrys out Optimal Parameters；

Indicate true tag sequence,Indicate sentence,Indicate any possible sequence label；

S45: it uses viterbi algorithm: finding optimum label sequence:

Indicate all possible sequence label.

A kind of chemical composition of Chinese materia medica name entity recognition device, comprising:

Acquiring unit, for obtaining chemical composition of Chinese materia medica name entity pertinent literature；

Pretreatment unit, for carrying out information filtering to the pertinent literature of acquisition, to obtain content of text standardization Corpus；

Coding mark unit obtains the corpus marked for the corpus to be encoded and marked；

Model training unit is obtained for being trained using the corpus marked as the training sample of BiLSTM To trained BiLSTM；

Recognition unit, the pertinent literature input for that will need to identify chemical composition of Chinese materia medica name entity are described trained BiLSTM is identified, to identify that chemical composition of Chinese materia medica names entity.

Further, chemical composition of Chinese materia medica as described above names entity recognition device, and the acquiring unit includes:

Search unit, for carrying out literature search from Chinese periodical literature database using keyword, and under PDF format Published article is offered；And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, with It obtains chemical composition of Chinese materia medica and names entity documents collection；

Extraction unit, for extracting document content according to the document sets.

Further, chemical composition of Chinese materia medica as described above names entity recognition device, and the coding mark unit includes:

Coding unit obtains coding corpus for encoding according to certain rules to the corpus；

Unit is marked, for being labeled according to certain rules to the coding corpus, to distinguish Chemistry for Chinese Traditional Medicine Ingredient names entity and non-chemical composition of Chinese materia medica to name entity, to obtain the corpus marked.

A kind of chemical composition of Chinese materia medica name Entity recognition equipment, including processor and is stored with computer program code Memory；When the computer program code is run by the processor, causes the calculating equipment to execute and appoint as described above Chemical composition of Chinese materia medica described in one names entity recognition method.

A kind of computer readable storage medium is stored with program code on the computer readable storage medium, when described Program code is performed realization described in any item chemical composition of Chinese materia medica name entity recognition methods as described above.

The utility model has the advantages that

The present invention is extracting related literatures just for the magnanimity scientific and technical literature for including chemical composition of Chinese materia medica name entity Literary content is gone forward side by side professional etiquette generalized, and building chemical composition of Chinese materia medica name entity corpus and rule base establish deep neural network mould Type is trained model, identifies that the method for the present invention will to chemical composition of Chinese materia medica name entity using trained model Name entity recognition techniques based on deep neural network are applied to chemical composition of Chinese materia medica identification, have than manual identified mode higher Efficiency；Be conducive to name solid foundation database to provide data source for building chemical composition of Chinese materia medica.

Detailed description of the invention

Fig. 1 is that chemical composition of Chinese materia medica of the present invention names entity recognition method flow chart；

Fig. 2 is corpus storage form figure after coding mark of the embodiment of the present invention；

Fig. 3 is BiLSTM of embodiment of the present invention structure chart；

Fig. 4 is that chemical composition of Chinese materia medica of the present invention names entity recognition device structure chart.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the technical solution in the present invention is carried out below It clearly and completely describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

Embodiment 1:

Fig. 1 be chemical composition of Chinese materia medica of the present invention name entity recognition method flow chart, as shown in Figure 1, this method include with Lower step:

S3: the corpus is encoded and is marked, the corpus marked is obtained；

Embodiment 2:

Preferably, the S1 includes:

Specifically, the embodiment of the present invention analyzes pertinent literature by obtaining chemical composition of Chinese materia medica from a variety of sources.Such as: with Keyword " ingredient ", " Chinese medicine " (or Chinese medicine name) etc. carry out literature search from Chinese periodical literature database (such as all places, Hownet), And document is downloaded with PDF format；Retrieved from Baidupedia, using spiders method by obtained information preservation at TXT text.

S12: document content is extracted according to the document sets.

Specifically, for the input of different type document, need to formulate different method for extracting content of text.Such as: needle To the method that java reads Word document content, the library POI that apache can be used is realized；PDF format is read for java Document can be used pdfbox kit and realize that content of text extracts.

The step S2 includes: document specification, and since the Doctype of reading is different, document format is had differences, text It include interference information in shelves, such as: from the text of scientific and technical literature, wherein including periodical information, author information, postcode letter Breath, mailbox message, bibliography list etc. need to be filtered this category information, only retain text message, it is therefore desirable to mentioning The content of text taken standardizes, and meets the input requirements of chemical name identification model.

Embodiment 3:

The step S3 includes:

S31: according to certain rules encoding the corpus, obtains coding corpus；

S32: being labeled the coding corpus according to certain rules, real to distinguish chemical composition of Chinese materia medica name Body and non-chemical composition of Chinese materia medica name entity, to obtain the corpus marked.

Specifically, before being encoded, it is necessary first to set up rule base, be compiled using rule base to corpus Code.

And the construction method of the rule base is specific as follows:

Corpus labeling expert group is set up, includes largely Chinese medicine-chemical component information text data by collecting, using people The mode of work mark constructs standard corpus library and extraction standard rule base.Firstly, we pass through to a large amount of chemical component titles Carry out rule-statistical, it is possible to find chemical component has significant rule at word, by a large amount of chemical component title samples into Row feature extraction, obtaining chemical component includes symbol, Chinese figure, English alphabet, Greek alphabet, Arabic numerals, chemistry member Plain name, chemical special term, specific prefix suffix etc., character code can be converted to for chemical name by encoding to different type. That is: the described rule is formed in accordance with the following methods: being carried out feature extraction to a large amount of chemical component title samples, is obtained comprising chemistry member Plain name, chemical preposition, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, indicates chemical special term The letter of serial number, Chinese medicine title, indicate area, group, symbol characteristic attribute；By each characteristic attribute be utilized respectively character into Row distinguishes definition, and to form a character and the mutual corresponding regular table of comparisons of characteristic attribute, the table of comparisons is as shown in table 1:

The regular table of comparisons of table 1

Label	Feature part of speech	Example
			A	Chemical element	Hydrogen, helium, lithium, silver, platinum, gold ...
B	Chemical special term	Acid, amine, rouge, ketone, glycosides, glucoside ...
			C	Chemical preposition	Change, conjunction, generation, gather, go ...
D	Specific prefix	Asia is crossed, is inclined, is former ...
			E	Specific affixe	It is plain, peaceful, fixed, clever, smart ...
F	The number of table serial number	Arabic numerals, Roman number, Chinese number
			G	The Chinese character of table serial number	The Heavenly Stems
H	The letter of table serial number	English alphabet, Greek alphabet
			I	Chinese medicine title	Ginseng, Radix Glycyrrhizae, Radix Salviae Miltiorrhizae ...
J	Indicate area	River, China, South America, day, Australia ...
			K	Group	Hydroxyl, carboxyl, nitro, phenyl ...
L	Symbol	', ', ', ', '-' ...
			……	……	……

Become based on the above rule control, the document content of extraction is encoded, coding mode is as follows: when document content is advised After generalized, a label is assigned according to rule described in institute's table 1 to each word in document.Such as: we with containing Chinese medicine at Point sentence " from isolated 12 compounds of the Ethyl acetate fraction of bletilla, be identified respectively as Physcion (1), For erythroglaucin (2) ... ", for plain language material coding method: the sentence can further code conversion be " from/O it is white/I1 and/ I2 /O second/G acid/B second/G ester/B extraction/O takes/portion O/O/O points/O obtains from/O/O to/O 1/F 2/F is a/Oization/C conjunction/O object/ O ,/L point/O not /O mirror/O is fixed/E is /O is big/I1 Huang/I2 element/E first/G ether/B (/L 1/F)/L ,/L is red/O ash/O blueness/O element/E (/ L2/F)/L .../L " so far completes the coding to corpus, is formed coding corpus after a large amount of corpus coding.

Then the part in sentence for chemical component and non-chemical ingredient, the BIO rule are marked out using BIO rule Are as follows: it uses { B, I, O, S } label manually to mark pretreated text, generates sequence label, wherein B indicates a list The beginning of word, I indicate that this word indicates other in addition to the other parts of beginning, O, and S indicates the word of single word.Example Such as, " Shandong plantation violet flower danshen ", sequence label are { B-Location, I-Location, O, O, B-Herb, I-Herb, I- Herb, I-Herb }, wherein " B-Location, I-Location " indicate location category entity, i.e., " Shandong ", " B-Herb, I- Herb, I-Herb, I-Herb " indicate herbs Class entity, i.e., " violet flower danshen ", O indicates irrelevant portions.

Corpus is labeled using BIO rule, thus be the part of chemical component and non-chemical ingredient by corpus labeling, Herein or with the sentence containing traditional Chinese medicine ingredients " from isolated 12 compounds of the Ethyl acetate fraction of bletilla, respectively Be accredited as Physcion (1), erythroglaucin (2) ... " for: its annotation results be " from/O it is white/O and/O /O second/O acid/ O second/O ester/O extraction/O takes/and the portion O/O/O points/O obtains from/O/and O to/O 1/O 2/O/Oization/O conjunction/O object/O ,/O points/O is other/O Mirror/O is fixed/and O is /O is big/and B Huang/I element/I first/I ether/I (/O 1/O)/O ,/O be red/B ash/I blueness/I element/I (/O 2/O)/O .../ O”。

Finally, the result of above-mentioned two step is integrated, the corpus marked is finally obtained, in actual document In storage form it is as shown in Figure 2, wherein every a line stores a word, and first row is the word in former sentence；Secondary series is basis The code conversion that rule base carries out word；Third column be according to whether for chemical composition of Chinese materia medica, using BIO sequence labelling rule into The result of rower note.Operation as above-mentioned example is carried out to all collection of document of all acquisitions, will be ultimately formed big The text set of the storage form as shown above of amount, this text set are corpus needed for model training and test, thus will The corpus enters data to training pattern as model, and the structure of the model is as shown in figure 3, its training method is as follows:

Step 1: using the corpus of building as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, The contextual information of current term vector can be obtained simultaneously.VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4.

i_t=σ (W_xix_t+W_hih_t-1+W_cic_t-1+b_i), (1)

c_t=(1-i_t)⊙c_t-1+i_t⊙tanh(W_xcx_t+W_hch_t-1+b_c), (2)

o_t=σ (W_xox_t+W_hoh_t-1+W_coc_t+b_o), (3)

h_t=o_t⊙tanh(c_t). (4)

Wherein, σ is nonlinear function, { W_xi,W_hi,W_ci,W_xc,W_hc,W_xo,W_ho,W_coBe LSTM parameter matrix, { b_i, b_c,b_oIt is bias term.i_tAnd o_tThe respectively input gate of BiLSTM and out gate；⊙ is dot product；C is each memory in BiLSTM The state of unit；h_tIt is last output；

Step 2: obtaining the expression of word in the text using Attention mechanism, i-th of word is calculated in full text by formula The attention α that should be distributed in range_i。

energy_i=f (attended, state_i,W), (5)

α_i=softmax (energy_i). (6)

Wherein, attended is the combination of term vector；State is i-th of word corresponding one in the combination；W is Weight coefficient (with model training)；F function is used to calculate the correlation between state and attended, using manhatton distance As similarity measure.

In addition, indicating the output for handling entire article by BiLSTM with source.

Step 3: the word is indicated to combine in the context of full text range and the context of adjacent word, it is non-thread by tanh Property function is mapped, and output (that is, Tanh layer in Fig. 3) is denoted as.

context_i=tanh (glimpse_i,source_i,U). (9)

Wherein, context_iIndicate the content of Attention layers of unit i in attached drawing 3, U is the weight with model training Parameter.

Step 4: use condition random field (CRF) obtains the sequence label of entire article.Calculate entire articleIt is giving Determine sequence labelUnder total score:

θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer Matrix A, A are label transfer matrixes, are indicated from [m]_t-1Label is transferred to [m]_tThe score of label is calculated using Softmax function The word is determined as to the Probability p of true tag.By maximizing log-likelihood probability come training pattern parameter, and using under gradient Drop method carrys out Optimal Parameters.Gradient descent method is commonly used optimization method in machine learning, belongs to a kind of solution strategies, for not Objective function with model solves.The purpose of gradient descent method is to obtain optimal solution, can reach parameter by the calculating of optimal solution The purpose of optimization.

Step 5: it uses viterbi algorithm: finding optimum label sequence:

Indicate all possible sequence label.

The present invention also provides a kind of chemical composition of Chinese materia medica to name entity recognition device, as shown in Figure 4, comprising:

The acquiring unit includes:

The coding marks unit

The present invention also provides a kind of chemical composition of Chinese materia medica to name Entity recognition equipment, including processor and is stored with calculating The memory of machine program code；When the computer program code is run by the processor, the calculating equipment is caused to be held Row chemical composition of Chinese materia medica of the present invention names entity recognition method.

The present invention also provides a kind of computer readable storage medium, program is stored on the computer readable storage medium Code realizes chemical composition of Chinese materia medica name entity recognition method of the present invention when said program code is performed.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of chemical composition of Chinese materia medica names entity recognition method, which comprises the following steps:

S3: the corpus is encoded and is marked, the corpus marked is obtained；

S4: it is trained using the corpus marked as the training sample of BiLSTM, obtains trained BiLSTM；

S5: the pertinent literature for needing to identify chemical composition of Chinese materia medica name entity is inputted into the trained BiLSTM and is known Not, to identify that chemical composition of Chinese materia medica names entity.

2. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the S1 includes:

S11: literature search is carried out from Chinese periodical literature database using keyword, and document is downloaded with PDF format；And from Baidupedia is retrieved, using the method for spiders by obtained information preservation at TXT text, with obtain Chemistry for Chinese Traditional Medicine at Divide name entity documents collection；

S12: document content is extracted according to the document sets.

3. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the S3 includes:

S31: according to certain rules encoding the corpus, obtains coding corpus；

S32: being labeled the coding corpus according to certain rules, with distinguish chemical composition of Chinese materia medica name entity and Non- chemical composition of Chinese materia medica names entity, to obtain the corpus marked.

4. chemical composition of Chinese materia medica according to claim 3 names entity recognition method, which is characterized in that the rule according to Following methods are formed:

Feature extraction is carried out to a large amount of chemical component title samples, obtains being situated between comprising chemical element name, chemical special term, chemistry Word, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, the letter for indicating serial number, Chinese medicine title, table Show area, group, symbol characteristic attribute；

Each characteristic attribute is utilized respectively character and distinguishes definition, it is mutual corresponding to form a character and characteristic attribute The regular table of comparisons.

5. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the step S4 packet It includes:

S41.: using the corpus marked as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, it can be same When obtain the contextual information of current term vector；VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4:

i_t=σ (W_xix_t+W_hih_t-1+W_cic_t-1+b_i) (1)

c_t=(1-i_t)⊙c_t-1+i_t⊙tanh(W_xcx_t+W_hch_t-1+b_c) (2)

o_t=σ (W_xox_t+W_hoh_t-1+W_coc_t+b_o) (3)

h_t=o_t⊙tanh(c_t) (4)

Wherein, σ is nonlinear function, { W_xi,W_hi,W_ci,W_xc,W_hc,W_xo,W_ho,W_coBe LSTM parameter matrix, { b_i,b_c,b_o} It is bias term；i_tAnd o_tThe respectively input gate of BiLSTM and out gate；⊙ is dot product；C is each memory unit in BiLSTM State；h_tIt is last output；

S42. the expression of word in the text is obtained using Attention mechanism, i-th of word is calculated within the scope of full text by formula The attention α that should be distributed_i；

energy_i=f (attended, state_i,W) (5)

α_i=softmax (energy_i) (6)

Wherein, attended is the combination of term vector；State is i-th of word corresponding one in the combination；W is weight Coefficient；F function is used to calculate the correlation between state and attended, using manhatton distance as similarity measure；

Wherein, a, b indicate two term vectors, a_i、b_iRespectively correspond i-th of element of vector a and b；

S43. the word is indicated to combine in the context of full text range and the context of adjacent word, passes through tanh nonlinear function It is mapped, is denoted as output；

context_i=tanh (glimpse_i,source_i,U). (9)

S44: use condition random field obtains the sequence label of entire article, calculates entire articleIn given sequence labelUnder total score:

θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer matrix A, A are label transfer matrixes, are indicated from [m]_t-1Label is transferred to [m]_tThe score of label, should using the calculating of Softmax function Word is determined as the Probability p of true tag；By maximizing log-likelihood probability come training pattern parameter, and use gradient descent method Carry out Optimal Parameters；

S45: it uses viterbi algorithm: finding optimum label sequence:

Indicate all possible sequence label.

6. a kind of chemical composition of Chinese materia medica names entity recognition device characterized by comprising

Pretreatment unit, for carrying out information filtering to the pertinent literature of acquisition, to obtain the language of content of text standardization Expect library；

Model training unit is instructed for being trained using the corpus marked as the training sample of BiLSTM The BiLSTM perfected；

7. chemical composition of Chinese materia medica according to claim 6 names entity recognition device, which is characterized in that the acquiring unit Include:

Search unit, for carrying out literature search from Chinese periodical literature database using keyword, and to publish papers under PDF format It offers；And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, to obtain Chemical composition of Chinese materia medica names entity documents collection；

8. chemical composition of Chinese materia medica according to claim 6 names entity recognition device, which is characterized in that the coding mark Unit includes:

Unit is marked, for being labeled according to certain rules to the coding corpus, to distinguish chemical composition of Chinese materia medica Entity and non-chemical composition of Chinese materia medica is named to name entity, to obtain the corpus marked.

9. a kind of chemical composition of Chinese materia medica names Entity recognition equipment, which is characterized in that including processor and be stored with computer The memory of program code；When the computer program code is run by the processor, the calculating equipment is caused to execute Chemical composition of Chinese materia medica according to any one of claims 1-5 names entity recognition method.

10. a kind of computer readable storage medium, which is characterized in that be stored with program generation on the computer readable storage medium Code realizes that the chemical composition of Chinese materia medica as described in any one of claims 1 to 5 names entity when said program code is performed Recognition methods.