CN110223737A - A kind of chemical composition of Chinese materia medica name entity recognition method and device - Google Patents

A kind of chemical composition of Chinese materia medica name entity recognition method and device Download PDF

Info

Publication number
CN110223737A
CN110223737A CN201910512263.3A CN201910512263A CN110223737A CN 110223737 A CN110223737 A CN 110223737A CN 201910512263 A CN201910512263 A CN 201910512263A CN 110223737 A CN110223737 A CN 110223737A
Authority
CN
China
Prior art keywords
chemical composition
materia medica
chinese materia
corpus
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910512263.3A
Other languages
Chinese (zh)
Inventor
刘勇国
蒋羽
李杨
何家欢
蔡茁
杨尚明
李巧勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910512263.3A priority Critical patent/CN110223737A/en
Publication of CN110223737A publication Critical patent/CN110223737A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention provides a kind of chemical composition of Chinese materia medica name entity recognition method and device, comprising the following steps: S1: obtaining chemical composition of Chinese materia medica and names entity pertinent literature;S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization;S3: the corpus is encoded and is marked, the corpus marked is obtained;S4: it is trained using the corpus marked as the training sample of BiLSTM, obtains trained BiLSTM;S5: inputting the trained BiLSTM for the pertinent literature for needing to identify chemical composition of Chinese materia medica name entity and identify, to identify that chemical composition of Chinese materia medica names entity.Name entity recognition techniques based on deep neural network are applied to chemical composition of Chinese materia medica identification by the method for the present invention, have higher efficiency than manual identified mode;Be conducive to name solid foundation database to provide data source for building chemical composition of Chinese materia medica.

Description

A kind of chemical composition of Chinese materia medica name entity recognition method and device
Technical field
The present invention relates to technical field of information processing more particularly to a kind of chemical composition of Chinese materia medica based on deep neural network Name entity recognition method and device.
Background technique
Chemical composition of Chinese materia medica names entity, refers to the title of chemical component contained by Chinese medicine, such as glycycoumarin, Glucoperiplocymarin Member, Dihydrocaffeic acid, Physcion -8-O- β-D-Glucose glycosides etc., are to be distinguished to heterogeneity using different names With the naming method for calling and following certain rule.
The entity research of existing Chinese medicine chemical component name at present is more scattered, and the chemical composition of Chinese materia medica of standard is not named real Volume data library is melted into point progress to help traditional Chinese medicine research personnel fast and efficiently to obtain newest Chinese medicine, and there is an urgent need to right Chemical composition of Chinese materia medica name entity is arranged, and effectively identifies that chemical composition of Chinese materia medica name is real from a large amount of scattered Research Literatures Body.
With the development of natural language processing technique, the name entity recognition techniques based on natural language processing are had begun For the Western medicine chemical component identification in Biomedical literature, common method has pattern match, machine learning and depth mind Method through network, or a variety of methods are merged.The chemicals title of Western medicine follows strictly compound in name Naming rule, have normalization.Compared with the chemical component title of Western medicine, the natural medicinal ingredients title of Chinese medicine is in name side There are difference in formula.For example, natural medicinal ingredients title has much comprising special prefix, suffix, or according to its source The name that botanical name develops, some natural medicinal ingredients are named even with popular name.Therefore, entity recognition techniques are named It is still blank out in chemical composition of Chinese materia medica identification field.Artificial side is still passed through to the housekeeping of chemical composition of Chinese materia medica at present Formula, low efficiency are unfavorable for the foundation of propulsion standard chemical composition of Chinese materia medica name entity data bak.
Summary of the invention
It is an object of the invention to solve the problems of the above-mentioned prior art, provide a kind of based on deep neural network Chemical composition of Chinese materia medica names entity recognition method and system, solves the problems, such as the nonstandard chemical composition of Chinese materia medica name identification of name.
A kind of chemical composition of Chinese materia medica name entity recognition method, comprising the following steps:
S1: it obtains chemical composition of Chinese materia medica and names entity pertinent literature;
S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization;
S3: the corpus is encoded and is marked, the corpus marked is obtained;
S4: it is trained, obtains trained using the corpus marked as the training sample of BiLSTM BiLSTM;
S5: the pertinent literature input trained BiLSTM for needing to identify chemical composition of Chinese materia medica name entity is carried out Identification, to identify that chemical composition of Chinese materia medica names entity.
Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the S1 includes:
S11: literature search is carried out from Chinese periodical literature database using keyword, and document is downloaded with PDF format;With And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, to obtain Chinese medicine Study a point name entity documents collection;
S12: document content is extracted according to the document sets.
Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the S3 includes:
The corpus is encoded according to certain rules, obtains coding corpus;
The coding corpus is labeled according to certain rules, with distinguish chemical composition of Chinese materia medica name entity and Non- chemical composition of Chinese materia medica names entity, to obtain the corpus marked.
Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the rule is in accordance with the following methods It is formed:
Feature extraction is carried out to a large amount of chemical component title samples, is obtained comprising chemical element name, chemical special term, chemistry Preposition, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, indicate the letter of serial number, Chinese medicine title, Indicate area, group, symbol characteristic attribute;
Each characteristic attribute is utilized respectively character and distinguishes definition, it is mutually right to form a character and characteristic attribute The regular table of comparisons answered.
Further, chemical composition of Chinese materia medica as described above names entity recognition method, and the step S4 includes:
S41.: using the corpus marked as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, The contextual information of current term vector can be obtained simultaneously;VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4:
it=σ (Wxixt+Whiht-1+Wcict-1+bi) (1)
ot=σ (Wxoxt+Whoht-1+Wcoct+bo) (3)
Wherein, σ is nonlinear function, { Wxi,Whi,Wci,Wxc,Whc,Wxo,Who,WcoBe LSTM parameter matrix, { bi, bc,boIt is bias term;itAnd otThe respectively input gate of BiLSTM and out gate;⊙ is dot product;C is each memory in BiLSTM The state of unit;htIt is last output;
S42. the expression of word in the text is obtained using Attention mechanism, i-th of word is calculated in full text model by formula Enclose the interior attention α that should be distributedi
energyi=f (attended, statei,W) (5)
αi=softmax (energyi) (6)
Wherein, attended is the combination of term vector;State is i-th of word corresponding one in the combination;W is Weight coefficient;F function is used to calculate the correlation between state and attended, is surveyed using manhatton distance as similitude Degree;
Wherein, a, b indicate two term vectors, and ai, bi respectively correspond i-th of element of vector a and b;
In addition, indicating the output for handling entire article by BiLSTM with source;
Later, obtaining context of the current word under full text range indicates, is defined as glimpse:
S43. the word is indicated to combine in the context of full text range and the context of adjacent word, it is non-linear by tanh Function is mapped, and output is denoted as;
contexti=tanh (glimpsei,sourcei,U). (9)
Wherein, contextiIndicate the content of Attention layers of unit i, U is the weighting parameter with model training;
S44: use condition random field obtains the sequence label of entire article, calculates entire articleIn given label sequence ColumnUnder total score:
θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer Matrix A, A are label transfer matrixes, are indicated from [m]t-1Label is transferred to [m]tThe score of label is calculated using Softmax function The word is determined as to the Probability p of true tag;By maximizing log-likelihood probability come training pattern parameter, and using under gradient Drop method carrys out Optimal Parameters;
Indicate true tag sequence,Indicate sentence,Indicate any possible sequence label;
S45: it uses viterbi algorithm: finding optimum label sequence:
Indicate all possible sequence label.
A kind of chemical composition of Chinese materia medica name entity recognition device, comprising:
Acquiring unit, for obtaining chemical composition of Chinese materia medica name entity pertinent literature;
Pretreatment unit, for carrying out information filtering to the pertinent literature of acquisition, to obtain content of text standardization Corpus;
Coding mark unit obtains the corpus marked for the corpus to be encoded and marked;
Model training unit is obtained for being trained using the corpus marked as the training sample of BiLSTM To trained BiLSTM;
Recognition unit, the pertinent literature input for that will need to identify chemical composition of Chinese materia medica name entity are described trained BiLSTM is identified, to identify that chemical composition of Chinese materia medica names entity.
Further, chemical composition of Chinese materia medica as described above names entity recognition device, and the acquiring unit includes:
Search unit, for carrying out literature search from Chinese periodical literature database using keyword, and under PDF format Published article is offered;And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, with It obtains chemical composition of Chinese materia medica and names entity documents collection;
Extraction unit, for extracting document content according to the document sets.
Further, chemical composition of Chinese materia medica as described above names entity recognition device, and the coding mark unit includes:
Coding unit obtains coding corpus for encoding according to certain rules to the corpus;
Unit is marked, for being labeled according to certain rules to the coding corpus, to distinguish Chemistry for Chinese Traditional Medicine Ingredient names entity and non-chemical composition of Chinese materia medica to name entity, to obtain the corpus marked.
A kind of chemical composition of Chinese materia medica name Entity recognition equipment, including processor and is stored with computer program code Memory;When the computer program code is run by the processor, causes the calculating equipment to execute and appoint as described above Chemical composition of Chinese materia medica described in one names entity recognition method.
A kind of computer readable storage medium is stored with program code on the computer readable storage medium, when described Program code is performed realization described in any item chemical composition of Chinese materia medica name entity recognition methods as described above.
The utility model has the advantages that
The present invention is extracting related literatures just for the magnanimity scientific and technical literature for including chemical composition of Chinese materia medica name entity Literary content is gone forward side by side professional etiquette generalized, and building chemical composition of Chinese materia medica name entity corpus and rule base establish deep neural network mould Type is trained model, identifies that the method for the present invention will to chemical composition of Chinese materia medica name entity using trained model Name entity recognition techniques based on deep neural network are applied to chemical composition of Chinese materia medica identification, have than manual identified mode higher Efficiency;Be conducive to name solid foundation database to provide data source for building chemical composition of Chinese materia medica.
Detailed description of the invention
Fig. 1 is that chemical composition of Chinese materia medica of the present invention names entity recognition method flow chart;
Fig. 2 is corpus storage form figure after coding mark of the embodiment of the present invention;
Fig. 3 is BiLSTM of embodiment of the present invention structure chart;
Fig. 4 is that chemical composition of Chinese materia medica of the present invention names entity recognition device structure chart.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the technical solution in the present invention is carried out below It clearly and completely describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
Embodiment 1:
Fig. 1 be chemical composition of Chinese materia medica of the present invention name entity recognition method flow chart, as shown in Figure 1, this method include with Lower step:
S1: it obtains chemical composition of Chinese materia medica and names entity pertinent literature;
S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization;
S3: the corpus is encoded and is marked, the corpus marked is obtained;
S4: it is trained, obtains trained using the corpus marked as the training sample of BiLSTM BiLSTM;
S5: the pertinent literature input trained BiLSTM for needing to identify chemical composition of Chinese materia medica name entity is carried out Identification, to identify that chemical composition of Chinese materia medica names entity.
The present invention is extracting related literatures just for the magnanimity scientific and technical literature for including chemical composition of Chinese materia medica name entity Literary content is gone forward side by side professional etiquette generalized, and building chemical composition of Chinese materia medica name entity corpus and rule base establish deep neural network mould Type is trained model, identifies that the method for the present invention will to chemical composition of Chinese materia medica name entity using trained model Name entity recognition techniques based on deep neural network are applied to chemical composition of Chinese materia medica identification, have than manual identified mode higher Efficiency;Be conducive to name solid foundation database to provide data source for building chemical composition of Chinese materia medica.
Embodiment 2:
Preferably, the S1 includes:
S11: literature search is carried out from Chinese periodical literature database using keyword, and document is downloaded with PDF format;With And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, to obtain Chinese medicine Study a point name entity documents collection;
Specifically, the embodiment of the present invention analyzes pertinent literature by obtaining chemical composition of Chinese materia medica from a variety of sources.Such as: with Keyword " ingredient ", " Chinese medicine " (or Chinese medicine name) etc. carry out literature search from Chinese periodical literature database (such as all places, Hownet), And document is downloaded with PDF format;Retrieved from Baidupedia, using spiders method by obtained information preservation at TXT text.
S12: document content is extracted according to the document sets.
Specifically, for the input of different type document, need to formulate different method for extracting content of text.Such as: needle To the method that java reads Word document content, the library POI that apache can be used is realized;PDF format is read for java Document can be used pdfbox kit and realize that content of text extracts.
The step S2 includes: document specification, and since the Doctype of reading is different, document format is had differences, text It include interference information in shelves, such as: from the text of scientific and technical literature, wherein including periodical information, author information, postcode letter Breath, mailbox message, bibliography list etc. need to be filtered this category information, only retain text message, it is therefore desirable to mentioning The content of text taken standardizes, and meets the input requirements of chemical name identification model.
Embodiment 3:
The step S3 includes:
S31: according to certain rules encoding the corpus, obtains coding corpus;
S32: being labeled the coding corpus according to certain rules, real to distinguish chemical composition of Chinese materia medica name Body and non-chemical composition of Chinese materia medica name entity, to obtain the corpus marked.
Specifically, before being encoded, it is necessary first to set up rule base, be compiled using rule base to corpus Code.
And the construction method of the rule base is specific as follows:
Corpus labeling expert group is set up, includes largely Chinese medicine-chemical component information text data by collecting, using people The mode of work mark constructs standard corpus library and extraction standard rule base.Firstly, we pass through to a large amount of chemical component titles Carry out rule-statistical, it is possible to find chemical component has significant rule at word, by a large amount of chemical component title samples into Row feature extraction, obtaining chemical component includes symbol, Chinese figure, English alphabet, Greek alphabet, Arabic numerals, chemistry member Plain name, chemical special term, specific prefix suffix etc., character code can be converted to for chemical name by encoding to different type. That is: the described rule is formed in accordance with the following methods: being carried out feature extraction to a large amount of chemical component title samples, is obtained comprising chemistry member Plain name, chemical preposition, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, indicates chemical special term The letter of serial number, Chinese medicine title, indicate area, group, symbol characteristic attribute;By each characteristic attribute be utilized respectively character into Row distinguishes definition, and to form a character and the mutual corresponding regular table of comparisons of characteristic attribute, the table of comparisons is as shown in table 1:
The regular table of comparisons of table 1
Label Feature part of speech Example
A Chemical element Hydrogen, helium, lithium, silver, platinum, gold ...
B Chemical special term Acid, amine, rouge, ketone, glycosides, glucoside ...
C Chemical preposition Change, conjunction, generation, gather, go ...
D Specific prefix Asia is crossed, is inclined, is former ...
E Specific affixe It is plain, peaceful, fixed, clever, smart ...
F The number of table serial number Arabic numerals, Roman number, Chinese number
G The Chinese character of table serial number The Heavenly Stems
H The letter of table serial number English alphabet, Greek alphabet
I Chinese medicine title Ginseng, Radix Glycyrrhizae, Radix Salviae Miltiorrhizae ...
J Indicate area River, China, South America, day, Australia ...
K Group Hydroxyl, carboxyl, nitro, phenyl ...
L Symbol ', ', ', ', '-' ...
…… …… ……
Become based on the above rule control, the document content of extraction is encoded, coding mode is as follows: when document content is advised After generalized, a label is assigned according to rule described in institute's table 1 to each word in document.Such as: we with containing Chinese medicine at Point sentence " from isolated 12 compounds of the Ethyl acetate fraction of bletilla, be identified respectively as Physcion (1), For erythroglaucin (2) ... ", for plain language material coding method: the sentence can further code conversion be " from/O it is white/I1 and/ I2 /O second/G acid/B second/G ester/B extraction/O takes/portion O/O/O points/O obtains from/O/O to/O 1/F 2/F is a/Oization/C conjunction/O object/ O ,/L point/O not /O mirror/O is fixed/E is /O is big/I1 Huang/I2 element/E first/G ether/B (/L 1/F)/L ,/L is red/O ash/O blueness/O element/E (/ L2/F)/L .../L " so far completes the coding to corpus, is formed coding corpus after a large amount of corpus coding.
Then the part in sentence for chemical component and non-chemical ingredient, the BIO rule are marked out using BIO rule Are as follows: it uses { B, I, O, S } label manually to mark pretreated text, generates sequence label, wherein B indicates a list The beginning of word, I indicate that this word indicates other in addition to the other parts of beginning, O, and S indicates the word of single word.Example Such as, " Shandong plantation violet flower danshen ", sequence label are { B-Location, I-Location, O, O, B-Herb, I-Herb, I- Herb, I-Herb }, wherein " B-Location, I-Location " indicate location category entity, i.e., " Shandong ", " B-Herb, I- Herb, I-Herb, I-Herb " indicate herbs Class entity, i.e., " violet flower danshen ", O indicates irrelevant portions.
Corpus is labeled using BIO rule, thus be the part of chemical component and non-chemical ingredient by corpus labeling, Herein or with the sentence containing traditional Chinese medicine ingredients " from isolated 12 compounds of the Ethyl acetate fraction of bletilla, respectively Be accredited as Physcion (1), erythroglaucin (2) ... " for: its annotation results be " from/O it is white/O and/O /O second/O acid/ O second/O ester/O extraction/O takes/and the portion O/O/O points/O obtains from/O/and O to/O 1/O 2/O/Oization/O conjunction/O object/O ,/O points/O is other/O Mirror/O is fixed/and O is /O is big/and B Huang/I element/I first/I ether/I (/O 1/O)/O ,/O be red/B ash/I blueness/I element/I (/O 2/O)/O .../ O”。
Finally, the result of above-mentioned two step is integrated, the corpus marked is finally obtained, in actual document In storage form it is as shown in Figure 2, wherein every a line stores a word, and first row is the word in former sentence;Secondary series is basis The code conversion that rule base carries out word;Third column be according to whether for chemical composition of Chinese materia medica, using BIO sequence labelling rule into The result of rower note.Operation as above-mentioned example is carried out to all collection of document of all acquisitions, will be ultimately formed big The text set of the storage form as shown above of amount, this text set are corpus needed for model training and test, thus will The corpus enters data to training pattern as model, and the structure of the model is as shown in figure 3, its training method is as follows:
Step 1: using the corpus of building as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, The contextual information of current term vector can be obtained simultaneously.VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4.
it=σ (Wxixt+Whiht-1+Wcict-1+bi), (1)
ct=(1-it)⊙ct-1+it⊙tanh(Wxcxt+Whcht-1+bc), (2)
ot=σ (Wxoxt+Whoht-1+Wcoct+bo), (3)
ht=ot⊙tanh(ct). (4)
Wherein, σ is nonlinear function, { Wxi,Whi,Wci,Wxc,Whc,Wxo,Who,WcoBe LSTM parameter matrix, { bi, bc,boIt is bias term.itAnd otThe respectively input gate of BiLSTM and out gate;⊙ is dot product;C is each memory in BiLSTM The state of unit;htIt is last output;
Step 2: obtaining the expression of word in the text using Attention mechanism, i-th of word is calculated in full text by formula The attention α that should be distributed in rangei
energyi=f (attended, statei,W), (5)
αi=softmax (energyi). (6)
Wherein, attended is the combination of term vector;State is i-th of word corresponding one in the combination;W is Weight coefficient (with model training);F function is used to calculate the correlation between state and attended, using manhatton distance As similarity measure.
Wherein, a, b indicate two term vectors, and ai, bi respectively correspond i-th of element of vector a and b;
In addition, indicating the output for handling entire article by BiLSTM with source.
Later, obtaining context of the current word under full text range indicates, is defined as glimpse:
Step 3: the word is indicated to combine in the context of full text range and the context of adjacent word, it is non-thread by tanh Property function is mapped, and output (that is, Tanh layer in Fig. 3) is denoted as.
contexti=tanh (glimpsei,sourcei,U). (9)
Wherein, contextiIndicate the content of Attention layers of unit i in attached drawing 3, U is the weight with model training Parameter.
Step 4: use condition random field (CRF) obtains the sequence label of entire article.Calculate entire articleIt is giving Determine sequence labelUnder total score:
θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer Matrix A, A are label transfer matrixes, are indicated from [m]t-1Label is transferred to [m]tThe score of label is calculated using Softmax function The word is determined as to the Probability p of true tag.By maximizing log-likelihood probability come training pattern parameter, and using under gradient Drop method carrys out Optimal Parameters.Gradient descent method is commonly used optimization method in machine learning, belongs to a kind of solution strategies, for not Objective function with model solves.The purpose of gradient descent method is to obtain optimal solution, can reach parameter by the calculating of optimal solution The purpose of optimization.
Indicate true tag sequence,Indicate sentence,Indicate any possible sequence label;
Step 5: it uses viterbi algorithm: finding optimum label sequence:
Indicate all possible sequence label.
The present invention also provides a kind of chemical composition of Chinese materia medica to name entity recognition device, as shown in Figure 4, comprising:
Acquiring unit, for obtaining chemical composition of Chinese materia medica name entity pertinent literature;
Pretreatment unit, for carrying out information filtering to the pertinent literature of acquisition, to obtain content of text standardization Corpus;
Coding mark unit obtains the corpus marked for the corpus to be encoded and marked;
Model training unit is obtained for being trained using the corpus marked as the training sample of BiLSTM To trained BiLSTM;
Recognition unit, the pertinent literature input for that will need to identify chemical composition of Chinese materia medica name entity are described trained BiLSTM is identified, to identify that chemical composition of Chinese materia medica names entity.
The acquiring unit includes:
Search unit, for carrying out literature search from Chinese periodical literature database using keyword, and under PDF format Published article is offered;And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, with It obtains chemical composition of Chinese materia medica and names entity documents collection;
Extraction unit, for extracting document content according to the document sets.
The coding marks unit
Coding unit obtains coding corpus for encoding according to certain rules to the corpus;
Unit is marked, for being labeled according to certain rules to the coding corpus, to distinguish Chemistry for Chinese Traditional Medicine Ingredient names entity and non-chemical composition of Chinese materia medica to name entity, to obtain the corpus marked.
The present invention also provides a kind of chemical composition of Chinese materia medica to name Entity recognition equipment, including processor and is stored with calculating The memory of machine program code;When the computer program code is run by the processor, the calculating equipment is caused to be held Row chemical composition of Chinese materia medica of the present invention names entity recognition method.
The present invention also provides a kind of computer readable storage medium, program is stored on the computer readable storage medium Code realizes chemical composition of Chinese materia medica name entity recognition method of the present invention when said program code is performed.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. a kind of chemical composition of Chinese materia medica names entity recognition method, which comprises the following steps:
S1: it obtains chemical composition of Chinese materia medica and names entity pertinent literature;
S2: information filtering is carried out to the pertinent literature of acquisition, to obtain the corpus of content of text standardization;
S3: the corpus is encoded and is marked, the corpus marked is obtained;
S4: it is trained using the corpus marked as the training sample of BiLSTM, obtains trained BiLSTM;
S5: the pertinent literature for needing to identify chemical composition of Chinese materia medica name entity is inputted into the trained BiLSTM and is known Not, to identify that chemical composition of Chinese materia medica names entity.
2. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the S1 includes:
S11: literature search is carried out from Chinese periodical literature database using keyword, and document is downloaded with PDF format;And from Baidupedia is retrieved, using the method for spiders by obtained information preservation at TXT text, with obtain Chemistry for Chinese Traditional Medicine at Divide name entity documents collection;
S12: document content is extracted according to the document sets.
3. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the S3 includes:
S31: according to certain rules encoding the corpus, obtains coding corpus;
S32: being labeled the coding corpus according to certain rules, with distinguish chemical composition of Chinese materia medica name entity and Non- chemical composition of Chinese materia medica names entity, to obtain the corpus marked.
4. chemical composition of Chinese materia medica according to claim 3 names entity recognition method, which is characterized in that the rule according to Following methods are formed:
Feature extraction is carried out to a large amount of chemical component title samples, obtains being situated between comprising chemical element name, chemical special term, chemistry Word, specific prefix, specific affixe, the number for indicating serial number, the Chinese character for indicating serial number, the letter for indicating serial number, Chinese medicine title, table Show area, group, symbol characteristic attribute;
Each characteristic attribute is utilized respectively character and distinguishes definition, it is mutual corresponding to form a character and characteristic attribute The regular table of comparisons.
5. chemical composition of Chinese materia medica according to claim 1 names entity recognition method, which is characterized in that the step S4 packet It includes:
S41.: using the corpus marked as the input of model training, it is input to the feedforward layer and feedback layer of BiLSTM, it can be same When obtain the contextual information of current term vector;VectorIndicate the bidirectional output of splicingWithAs BiLSTM can be obtained in the output of t moment by formula 1-4:
it=σ (Wxixt+Whiht-1+Wcict-1+bi) (1)
ct=(1-it)⊙ct-1+it⊙tanh(Wxcxt+Whcht-1+bc) (2)
ot=σ (Wxoxt+Whoht-1+Wcoct+bo) (3)
ht=ot⊙tanh(ct) (4)
Wherein, σ is nonlinear function, { Wxi,Whi,Wci,Wxc,Whc,Wxo,Who,WcoBe LSTM parameter matrix, { bi,bc,bo} It is bias term;itAnd otThe respectively input gate of BiLSTM and out gate;⊙ is dot product;C is each memory unit in BiLSTM State;htIt is last output;
S42. the expression of word in the text is obtained using Attention mechanism, i-th of word is calculated within the scope of full text by formula The attention α that should be distributedi
energyi=f (attended, statei,W) (5)
αi=softmax (energyi) (6)
Wherein, attended is the combination of term vector;State is i-th of word corresponding one in the combination;W is weight Coefficient;F function is used to calculate the correlation between state and attended, using manhatton distance as similarity measure;
Wherein, a, b indicate two term vectors, ai、biRespectively correspond i-th of element of vector a and b;
In addition, indicating the output for handling entire article by BiLSTM with source;
Later, obtaining context of the current word under full text range indicates, is defined as glimpse:
S43. the word is indicated to combine in the context of full text range and the context of adjacent word, passes through tanh nonlinear function It is mapped, is denoted as output;
contexti=tanh (glimpsei,sourcei,U). (9)
Wherein, contextiIndicate the content of Attention layers of unit i, U is the weighting parameter with model training;
S44: use condition random field obtains the sequence label of entire article, calculates entire articleIn given sequence labelUnder total score:
θ ' is all parameters that entire model needs to learn, including the original BiLSTM parameter for needing to learn and label transfer matrix A, A are label transfer matrixes, are indicated from [m]t-1Label is transferred to [m]tThe score of label, should using the calculating of Softmax function Word is determined as the Probability p of true tag;By maximizing log-likelihood probability come training pattern parameter, and use gradient descent method Carry out Optimal Parameters;
Indicate true tag sequence,Indicate sentence,Indicate any possible sequence label;
S45: it uses viterbi algorithm: finding optimum label sequence:
Indicate all possible sequence label.
6. a kind of chemical composition of Chinese materia medica names entity recognition device characterized by comprising
Acquiring unit, for obtaining chemical composition of Chinese materia medica name entity pertinent literature;
Pretreatment unit, for carrying out information filtering to the pertinent literature of acquisition, to obtain the language of content of text standardization Expect library;
Coding mark unit obtains the corpus marked for the corpus to be encoded and marked;
Model training unit is instructed for being trained using the corpus marked as the training sample of BiLSTM The BiLSTM perfected;
Recognition unit, the pertinent literature input for that will need to identify chemical composition of Chinese materia medica name entity are described trained BiLSTM is identified, to identify that chemical composition of Chinese materia medica names entity.
7. chemical composition of Chinese materia medica according to claim 6 names entity recognition device, which is characterized in that the acquiring unit Include:
Search unit, for carrying out literature search from Chinese periodical literature database using keyword, and to publish papers under PDF format It offers;And retrieved from Baidupedia, using the method for spiders by obtained information preservation at TXT text, to obtain Chemical composition of Chinese materia medica names entity documents collection;
Extraction unit, for extracting document content according to the document sets.
8. chemical composition of Chinese materia medica according to claim 6 names entity recognition device, which is characterized in that the coding mark Unit includes:
Coding unit obtains coding corpus for encoding according to certain rules to the corpus;
Unit is marked, for being labeled according to certain rules to the coding corpus, to distinguish chemical composition of Chinese materia medica Entity and non-chemical composition of Chinese materia medica is named to name entity, to obtain the corpus marked.
9. a kind of chemical composition of Chinese materia medica names Entity recognition equipment, which is characterized in that including processor and be stored with computer The memory of program code;When the computer program code is run by the processor, the calculating equipment is caused to execute Chemical composition of Chinese materia medica according to any one of claims 1-5 names entity recognition method.
10. a kind of computer readable storage medium, which is characterized in that be stored with program generation on the computer readable storage medium Code realizes that the chemical composition of Chinese materia medica as described in any one of claims 1 to 5 names entity when said program code is performed Recognition methods.
CN201910512263.3A 2019-06-13 2019-06-13 A kind of chemical composition of Chinese materia medica name entity recognition method and device Pending CN110223737A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910512263.3A CN110223737A (en) 2019-06-13 2019-06-13 A kind of chemical composition of Chinese materia medica name entity recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910512263.3A CN110223737A (en) 2019-06-13 2019-06-13 A kind of chemical composition of Chinese materia medica name entity recognition method and device

Publications (1)

Publication Number Publication Date
CN110223737A true CN110223737A (en) 2019-09-10

Family

ID=67817075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910512263.3A Pending CN110223737A (en) 2019-06-13 2019-06-13 A kind of chemical composition of Chinese materia medica name entity recognition method and device

Country Status (1)

Country Link
CN (1) CN110223737A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837737A (en) * 2019-11-11 2020-02-25 中国电子科技集团公司信息科学研究院 Method for recognizing ability word entity
CN111160031A (en) * 2019-12-13 2020-05-15 华南理工大学 Social media named entity identification method based on affix perception
CN112699668A (en) * 2021-01-05 2021-04-23 广州楹鼎生物科技有限公司 Training method, extraction method, device, equipment and storage medium of chemical information extraction model
WO2021218024A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Method and apparatus for training named entity recognition model, and computer device
CN113723089A (en) * 2020-05-25 2021-11-30 阿里巴巴集团控股有限公司 Word segmentation model training method, word segmentation method, data processing method and data processing device
WO2022007871A1 (en) * 2020-07-09 2022-01-13 中国科学院上海药物研究所 Processing method and device for bidirectional automatic conversion of chemical structure and name of organic compound

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
US20180300312A1 (en) * 2017-04-13 2018-10-18 Baidu Usa Llc Global normalized reader systems and methods
CN108874997A (en) * 2018-06-13 2018-11-23 广东外语外贸大学 A kind of name name entity recognition method towards film comment
CN108920465A (en) * 2018-07-13 2018-11-30 福州大学 A kind of agriculture field Relation extraction method based on syntactic-semantic
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109669994A (en) * 2018-12-21 2019-04-23 吉林大学 A kind of construction method and system of health knowledge map

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
US20180300312A1 (en) * 2017-04-13 2018-10-18 Baidu Usa Llc Global normalized reader systems and methods
CN108874997A (en) * 2018-06-13 2018-11-23 广东外语外贸大学 A kind of name name entity recognition method towards film comment
CN108920465A (en) * 2018-07-13 2018-11-30 福州大学 A kind of agriculture field Relation extraction method based on syntactic-semantic
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109669994A (en) * 2018-12-21 2019-04-23 吉林大学 A kind of construction method and system of health knowledge map

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JASON P.C. CHIU等: "Named Entity Recognition with Bidirectional LSTM-CNNs", 《TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
LING LUO等: "A neural network approach to chemical and gene/protein entity recognition in patents", 《JOURNAL OF CHEMINFORMATICS》 *
LING LUO等: "An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition", 《BIOINFORMATICS》 *
RRUBAA PANCHENDRARAJAN等: "Bidirectional LSTM-CRF for Named Entity Recognition", 《32ND PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, HONG KONG》 *
郝伟学: "中医健康知识图谱的构建研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837737A (en) * 2019-11-11 2020-02-25 中国电子科技集团公司信息科学研究院 Method for recognizing ability word entity
CN111160031A (en) * 2019-12-13 2020-05-15 华南理工大学 Social media named entity identification method based on affix perception
WO2021218024A1 (en) * 2020-04-29 2021-11-04 平安科技(深圳)有限公司 Method and apparatus for training named entity recognition model, and computer device
CN113723089A (en) * 2020-05-25 2021-11-30 阿里巴巴集团控股有限公司 Word segmentation model training method, word segmentation method, data processing method and data processing device
CN113723089B (en) * 2020-05-25 2023-12-26 阿里巴巴集团控股有限公司 Word segmentation model training method, word segmentation method and data processing method and device
WO2022007871A1 (en) * 2020-07-09 2022-01-13 中国科学院上海药物研究所 Processing method and device for bidirectional automatic conversion of chemical structure and name of organic compound
CN112699668A (en) * 2021-01-05 2021-04-23 广州楹鼎生物科技有限公司 Training method, extraction method, device, equipment and storage medium of chemical information extraction model

Similar Documents

Publication Publication Date Title
CN110223737A (en) A kind of chemical composition of Chinese materia medica name entity recognition method and device
CN105404632B (en) System and method for carrying out serialized annotation on biomedical text based on deep neural network
CN110851599B (en) Automatic scoring method for Chinese composition and teaching assistance system
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN109697285A (en) Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN107341264A (en) A kind of electronic health record system and method for supporting custom entities
CN106919793A (en) A kind of data standardization processing method and device of medical big data
CN111538845A (en) Method, model and system for constructing kidney disease specialized medical knowledge map
CN106909783A (en) A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline
CN107247739A (en) A kind of financial publication text knowledge extracting method based on factor graph
Liu et al. Extracting the gist of chinese judgments of the supreme court
CN109033166B (en) Character attribute extraction training data set construction method
CN113946685B (en) Fishery standard knowledge graph construction method integrating rules and deep learning
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN116719913A (en) Medical question-answering system based on improved named entity recognition and construction method thereof
CN107862069A (en) A kind of construction method of taxonomy database and the method for book classification
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
Jamali et al. Percqa: Persian community question answering dataset
Hafeez et al. Contextual urdu lemmatization using recurrent neural network models
CN109284391A (en) A kind of document automatic classification method
CN109189848A (en) Abstracting method, system, computer equipment and the storage medium of knowledge data
CN115438379A (en) Electronic medical record data desensitization method and system based on FLAT
Steigen Social support in nature-based services for young adults with mental health problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910