CN110222343A - A kind of Chinese medicine plant resource name entity recognition method - Google Patents

A kind of Chinese medicine plant resource name entity recognition method Download PDF

Info

Publication number
CN110222343A
CN110222343A CN201910512743.XA CN201910512743A CN110222343A CN 110222343 A CN110222343 A CN 110222343A CN 201910512743 A CN201910512743 A CN 201910512743A CN 110222343 A CN110222343 A CN 110222343A
Authority
CN
China
Prior art keywords
chinese medicine
vector
word
document
recognition method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910512743.XA
Other languages
Chinese (zh)
Inventor
李巧勤
蔡茁
何家欢
李杨
刘勇国
杨尚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910512743.XA priority Critical patent/CN110222343A/en
Publication of CN110222343A publication Critical patent/CN110222343A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of Chinese medicine plant resource name entity recognition method, comprising the following steps: S1: obtains Chinese medicine germplasm resource document;S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into text sentence;S3: searching term vector corresponding to each text sentence and word vector one by one respectively, and GRU-CRF model is trained using the term vector and word vector;S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.Chinese medicine plant resource provided by the invention names entity recognition method, pass through framework GRU-CRF model, to realize the name entity for capableing of automatic identification Chinese medicine plant resource document, not only substantially increase recognition accuracy and recognition efficiency, reduce manual identified time overhead, and other names entity class is extended to, such as soil mineral content, moisture content etc..

Description

A kind of Chinese medicine plant resource name entity recognition method
Technical field
The present invention relates to Chinese medicine name entity recognition techniques fields more particularly to a kind of Chinese medicine plant resource name entity to know Other method.
Background technique
Chinese medicine planting environment has a significant impact to quality of medicinal material, for example, different sources Chinese medicine appearance character, effectively There is larger difference in ingredient and medicinal efficacy etc..In order to study the influence of planting environment centering quality of medicinal material, need from a large amount of Name entity relevant to planting environment is identified in non-structured Chinese medicine germplasm resource document, is provided for further research Basic data.For example, the corresponding geographical location of Chinese medicine plant, Chinese medicine title, soil, weather etc., referred to as name entity, and want Identify these different name entities, the method for the entity of identification name at present is passed through first by the way of based on artificial+rule Manual type arranges document, is then matched using regular expression, so that the name entity of needs is extracted, this Kind recognition methods efficiency is too low, does not meet the research instantly to Chinese medicine planting environment.
Summary of the invention
It is an object of the invention to solve the problems of the above-mentioned prior art, a kind of pharmacopoeia that recognition efficiency is high plant is provided Resource names entity recognition method.
A kind of Chinese medicine plant resource name entity recognition method, comprising the following steps:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open It is divided into text sentence;S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector GRU-CRF model is trained with word vector;
S4: entity is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model Identification.
Further, Chinese medicine plant resource as described above names entity recognition method, and the step S3 includes: difference one One searches term vector and word vector corresponding to each text sentence, and the word for the word for being included to each text sentence to Amount obtains hiding vector using two-way LSTM model, is spliced term vector and hiding vector to obtain intermediate vector, will be intermediate Vector is trained it as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.
Further, Chinese medicine plant resource as described above names entity recognition method, certain rule are as follows: uses { B-s, I-s, O-s, S-s } label is labeled Chinese medicine germplasm resource document, wherein B indicates opening for a name entity Begin, I indicates the name entity in addition to the other parts of beginning, and O indicates other parts, and S indicates that the name being made of single word is real Body;S indicates the name entity attributes.
Further, Chinese medicine plant resource as described above name entity recognition method, the term vector and word vector by The corpus that wikipedia, Baidupedia construct carries out word2vec insertion training and obtains.
Further, Chinese medicine plant resource as described above names entity recognition method, after step S1, before S2, It further include being pre-processed to Chinese medicine germplasm resource document;
The pretreated step includes: to carry out unification to the format of all Chinese medicine germplasm resource documents, and to unified Document afterwards carries out delete operation, to delete interference information.
Further, Chinese medicine plant resource as described above names entity recognition method, and step S3 includes:
When searching term vector corresponding to text sentence, carried out in the term vector library that the insertion training obtains first It searches, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again Vector, the domain lexicon by wikipedia, Baidupedia construct corpus generate.
The utility model has the advantages that
Chinese medicine plant resource provided by the invention names entity recognition method, by framework GRU-CRF model, to realize It is capable of the name entity of automatic identification Chinese medicine plant resource document, not only substantially increases recognition accuracy and recognition efficiency, Manual identified time overhead is reduced, and extends to other names entity class, such as soil mineral content, moisture content Deng.
Detailed description of the invention
Fig. 1 is that Chinese medicine plant resource of the present invention names entity recognition method flow chart;
Fig. 2 is to utilize { B-s, I-s, O-s, S-s } rule mark schematic diagram;
Fig. 3 is the structure chart of GRU-CRF model.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the technical solution in the present invention is carried out below It clearly and completely describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.
The present invention provides a kind of Chinese medicine plant resource name entity recognition method, as shown in Figure 1, this method includes following step It is rapid:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open It is divided into text sentence;
S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word Vector trains GRU-CRF model;
S4: entity is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model Identification.
Wherein, rule certain described in step S2 are as follows: use { B-s, I-s, O-s, S-s } label to plant Chinese medicine and provide Source document, which is offered, to be labeled, wherein B indicates the beginning of a name entity, and I indicates the name entity in addition to other portions of beginning Point, O indicates other parts, and S indicates the name entity being made of single word;S indicates the name entity attributes.Such as: such as Fig. 2 It is shown, " Shandong plantation violet flower danshen ", sequence label be B-Location, I-Location, O, O, B-Herb, I-Herb, I-Herb, I-Herb }, wherein " B-Location, I-Location " indicate location category entity, i.e., " Shandong ", " B-Herb, I- Herb, I-Herb, I-Herb " indicate herbs Class entity, i.e., " violet flower danshen ", O indicates irrelevant portions.The GRU-CRF model Structure it is as described in Figure 3.
The term vector and word vector carry out word2vec insertion by the corpus that wikipedia, Baidupedia construct and train It obtains.Equally, Chinese medicine plant field dictionary is obtained also by the corpus that wikipedia, Baidupedia construct.
Further, in order to which the efficiency for improving GRU-CRF model training further includes centering after step S1, before S2 Medicinal material germplasm resource document is pre-processed;
The pretreated step includes: to carry out unification to the format of all Chinese medicine germplasm resource documents, and to unified Document afterwards carries out delete operation, to delete interference information.
It is specifically as follows: it is unified that the PDF format of Chinese medicine germplasm resource document is converted into text formatting, only retain document Effective content of text, delete the interference informations such as author and unit information, bibliography.
The step S3 include: search term vector corresponding to each text sentence and word vector one by one respectively, and The word vector for the word for being included to each text sentence obtains hiding vector using two-way LSTM model, by term vector and hide to Amount is spliced to obtain intermediate vector, is trained, is instructed to it using intermediate vector as the input of two-way GRU-CRF model The GRU-CRF model perfected.
In addition, passing through the corpus by wikipedia, Baidupedia building first when searching term vector corresponding to phrase Library carry out word2vec insertion training obtain term vector library in is searched, if can not find, first pass through domain lexicon into Row synonym is converted to find corresponding word, then looks into the corresponding vector of the word again.
Step S3 is described in detail below, should the following steps are included:
S31. word2vec insertion training is carried out to corpus such as wikipedia, Baidupedias, respectively obtains word vector table CT and term vector table WT.
S32. for given text sentence, word vector table CT is searched, a word sequence vector (c is generated1,c2,…,cm), Wherein m indicates the character quantity of sentence.Using the vector as the input of two-way LSTM model, model training extracts word Morphological feature.LSTM's specific formula is as follows:
it=sigmoid (Wicct+Wihht-1+bi) (1)
ft=sigmoid (Wfcct+Wfhht-1+bf) (2)
gt=tanh (Wgcct+Wghht-1+bg) (3)
ot=sigmoid (Wocct+Wohht-1+bo) (4)
st=ft⊙st-1+it⊙gt(dark)
ht=ot⊙tanh(st) (6)
Wherein, W indicates connection weight matrix;F, i, o are to forget door, input gate and out gate respectively;G indicates currently thin Born of the same parents' state;S is cell state, includes long-term Dependency Specification.it、ft、gtThe common cell state s for updating the t-1 momentt-1Obtain t The cell state s at momentt;htIt is the hidden state of moment t, by ot、stIt codetermines.⊙ is indicated by element multiplication.By it is preceding to LSTM layers and backward LSTM layers calculating, obtain the forward direction information of sentence SentenceWith backward informationThe two collectively constitutes Hiding layer state, here shown as
S33. first looking for term vector table WT whether there is the vector of current word, and if it exists, then be made using current term vector For input;If it does not exist, then it first passes through domain lexicon and carries out synonym conversion to find corresponding word, then search term vector table WT forms a term vector sequence (w1, w2 ..., wn) for given text sentence, and wherein n indicates the word number of sentence Amount.It willWith term vector sequence (w1,w2,…,wn) splicing joint, generation words combine sequence vector (x respectively1, x2,…,xn), as two-way GRU-CRF network input and training the model, GRU's specific formula is as follows:
rt=sigmoid (Wrxxt+Wruut-1+br) (7)
zt=sigmoid (Wzxxt+Wzuut-1+bz) (8)
nt=tanh (Wnxxt+Wnu(rt⊙ut-1)+bn) (9)
ut=(1-zt)⊙nt+zt⊙ut-1 (10)
Wherein, ztIt is the update door of GRU, determines t-1 moment hidden state ut-1Information in how many be when being transmitted to t Carve hidden state utIn;rtIt is resetting door, determines t-1 moment hidden state ut-1Information in how many need pass into silence; ntTo input xtWith t-1 moment hidden state ut-1Information summarize, i.e., current cell state.By preceding to GRU layers and backward GRU layers of calculating obtain before sentence to informationWith backward informationThe two has collectively constituted hiding layer state, is expressed as
S34. sequence label prediction is carried out using CRF, it is raw finally by optimal sequence label mapping to corresponding name entity At the annotated sequence for being based on { B, I, O, S } label.Wherein, step S2 is manually to be marked using { B, I, O, S } label to text Note is the training set for exporting result;This step is the model obtained according to training, is predicted, obtained output result is also With { B, I, O, S } tag representation.
By the prediction of a sequence is defined as:
Y=(y1,y2,…,yn) (11)
The score of the forecasting sequence, formula are as follows:
Wherein A is state-transition matrix, Ai,jIndicate that the probability that label j is transferred to from label i, P are expressed as two-way GRU's Output matrix.The probability of all possible sequence label is finally obtained, formula is as follows:
Wherein YsIndicate all possible sequence label of sentence s.During model training, true tag sequence is maximized Log probability, formula is as follows:
During model prediction, the top score of output sequence is solved, it may be assumed that
y*The automatic identification result of the optimum label sequence exactly predicted, i.e. Chinese medicine plant resource name entity.
Below by for example bright recognition methods of the invention:
TXT text is converted by Chinese medicine germplasm resource PDF document first, with " Shandong plantation violet flower danshen " in the text For one, using the sentence as input, word vector table and term vector table are obtained by word2vec model, then being directed to " mountain East plantation violet flower danshen " is segmented, and is searched from term vector table, obtains the term vector of each word;Then word vector is searched Table, obtains the word vector of each word, and the word vector for the word for being included to each word using two-way LSTM model obtain hiding to Amount, term vector and hiding vector is spliced, referred to as intermediate vector, using intermediate vector as the defeated of two-way GRU-CRF model Enter, moving model, obtains the optimum label sequence of the sentence, the i.e. sentence a annotated sequence, i.e. " B-Location I- Location O O B-Herb I-Herb I-Herb I-Herb ", then the result that can be identified from this sentence Are as follows: name entity has location entity, medicinal material entity.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (6)

1. a kind of Chinese medicine plant resource names entity recognition method, which comprises the following steps:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into Text sentence;
S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word vector To train GRU-CRF model;
S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.
2. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the step S3 packet It includes: searching term vector corresponding to each text sentence and word vector one by one respectively, and included to each text sentence The word vector of word obtain hiding vector using two-way LSTM model, by term vector and hiding vector spliced to obtain it is intermediate to Amount, is trained it for intermediate vector as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.
3. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that certain rule Then are as follows: { B-s, I-s, O-s, S-s } label is used to be labeled Chinese medicine germplasm resource document, wherein B indicates a name The beginning of entity, I indicate the name entity in addition to the other parts of beginning, and O indicates other parts, and S expression is made of single word Name entity;S indicates the name entity attributes.
4. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the term vector and Word vector carries out word2vec insertion training by the corpus that wikipedia, Baidupedia construct and obtains.
5. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that step S1 it It afterwards, further include being pre-processed to Chinese medicine germplasm resource document before S2;
The pretreated step includes: to carry out unification to the formats of all Chinese medicine germplasm resource documents, and to after reunification Document carries out delete operation, to delete interference information.
6. Chinese medicine plant resource according to claim 4 names entity recognition method, which is characterized in that step S3 includes:
When searching term vector corresponding to text sentence, looked into the term vector library that the insertion training obtains first It looks for, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again Vector, the domain lexicon are generated by the corpus that wikipedia, Baidupedia construct.
CN201910512743.XA 2019-06-13 2019-06-13 A kind of Chinese medicine plant resource name entity recognition method Pending CN110222343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910512743.XA CN110222343A (en) 2019-06-13 2019-06-13 A kind of Chinese medicine plant resource name entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910512743.XA CN110222343A (en) 2019-06-13 2019-06-13 A kind of Chinese medicine plant resource name entity recognition method

Publications (1)

Publication Number Publication Date
CN110222343A true CN110222343A (en) 2019-09-10

Family

ID=67816976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910512743.XA Pending CN110222343A (en) 2019-06-13 2019-06-13 A kind of Chinese medicine plant resource name entity recognition method

Country Status (1)

Country Link
CN (1) CN110222343A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680843A (en) * 2020-06-12 2020-09-18 电子科技大学 Chinese medicinal material survival suitability area prediction method and system based on depth SVDD model
CN112257422A (en) * 2020-10-22 2021-01-22 京东方科技集团股份有限公司 Named entity normalization processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460012A (en) * 2018-02-01 2018-08-28 哈尔滨理工大学 A kind of name entity recognition method based on GRU-CRF
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108460012A (en) * 2018-02-01 2018-08-28 哈尔滨理工大学 A kind of name entity recognition method based on GRU-CRF
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680843A (en) * 2020-06-12 2020-09-18 电子科技大学 Chinese medicinal material survival suitability area prediction method and system based on depth SVDD model
CN111680843B (en) * 2020-06-12 2022-06-28 电子科技大学 Chinese medicinal material survival area prediction method and system based on depth SVDD model
CN112257422A (en) * 2020-10-22 2021-01-22 京东方科技集团股份有限公司 Named entity normalization processing method and device, electronic equipment and storage medium
CN112257422B (en) * 2020-10-22 2024-06-11 京东方科技集团股份有限公司 Named entity normalization processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104809176B (en) Tibetan language entity relation extraction method
US8478581B2 (en) Interlingua, interlingua engine, and interlingua machine translation system
CN110110054A (en) A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN107193798A (en) A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN108920447B (en) Chinese event extraction method for specific field
CN107656921B (en) Short text dependency analysis method based on deep learning
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
Wyse et al. Generating questions from openlearn study units
Hämäläinen et al. Advances in synchronized XML-MediaWiki dictionary development in the context of endangered Uralic languages
CN110222343A (en) A kind of Chinese medicine plant resource name entity recognition method
CN108491399A (en) Chinese to English machine translation method based on context iterative analysis
Sitender et al. Sansunl: a Sanskrit to UNL enconverter system
Liu Corpus Design of Chinese Medicine English Vocabulary Translation Teaching System Based on Python
Baker FrameNet, present and future
De Melo et al. Towards universal multilingual knowledge bases
CN112182204A (en) Method and device for constructing corpus labeled by Chinese named entities
Mahlaza Grammars for generating isiXhosa and isiZulu weather bulletin verbs
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
Ganbold et al. An experiment in managing language diversity across cultures
Ghosh et al. Clause identification and classification in bengali
CN113886521A (en) Text relation automatic labeling method based on similar vocabulary
Amezian et al. Training an LSTM-based Seq2Seq model on a Moroccan biscript lexicon
Mridha et al. Development of morphological rules for bangla words for universal networking language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910