CN110222343A - A kind of Chinese medicine plant resource name entity recognition method - Google Patents
A kind of Chinese medicine plant resource name entity recognition method Download PDFInfo
- Publication number
- CN110222343A CN110222343A CN201910512743.XA CN201910512743A CN110222343A CN 110222343 A CN110222343 A CN 110222343A CN 201910512743 A CN201910512743 A CN 201910512743A CN 110222343 A CN110222343 A CN 110222343A
- Authority
- CN
- China
- Prior art keywords
- chinese medicine
- vector
- word
- document
- recognition method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of Chinese medicine plant resource name entity recognition method, comprising the following steps: S1: obtains Chinese medicine germplasm resource document;S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into text sentence;S3: searching term vector corresponding to each text sentence and word vector one by one respectively, and GRU-CRF model is trained using the term vector and word vector;S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.Chinese medicine plant resource provided by the invention names entity recognition method, pass through framework GRU-CRF model, to realize the name entity for capableing of automatic identification Chinese medicine plant resource document, not only substantially increase recognition accuracy and recognition efficiency, reduce manual identified time overhead, and other names entity class is extended to, such as soil mineral content, moisture content etc..
Description
Technical field
The present invention relates to Chinese medicine name entity recognition techniques fields more particularly to a kind of Chinese medicine plant resource name entity to know
Other method.
Background technique
Chinese medicine planting environment has a significant impact to quality of medicinal material, for example, different sources Chinese medicine appearance character, effectively
There is larger difference in ingredient and medicinal efficacy etc..In order to study the influence of planting environment centering quality of medicinal material, need from a large amount of
Name entity relevant to planting environment is identified in non-structured Chinese medicine germplasm resource document, is provided for further research
Basic data.For example, the corresponding geographical location of Chinese medicine plant, Chinese medicine title, soil, weather etc., referred to as name entity, and want
Identify these different name entities, the method for the entity of identification name at present is passed through first by the way of based on artificial+rule
Manual type arranges document, is then matched using regular expression, so that the name entity of needs is extracted, this
Kind recognition methods efficiency is too low, does not meet the research instantly to Chinese medicine planting environment.
Summary of the invention
It is an object of the invention to solve the problems of the above-mentioned prior art, a kind of pharmacopoeia that recognition efficiency is high plant is provided
Resource names entity recognition method.
A kind of Chinese medicine plant resource name entity recognition method, comprising the following steps:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open
It is divided into text sentence;S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector
GRU-CRF model is trained with word vector;
S4: entity is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model
Identification.
Further, Chinese medicine plant resource as described above names entity recognition method, and the step S3 includes: difference one
One searches term vector and word vector corresponding to each text sentence, and the word for the word for being included to each text sentence to
Amount obtains hiding vector using two-way LSTM model, is spliced term vector and hiding vector to obtain intermediate vector, will be intermediate
Vector is trained it as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.
Further, Chinese medicine plant resource as described above names entity recognition method, certain rule are as follows: uses
{ B-s, I-s, O-s, S-s } label is labeled Chinese medicine germplasm resource document, wherein B indicates opening for a name entity
Begin, I indicates the name entity in addition to the other parts of beginning, and O indicates other parts, and S indicates that the name being made of single word is real
Body;S indicates the name entity attributes.
Further, Chinese medicine plant resource as described above name entity recognition method, the term vector and word vector by
The corpus that wikipedia, Baidupedia construct carries out word2vec insertion training and obtains.
Further, Chinese medicine plant resource as described above names entity recognition method, after step S1, before S2,
It further include being pre-processed to Chinese medicine germplasm resource document;
The pretreated step includes: to carry out unification to the format of all Chinese medicine germplasm resource documents, and to unified
Document afterwards carries out delete operation, to delete interference information.
Further, Chinese medicine plant resource as described above names entity recognition method, and step S3 includes:
When searching term vector corresponding to text sentence, carried out in the term vector library that the insertion training obtains first
It searches, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again
Vector, the domain lexicon by wikipedia, Baidupedia construct corpus generate.
The utility model has the advantages that
Chinese medicine plant resource provided by the invention names entity recognition method, by framework GRU-CRF model, to realize
It is capable of the name entity of automatic identification Chinese medicine plant resource document, not only substantially increases recognition accuracy and recognition efficiency,
Manual identified time overhead is reduced, and extends to other names entity class, such as soil mineral content, moisture content
Deng.
Detailed description of the invention
Fig. 1 is that Chinese medicine plant resource of the present invention names entity recognition method flow chart;
Fig. 2 is to utilize { B-s, I-s, O-s, S-s } rule mark schematic diagram;
Fig. 3 is the structure chart of GRU-CRF model.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the technical solution in the present invention is carried out below
It clearly and completely describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
The present invention provides a kind of Chinese medicine plant resource name entity recognition method, as shown in Figure 1, this method includes following step
It is rapid:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is torn open
It is divided into text sentence;
S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word
Vector trains GRU-CRF model;
S4: entity is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model
Identification.
Wherein, rule certain described in step S2 are as follows: use { B-s, I-s, O-s, S-s } label to plant Chinese medicine and provide
Source document, which is offered, to be labeled, wherein B indicates the beginning of a name entity, and I indicates the name entity in addition to other portions of beginning
Point, O indicates other parts, and S indicates the name entity being made of single word;S indicates the name entity attributes.Such as: such as Fig. 2
It is shown, " Shandong plantation violet flower danshen ", sequence label be B-Location, I-Location, O, O, B-Herb, I-Herb,
I-Herb, I-Herb }, wherein " B-Location, I-Location " indicate location category entity, i.e., " Shandong ", " B-Herb, I-
Herb, I-Herb, I-Herb " indicate herbs Class entity, i.e., " violet flower danshen ", O indicates irrelevant portions.The GRU-CRF model
Structure it is as described in Figure 3.
The term vector and word vector carry out word2vec insertion by the corpus that wikipedia, Baidupedia construct and train
It obtains.Equally, Chinese medicine plant field dictionary is obtained also by the corpus that wikipedia, Baidupedia construct.
Further, in order to which the efficiency for improving GRU-CRF model training further includes centering after step S1, before S2
Medicinal material germplasm resource document is pre-processed;
The pretreated step includes: to carry out unification to the format of all Chinese medicine germplasm resource documents, and to unified
Document afterwards carries out delete operation, to delete interference information.
It is specifically as follows: it is unified that the PDF format of Chinese medicine germplasm resource document is converted into text formatting, only retain document
Effective content of text, delete the interference informations such as author and unit information, bibliography.
The step S3 include: search term vector corresponding to each text sentence and word vector one by one respectively, and
The word vector for the word for being included to each text sentence obtains hiding vector using two-way LSTM model, by term vector and hide to
Amount is spliced to obtain intermediate vector, is trained, is instructed to it using intermediate vector as the input of two-way GRU-CRF model
The GRU-CRF model perfected.
In addition, passing through the corpus by wikipedia, Baidupedia building first when searching term vector corresponding to phrase
Library carry out word2vec insertion training obtain term vector library in is searched, if can not find, first pass through domain lexicon into
Row synonym is converted to find corresponding word, then looks into the corresponding vector of the word again.
Step S3 is described in detail below, should the following steps are included:
S31. word2vec insertion training is carried out to corpus such as wikipedia, Baidupedias, respectively obtains word vector table
CT and term vector table WT.
S32. for given text sentence, word vector table CT is searched, a word sequence vector (c is generated1,c2,…,cm),
Wherein m indicates the character quantity of sentence.Using the vector as the input of two-way LSTM model, model training extracts word
Morphological feature.LSTM's specific formula is as follows:
it=sigmoid (Wicct+Wihht-1+bi) (1)
ft=sigmoid (Wfcct+Wfhht-1+bf) (2)
gt=tanh (Wgcct+Wghht-1+bg) (3)
ot=sigmoid (Wocct+Wohht-1+bo) (4)
st=ft⊙st-1+it⊙gt(dark)
ht=ot⊙tanh(st) (6)
Wherein, W indicates connection weight matrix;F, i, o are to forget door, input gate and out gate respectively;G indicates currently thin
Born of the same parents' state;S is cell state, includes long-term Dependency Specification.it、ft、gtThe common cell state s for updating the t-1 momentt-1Obtain t
The cell state s at momentt;htIt is the hidden state of moment t, by ot、stIt codetermines.⊙ is indicated by element multiplication.By it is preceding to
LSTM layers and backward LSTM layers calculating, obtain the forward direction information of sentence SentenceWith backward informationThe two collectively constitutes
Hiding layer state, here shown as
S33. first looking for term vector table WT whether there is the vector of current word, and if it exists, then be made using current term vector
For input;If it does not exist, then it first passes through domain lexicon and carries out synonym conversion to find corresponding word, then search term vector table
WT forms a term vector sequence (w1, w2 ..., wn) for given text sentence, and wherein n indicates the word number of sentence
Amount.It willWith term vector sequence (w1,w2,…,wn) splicing joint, generation words combine sequence vector (x respectively1,
x2,…,xn), as two-way GRU-CRF network input and training the model, GRU's specific formula is as follows:
rt=sigmoid (Wrxxt+Wruut-1+br) (7)
zt=sigmoid (Wzxxt+Wzuut-1+bz) (8)
nt=tanh (Wnxxt+Wnu(rt⊙ut-1)+bn) (9)
ut=(1-zt)⊙nt+zt⊙ut-1 (10)
Wherein, ztIt is the update door of GRU, determines t-1 moment hidden state ut-1Information in how many be when being transmitted to t
Carve hidden state utIn;rtIt is resetting door, determines t-1 moment hidden state ut-1Information in how many need pass into silence;
ntTo input xtWith t-1 moment hidden state ut-1Information summarize, i.e., current cell state.By preceding to GRU layers and backward
GRU layers of calculating obtain before sentence to informationWith backward informationThe two has collectively constituted hiding layer state, is expressed as
S34. sequence label prediction is carried out using CRF, it is raw finally by optimal sequence label mapping to corresponding name entity
At the annotated sequence for being based on { B, I, O, S } label.Wherein, step S2 is manually to be marked using { B, I, O, S } label to text
Note is the training set for exporting result;This step is the model obtained according to training, is predicted, obtained output result is also
With { B, I, O, S } tag representation.
By the prediction of a sequence is defined as:
Y=(y1,y2,…,yn) (11)
The score of the forecasting sequence, formula are as follows:
Wherein A is state-transition matrix, Ai,jIndicate that the probability that label j is transferred to from label i, P are expressed as two-way GRU's
Output matrix.The probability of all possible sequence label is finally obtained, formula is as follows:
Wherein YsIndicate all possible sequence label of sentence s.During model training, true tag sequence is maximized
Log probability, formula is as follows:
During model prediction, the top score of output sequence is solved, it may be assumed that
y*The automatic identification result of the optimum label sequence exactly predicted, i.e. Chinese medicine plant resource name entity.
Below by for example bright recognition methods of the invention:
TXT text is converted by Chinese medicine germplasm resource PDF document first, with " Shandong plantation violet flower danshen " in the text
For one, using the sentence as input, word vector table and term vector table are obtained by word2vec model, then being directed to " mountain
East plantation violet flower danshen " is segmented, and is searched from term vector table, obtains the term vector of each word;Then word vector is searched
Table, obtains the word vector of each word, and the word vector for the word for being included to each word using two-way LSTM model obtain hiding to
Amount, term vector and hiding vector is spliced, referred to as intermediate vector, using intermediate vector as the defeated of two-way GRU-CRF model
Enter, moving model, obtains the optimum label sequence of the sentence, the i.e. sentence a annotated sequence, i.e. " B-Location I-
Location O O B-Herb I-Herb I-Herb I-Herb ", then the result that can be identified from this sentence
Are as follows: name entity has location entity, medicinal material entity.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and
Range.
Claims (6)
1. a kind of Chinese medicine plant resource names entity recognition method, which comprises the following steps:
S1: Chinese medicine germplasm resource document is obtained;
S2: the Chinese medicine germplasm resource document is labeled according to certain rules, and the document after mark is split into
Text sentence;
S3: searching term vector corresponding to each text sentence and word vector one by one respectively, utilizes the term vector and word vector
To train GRU-CRF model;
S4: Entity recognition is named to unknown Chinese medicine germplasm resource document using the trained GRU-CRF model.
2. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the step S3 packet
It includes: searching term vector corresponding to each text sentence and word vector one by one respectively, and included to each text sentence
The word vector of word obtain hiding vector using two-way LSTM model, by term vector and hiding vector spliced to obtain it is intermediate to
Amount, is trained it for intermediate vector as the input of two-way GRU-CRF model, obtains trained GRU-CRF model.
3. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that certain rule
Then are as follows: { B-s, I-s, O-s, S-s } label is used to be labeled Chinese medicine germplasm resource document, wherein B indicates a name
The beginning of entity, I indicate the name entity in addition to the other parts of beginning, and O indicates other parts, and S expression is made of single word
Name entity;S indicates the name entity attributes.
4. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that the term vector and
Word vector carries out word2vec insertion training by the corpus that wikipedia, Baidupedia construct and obtains.
5. Chinese medicine plant resource according to claim 1 names entity recognition method, which is characterized in that step S1 it
It afterwards, further include being pre-processed to Chinese medicine germplasm resource document before S2;
The pretreated step includes: to carry out unification to the formats of all Chinese medicine germplasm resource documents, and to after reunification
Document carries out delete operation, to delete interference information.
6. Chinese medicine plant resource according to claim 4 names entity recognition method, which is characterized in that step S3 includes:
When searching term vector corresponding to text sentence, looked into the term vector library that the insertion training obtains first
It looks for, if can not find, first passes through domain lexicon and carry out synonym conversion to find corresponding word, it is corresponding then to look into the word again
Vector, the domain lexicon are generated by the corpus that wikipedia, Baidupedia construct.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910512743.XA CN110222343A (en) | 2019-06-13 | 2019-06-13 | A kind of Chinese medicine plant resource name entity recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910512743.XA CN110222343A (en) | 2019-06-13 | 2019-06-13 | A kind of Chinese medicine plant resource name entity recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110222343A true CN110222343A (en) | 2019-09-10 |
Family
ID=67816976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910512743.XA Pending CN110222343A (en) | 2019-06-13 | 2019-06-13 | A kind of Chinese medicine plant resource name entity recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222343A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680843A (en) * | 2020-06-12 | 2020-09-18 | 电子科技大学 | Chinese medicinal material survival suitability area prediction method and system based on depth SVDD model |
CN112257422A (en) * | 2020-10-22 | 2021-01-22 | 京东方科技集团股份有限公司 | Named entity normalization processing method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
CN107748757A (en) * | 2017-09-21 | 2018-03-02 | 北京航空航天大学 | A kind of answering method of knowledge based collection of illustrative plates |
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
CN108460012A (en) * | 2018-02-01 | 2018-08-28 | 哈尔滨理工大学 | A kind of name entity recognition method based on GRU-CRF |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
-
2019
- 2019-06-13 CN CN201910512743.XA patent/CN110222343A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107748757A (en) * | 2017-09-21 | 2018-03-02 | 北京航空航天大学 | A kind of answering method of knowledge based collection of illustrative plates |
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
CN108460012A (en) * | 2018-02-01 | 2018-08-28 | 哈尔滨理工大学 | A kind of name entity recognition method based on GRU-CRF |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680843A (en) * | 2020-06-12 | 2020-09-18 | 电子科技大学 | Chinese medicinal material survival suitability area prediction method and system based on depth SVDD model |
CN111680843B (en) * | 2020-06-12 | 2022-06-28 | 电子科技大学 | Chinese medicinal material survival area prediction method and system based on depth SVDD model |
CN112257422A (en) * | 2020-10-22 | 2021-01-22 | 京东方科技集团股份有限公司 | Named entity normalization processing method and device, electronic equipment and storage medium |
CN112257422B (en) * | 2020-10-22 | 2024-06-11 | 京东方科技集团股份有限公司 | Named entity normalization processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104809176B (en) | Tibetan language entity relation extraction method | |
US8478581B2 (en) | Interlingua, interlingua engine, and interlingua machine translation system | |
CN110110054A (en) | A method of obtaining question and answer pair in the slave non-structured text based on deep learning | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
CN106599032A (en) | Text event extraction method in combination of sparse coding and structural perceptron | |
CN107193798A (en) | A kind of examination question understanding method in rule-based examination question class automatically request-answering system | |
CN108920447B (en) | Chinese event extraction method for specific field | |
CN107656921B (en) | Short text dependency analysis method based on deep learning | |
CN111143574A (en) | Query and visualization system construction method based on minority culture knowledge graph | |
Wyse et al. | Generating questions from openlearn study units | |
Hämäläinen et al. | Advances in synchronized XML-MediaWiki dictionary development in the context of endangered Uralic languages | |
CN110222343A (en) | A kind of Chinese medicine plant resource name entity recognition method | |
CN108491399A (en) | Chinese to English machine translation method based on context iterative analysis | |
Sitender et al. | Sansunl: a Sanskrit to UNL enconverter system | |
Liu | Corpus Design of Chinese Medicine English Vocabulary Translation Teaching System Based on Python | |
Baker | FrameNet, present and future | |
De Melo et al. | Towards universal multilingual knowledge bases | |
CN112182204A (en) | Method and device for constructing corpus labeled by Chinese named entities | |
Mahlaza | Grammars for generating isiXhosa and isiZulu weather bulletin verbs | |
CN113807102B (en) | Method, device, equipment and computer storage medium for establishing semantic representation model | |
Ganbold et al. | An experiment in managing language diversity across cultures | |
Ghosh et al. | Clause identification and classification in bengali | |
CN113886521A (en) | Text relation automatic labeling method based on similar vocabulary | |
Amezian et al. | Training an LSTM-based Seq2Seq model on a Moroccan biscript lexicon | |
Mridha et al. | Development of morphological rules for bangla words for universal networking language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190910 |