CN110321550A

CN110321550A - A kind of name entity recognition method and device towards Chinese medical book document

Info

Publication number: CN110321550A
Application number: CN201910340359.6A
Authority: CN
Inventors: 谢永红; 夏超; 张德政; 阿孜古丽; 栗辉; 杨石兵
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2019-10-11

Abstract

The embodiment of the present invention discloses a kind of name entity recognition method and device towards Chinese medical book document, which comprises the entity word for arranging at least one entity type obtains a first traditional Chinese medical science field vocabulary comprising entity type to be identified；Using the automatic short phrase picking technology of AutoPhrase, short phrase picking is carried out from Chinese medicine ancient Chinese prose corpus, obtains the second traditional Chinese medical science field vocabulary；According to scheduled time mark strategy, the entity occurred in the Chinese medicine ancient Chinese prose corpus is marked out；Obtain the labeled data of Chinese medicine ancient Chinese prose corpus；Training dataset, validation data set, test data set are generated, training dataset is output in trained file, validation data set and test data set are output in test file；Data are read in from the trained file, test file, according to the automatic Named Entity Extraction Model of reading data training, the Chinese medicine ancient Chinese prose corpus are predicted, the result identified；The entity identified according to result.

Description

A kind of name entity recognition method and device towards Chinese medical book document

Technical field

The present invention relates to Chinese language processing field more particularly to a kind of name entity recognition methods towards Chinese medical book document And device.

Background technique

With the development of technology, need to be named Chinese medicine literature of ancient book Entity recognition processing.Current method all needs Will largely artificial labeled data or design feature, however the mark of traditional Chinese medical science field and characteristic Design need domain knowledge, institute It is higher with cost.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of name entity recognition method and dress towards Chinese medical book document It sets, can be improved the automatization level of the name Entity recognition of Chinese medical book document.

A kind of name entity recognition method towards Chinese medical book document, comprising:

S1, the entity word for arranging at least one entity type, obtaining one includes the first of entity type to be identified Traditional Chinese medical science field vocabulary；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；

S2, using the automatic short phrase picking technology of AutoPhrase, carry out short phrase picking from Chinese medicine ancient Chinese prose corpus, obtain institute Possible entity word, obtains the second traditional Chinese medical science field vocabulary, and the second traditional Chinese medical science field vocabulary includes entity word；

S3, in conjunction with the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary, according to scheduled time mark strategy, Mark out the entity occurred in the Chinese medicine ancient Chinese prose corpus；

S4, in conjunction with the Chinese medicine ancient Chinese prose corpus return mark result and tie/break connects/disconnects dimension model, obtain Cure the labeled data of ancient Chinese prose corpus；

S5, in conjunction with the labeled data and according to the obtained pre-training model Word of Chinese medicine ancient Chinese prose corpus training The insertion of Embedding word, generates training dataset, validation data set, test data set, and training dataset is output to training text In part, validation data set and test data set are output in test file；

S6, data are read in from the trained file, test file, it is automatic according to reading data training AutoNER Named Entity Extraction Model, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, it is ancient to the Chinese medicine Literary corpus predicted, the result identified；The entity identified according to result.

A kind of name entity recognition device towards Chinese medical book document, comprising:

Finishing unit arranges the entity word of at least one entity type, obtains one and includes entity type to be identified The first traditional Chinese medical science field vocabulary；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；

It excavates unit and carries out short phrase picking from Chinese medicine ancient Chinese prose corpus using the automatic short phrase picking technology of AutoPhrase, All possible entity word is obtained, the second traditional Chinese medical science field vocabulary is obtained, the second traditional Chinese medical science field vocabulary includes entity word；

Unit is marked, in conjunction with the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary, according to scheduled time Mark strategy, marks out the entity occurred in the Chinese medicine ancient Chinese prose corpus；

Processing unit connects/disconnects dimension model in conjunction with time mark result and tie/break of the Chinese medicine ancient Chinese prose corpus, Obtain the labeled data of Chinese medicine ancient Chinese prose corpus；

Output unit, in conjunction with the labeled data and the pre-training model obtained according to Chinese medicine ancient Chinese prose corpus training The insertion of Word Embedding word, generates training dataset, validation data set, test data set, training dataset is output to In training file, validation data set and test data set are output in test file；

Predicting unit reads in data from the trained file, test file, according to reading data training The automatic Named Entity Extraction Model of AutoNER, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, The Chinese medicine ancient Chinese prose corpus is predicted, the result identified；The entity identified according to result.

In the present invention, it can be named Entity recognition towards Chinese medical book document, improve automatic processing level.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is the flow diagram of name entity recognition method of the embodiment of the present invention towards Chinese medical book document；

Fig. 2 is the schematic diagram of the name entity recognition method towards Chinese medical book document in application scenarios of the present invention；

Fig. 3 is that the embodiment of the present invention executes the result schematic diagram after step 1；

Fig. 4 is that the embodiment of the present invention executes the result schematic diagram after step 2；

Fig. 5 is that the embodiment of the present invention executes the result schematic diagram after step 4；

Fig. 6 be the present invention in pre-training model word2vec word to vector result；

Fig. 7 is that the embodiment of the present invention executes the result schematic diagram after step 6；

Fig. 8 is the schematic diagram of model training evaluation index in the present invention.

Fig. 9 is the connection schematic diagram of name entity recognition device of the embodiment of the present invention towards Chinese medical book document.

Specific embodiment

The embodiment of the present invention is described in detail with reference to the accompanying drawing.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

For convenience of description, description apparatus above is to be divided into various units/modules with function to describe respectively.Certainly, exist Implement to realize each unit/module function in the same or multiple software and or hardware when the present invention.

As shown in Figure 1, a kind of name entity recognition method towards Chinese medical book document, comprising:

S1, the entity word for arranging at least one entity type, obtaining one includes the first of entity type to be identified Traditional Chinese medical science field vocabulary；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；As shown in figure 3, being S1 The result of step: what this step generated is the vocabulary of belt type,

Format a: word and corresponding type (it is pulse condition, first letter of pinyin that ZZ, which is symptom, MX)

S2, using the automatic short phrase picking technology of AutoPhrase, carry out short phrase picking from Chinese medicine ancient Chinese prose corpus, obtain institute Possible entity word, obtains the second traditional Chinese medical science field vocabulary, and the second traditional Chinese medical science field vocabulary includes entity word；Such as Fig. 4 Shown, be the result of AutoPhrase short phrase picking: what this step generated is the vocabulary of not belt type, format: according to phrase Quality sorts from high to low.

S4, in conjunction with the Chinese medicine ancient Chinese prose corpus return mark result and tie/break connects/disconnects dimension model, obtain Cure the labeled data of ancient Chinese prose corpus；As shown in figure 5, format: showing complete a word.It is possible that " I None S ", which represents the word not, Entity；The type that " O None D " represents the possible entity is unknown；" I+ corresponding types+S " represents the beginning of single entity, It is break；Centre or the end that " O+ corresponding types+S " represents single entity, are tie；

S5, in conjunction with the labeled data and according to the obtained pre-training model Word of Chinese medicine ancient Chinese prose corpus training The insertion of Embedding word, generates training dataset, validation data set, test data set, and training dataset is output to training text In part, verify data and test data set are output in test file；As shown in fig. 6, pre-training model word2vec word to Measure result: should be the result is that converting a vector for a word or a word；Such as Fig. 6 ' ' word be converted into one 200 dimension Vector, 48355 representatives have 48355 words；Training, verifying, test set are exactly three files, and the content of the inside is by the knot in S4 Fruit is converted into pure digi-tal, the i.e. format of computer capacity understanding in conjunction with the result of word2vec.

S6, data are read in from the trained file, test file, it is automatic according to reading data training AutoNER Named Entity Extraction Model, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, it is ancient to the Chinese medicine Literary corpus predicted, the result identified；The entity identified according to result.As shown in fig. 7, being obtained for model prediction Result format: the first and second column are line number of the entity in original language material, and three column are entities, and four, five column are entity types Id and entity type.

Wherein, the step S1 includes:

S101, the entity word for arranging at least one entity type, the arrangement includes: to clear up entity, is deleted Space, punctuation mark；

S102, the entity for having ambiguousness in existing vocabulary is deleted, and carries out the operation of duplicate removal to the entity of same type, obtained First traditional Chinese medical science field vocabulary of the entity composition of type to be identified.

Wherein, the step S2 includes:

S201, setting the automatic short phrase picking script of AutoPhrase in input and output path, short phrase picking word frequency with And the Thread Count of program operation；

S202, according to the Chinese medicine ancient Chinese prose corpus, the deactivated vocabulary of Chinese is safeguarded, word that addition needs to filter and Word；

S203, the TCM Major term in Chinese medicine ancient Chinese prose corpus is added in vocabulary to promote the quality of short phrase picking, so After carry out short phrase picking, generate the second traditional Chinese medical science field vocabulary, the second traditional Chinese medical science field vocabulary be according to phrase mass fraction from The vocabulary of high to Low sequence.

Wherein, the step S3 includes:

301, merge the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary；First Chinese medicine is led Domain vocabulary reads in entity and corresponding type；

S302, the second traditional Chinese medical science field vocabulary is cleared up, deletes front and back space, newline, and filter out list The entity of word；Two credible threshold values are set, and the entity formed respectively for word and more words is screened, and is more than the credible threshold Value, then be merged into final vocabulary, and entity word is saved when merging, and corresponding entity type is set as NULL；

S303, according to mark strategy is returned, for each sentence in Chinese medicine ancient Chinese prose, return to the entity being wherein likely to occur and right The entity type answered；

Described time mark strategy includes:

Short word situation is contained at least two for a long word, using the strategy of priority of long word, and it is corresponding to mark word Entity type；Situation is formed by least two short words for long word, the entity type of the short word of each of the long word is marked For NULL；The case where at least two word intersections conflict, the entity type of described two words is all labeled as NULL.

Wherein, the step S4 includes:

S401, Chinese medicine ancient Chinese prose corpus is read in, the Chinese medicine ancient Chinese prose corpus is segmented, word segmentation result is obtained；

S402, use "." two Chinese medicine ancient Chinese prose corpus are subjected to subordinate sentences, retain after subordinate sentence it is original ".", and filter out institute The sentence that simple sentence number of words in Chinese medicine ancient Chinese prose corpus is less than 4 is stated, without mark；

S403, it converts corpus to the dimension model that tie/break is connected/disconnected, adds "<s>O in one beginning None S " row is as one opening flag；For after simple sentence participle as a result, index index is traversed, if the index Index then illustrates not arriving entity word, i.e., not equal to the starting index index of next entity of the result marked in third step The index is indexed into the mould that word all in the starting index index of next entity word is designated as " word+I None S " Formula, and a word a line；On the contrary, illustrating if the index index is equal to the starting index index of next entity word Have found the entity word；If the entity word is corresponding with type, i.e., first word of the entity word is labeled as " Subsequent all words of the entity word are labeled as " word+O+ corresponding types+S " by word+I+ corresponding types+S "；

If the entity word, without corresponding types, all words for including by the entity word are labeled as " word+O None The mode of D "；The operation is recycled, until subsequent all words are designated as " word+I without possible entity word in simple sentence None S"；Finally, finally filling "<eof>I None S " row as one end mark in simple sentence；In this dimension model The mark of word accounts for a line；

In the dimension model, it is not possible entity that " I None S ", which represents the entity word,；"O None D" The type for representing the possible entity of entity word is unknown；The beginning that " I+ corresponding types+S " represents single entity is Break is disconnected；Centre or the end that " O+ corresponding types+S " represents single entity word are tie connection；So far it completes Conversion of the Chinese medicine ancient Chinese prose corpus to the tie/break dimension model connected/disconnected.

Wherein, the step S5 includes:

S501, pre-training, the word2vec word distinguished to vector are carried out according to the result after Chinese medicine ancient Chinese prose and participle As a result, the namely corresponding words vector of Chinese medicine corpus；

S502, the mark that will occur in annotation results: '<s>', '<unk>', '<>', '<\n>' also training obtain it is corresponding Word2vec word is respectively added in the words vector to vector, obtains complete words vector with this；

S503, according to word, the word occurred in the Chinese medicine ancient Chinese prose corpus, the complete words vector is filtered, Leave behind word, the word occurred in Chinese medicine ancient Chinese prose corpus；Multiple Chinese medicine corpus are handled using a complete words vector；

The mistake being likely to occur in S504, detection annotation results；If it was found that there is the number less than 4 parts in annotation results According to then judging are as follows: there are problems for mark；

The entity type occurred in S505, automatic detection annotation results, and return and obtain all types of dictionaries, carry out letter Single sequence, to obtain the Embedding insertion of all entity type and corresponding type；And by tie/break connection/ It disconnects making simply to sort and connects/disconnects corresponding test data to obtain corresponding dictionary to get to tie/break Embedding insertion；

S506, it is embedded according to the annotation results and words vector of input and the Embedding, obtains training data Collection；Sentence is ranked up according to sentence length, finally the result by the Embedding insertion of training labeled data is output to instruction Practice and collects in corresponding trained file；

S507, according to the verification result of input, test result and words vector, be verified data set and test data Collection；And be ranked up according to sentence length, finally return to the Embedding insertion of verify data and test data as a result, simultaneously The Embedding insertion and verifying that Embedding insertion and tie/break by words vector, entity type connect/disconnect Data set and the Embedding of test data set insertion are output in test file.

Wherein, the step S6 includes:

S601, file path, test file path, the outgoing route of model, will be trained obtained in the S5 step Habit rate, hidden layer dimension, forgetting rate, term vector dimension, epoch stage and optimizer algorithm are passed in training pattern；

S602, in model training, reading test file first reads in type label, tie/break and connects/disconnects mark Label and validation data set and test data set；Then training file is read in, training dataset is read in；NER model is set, then Trained words vector in advance is loaded, the parameter of corresponding optimizer is set；And in the training process every one section of wheel number or Test obtains accuracy rate and recall rate and F1 value to a part of data of person's training i.e. on verifying collection, and record optimal F1 value with And the corresponding checkpoint checkpoint, and tested on test set；If it was found that the effect on verifying collection does not increase Afterwards, i.e. adjusting parameter；After multiple checks, best checkpoint and best evaluation evaluation index numerical value can be obtained；

S603, the model path by the best checkpoint of S602 final entry, finally saved and the Chinese medicine for needing to predict are ancient In the program of the incoming coding forecast set of literary corpus, it will predict that corpus carries out Embedding according to incoming prediction corpus and model Insertion, the predictive data set after being encoded；

S604, the model path by the checkpoint of predictive data set and record, finally saved, hidden layer dimension, forgetting rate, In the program of the incoming prediction of term vector dimensional parameter；Predictive data set is read in, NER Named Entity Extraction Model is set, for every Sentence input predicted respectively as a result, and being saved in decoding result file；

S605, the decoding result file for obtaining prediction are read in, and are obtained different types of entity according to type-collection and are protected It is stored in different files, while saving its position in Chinese medicine ancient Chinese prose corpus, corresponding type and entity word itself； So far the name Entity recognition task towards Chinese medical book document is completed.

Application scenarios of the invention are described below.

As shown in Fig. 2, being a kind of name entity recognition techniques towards Chinese medical book document of the present invention, including step It is rapid:

S1, the Chinese medicine for the entity type that one identifies comprising needs is obtained by arranging the entity word of various entity types Field vocabulary；If the entities such as symptom, medicinal material, prescription need to be identified, then needs arrangement to obtain one in this step and include above three The vocabulary of class entity, format are that an entity corresponds to corresponding type；

S2, short phrase picking is carried out from Chinese medicine ancient Chinese prose corpus using AutoPhrase technology, obtains all possible entity, And it arranges and obtains the traditional Chinese medical science field vocabulary for not including entity type；

S3, mark strategy is returned in conjunction with above two vocabulary and according to design, marks out and is likely to occur in Chinese medicine ancient Chinese prose corpus Entity；

S4, in conjunction with Chinese medicine ancient Chinese prose corpus return mark result and tie/break dimension model obtain the labeled data of ancient Chinese prose；

S5, instruction is generated in conjunction with labeled data and according to the pre-training model Word Embedding that original language material training obtains Practice, verifying, test data；Training set: it is used to training pattern；Verifying collection: for the effect for adjusting model parameter to be optimal； Test set:

S6, test model performance, training AutoNER model simultaneously are predicted to obtain using the model that training obtains to corpus The result of identification；The entity identified according to result.Format: the first and second column are line number of the entity in original language material, three column It is entity, four, five column are the id and entity type of entity type.

It is further comprising the steps of in the implementation steps of step S1:

S101, the entity word for arranging various entity types, including entity is cleared up, delete space, punctuation mark Deng；

S102, the biggish entity of ambiguousness in existing vocabulary is deleted, such as the entity of individual character, the entity and number of words of multiple meaning Excessive entity finally carries out the operation of duplicate removal to the entity of same type, obtains the word for the entity composition for needing the type identified Table.

It is further comprising the steps of in the implementation steps of step S2:

What S201, the input and output path in modification AutoPhrase script, the word frequency of short phrase picking and program were run Thread Count；

S202, safeguard that addition needs the word and word filtered according to deactivated vocabulary of the original language material to Chinese；

S203, the high quality phrase in Chinese medicine ancient Chinese prose corpus is added in vocabulary to promote the quality of short phrase picking, then Carry out short phrase picking, short phrase picking the result is that the vocabulary of the not belt type to be sorted from high to low according to phrase mass fraction.

It is further comprising the steps of in the implementation steps of step S3:

301, before marking out the entity being likely to occur in original language material, merge two generated in one or two steps first A vocabulary.Firstly for the vocabulary of belt type, it is only necessary to read in entity and corresponding type；

S302, the entity excavated for AutoPhrase, first clear up the word of the part, and it is empty to delete front and back Lattice, newline, and the entity of individual character is filtered out, because its singularity is too big.And two believable threshold values are set, respectively for The entity of word and more words composition is screened, and the threshold value of Manual definition's confidence level can be merged into final vocabulary more than threshold value In, entity need to be only saved when merging, entity type is set as NULL；

S303, then illustrate that designs returns mark strategy:

Entity returns mark and is roughly divided into three kinds of situations:

(1), a long word includes multiple short words, for example hypodynia includes hypodynia in abdomen

(2), long word is fully composed by multiple short words, for example is suffered from vomiting and diarrhoea to be dropped by upper Tuhe and be formed

(3), two word intersection conflicts, such as diet are divided into two kinds of diet, deficiency of food point-scores less.

For the first case, using the strategy of priority of long word, and the type of upper entity is marked；For second of feelings The each word of the long word is labeled as NULL by condition；For the third situation, multiple words are all designated as NULL；

S304, according to mark strategy is returned, for each sentence in Chinese medical book document, return the entity being wherein likely to occur and Corresponding entity type, including its accurate entity type and NULL, for the part NULL, though entity is not indicated in mark Type will not have an impact model in model training, but can based on context predict to obtain part possibility in prediction Entity type.

It is further comprising the steps of in the implementation steps of step S4:

S401, original language material is read in first, after then using the segmenting method towards Chinese medical book document to obtain participle Result；

S402, use "." two corpus are subjected to subordinate sentences, retain after subordinate sentence it is original ".", and filter out in original language material Simple sentence number of words is less than 4 sentence, without mark；

S403, the dimension model for converting corpus to tie/break, in one beginning addition "<s>O None S " row As one opening flag.For after simple sentence participle as a result, traversal indexes index, if the index is not equal to third step The starting index of next entity of the result of middle mark then illustrates not arrive entity, i.e., by the index to next entity All words are designated as the mode of " word+I None S ", and a word a line in starting index；If the opposite index is equal to next The starting index of a entity, then explanation has found the entity, if the entity has type, i.e., marks first word of the entity For " word+I+ corresponding types+S ", subsequent all words of the entity are labeled as " word+O+ corresponding types+S "；

If the entity, without type, all words for including by the entity are labeled as the mode of " word+O None D ".Circulation should Operation, until subsequent all words are designated as " word+I None S " without possible entity in simple sentence.Finally simple sentence most "<eof>I None S " row is filled afterwards as one end mark.

The mark of word accounts for a line in this dimension model.In this dimension model, " I None S " represent the word be not can The entity of energy；The type that " O None D " represents the possible entity is unknown；" I+ corresponding types+S " represents opening for single entity Begin, is break；Centre or the end that " O+ corresponding types+S " represents single entity, are tie；So far Chinese medicine is completed Corpus to tie/break dimension model conversion.

It is further comprising the steps of in the implementation steps of step S5:

S501, it pre-training is carried out according to the result after original language material and participle respectively obtains word2vec's as a result, also It is the corresponding word of Chinese medicine corpus, term vector；

S502, the mark that will occur in annotation results, such as '<s>', '<unk>', '<>', '<\n>' also training corresponded to Word2vec, and be respectively added in words vector obtained in the previous step, complete words vector obtained with this；

Word, term vector that previous step training obtains are filtered, only by S503, the word according to occurring in original language material, word The word occurred in original language material, word are left, this step can be filled into original word term vector not in the corpus of this training Word, the word of appearance, to reduce memory consumption.And a more complete word, term vector can be used to multiple Chinese medicine corpus It is handled, eliminates the process that training every time requires re -training words vector；

The mistake being likely to occur in S504, detection annotation results, if having the number less than 4 parts in discovery annotation results According to, then illustrate mark there are problems；

The entity type occurred in S505, automatic detection annotation results, and return and obtain all types of dictionaries, carry out letter Single sequence, to obtain the Embedding of all entity type and corresponding type；And tie/break is made simple Sequence arrives the corresponding Embedding of tie/break to obtain corresponding dictionary；

S506, training data is obtained according to the annotation results and words vector and the Embedding of previous step of input Collection.And be ranked up according to sentence length, it is corresponding that the result of the Embedding of training labeled data is finally output to training set Training file in；

S507, verifying annotation results, test annotation results and words vector according to input.And it is carried out according to sentence length Sequence finally returns to verifying labeled data and tests the Embedding's as a result, and by words vector, entity class of labeled data The Embedding of the Embedding of the Embedding and tie/break of type and verifying and test data is output to test In file.

It is further comprising the steps of in the implementation steps of step S6:

S601, will obtained in S5 step training file path, test file path, the outgoing route of model, learning rate, Hidden layer dimension, droprate, term vector dimension, epoch and optimizer algorithm etc. are passed in training pattern；Model and algorithm Specifically: in AutoNER on existing model training program, it is added to the process of load word vector, it is other the same.The mould Type uses the network structure based on Recognition with Recurrent Neural Network RNN, is obtained by incoming data set and model parameter training AutoNER model.

S602, in model training, reading test file first reads in type label, tie/break label and verifying Collection and test set；Then training file is read in, training set is read in；Be arranged NER model, then load in advance trained words to Amount, is arranged the parameter of corresponding optimizer, such as type, the initial learning rate of optimizer, such as；Finally start training for model Journey；And every one section of wheel number or a part of data of training, test obtains accuracy rate and calls together i.e. on verifying collection in the training process Rate and F1 value are returned, and records optimal F1 value and corresponding checkpoint, and tested on test set；If it was found that After the effect on verifying collection does not increase, i.e. adjusting parameter, such as learning rate；After multiple check, it can be obtained optimal Checkpoint and optimal evaluation evaluation index numerical value；

S603, the model path by the checkpoint of S602 final entry, finally saved and the original language for needing to predict In the program for expecting incoming encode forecast set, it will predict that corpus carries out Embedding, obtains according to incoming prediction corpus and model Predictive data set after to coding；

S604, the model path by the checkpoint of predictive data set and record, finally saved, hidden layer dimension, In the program of the incoming prediction of the parameters such as droprate, term vector dimension；Predictive data set is read in, NER model is set, for every Input predicted respectively as a result, and being saved in decoding result file；

S605, the decoding result file for obtaining prediction are read in, and are obtained different types of entity according to type-collection and are protected It is stored in different files, while saving its position in original language material, corresponding type and entity itself；So far i.e. complete At the name Entity recognition task towards Chinese medical book document.

In the present invention, model training evaluation index is as shown in Figure 8.

Original Chinese ancient Chinese prose corpus are as follows:

Taishan leaf grandson Xiang Guoyou avoids as taboo beneficial bud person, and body such as bright charcoal when prolonging remaining examine, opisthotonos, coma is in silence.Examine complete, Xiang Guowen It says；Does arteries and veins dissipate no? can you rescue no? it is remaining to say；In a panic without labor, without delivering, Gu Mai does not dissipate still this typhoid fever Grand Tutor, and disease gesture is also light, can answer Hand takes effect.Prime minister says；It is easier said than done, i.e., it with treatise on Febrile Diseases, can deliver within 1st or two, preferably conciliate within 3rd or four, it is 14 days modern, An Ganbiao? it is remaining to say；There is exterior syndrome table arteries and veins Ju to see, though more time again, still might as well with opening pores method solution it.High fever like the burning coal over the body, a sweat are It dissipates, is monarch with Rhizoma Et Radix Notopterygii then, pueraria lobata, radix bupleuri, cimicifugae foetidae are minister, and Rhizoma Chuanxiong, purple perilla, radix paeoniae rubra are assistant, and Chinese ephedra is to make, about 1 Qian Chongxian Ginger five, even must be Bulbus Allii Fistulosi five, two bowls of water, decoct one bowl.Hot drink and sweat such as rain are infused, and disease is recover, without tired medicine again.Public affairs, which are sighed, to be said；Preceding doctor Fire syndrome is takeed for, it is more to take freshener, so disease is more, public celestial, the evening what is met each other.It is remaining to say；Without very surprise, but feel the pulse not Accidentally, the appropriate ear of medication weight.

After subordinate sentence:

Taishan leaf grandson Xiang Guoyou avoids as taboo beneficial bud person, and body such as bright charcoal when prolonging remaining examine, opisthotonos, coma is in silence.

Examine complete, prime minister, which asks, to be said；Does arteries and veins dissipate no? can you rescue no? it is remaining to say；In a panic without labor, this typhoid fever is without delivering, and Gu Mai is still by Grand Tutor It does not dissipate, disease gesture is also light, hand can be answered to take effect.

Prime minister says；It is easier said than done, i.e., it with treatise on Febrile Diseases, can deliver within 1st or two, preferably conciliate within 3rd or four, it is 14 days modern , An Ganbiao? it is remaining to say；There is exterior syndrome table arteries and veins Ju to see, though more time again, still might as well with opening pores method solution it.

High fever like the burning coal over the body, a sweat dissipate, and are monarch with Rhizoma Et Radix Notopterygii then, and pueraria lobata, radix bupleuri, cimicifugae foetidae are minister, and Rhizoma Chuanxiong, purple perilla, radix paeoniae rubra are assistant, Chinese ephedra is to make, and about 1 money weight fresh gingers five even must be Bulbus Allii Fistulosi five, two bowls of water, decocts one bowl.

Hot drink and sweat such as rain are infused, and disease is recover, without tired medicine again.

Public affairs, which are sighed, to be said；Preceding doctor takes for fire syndrome, and it is more to take freshener, so disease is more, public celestial, the evening what is met each other.

It is remaining to say；Without very surprise, but feeling the pulse does not miss, the appropriate ear of medication weight.

Result after participle:

As shown in figure 9, being a kind of name entity recognition device towards Chinese medical book document of the present invention, comprising:

Finishing unit 41 arranges the entity word of at least one entity type, obtains one and includes entity class to be identified First traditional Chinese medical science field vocabulary of type；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；

Unit 42 is excavated, using the automatic short phrase picking technology of AutoPhrase, phrase digging is carried out from Chinese medicine ancient Chinese prose corpus Pick, obtains all possible entity word, obtains the second traditional Chinese medical science field vocabulary, the second traditional Chinese medical science field vocabulary includes entity word Language；

Unit 43 is marked, in conjunction with the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary, according to scheduled Mark strategy is returned, the entity occurred in the Chinese medicine ancient Chinese prose corpus is marked out；

Processing unit 44 connects/disconnects mark mould in conjunction with time mark result and tie/break of the Chinese medicine ancient Chinese prose corpus Formula obtains the labeled data of Chinese medicine ancient Chinese prose corpus；

Output unit 45, in conjunction with the labeled data and the pre-training model obtained according to Chinese medicine ancient Chinese prose corpus training The insertion of Word Embedding word, generates training dataset, validation data set, test data set, training dataset is output to In training file, validation data set and test data set are output in test file；

Predicting unit 46 reads in data from the trained file, test file, according to reading data training The automatic Named Entity Extraction Model of AutoNER, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, The Chinese medicine ancient Chinese prose corpus is predicted, the result identified；The entity identified according to result.

The present invention provides a kind of name entity recognition techniques towards Chinese medical book document, by using short phrase picking technology AutoPhrase, the segmenting method (patent applied for) towards Chinese medical book document and modified English entity recognition techniques AutoNER carries out the name Entity recognition for Chinese medicine ancient Chinese prose corpus.The described method includes: by arranging various entity types Entity word obtains one and includes the traditional Chinese medical science field vocabulary for the entity type for needing to identify；Using AutoPhrase technology from Chinese medicine The traditional Chinese medical science field vocabulary of occurred phrase composition is obtained in ancient Chinese prose corpus；By the design of above two vocabulary knot merga pass Strategy is returned and is marked back in original Chinese medicine ancient Chinese prose corpus；Chinese medicine is obtained in conjunction with time mark result and newest tie/break dimension model The labeled data of ancient Chinese prose corpus；In conjunction with labeled data and the pre-training Word Embedding life obtained according to original language material training At training, verifying, test data；Training AutoNER model and the model obtained using training carry out Chinese medicine ancient Chinese prose corpus pre- Measure the result of identification；The entity identified according to result.Name entity recognition techniques of the invention are based on remote supervisory Method does not need mark corpus, does not need manually to extract feature yet, it is only necessary to the vocabulary of entity type to be identified, it can be right Entity in Chinese medical book document carries out identification efficiently, intelligent.Method proposed by the present invention can use remote supervisory, only The vocabulary for needing to use traditional Chinese medical science field does not need manually, cost to be greatly saved.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, described program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be Magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of name entity recognition method towards Chinese medical book document characterized by comprising

S1, the entity word for arranging at least one entity type obtain first Chinese medicine comprising entity type to be identified Field vocabulary；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；

S2, using the automatic short phrase picking technology of AutoPhrase, carry out short phrase picking from Chinese medicine ancient Chinese prose corpus, obtain it is all can The entity word of energy, obtains the second traditional Chinese medical science field vocabulary, the second traditional Chinese medical science field vocabulary includes entity word；

S3, it is marked in conjunction with the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary according to scheduled time mark strategy The entity occurred in the Chinese medicine ancient Chinese prose corpus out；

S4, in conjunction with the Chinese medicine ancient Chinese prose corpus return mark result and tie/break connect/disconnect dimension model, obtain Chinese medicine Gu The labeled data of literary corpus；

S6, data are read in from the trained file, test file, is named automatically according to reading data training AutoNER Entity recognition model, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, to the Chinese medicine ancient Chinese prose language Expect to be predicted, the result identified；The entity identified according to result.

2. the method according to claim 1, wherein the step S1 includes:

S101, the entity word for arranging at least one entity type, the arrangement includes: to clear up entity, deletion space, Punctuation mark；

S102, the entity for having ambiguousness in existing vocabulary is deleted, and carries out the operation of duplicate removal to the entity of same type, obtained wait know First traditional Chinese medical science field vocabulary of the entity composition of other type.

3. the method according to claim 1, wherein the step S2 includes:

S201, setting the automatic short phrase picking script of AutoPhrase in input and output path, short phrase picking word frequency and journey The Thread Count of sort run；

S202, according to the Chinese medicine ancient Chinese prose corpus, the deactivated vocabulary of Chinese is safeguarded, addition needs the word and word filtered；

S203, add the TCM Major term in Chinese medicine ancient Chinese prose corpus in vocabulary to promote the quality of short phrase picking, then into Row short phrase picking, generate the second traditional Chinese medical science field vocabulary, the second traditional Chinese medical science field vocabulary be according to phrase mass fraction from height to The vocabulary of low sequence.

4. the method according to claim 1, wherein the step S3 includes:

301, merge the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary；For the first traditional Chinese medical science field word Table reads in entity and corresponding type；

S302, the second traditional Chinese medical science field vocabulary is cleared up, deletes front and back space, newline, and filter out individual character Entity；Two credible threshold values are set, it is more than the credible threshold value that the entity formed respectively for word and more words, which is screened, It is then merged into final vocabulary, entity word is saved when merging, corresponding entity type is set as NULL；

S303, according to mark strategy is returned, for each sentence in Chinese medicine ancient Chinese prose, return to the entity being wherein likely to occur and corresponding Entity type；

Described time mark strategy includes:

Short word situation is contained at least two for a long word, using the strategy of priority of long word, and marks the corresponding reality of word Body type；Situation is formed by least two short words for long word, the entity type of the short word of each of the long word is labeled as NULL；The case where at least two word intersections conflict, the entity type of described two words is all labeled as NULL.

5. the method according to claim 1, wherein the step S4 includes:

S402, use "." two Chinese medicine ancient Chinese prose corpus are subjected to subordinate sentences, retain after subordinate sentence it is original ".", and filter out in described The sentence that simple sentence number of words in ancient Chinese prose corpus is less than 4 is cured, without mark；

S403, it converts corpus to the dimension model that tie/break is connected/disconnected, adds "<s>O None in one beginning S " row is as one opening flag；For after simple sentence participle as a result, index index is traversed, if the index index is not Equal to the starting index index of next entity of the result marked in third step, then illustrate not arriving entity word, i.e., it will be described Index indexes the mode that word all in the starting index index of next entity word is designated as " word+I None S ", and And word a line；On the contrary, illustrating to have found if the index index is equal to the starting index index of next entity word The entity word；If the entity word is corresponding with type, i.e., first word of the entity word is labeled as " word+I+ Subsequent all words of the entity word are labeled as " word+O+ corresponding types+S " by corresponding types+S "；

If the entity word, without corresponding types, all words for including by the entity word are labeled as " word+O None D " Mode；The operation is recycled, until without possible entity word subsequent all words to be designated as to " word+I None in simple sentence S"；Finally, finally filling "<eof>I None S " row as one end mark in simple sentence；Word in this dimension model Mark account for a line；

In the dimension model, it is not possible entity that " I None S ", which represents the entity word,；" O None D " is represented The type of the possible entity of entity word is unknown；The beginning that " I+ corresponding types+S " represents single entity is break disconnected It opens；Centre or the end that " O+ corresponding types+S " represents single entity word are tie connection；So far Chinese medicine Gu is completed Conversion of the literary corpus to the tie/break dimension model connected/disconnected.

6. the method according to claim 1, wherein the step S5 includes:

S501, according to Chinese medicine ancient Chinese prose and participle after result carry out pre-training, respectively obtain word2vec word to vector as a result, The namely corresponding words vector of Chinese medicine corpus；

S503, according to word, the word occurred in the Chinese medicine ancient Chinese prose corpus, the complete words vector is filtered, is only stayed Word, the word occurred in lower Chinese medicine ancient Chinese prose corpus；Multiple Chinese medicine corpus are handled using a complete words vector；

The mistake being likely to occur in S504, detection annotation results；If it was found that there is the data less than 4 parts in annotation results, Judgement are as follows: there are problems for mark；

The entity type occurred in S505, automatic detection annotation results, and return and obtain all types of dictionaries, it carries out simple Sequence, to obtain the Embedding insertion of all entity type and corresponding type；And tie/break is connected/disconnected Make simple sequence and connects/disconnects corresponding test data Embedding to obtain corresponding dictionary to get to tie/break Insertion；

S506, it is embedded according to the annotation results and words vector of input and the Embedding, obtains training dataset；Root Sentence is ranked up according to sentence length, finally the result by the Embedding insertion of training labeled data is output to training set In corresponding trained file；

S507, according to the verification result of input, test result and words vector, be verified data set and test data set；And It is ranked up according to sentence length, finally return to the Embedding insertion of verify data and test data as a result, and by words The Embedding insertion and validation data set that vector, the Embedding insertion of entity type and tie/break are connected/disconnected Embedding insertion with test data set is output in test file.

7. the method according to claim 1, wherein the step S6 includes:

S601, will obtained in the S5 step training file path, test file path, the outgoing route of model, learning rate, Hidden layer dimension, forgetting rate, term vector dimension, epoch stage and optimizer algorithm are passed in training pattern；

S602, in model training, reading test file first, read in type label, tie/break connect/disconnect label with And validation data set and test data set；Then training file is read in, training dataset is read in；NER model is set, is then loaded Preparatory trained words vector, is arranged the parameter of corresponding optimizer；And in the training process every one section of wheel number or instruction Practicing a part of data, test obtains accuracy rate and recall rate and F1 value i.e. on verifying collection, and records optimal F1 value and right The checkpoint checkpoint answered, and tested on test set；If it was found that verifying collection on effect do not increase after, i.e., Adjusting parameter；After multiple checks, best checkpoint and best evaluation evaluation index numerical value can be obtained；

S603, the model path by the best checkpoint of S602 final entry, finally saved and the Chinese medicine ancient Chinese prose language for needing to predict In the program of the incoming coding forecast set of material, it will predict that corpus carries out Embedding insertion according to incoming prediction corpus and model, Predictive data set after being encoded；

S604, the model path by the checkpoint of predictive data set and record, finally saved, hidden layer dimension, forgetting rate, word to In the program for measuring the incoming prediction of dimensional parameter；Predictive data set is read in, NER Named Entity Extraction Model is set, it is defeated for every Enter being predicted respectively as a result, and being saved in decoding result file；

S605, the decoding result file for obtaining prediction are read in, and are obtained different types of entity according to type-collection and are saved in In different files, while saving its position in Chinese medicine ancient Chinese prose corpus, corresponding type and entity word itself；So far Complete the name Entity recognition task towards Chinese medical book document.

8. a kind of name entity recognition device towards Chinese medical book document characterized by comprising

Finishing unit arranges the entity word of at least one entity type, obtains the comprising entity type to be identified One traditional Chinese medical science field vocabulary；The first traditional Chinese medical science field vocabulary includes entity word and corresponding entity type；

Unit is excavated to carry out short phrase picking using the automatic short phrase picking technology of AutoPhrase from Chinese medicine ancient Chinese prose corpus, obtain All possible entity word, obtains the second traditional Chinese medical science field vocabulary, and the second traditional Chinese medical science field vocabulary includes entity word；

Unit is marked, in conjunction with the first traditional Chinese medical science field vocabulary and the second traditional Chinese medical science field vocabulary, according to scheduled time mark plan Slightly, the entity occurred in the Chinese medicine ancient Chinese prose corpus is marked out；

Processing unit connects/disconnects dimension model in conjunction with time mark result and tie/break of the Chinese medicine ancient Chinese prose corpus, obtains The labeled data of Chinese medicine ancient Chinese prose corpus；

Output unit, in conjunction with the labeled data and the pre-training model Word obtained according to Chinese medicine ancient Chinese prose corpus training The insertion of Embedding word, generates training dataset, validation data set, test data set, and training dataset is output to training text In part, validation data set and test data set are output in test file；

Predicting unit reads in data from the trained file, test file, certainly according to reading data training AutoNER Dynamic Named Entity Extraction Model, and the automatic Named Entity Extraction Model of the AutoNER obtained using training, to the Chinese medicine Ancient Chinese prose corpus predicted, the result identified；The entity identified according to result.