CN110134766A

CN110134766A - A kind of segmenting method and device towards Chinese medical book document

Info

Publication number: CN110134766A
Application number: CN201910384880.XA
Authority: CN
Inventors: 谢永红; 周越; 张德政; 阿孜古丽; 栗辉; 贾麒
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2019-08-16
Anticipated expiration: 2039-05-09
Also published as: CN110134766B

Abstract

The embodiment of the present invention discloses a kind of segmenting method and device towards Chinese medical book document, which comprises pre-processes to the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model；The corpus is trained, language model is generated；Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result；It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file；According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates first time correction result.

Description

A kind of segmenting method and device towards Chinese medical book document

Technical field

The present invention relates to the segmenting methods of natural language processing field medical literature, especially towards Chinese medical book document Segmenting method and device.

Background technique

Chinese word segmentation is a basic step of Chinese text processing.Different from texts such as English, do not make in Chinese sentence The division between word and word is carried out with space, so automatic in progress text classification, information retrieval, information filtering, document When index, abstract such as automatically generate at the task of Chinese information processing, Chinese word segmentation has the meaning of key as basic step.In The correctness of literary word segmentation result will directly affect the correctness of follow-up work.

In traditional Chinese medical science field, it is born from primitive society and the traditional Chinese medicine of constantly development and change has accumulated a large amount of medical literature Gu Nationality works.These reference works substantial amounts, content are many and diverse, wide variety, including theory of jingqi, the theory of yin-yang and five elements (metal, wood, water, fire and earth), qi and blood Body fluid, hiding as, channels and collaterals, constitution, the cause of disease, morbidity, the interpretation of the cause, onset and process of an illness, the rules for the treatment of, health etc..Mostly using the writing in classical Chinese or the mouth of ancients in them Language, formulas or directions put into verse are recorded, and ways of writing, Compiling date is all different, have biggish difference with Modern Chinese.Also, it wraps Proper noun and technical term containing many traditional Chinese medical science fields.Reasonably carrying out participle to Chinese medicine literature of ancient book is by tcm knowledge knot The basis of structure, but there is no the segmenter specifically for traditional Chinese medical science field at present, and the segmenter of general field cannot be fine Solution in the diplomatic participle task of Chinese medical book.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of segmenting method towards Chinese medical book document, device, can be improved The accuracy of traditional Chinese medical science field document participle.

A kind of segmenting method towards Chinese medical book document, comprising:

The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated

The corpus is trained, language model is generated；

Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result；

According to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarize to the preliminary word segmentation result, Sort out segmentation rules, formation rule file；

According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, is generated for the first time Correction result.

The method also includes:

The traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary；

Using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.

It is described to include: to the pretreated step of literature of ancient book progress

The urtext for obtaining the literature of ancient book deletes the catalogue of the literature of ancient book from the urtext, and Delete the sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning；

A space, the corpus as train language model are added behind each of text after the cleaning word.

It is described to include: to the step of literature of ancient book progress unsupervised participle using the language model

The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a；Second Kind are as follows: the second word of multi-character words is labeled as b；The third are as follows: the third word of multi-character words is labeled as c；4th kind are as follows: multi-character words Rest part is labeled as d；

Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be more behind the lead-in of monosyllabic word Second word of words, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words behind the second word of multi-character words, more It can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, the rest part of multi-character words behind the third word of words It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words below, remove above-mentioned transfering state, remaining transition probability is Zero；

To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that

p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d)；

Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words, The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words；

Different transition probabilities is set and carries out Experimental comparison, obtains transition probability；

P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)= 0.005,

P (a | d)=1, p (d | d)=0.0001

Use the language model design conditions probability；

Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining Initial word segmentation result.

It is described to include: using the step of language model design conditions probability

P (w)=p (z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

Wherein, the word that word w is made of k word, z₁, z₂... z_kIt is the k word in the 1st, 2 of word w ... respectively, word w is one A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.

The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file includes:

The preliminary word segmentation result is ranked up according to pinyin order；

Successively the preliminary word segmentation result of sequence is handled；The processing specifically:

For Chinese character and punctuation mark be divided into a word as a result, according to punctuation mark redaction rule text Part；

For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and language Speech, which is gained knowledge, to be judged, when for verb+noun form, is split as two verb, noun words；For adjective+noun Form when, be split as two adjective, noun words.According to the verb word redaction rule file；

For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to word Property and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words；For When adjective+noun form, it is split as two adjective, noun words.According to the identical preceding two words redaction rule text Part；

For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to Part of speech and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words； When for adjective+noun form, it is split as two adjective, noun words.According to the ending character redaction rule file.

The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file further include:

Count the number that whole words and each word in the preliminary word segmentation result occur；

According to number being ranked up from high to low, the word of predetermined quantity before obtaining；

Whether the word of predetermined quantity is in preset vocabulary before judging；

If according to the word establishment rules file.

It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained Suddenly include:

The word in the vocabulary is found in the urtext, as word to be modified；

Record position of the word to be modified in the urtext；

According to the position of record, the word segmentation result of the first time revised word to be modified is found；

Judge whether the word segmentation result and the word in the vocabulary are consistent；

If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified；If consistent, Retain word segmentation result；

Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result.

It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained Suddenly further include:

When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, length word is carried out It disambiguates.

A kind of participle device towards Chinese medical book document characterized by comprising

Preprocessing module pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model

Training module is trained the corpus, generates language model；

Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle knot Fruit；

Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary participle As a result it summarizes, sorts out segmentation rules, formation rule file；

First correction module repair for the first time to the preliminary word segmentation result according to the rule in the rule file Just, first time correction result is generated.

The device, further includes:

Module is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary；

Second correction module is modified the first time correction result, is obtained final participle using the vocabulary As a result.

It in above-described embodiment, is designed particular for traditional Chinese medical science field, improves the correctness of traditional Chinese medical science field participle.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is the flow diagram of segmenting method of the embodiment of the present invention towards Chinese medical book document；

Fig. 2 is the connection schematic diagram of the participle device of the invention towards Chinese medical book document；

Fig. 3 is the flow chart of Chinese medical book document segmenting method of the invention；

Fig. 4 is training corpus pre-processed results of the invention；

Fig. 5 is rule file of the invention；

Fig. 6 is word segmentation result of the invention.

Specific embodiment

The embodiment of the present invention is described in detail with reference to the accompanying drawing.

As shown in Figure 1, being a kind of segmenting method towards Chinese medical book document of the present invention, comprising:

Step 101, the literature of ancient book of traditional Chinese medical science field is pre-processed, generates the corpus of train language model；Wherein, institute Stating and carrying out pretreated step to the literature of ancient book includes: to obtain the urtext of the literature of ancient book, from the original text The catalogue of the literature of ancient book is deleted in this, and deletes the sentence containing the character that utf-8 cannot be used to indicate, after generating cleaning Text；A space, the corpus as train language model are added behind each of text after the cleaning word.

Step 102, the corpus is trained, generates language model；

Step 103, unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary participle knot Fruit；

P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d)；

Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words, The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words.

P (a | d)=1, p (d | d)=0.0001

Use the language model design conditions probability；

P (w)=p (z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

Wherein, the word that word w is made of k word, z₁, z₂,…z_kIt is the k word in the 1st, 2 of word w ... respectively, word w is one A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.

Step 104, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result It summarizes, sorts out segmentation rules, formation rule file；

If according to the word establishment rules file.

Step 105, according to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, it is raw At first time correction result.

The method also includes:

Step 106, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary；

Step 107, using the vocabulary, the first time correction result is modified, final word segmentation result is obtained. The step includes:

The word in the vocabulary is found in the urtext, as word to be modified；

Record position of the word to be modified in the urtext；

The step can also include:

As shown in Fig. 2, being a kind of participle device towards Chinese medical book document of the present invention, comprising:

Preprocessing module 21 pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model

Training module 22 is trained the corpus, generates language model；

Word segmentation module 23 carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle As a result；

Rule establishes module 24, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to described preliminary point Word result is summarized, and segmentation rules, formation rule file are sorted out；

First correction module 25 carries out for the first time the preliminary word segmentation result according to the rule in the rule file Amendment generates first time correction result.

The device, further includes:

Module 26 is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary；

Second correction module 27 is modified the first time correction result using the vocabulary, obtains final point Word result.

Application scenarios of the invention are described below.The present invention in order to solve the problems, such as the participle of traditional Chinese medical science field literature of ancient book, and A kind of segmenting method based on Chinese medical book document put forward.Specific implementation step is as follows:

Step 1: obtaining the relevant literature of ancient book of traditional Chinese medical science field, as the corpus of train language model, arranges traditional Chinese medical science field Peculiar term is as vocabulary.Vocabulary one word of every a line, storage format TXT.

Step 2: pre-processing document, uses kenlm tool, train language model.

It wherein pre-processes and includes:

(1) it deltrees, deleting the sentence containing the spcial character that utf-8 cannot be used to indicate, (sentence here is with sentence Number, exclamation mark, question mark is divided)；

(2) space is added behind each of text after the cleaning word, the language as training word language model Material.

Step 3: preliminary unsupervised participle is carried out using literature of ancient book of the language model to traditional Chinese medical science field.

In Chinese, the word that length is greater than four words is fewer, so being four kinds by the state demarcation of word in this patent: The lead-in of monosyllabic word or multi-character words is labeled as c labeled as b, the third word of multi-character words labeled as a, the second word of multi-character words With the rest part of multi-character words, it is labeled as d.It can only be wherein the prefix of monosyllabic word or multi-character words, monosyllabic word behind monosyllabic word Lead-in behind can only be multi-character words the second word, can only be the third word or monosyllabic word of multi-character words behind the second word of multi-character words Or the lead-in of multi-character words, it can only be the rest part of multi-character words or the head of monosyllabic word or multi-character words behind the third word of multi-character words Word can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words behind the rest part of multi-character words, remove above-mentioned transfer State, remaining transition probability are zero.To sum up, the transition probability of non-zero has 8 kinds, it may be assumed that

P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d).

It is less according to ancient Chinese prose long word, there is the complete semantic more situation of individual character, it is general by the transfer for being arranged different Rate carries out Experimental comparison, obtains transition probability

P (a | d)=1, p (d | d)=0.0001

Using language model design conditions probability obtained in step 2, as shown in formula (1).Wherein word w is by k word The word of composition, the probability of word w can be converted to the probability of the word of composition.Optimal path is found using the method for Dynamic Programming, i.e., The path of maximum probability, as cutting as a result, obtaining initial word segmentation result.

P (w)=p (z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

Step 4: according to part of speech relationship, the regular collocation of clause and linguistic knowledge to the preliminary participle in step 3 As a result it summarizes, sorts out segmentation rules, formation rule file.The method for wherein summarizing formation rule file is specific as follows:

(1) number that whole words and each word in the word segmentation result obtained by step 3 occur is counted；

(2) statistical result is ranked up according to pinyin order；

(3) for Chinese character and punctuation mark be divided into a word as a result, being write using the punctuation mark as rule Enter rule file；

(4) for Chinese character and Chinese character be divided into a word as a result, sorting out the word of identical lead-in first, so Judged afterwards according to part of speech and linguistic knowledge, such as under normal circumstances, verb+noun form generally should all be split For verb, which is then added to rule file by two words of noun；

(5) the identical word of preceding two word is sorted out, is judged according to part of speech and linguistic knowledge, adds corresponding word Into rule file；

(6) word with identical ending character is arranged, is judged according to part of speech and linguistic knowledge, adds corresponding word Language is into rule file.

(7) it is ranked up from high to low according still further to frequency, high frequency words is judged.

Rule file storage format is TXT, every one rule of row write.It is described as follows shown in table 1:

1 rule file of table

Step 5: first time amendment is carried out to the preliminary word segmentation result in step 3 according to the rule in rule file；

Step 6: the first time correction result in step 5 is modified using vocabulary, detailed modification method is as follows:

(1) word occurred in vocabulary is found in urtext, and writes down the position of appearance；

(2) it when being both the sub- word of some word in word and vocabulary in vocabulary if there is a word, is grown Short word disambiguates, and determines the word that should be used in which vocabulary herein；

(3) step 5 is found by once correcting obtained word segmentation result according to the position of (1) record；

(4) if word segmentation result is inconsistent with vocabulary, i.e., a word in vocabulary multiple words or segmentation have been divided into Boundary is incorrect, and word segmentation result is merged modification according to vocabulary；If consistent, retain result；

(5) final word segmentation result is obtained after vocabulary is corrected.

The present invention has the advantages that

In existing participle tool, the segmenter not segmented for Chinese ancient Chinese prose feature also lacks special needle To the segmenter of traditional Chinese medical science field.On expression way, there is biggish difference between Chinese ancient Chinese prose and Modern Chinese, such as existing Generally use " ", " ", " " etc. as modal particle for Chinese, but the modal particle in ancient Chinese prose be generally " it ", " ", " person ", " "；There are more individual characters and relative to Modern Chinese, in ancient Chinese prose it can be shown that complete semantic.These are resulted in The upper error result that again and again occurs of the existing segmenter in ancient Chinese prose participle task.And in Chinese medical book document, have big The medicine name prescription symptom of amount is described, and the distinctive expression method of these traditional Chinese medical science fields is difficult to see on general field, also result in Segmenter on current general field not can solve the participle task on Chinese medical book.By the present invention in that with no prison The method superintended and directed has trained the participle model for being specifically applied to Chinese medical book document, can be good at solving Chinese medical book document Participle task.And unsupervised method is used, save the manpower and time cost manually marked.This method is easy to expand It opens up and is used on the ancient Chinese prose document of other field, by changing the data set of training pattern and the vocabulary in relevant field, so that it may The segmenter invented herein is applied to other field.

By taking next chapter Chinese medical book document as an example, illustrate the segmenting method of Chinese medical book document, as shown in Figure 3.

First, obtain Chinese medical book document, the corpus as train language model.Arrange the peculiar term conduct of traditional Chinese medical science field Vocabulary, as shown in table 1.

Table 1: the peculiar term vocabulary of traditional Chinese medical science field

Second, training corpus is pre-processed, pre-processed results are as shown in Figure 4.Use Kenlm tool training language mould Type, wherein the parameter of gram is set as 4.

Third carries out unsupervised participle to the literature of ancient book of traditional Chinese medical science field using the language model of step 2 training.Word with It is separated between word by space.

4th, arrangement sums up first word segmentation result.Be ranked up first, in accordance with phonetic, by symbol and Chinese character segmentation at This result of one word is split.For example occur in word segmentation result ", Poria cocos ", added in rule file ", | * ". " * " indicates all characters, i.e., is all split once there is the situation that ", " is connected with any character.Next lead-in is identical Word all find out, two lists generally should be all divided into according to this combination of linguistics common sense such as " verb+noun " Word, such as " eating heat wine ", " eating salt ", " eating hot object " etc. add " eating " word using " eating " word as the word of prefix in rule file Rule, " eat | * ".Similarly, the identical word of two words of beginning word identical with ending individual character is arranged, and asks in valence and adds in rule Corresponding rule.Again first word segmentation result is ranked up from high to low according to word frequency, high frequency words are checked, needs to cut That divides is added to respective rule in rule file.

5th, first word segmentation result is modified using the rule in rule file.Rule file is as shown in Figure 5.

6th, it is modified using the result that vocabulary obtains the 5th step.

Firstly, finding the word in vocabulary in the original TCM Document not segmented, and record position.

Then, the word of vocabulary is found in using the corrected word segmentation result of rule.

If word segmentation result herein is that the word in vocabulary is inconsistent, by result herein according to being modified in vocabulary. For example have in original TCM Document in short for " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more.If " using rule amendment The word segmentation result crossed is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." but plus-minus decoction of Six Ingredients is the word in vocabulary, is one A TCM Recipe name, such case, it will merging " plus-minus decoction of Six Ingredients " in final word segmentation result becomes a word.Final Word segmentation result is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." word segmentation result form it is as shown in Figure 6.

For convenience of description, description apparatus above is to be divided into various units/modules with function to describe respectively.Certainly, exist Implement to realize each unit/module function in the same or multiple software and or hardware when the present invention.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of segmenting method towards Chinese medical book document characterized by comprising

The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated；

The corpus is trained, language model is generated；

It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes, arranges to the preliminary word segmentation result Segmentation rules out, formation rule file；

According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates and corrects for the first time As a result.

2. the method according to claim 1, wherein the method also includes:

3. the method according to claim 1, wherein described carry out pretreated step packet to the literature of ancient book It includes:

The urtext for obtaining the literature of ancient book, deletes the catalogue of the literature of ancient book from the urtext, and deletes Sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning；

4. according to the method described in claim 2, it is characterized in that, it is described using the language model to the literature of ancient book into Row unsupervised participle the step of include:

The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a；Second Are as follows: the second word of multi-character words is labeled as b；The third are as follows: the third word of multi-character words is labeled as c；4th kind are as follows: multi-character words its Remaining part point is labeled as d；

Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be multi-character words behind the lead-in of monosyllabic word The second word, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words, multi-character words behind the second word of multi-character words Third word behind can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, behind the rest part of multi-character words It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words, remove above-mentioned transfering state, remaining transition probability is zero；

Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is the lead-in of monosyllabic word or multi-character words Conditional probability；P (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is the second of multi-character words The conditional probability of word；P (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multi-character words The conditional probability of third word；P (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is individual character The conditional probability of word or the lead-in of multi-character words；P (a | c) it indicates under conditions of previous word is the third word of multi-character words, it is latter A word is the conditional probability of the lead-in of monosyllabic word or multi-character words；P (d | c) it indicates in previous word to be the item of the third word of multi-character words Under part, the latter word is the conditional probability of the rest part of multi-character words；P (a | d) it indicates in previous word to be remaining of multi-character words Under conditions of part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words；P (d | d) is indicated in previous word Under conditions of the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words；

P (a | d)=1, p (d | d)=0.0001

Use the language model design conditions probability；

Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining initially Word segmentation result.

5. according to the method described in claim 4, it is characterized in that, the step using the language model design conditions probability Suddenly include:

P (w)=p (z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

Wherein, the word that word w is made of k word, z₁, z₂,…z_kIt is the k word in the 1st, 2 of word w ... respectively, word w is a word Possibility, i.e. the existing probability p (w) of word w can be converted to the existing probability of each word of composition, wherein the presence of each word Probability is calculated by using the number that the word occurs in the text of train language model divided by total number of word.

6. the method according to claim 1, wherein it is described according to part of speech relationship, clause regular collocation and The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file include:

For Chinese character and punctuation mark be divided into a word as a result, according to the punctuation mark redaction rule file；

For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and linguistics Knowledge is judged, when for verb+noun form, is split as two verb, noun words；When for adjective+noun When form, it is split as two adjective, noun words；According to the identical lead-in redaction rule file；

For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to part of speech and Linguistic knowledge is judged, when for verb+noun form, is split as two verb, noun words；When for adjective+ When the form of noun, it is split as two adjective, noun words；According to the identical preceding two words redaction rule file；

For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to part of speech Judged with linguistic knowledge, when for verb+noun form, is split as two verb, noun words；When for adjective When the form of+noun, it is split as two adjective, noun words；According to the ending character redaction rule file.

7. according to the method described in claim 6, it is characterized in that, it is described according to part of speech relationship, clause regular collocation and The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file is also wrapped It includes:

If according to the word establishment rules file.

8. according to the method described in claim 2, it is characterized in that, described use the vocabulary, to first time amendment knot The step of fruit is modified, and obtains final word segmentation result include:

The word in the vocabulary is found in the urtext, as word to be modified；

Record position of the word to be modified in the urtext；

Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result；Or

Described to use the vocabulary, the step of being modified to the first time correction result, obtain final word segmentation result, is also Include:

When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, carries out length word and disappear Discrimination.

9. a kind of participle device towards Chinese medical book document characterized by comprising

Training module is trained the corpus, generates language model；

Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary word segmentation result；

Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result It summarizes, sorts out segmentation rules, formation rule file；

First correction module carries out first time amendment to the preliminary word segmentation result according to the rule in the rule file, raw At first time correction result.

10. device according to claim 9, which is characterized in that further include:

Second correction module is modified the first time correction result using the vocabulary, obtains final participle knot Fruit.