CN110134766A - A kind of segmenting method and device towards Chinese medical book document - Google Patents
A kind of segmenting method and device towards Chinese medical book document Download PDFInfo
- Publication number
- CN110134766A CN110134766A CN201910384880.XA CN201910384880A CN110134766A CN 110134766 A CN110134766 A CN 110134766A CN 201910384880 A CN201910384880 A CN 201910384880A CN 110134766 A CN110134766 A CN 110134766A
- Authority
- CN
- China
- Prior art keywords
- word
- character
- words
- segmentation result
- character words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the present invention discloses a kind of segmenting method and device towards Chinese medical book document, which comprises pre-processes to the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model;The corpus is trained, language model is generated;Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file;According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates first time correction result.
Description
Technical field
The present invention relates to the segmenting methods of natural language processing field medical literature, especially towards Chinese medical book document
Segmenting method and device.
Background technique
Chinese word segmentation is a basic step of Chinese text processing.Different from texts such as English, do not make in Chinese sentence
The division between word and word is carried out with space, so automatic in progress text classification, information retrieval, information filtering, document
When index, abstract such as automatically generate at the task of Chinese information processing, Chinese word segmentation has the meaning of key as basic step.In
The correctness of literary word segmentation result will directly affect the correctness of follow-up work.
In traditional Chinese medical science field, it is born from primitive society and the traditional Chinese medicine of constantly development and change has accumulated a large amount of medical literature Gu
Nationality works.These reference works substantial amounts, content are many and diverse, wide variety, including theory of jingqi, the theory of yin-yang and five elements (metal, wood, water, fire and earth), qi and blood
Body fluid, hiding as, channels and collaterals, constitution, the cause of disease, morbidity, the interpretation of the cause, onset and process of an illness, the rules for the treatment of, health etc..Mostly using the writing in classical Chinese or the mouth of ancients in them
Language, formulas or directions put into verse are recorded, and ways of writing, Compiling date is all different, have biggish difference with Modern Chinese.Also, it wraps
Proper noun and technical term containing many traditional Chinese medical science fields.Reasonably carrying out participle to Chinese medicine literature of ancient book is by tcm knowledge knot
The basis of structure, but there is no the segmenter specifically for traditional Chinese medical science field at present, and the segmenter of general field cannot be fine
Solution in the diplomatic participle task of Chinese medical book.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of segmenting method towards Chinese medical book document, device, can be improved
The accuracy of traditional Chinese medical science field document participle.
A kind of segmenting method towards Chinese medical book document, comprising:
The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated
The corpus is trained, language model is generated;
Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;
According to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarize to the preliminary word segmentation result,
Sort out segmentation rules, formation rule file;
According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, is generated for the first time
Correction result.
The method also includes:
The traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.
It is described to include: to the pretreated step of literature of ancient book progress
The urtext for obtaining the literature of ancient book deletes the catalogue of the literature of ancient book from the urtext, and
Delete the sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning;
A space, the corpus as train language model are added behind each of text after the cleaning word.
It is described to include: to the step of literature of ancient book progress unsupervised participle using the language model
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second
Kind are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words
Rest part is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be more behind the lead-in of monosyllabic word
Second word of words, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words behind the second word of multi-character words, more
It can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, the rest part of multi-character words behind the third word of words
It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words below, remove above-mentioned transfering state, remaining transition probability is
Zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words
The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words
The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword
The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is
The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words,
The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words
Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words
Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous
Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words;
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)=
0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining
Initial word segmentation result.
It is described to include: using the step of language model design conditions probability
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2... zkIt is the k word in the 1st, 2 of word w ... respectively, word w is one
A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word
It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result
It summarizes, sorts out segmentation rules, the step of formation rule file includes:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to punctuation mark redaction rule text
Part;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and language
Speech, which is gained knowledge, to be judged, when for verb+noun form, is split as two verb, noun words;For adjective+noun
Form when, be split as two adjective, noun words.According to the verb word redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to word
Property and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;For
When adjective+noun form, it is split as two adjective, noun words.According to the identical preceding two words redaction rule text
Part;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to
Part of speech and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;
When for adjective+noun form, it is split as two adjective, noun words.According to the ending character redaction rule file.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result
It summarizes, sorts out segmentation rules, the step of formation rule file further include:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained
Suddenly include:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent,
Retain word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result.
It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained
Suddenly further include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, length word is carried out
It disambiguates.
A kind of participle device towards Chinese medical book document characterized by comprising
Preprocessing module pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module is trained the corpus, generates language model;
Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle knot
Fruit;
Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary participle
As a result it summarizes, sorts out segmentation rules, formation rule file;
First correction module repair for the first time to the preliminary word segmentation result according to the rule in the rule file
Just, first time correction result is generated.
The device, further includes:
Module is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module is modified the first time correction result, is obtained final participle using the vocabulary
As a result.
It in above-described embodiment, is designed particular for traditional Chinese medical science field, improves the correctness of traditional Chinese medical science field participle.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow diagram of segmenting method of the embodiment of the present invention towards Chinese medical book document;
Fig. 2 is the connection schematic diagram of the participle device of the invention towards Chinese medical book document;
Fig. 3 is the flow chart of Chinese medical book document segmenting method of the invention;
Fig. 4 is training corpus pre-processed results of the invention;
Fig. 5 is rule file of the invention;
Fig. 6 is word segmentation result of the invention.
Specific embodiment
The embodiment of the present invention is described in detail with reference to the accompanying drawing.
As shown in Figure 1, being a kind of segmenting method towards Chinese medical book document of the present invention, comprising:
Step 101, the literature of ancient book of traditional Chinese medical science field is pre-processed, generates the corpus of train language model;Wherein, institute
Stating and carrying out pretreated step to the literature of ancient book includes: to obtain the urtext of the literature of ancient book, from the original text
The catalogue of the literature of ancient book is deleted in this, and deletes the sentence containing the character that utf-8 cannot be used to indicate, after generating cleaning
Text;A space, the corpus as train language model are added behind each of text after the cleaning word.
Step 102, the corpus is trained, generates language model;
Step 103, unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary participle knot
Fruit;
It is described to include: to the step of literature of ancient book progress unsupervised participle using the language model
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second
Kind are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words
Rest part is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be more behind the lead-in of monosyllabic word
Second word of words, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words behind the second word of multi-character words, more
It can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, the rest part of multi-character words behind the third word of words
It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words below, remove above-mentioned transfering state, remaining transition probability is
Zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words
The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words
The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword
The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is
The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words,
The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words
Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words
Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous
Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words.
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)=
0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining
Initial word segmentation result.
It is described to include: using the step of language model design conditions probability
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2,…zkIt is the k word in the 1st, 2 of word w ... respectively, word w is one
A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word
It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
Step 104, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result
It summarizes, sorts out segmentation rules, formation rule file;
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result
It summarizes, sorts out segmentation rules, the step of formation rule file includes:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to punctuation mark redaction rule text
Part;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and language
Speech, which is gained knowledge, to be judged, when for verb+noun form, is split as two verb, noun words;For adjective+noun
Form when, be split as two adjective, noun words.According to the verb word redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to word
Property and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;For
When adjective+noun form, it is split as two adjective, noun words.According to the identical preceding two words redaction rule text
Part;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to
Part of speech and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;
When for adjective+noun form, it is split as two adjective, noun words.According to the ending character redaction rule file.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result
It summarizes, sorts out segmentation rules, the step of formation rule file further include:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
Step 105, according to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, it is raw
At first time correction result.
The method also includes:
Step 106, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Step 107, using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.
The step includes:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent,
Retain word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result.
The step can also include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, length word is carried out
It disambiguates.
As shown in Fig. 2, being a kind of participle device towards Chinese medical book document of the present invention, comprising:
Preprocessing module 21 pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module 22 is trained the corpus, generates language model;
Word segmentation module 23 carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle
As a result;
Rule establishes module 24, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to described preliminary point
Word result is summarized, and segmentation rules, formation rule file are sorted out;
First correction module 25 carries out for the first time the preliminary word segmentation result according to the rule in the rule file
Amendment generates first time correction result.
The device, further includes:
Module 26 is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module 27 is modified the first time correction result using the vocabulary, obtains final point
Word result.
Application scenarios of the invention are described below.The present invention in order to solve the problems, such as the participle of traditional Chinese medical science field literature of ancient book, and
A kind of segmenting method based on Chinese medical book document put forward.Specific implementation step is as follows:
Step 1: obtaining the relevant literature of ancient book of traditional Chinese medical science field, as the corpus of train language model, arranges traditional Chinese medical science field
Peculiar term is as vocabulary.Vocabulary one word of every a line, storage format TXT.
Step 2: pre-processing document, uses kenlm tool, train language model.
It wherein pre-processes and includes:
(1) it deltrees, deleting the sentence containing the spcial character that utf-8 cannot be used to indicate, (sentence here is with sentence
Number, exclamation mark, question mark is divided);
(2) space is added behind each of text after the cleaning word, the language as training word language model
Material.
Step 3: preliminary unsupervised participle is carried out using literature of ancient book of the language model to traditional Chinese medical science field.
In Chinese, the word that length is greater than four words is fewer, so being four kinds by the state demarcation of word in this patent:
The lead-in of monosyllabic word or multi-character words is labeled as c labeled as b, the third word of multi-character words labeled as a, the second word of multi-character words
With the rest part of multi-character words, it is labeled as d.It can only be wherein the prefix of monosyllabic word or multi-character words, monosyllabic word behind monosyllabic word
Lead-in behind can only be multi-character words the second word, can only be the third word or monosyllabic word of multi-character words behind the second word of multi-character words
Or the lead-in of multi-character words, it can only be the rest part of multi-character words or the head of monosyllabic word or multi-character words behind the third word of multi-character words
Word can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words behind the rest part of multi-character words, remove above-mentioned transfer
State, remaining transition probability are zero.To sum up, the transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d).
It is less according to ancient Chinese prose long word, there is the complete semantic more situation of individual character, it is general by the transfer for being arranged different
Rate carries out Experimental comparison, obtains transition probability
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)=
0.005,
P (a | d)=1, p (d | d)=0.0001
Using language model design conditions probability obtained in step 2, as shown in formula (1).Wherein word w is by k word
The word of composition, the probability of word w can be converted to the probability of the word of composition.Optimal path is found using the method for Dynamic Programming, i.e.,
The path of maximum probability, as cutting as a result, obtaining initial word segmentation result.
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Step 4: according to part of speech relationship, the regular collocation of clause and linguistic knowledge to the preliminary participle in step 3
As a result it summarizes, sorts out segmentation rules, formation rule file.The method for wherein summarizing formation rule file is specific as follows:
(1) number that whole words and each word in the word segmentation result obtained by step 3 occur is counted;
(2) statistical result is ranked up according to pinyin order;
(3) for Chinese character and punctuation mark be divided into a word as a result, being write using the punctuation mark as rule
Enter rule file;
(4) for Chinese character and Chinese character be divided into a word as a result, sorting out the word of identical lead-in first, so
Judged afterwards according to part of speech and linguistic knowledge, such as under normal circumstances, verb+noun form generally should all be split
For verb, which is then added to rule file by two words of noun;
(5) the identical word of preceding two word is sorted out, is judged according to part of speech and linguistic knowledge, adds corresponding word
Into rule file;
(6) word with identical ending character is arranged, is judged according to part of speech and linguistic knowledge, adds corresponding word
Language is into rule file.
(7) it is ranked up from high to low according still further to frequency, high frequency words is judged.
Rule file storage format is TXT, every one rule of row write.It is described as follows shown in table 1:
1 rule file of table
Step 5: first time amendment is carried out to the preliminary word segmentation result in step 3 according to the rule in rule file;
Step 6: the first time correction result in step 5 is modified using vocabulary, detailed modification method is as follows:
(1) word occurred in vocabulary is found in urtext, and writes down the position of appearance;
(2) it when being both the sub- word of some word in word and vocabulary in vocabulary if there is a word, is grown
Short word disambiguates, and determines the word that should be used in which vocabulary herein;
(3) step 5 is found by once correcting obtained word segmentation result according to the position of (1) record;
(4) if word segmentation result is inconsistent with vocabulary, i.e., a word in vocabulary multiple words or segmentation have been divided into
Boundary is incorrect, and word segmentation result is merged modification according to vocabulary;If consistent, retain result;
(5) final word segmentation result is obtained after vocabulary is corrected.
The present invention has the advantages that
In existing participle tool, the segmenter not segmented for Chinese ancient Chinese prose feature also lacks special needle
To the segmenter of traditional Chinese medical science field.On expression way, there is biggish difference between Chinese ancient Chinese prose and Modern Chinese, such as existing
Generally use " ", " ", " " etc. as modal particle for Chinese, but the modal particle in ancient Chinese prose be generally " it ", " ",
" person ", " ";There are more individual characters and relative to Modern Chinese, in ancient Chinese prose it can be shown that complete semantic.These are resulted in
The upper error result that again and again occurs of the existing segmenter in ancient Chinese prose participle task.And in Chinese medical book document, have big
The medicine name prescription symptom of amount is described, and the distinctive expression method of these traditional Chinese medical science fields is difficult to see on general field, also result in
Segmenter on current general field not can solve the participle task on Chinese medical book.By the present invention in that with no prison
The method superintended and directed has trained the participle model for being specifically applied to Chinese medical book document, can be good at solving Chinese medical book document
Participle task.And unsupervised method is used, save the manpower and time cost manually marked.This method is easy to expand
It opens up and is used on the ancient Chinese prose document of other field, by changing the data set of training pattern and the vocabulary in relevant field, so that it may
The segmenter invented herein is applied to other field.
By taking next chapter Chinese medical book document as an example, illustrate the segmenting method of Chinese medical book document, as shown in Figure 3.
First, obtain Chinese medical book document, the corpus as train language model.Arrange the peculiar term conduct of traditional Chinese medical science field
Vocabulary, as shown in table 1.
Table 1: the peculiar term vocabulary of traditional Chinese medical science field
Second, training corpus is pre-processed, pre-processed results are as shown in Figure 4.Use Kenlm tool training language mould
Type, wherein the parameter of gram is set as 4.
Third carries out unsupervised participle to the literature of ancient book of traditional Chinese medical science field using the language model of step 2 training.Word with
It is separated between word by space.
4th, arrangement sums up first word segmentation result.Be ranked up first, in accordance with phonetic, by symbol and Chinese character segmentation at
This result of one word is split.For example occur in word segmentation result ", Poria cocos ", added in rule file ", | * ".
" * " indicates all characters, i.e., is all split once there is the situation that ", " is connected with any character.Next lead-in is identical
Word all find out, two lists generally should be all divided into according to this combination of linguistics common sense such as " verb+noun "
Word, such as " eating heat wine ", " eating salt ", " eating hot object " etc. add " eating " word using " eating " word as the word of prefix in rule file
Rule, " eat | * ".Similarly, the identical word of two words of beginning word identical with ending individual character is arranged, and asks in valence and adds in rule
Corresponding rule.Again first word segmentation result is ranked up from high to low according to word frequency, high frequency words are checked, needs to cut
That divides is added to respective rule in rule file.
5th, first word segmentation result is modified using the rule in rule file.Rule file is as shown in Figure 5.
6th, it is modified using the result that vocabulary obtains the 5th step.
Firstly, finding the word in vocabulary in the original TCM Document not segmented, and record position.
Then, the word of vocabulary is found in using the corrected word segmentation result of rule.
If word segmentation result herein is that the word in vocabulary is inconsistent, by result herein according to being modified in vocabulary.
For example have in original TCM Document in short for " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more.If " using rule amendment
The word segmentation result crossed is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." but plus-minus decoction of Six Ingredients is the word in vocabulary, is one
A TCM Recipe name, such case, it will merging " plus-minus decoction of Six Ingredients " in final word segmentation result becomes a word.Final
Word segmentation result is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." word segmentation result form it is as shown in Figure 6.
For convenience of description, description apparatus above is to be divided into various units/modules with function to describe respectively.Certainly, exist
Implement to realize each unit/module function in the same or multiple software and or hardware when the present invention.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.
Claims (10)
1. a kind of segmenting method towards Chinese medical book document characterized by comprising
The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated;
The corpus is trained, language model is generated;
Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;
It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes, arranges to the preliminary word segmentation result
Segmentation rules out, formation rule file;
According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates and corrects for the first time
As a result.
2. the method according to claim 1, wherein the method also includes:
The traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.
3. the method according to claim 1, wherein described carry out pretreated step packet to the literature of ancient book
It includes:
The urtext for obtaining the literature of ancient book, deletes the catalogue of the literature of ancient book from the urtext, and deletes
Sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning;
A space, the corpus as train language model are added behind each of text after the cleaning word.
4. according to the method described in claim 2, it is characterized in that, it is described using the language model to the literature of ancient book into
Row unsupervised participle the step of include:
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second
Are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words its
Remaining part point is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be multi-character words behind the lead-in of monosyllabic word
The second word, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words, multi-character words behind the second word of multi-character words
Third word behind can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, behind the rest part of multi-character words
It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words, remove above-mentioned transfering state, remaining transition probability is zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is the lead-in of monosyllabic word or multi-character words
Conditional probability;P (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is the second of multi-character words
The conditional probability of word;P (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multi-character words
The conditional probability of third word;P (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is individual character
The conditional probability of word or the lead-in of multi-character words;P (a | c) it indicates under conditions of previous word is the third word of multi-character words, it is latter
A word is the conditional probability of the lead-in of monosyllabic word or multi-character words;P (d | c) it indicates in previous word to be the item of the third word of multi-character words
Under part, the latter word is the conditional probability of the rest part of multi-character words;P (a | d) it indicates in previous word to be remaining of multi-character words
Under conditions of part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words;P (d | d) is indicated in previous word
Under conditions of the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words;
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)=
0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining initially
Word segmentation result.
5. according to the method described in claim 4, it is characterized in that, the step using the language model design conditions probability
Suddenly include:
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2,…zkIt is the k word in the 1st, 2 of word w ... respectively, word w is a word
Possibility, i.e. the existing probability p (w) of word w can be converted to the existing probability of each word of composition, wherein the presence of each word
Probability is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
6. the method according to claim 1, wherein it is described according to part of speech relationship, clause regular collocation and
The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file include:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to the punctuation mark redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and linguistics
Knowledge is judged, when for verb+noun form, is split as two verb, noun words;When for adjective+noun
When form, it is split as two adjective, noun words;According to the identical lead-in redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to part of speech and
Linguistic knowledge is judged, when for verb+noun form, is split as two verb, noun words;When for adjective+
When the form of noun, it is split as two adjective, noun words;According to the identical preceding two words redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to part of speech
Judged with linguistic knowledge, when for verb+noun form, is split as two verb, noun words;When for adjective
When the form of+noun, it is split as two adjective, noun words;According to the ending character redaction rule file.
7. according to the method described in claim 6, it is characterized in that, it is described according to part of speech relationship, clause regular collocation and
The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file is also wrapped
It includes:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
8. according to the method described in claim 2, it is characterized in that, described use the vocabulary, to first time amendment knot
The step of fruit is modified, and obtains final word segmentation result include:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent, retain
Word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result;Or
Described to use the vocabulary, the step of being modified to the first time correction result, obtain final word segmentation result, is also
Include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, carries out length word and disappear
Discrimination.
9. a kind of participle device towards Chinese medical book document characterized by comprising
Preprocessing module pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module is trained the corpus, generates language model;
Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary word segmentation result;
Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result
It summarizes, sorts out segmentation rules, formation rule file;
First correction module carries out first time amendment to the preliminary word segmentation result according to the rule in the rule file, raw
At first time correction result.
10. device according to claim 9, which is characterized in that further include:
Module is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module is modified the first time correction result using the vocabulary, obtains final participle knot
Fruit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910384880.XA CN110134766B (en) | 2019-05-09 | 2019-05-09 | Word segmentation method and device for traditional Chinese medical ancient book documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910384880.XA CN110134766B (en) | 2019-05-09 | 2019-05-09 | Word segmentation method and device for traditional Chinese medical ancient book documents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134766A true CN110134766A (en) | 2019-08-16 |
CN110134766B CN110134766B (en) | 2021-06-25 |
Family
ID=67576958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910384880.XA Active CN110134766B (en) | 2019-05-09 | 2019-05-09 | Word segmentation method and device for traditional Chinese medical ancient book documents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134766B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111241833A (en) * | 2020-01-16 | 2020-06-05 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and device for text data and electronic equipment |
CN111259667A (en) * | 2020-01-16 | 2020-06-09 | 上海国民集团健康科技有限公司 | Chinese medicine word segmentation algorithm |
CN112735556A (en) * | 2019-10-28 | 2021-04-30 | 北京中医药大学 | Traditional Chinese medicine ancient book data processing method for diagnosing and treating insomnia |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110144992A1 (en) * | 2009-12-15 | 2011-06-16 | Microsoft Corporation | Unsupervised learning using global features, including for log-linear model word segmentation |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN107168946A (en) * | 2017-04-14 | 2017-09-15 | 北京化工大学 | A kind of name entity recognition method of medical text data |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN108509419A (en) * | 2018-03-21 | 2018-09-07 | 山东中医药大学 | Ancient TCM books document participle and part of speech indexing method and system |
-
2019
- 2019-05-09 CN CN201910384880.XA patent/CN110134766B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110144992A1 (en) * | 2009-12-15 | 2011-06-16 | Microsoft Corporation | Unsupervised learning using global features, including for log-linear model word segmentation |
CN103020034A (en) * | 2011-09-26 | 2013-04-03 | 北京大学 | Chinese words segmentation method and device |
CN107168946A (en) * | 2017-04-14 | 2017-09-15 | 北京化工大学 | A kind of name entity recognition method of medical text data |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN108509419A (en) * | 2018-03-21 | 2018-09-07 | 山东中医药大学 | Ancient TCM books document participle and part of speech indexing method and system |
Non-Patent Citations (1)
Title |
---|
付璐: "以清代医籍为例探讨中医古籍分词规范标准", 《中华中医药杂志》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112735556A (en) * | 2019-10-28 | 2021-04-30 | 北京中医药大学 | Traditional Chinese medicine ancient book data processing method for diagnosing and treating insomnia |
CN111241833A (en) * | 2020-01-16 | 2020-06-05 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and device for text data and electronic equipment |
CN111259667A (en) * | 2020-01-16 | 2020-06-09 | 上海国民集团健康科技有限公司 | Chinese medicine word segmentation algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN110134766B (en) | 2021-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bengfort et al. | Applied text analysis with Python: Enabling language-aware data products with machine learning | |
US5384703A (en) | Method and apparatus for summarizing documents according to theme | |
Perkins | Python 3 text processing with NLTK 3 cookbook | |
Saad et al. | Arabic morphological tools for text mining | |
Winograd | Computer software for working with language | |
US20180060306A1 (en) | Extracting facts from natural language texts | |
GB2561660A (en) | Computer-implemented method of querying a dataset | |
Hickey | Corpus presenter: software for language analysis with a manual and" A corpus of Irish English" as sample data | |
CN110134766A (en) | A kind of segmenting method and device towards Chinese medical book document | |
US20030212543A1 (en) | Integrated development tool for building a natural language understanding application | |
US8321197B2 (en) | Method and process for performing category-based analysis, evaluation, and prescriptive practice creation upon stenographically written and voice-written text files | |
US20110313756A1 (en) | Text sizer (TM) | |
KR20120099578A (en) | Reconstruction of lists in a document | |
Hanafiah et al. | Text normalization algorithm on twitter in complaint category | |
Yudhana et al. | Indonesian words error detection system using nazief adriani stemmer algorithm | |
Pal et al. | Anubhuti--An annotated dataset for emotional analysis of Bengali short stories | |
US11868313B1 (en) | Apparatus and method for generating an article | |
Ghafour et al. | AEDA: Arabic edit distance algorithm Towards a new approach for Arabic name matching | |
Syafiq et al. | A concise review of named entity recognition system: Methods and features | |
Jayashree et al. | A Jaccards Similarity Score Based Methodology for Kannada Text Document Summarization | |
Luong et al. | Word graph-based multi-sentence compression: Re-ranking candidates using frequent words | |
Kavros et al. | SoundexGR: An algorithm for phonetic matching for the Greek language | |
Tapsai et al. | Thai Language Segmentation by Automatic Ranking Trie with Misspelling Correction | |
JP4829819B2 (en) | Word classification device and word classification program | |
Kaur et al. | Text Generator using Natural Language Processing Methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |