CN110134766A - A kind of segmenting method and device towards Chinese medical book document - Google Patents

A kind of segmenting method and device towards Chinese medical book document Download PDF

Info

Publication number
CN110134766A
CN110134766A CN201910384880.XA CN201910384880A CN110134766A CN 110134766 A CN110134766 A CN 110134766A CN 201910384880 A CN201910384880 A CN 201910384880A CN 110134766 A CN110134766 A CN 110134766A
Authority
CN
China
Prior art keywords
word
character
words
segmentation result
character words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910384880.XA
Other languages
Chinese (zh)
Other versions
CN110134766B (en
Inventor
谢永红
周越
张德政
阿孜古丽
栗辉
贾麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910384880.XA priority Critical patent/CN110134766B/en
Publication of CN110134766A publication Critical patent/CN110134766A/en
Application granted granted Critical
Publication of CN110134766B publication Critical patent/CN110134766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the present invention discloses a kind of segmenting method and device towards Chinese medical book document, which comprises pre-processes to the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model;The corpus is trained, language model is generated;Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file;According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates first time correction result.

Description

A kind of segmenting method and device towards Chinese medical book document
Technical field
The present invention relates to the segmenting methods of natural language processing field medical literature, especially towards Chinese medical book document Segmenting method and device.
Background technique
Chinese word segmentation is a basic step of Chinese text processing.Different from texts such as English, do not make in Chinese sentence The division between word and word is carried out with space, so automatic in progress text classification, information retrieval, information filtering, document When index, abstract such as automatically generate at the task of Chinese information processing, Chinese word segmentation has the meaning of key as basic step.In The correctness of literary word segmentation result will directly affect the correctness of follow-up work.
In traditional Chinese medical science field, it is born from primitive society and the traditional Chinese medicine of constantly development and change has accumulated a large amount of medical literature Gu Nationality works.These reference works substantial amounts, content are many and diverse, wide variety, including theory of jingqi, the theory of yin-yang and five elements (metal, wood, water, fire and earth), qi and blood Body fluid, hiding as, channels and collaterals, constitution, the cause of disease, morbidity, the interpretation of the cause, onset and process of an illness, the rules for the treatment of, health etc..Mostly using the writing in classical Chinese or the mouth of ancients in them Language, formulas or directions put into verse are recorded, and ways of writing, Compiling date is all different, have biggish difference with Modern Chinese.Also, it wraps Proper noun and technical term containing many traditional Chinese medical science fields.Reasonably carrying out participle to Chinese medicine literature of ancient book is by tcm knowledge knot The basis of structure, but there is no the segmenter specifically for traditional Chinese medical science field at present, and the segmenter of general field cannot be fine Solution in the diplomatic participle task of Chinese medical book.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of segmenting method towards Chinese medical book document, device, can be improved The accuracy of traditional Chinese medical science field document participle.
A kind of segmenting method towards Chinese medical book document, comprising:
The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated
The corpus is trained, language model is generated;
Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;
According to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarize to the preliminary word segmentation result, Sort out segmentation rules, formation rule file;
According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, is generated for the first time Correction result.
The method also includes:
The traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.
It is described to include: to the pretreated step of literature of ancient book progress
The urtext for obtaining the literature of ancient book deletes the catalogue of the literature of ancient book from the urtext, and Delete the sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning;
A space, the corpus as train language model are added behind each of text after the cleaning word.
It is described to include: to the step of literature of ancient book progress unsupervised participle using the language model
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second Kind are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words Rest part is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be more behind the lead-in of monosyllabic word Second word of words, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words behind the second word of multi-character words, more It can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, the rest part of multi-character words behind the third word of words It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words below, remove above-mentioned transfering state, remaining transition probability is Zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words, The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words;
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)= 0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining Initial word segmentation result.
It is described to include: using the step of language model design conditions probability
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2... zkIt is the k word in the 1st, 2 of word w ... respectively, word w is one A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file includes:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to punctuation mark redaction rule text Part;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and language Speech, which is gained knowledge, to be judged, when for verb+noun form, is split as two verb, noun words;For adjective+noun Form when, be split as two adjective, noun words.According to the verb word redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to word Property and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;For When adjective+noun form, it is split as two adjective, noun words.According to the identical preceding two words redaction rule text Part;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to Part of speech and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words; When for adjective+noun form, it is split as two adjective, noun words.According to the ending character redaction rule file.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file further include:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained Suddenly include:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent, Retain word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result.
It is described to use the vocabulary, the first time correction result is modified, the step of final word segmentation result is obtained Suddenly further include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, length word is carried out It disambiguates.
A kind of participle device towards Chinese medical book document characterized by comprising
Preprocessing module pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module is trained the corpus, generates language model;
Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle knot Fruit;
Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary participle As a result it summarizes, sorts out segmentation rules, formation rule file;
First correction module repair for the first time to the preliminary word segmentation result according to the rule in the rule file Just, first time correction result is generated.
The device, further includes:
Module is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module is modified the first time correction result, is obtained final participle using the vocabulary As a result.
It in above-described embodiment, is designed particular for traditional Chinese medical science field, improves the correctness of traditional Chinese medical science field participle.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow diagram of segmenting method of the embodiment of the present invention towards Chinese medical book document;
Fig. 2 is the connection schematic diagram of the participle device of the invention towards Chinese medical book document;
Fig. 3 is the flow chart of Chinese medical book document segmenting method of the invention;
Fig. 4 is training corpus pre-processed results of the invention;
Fig. 5 is rule file of the invention;
Fig. 6 is word segmentation result of the invention.
Specific embodiment
The embodiment of the present invention is described in detail with reference to the accompanying drawing.
As shown in Figure 1, being a kind of segmenting method towards Chinese medical book document of the present invention, comprising:
Step 101, the literature of ancient book of traditional Chinese medical science field is pre-processed, generates the corpus of train language model;Wherein, institute Stating and carrying out pretreated step to the literature of ancient book includes: to obtain the urtext of the literature of ancient book, from the original text The catalogue of the literature of ancient book is deleted in this, and deletes the sentence containing the character that utf-8 cannot be used to indicate, after generating cleaning Text;A space, the corpus as train language model are added behind each of text after the cleaning word.
Step 102, the corpus is trained, generates language model;
Step 103, unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary participle knot Fruit;
It is described to include: to the step of literature of ancient book progress unsupervised participle using the language model
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second Kind are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words Rest part is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be more behind the lead-in of monosyllabic word Second word of words, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words behind the second word of multi-character words, more It can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, the rest part of multi-character words behind the third word of words It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words below, remove above-mentioned transfering state, remaining transition probability is Zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is monosyllabic word or multi-character words The conditional probability of lead-in, and p (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is multi-character words The conditional probability of second word, and p (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multiword The conditional probability of the third word of word, and p (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is The conditional probability of monosyllabic word or the lead-in of multi-character words, p (a | c) it indicates under conditions of previous word is the third word of multi-character words, The latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | c) it indicates in previous word to be the third word of multi-character words Under conditions of, the latter word is the conditional probability of the rest part of multi-character words, and p (a | d) it indicates in previous word to be multi-character words Under conditions of rest part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words, and p (d | d) it indicates previous Under conditions of word is the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words.
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)= 0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining Initial word segmentation result.
It is described to include: using the step of language model design conditions probability
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2,…zkIt is the k word in the 1st, 2 of word w ... respectively, word w is one A possibility that word, i.e. the existing probability p (w) of word w, can be converted to the probability of each word of composition.Wherein, the probability of each word It is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
Step 104, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result It summarizes, sorts out segmentation rules, formation rule file;
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file includes:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to punctuation mark redaction rule text Part;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and language Speech, which is gained knowledge, to be judged, when for verb+noun form, is split as two verb, noun words;For adjective+noun Form when, be split as two adjective, noun words.According to the verb word redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to word Property and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words;For When adjective+noun form, it is split as two adjective, noun words.According to the identical preceding two words redaction rule text Part;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to Part of speech and linguistic knowledge judged, such as when for verb+noun form, to be split as two verb, noun words; When for adjective+noun form, it is split as two adjective, noun words.According to the ending character redaction rule file.
The regular collocation and linguistic knowledge according to part of speech relationship, clause carries out the preliminary word segmentation result It summarizes, sorts out segmentation rules, the step of formation rule file further include:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
Step 105, according to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, it is raw At first time correction result.
The method also includes:
Step 106, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Step 107, using the vocabulary, the first time correction result is modified, final word segmentation result is obtained. The step includes:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent, Retain word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result.
The step can also include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, length word is carried out It disambiguates.
As shown in Fig. 2, being a kind of participle device towards Chinese medical book document of the present invention, comprising:
Preprocessing module 21 pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module 22 is trained the corpus, generates language model;
Word segmentation module 23 carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary participle As a result;
Rule establishes module 24, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to described preliminary point Word result is summarized, and segmentation rules, formation rule file are sorted out;
First correction module 25 carries out for the first time the preliminary word segmentation result according to the rule in the rule file Amendment generates first time correction result.
The device, further includes:
Module 26 is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module 27 is modified the first time correction result using the vocabulary, obtains final point Word result.
Application scenarios of the invention are described below.The present invention in order to solve the problems, such as the participle of traditional Chinese medical science field literature of ancient book, and A kind of segmenting method based on Chinese medical book document put forward.Specific implementation step is as follows:
Step 1: obtaining the relevant literature of ancient book of traditional Chinese medical science field, as the corpus of train language model, arranges traditional Chinese medical science field Peculiar term is as vocabulary.Vocabulary one word of every a line, storage format TXT.
Step 2: pre-processing document, uses kenlm tool, train language model.
It wherein pre-processes and includes:
(1) it deltrees, deleting the sentence containing the spcial character that utf-8 cannot be used to indicate, (sentence here is with sentence Number, exclamation mark, question mark is divided);
(2) space is added behind each of text after the cleaning word, the language as training word language model Material.
Step 3: preliminary unsupervised participle is carried out using literature of ancient book of the language model to traditional Chinese medical science field.
In Chinese, the word that length is greater than four words is fewer, so being four kinds by the state demarcation of word in this patent: The lead-in of monosyllabic word or multi-character words is labeled as c labeled as b, the third word of multi-character words labeled as a, the second word of multi-character words With the rest part of multi-character words, it is labeled as d.It can only be wherein the prefix of monosyllabic word or multi-character words, monosyllabic word behind monosyllabic word Lead-in behind can only be multi-character words the second word, can only be the third word or monosyllabic word of multi-character words behind the second word of multi-character words Or the lead-in of multi-character words, it can only be the rest part of multi-character words or the head of monosyllabic word or multi-character words behind the third word of multi-character words Word can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words behind the rest part of multi-character words, remove above-mentioned transfer State, remaining transition probability are zero.To sum up, the transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d).
It is less according to ancient Chinese prose long word, there is the complete semantic more situation of individual character, it is general by the transfer for being arranged different Rate carries out Experimental comparison, obtains transition probability
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)= 0.005,
P (a | d)=1, p (d | d)=0.0001
Using language model design conditions probability obtained in step 2, as shown in formula (1).Wherein word w is by k word The word of composition, the probability of word w can be converted to the probability of the word of composition.Optimal path is found using the method for Dynamic Programming, i.e., The path of maximum probability, as cutting as a result, obtaining initial word segmentation result.
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Step 4: according to part of speech relationship, the regular collocation of clause and linguistic knowledge to the preliminary participle in step 3 As a result it summarizes, sorts out segmentation rules, formation rule file.The method for wherein summarizing formation rule file is specific as follows:
(1) number that whole words and each word in the word segmentation result obtained by step 3 occur is counted;
(2) statistical result is ranked up according to pinyin order;
(3) for Chinese character and punctuation mark be divided into a word as a result, being write using the punctuation mark as rule Enter rule file;
(4) for Chinese character and Chinese character be divided into a word as a result, sorting out the word of identical lead-in first, so Judged afterwards according to part of speech and linguistic knowledge, such as under normal circumstances, verb+noun form generally should all be split For verb, which is then added to rule file by two words of noun;
(5) the identical word of preceding two word is sorted out, is judged according to part of speech and linguistic knowledge, adds corresponding word Into rule file;
(6) word with identical ending character is arranged, is judged according to part of speech and linguistic knowledge, adds corresponding word Language is into rule file.
(7) it is ranked up from high to low according still further to frequency, high frequency words is judged.
Rule file storage format is TXT, every one rule of row write.It is described as follows shown in table 1:
1 rule file of table
Step 5: first time amendment is carried out to the preliminary word segmentation result in step 3 according to the rule in rule file;
Step 6: the first time correction result in step 5 is modified using vocabulary, detailed modification method is as follows:
(1) word occurred in vocabulary is found in urtext, and writes down the position of appearance;
(2) it when being both the sub- word of some word in word and vocabulary in vocabulary if there is a word, is grown Short word disambiguates, and determines the word that should be used in which vocabulary herein;
(3) step 5 is found by once correcting obtained word segmentation result according to the position of (1) record;
(4) if word segmentation result is inconsistent with vocabulary, i.e., a word in vocabulary multiple words or segmentation have been divided into Boundary is incorrect, and word segmentation result is merged modification according to vocabulary;If consistent, retain result;
(5) final word segmentation result is obtained after vocabulary is corrected.
The present invention has the advantages that
In existing participle tool, the segmenter not segmented for Chinese ancient Chinese prose feature also lacks special needle To the segmenter of traditional Chinese medical science field.On expression way, there is biggish difference between Chinese ancient Chinese prose and Modern Chinese, such as existing Generally use " ", " ", " " etc. as modal particle for Chinese, but the modal particle in ancient Chinese prose be generally " it ", " ", " person ", " ";There are more individual characters and relative to Modern Chinese, in ancient Chinese prose it can be shown that complete semantic.These are resulted in The upper error result that again and again occurs of the existing segmenter in ancient Chinese prose participle task.And in Chinese medical book document, have big The medicine name prescription symptom of amount is described, and the distinctive expression method of these traditional Chinese medical science fields is difficult to see on general field, also result in Segmenter on current general field not can solve the participle task on Chinese medical book.By the present invention in that with no prison The method superintended and directed has trained the participle model for being specifically applied to Chinese medical book document, can be good at solving Chinese medical book document Participle task.And unsupervised method is used, save the manpower and time cost manually marked.This method is easy to expand It opens up and is used on the ancient Chinese prose document of other field, by changing the data set of training pattern and the vocabulary in relevant field, so that it may The segmenter invented herein is applied to other field.
By taking next chapter Chinese medical book document as an example, illustrate the segmenting method of Chinese medical book document, as shown in Figure 3.
First, obtain Chinese medical book document, the corpus as train language model.Arrange the peculiar term conduct of traditional Chinese medical science field Vocabulary, as shown in table 1.
Table 1: the peculiar term vocabulary of traditional Chinese medical science field
Second, training corpus is pre-processed, pre-processed results are as shown in Figure 4.Use Kenlm tool training language mould Type, wherein the parameter of gram is set as 4.
Third carries out unsupervised participle to the literature of ancient book of traditional Chinese medical science field using the language model of step 2 training.Word with It is separated between word by space.
4th, arrangement sums up first word segmentation result.Be ranked up first, in accordance with phonetic, by symbol and Chinese character segmentation at This result of one word is split.For example occur in word segmentation result ", Poria cocos ", added in rule file ", | * ". " * " indicates all characters, i.e., is all split once there is the situation that ", " is connected with any character.Next lead-in is identical Word all find out, two lists generally should be all divided into according to this combination of linguistics common sense such as " verb+noun " Word, such as " eating heat wine ", " eating salt ", " eating hot object " etc. add " eating " word using " eating " word as the word of prefix in rule file Rule, " eat | * ".Similarly, the identical word of two words of beginning word identical with ending individual character is arranged, and asks in valence and adds in rule Corresponding rule.Again first word segmentation result is ranked up from high to low according to word frequency, high frequency words are checked, needs to cut That divides is added to respective rule in rule file.
5th, first word segmentation result is modified using the rule in rule file.Rule file is as shown in Figure 5.
6th, it is modified using the result that vocabulary obtains the 5th step.
Firstly, finding the word in vocabulary in the original TCM Document not segmented, and record position.
Then, the word of vocabulary is found in using the corrected word segmentation result of rule.
If word segmentation result herein is that the word in vocabulary is inconsistent, by result herein according to being modified in vocabulary. For example have in original TCM Document in short for " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more.If " using rule amendment The word segmentation result crossed is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." but plus-minus decoction of Six Ingredients is the word in vocabulary, is one A TCM Recipe name, such case, it will merging " plus-minus decoction of Six Ingredients " in final word segmentation result becomes a word.Final Word segmentation result is that " plus-minus decoction of Six Ingredients controls the enuresis, and empty people has this card more." word segmentation result form it is as shown in Figure 6.
For convenience of description, description apparatus above is to be divided into various units/modules with function to describe respectively.Certainly, exist Implement to realize each unit/module function in the same or multiple software and or hardware when the present invention.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims (10)

1. a kind of segmenting method towards Chinese medical book document characterized by comprising
The literature of ancient book of traditional Chinese medical science field is pre-processed, the corpus of train language model is generated;
The corpus is trained, language model is generated;
Unsupervised participle is carried out to the literature of ancient book using the language model, generates preliminary word segmentation result;
It according to the regular collocation and linguistic knowledge of part of speech relationship, clause, summarizes, arranges to the preliminary word segmentation result Segmentation rules out, formation rule file;
According to the rule in the rule file, first time amendment is carried out to the preliminary word segmentation result, generates and corrects for the first time As a result.
2. the method according to claim 1, wherein the method also includes:
The traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Using the vocabulary, the first time correction result is modified, final word segmentation result is obtained.
3. the method according to claim 1, wherein described carry out pretreated step packet to the literature of ancient book It includes:
The urtext for obtaining the literature of ancient book, deletes the catalogue of the literature of ancient book from the urtext, and deletes Sentence containing the character that utf-8 cannot be used to indicate, the text after generating cleaning;
A space, the corpus as train language model are added behind each of text after the cleaning word.
4. according to the method described in claim 2, it is characterized in that, it is described using the language model to the literature of ancient book into Row unsupervised participle the step of include:
The transfering state of word is divided into four kinds: the first are as follows: the lead-in of monosyllabic word or multi-character words is labeled as a;Second Are as follows: the second word of multi-character words is labeled as b;The third are as follows: the third word of multi-character words is labeled as c;4th kind are as follows: multi-character words its Remaining part point is labeled as d;
Wherein, it can only be the prefix of monosyllabic word or multi-character words behind monosyllabic word, can only be multi-character words behind the lead-in of monosyllabic word The second word, can only be the third word of multi-character words or the lead-in of monosyllabic word or multi-character words, multi-character words behind the second word of multi-character words Third word behind can only be the rest part of multi-character words or the lead-in of monosyllabic word or multi-character words, behind the rest part of multi-character words It can be the lead-in of monosyllabic word or multi-character words or the rest part of multi-character words, remove above-mentioned transfering state, remaining transition probability is zero;
To sum up, the state transition probability of non-zero has 8 kinds, it may be assumed that
P (a | a), p (b | a), p (c | b), p (a | b), p (a | c), p (d | c), p (a | d), p (d | d);
Wherein, p (a | a) indicates that under conditions of previous word is monosyllabic word, the latter word is the lead-in of monosyllabic word or multi-character words Conditional probability;P (b | a) it indicates under conditions of previous word is the lead-in of multi-character words, the latter word is the second of multi-character words The conditional probability of word;P (c | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is multi-character words The conditional probability of third word;P (a | b) it indicates under conditions of previous word is the second word of multi-character words, the latter word is individual character The conditional probability of word or the lead-in of multi-character words;P (a | c) it indicates under conditions of previous word is the third word of multi-character words, it is latter A word is the conditional probability of the lead-in of monosyllabic word or multi-character words;P (d | c) it indicates in previous word to be the item of the third word of multi-character words Under part, the latter word is the conditional probability of the rest part of multi-character words;P (a | d) it indicates in previous word to be remaining of multi-character words Under conditions of part, the latter word is the conditional probability of the lead-in of monosyllabic word or multi-character words;P (d | d) is indicated in previous word Under conditions of the rest part of multi-character words, the latter word is the conditional probability of the rest part of multi-character words;
Different transition probabilities is set and carries out Experimental comparison, obtains transition probability;
P (a | a)=0.96, p (b | a)=0.2, p (c | b)=0.009, p (a | b)=0.9, p (a | c)=1, p (d | c)= 0.005,
P (a | d)=1, p (d | d)=0.0001
Use the language model design conditions probability;
Optimal path, the i.e. maximum path of conditional probability are found using the method for Dynamic Programming, as cutting as a result, obtaining initially Word segmentation result.
5. according to the method described in claim 4, it is characterized in that, the step using the language model design conditions probability Suddenly include:
P (w)=p (z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Wherein, the word that word w is made of k word, z1, z2,…zkIt is the k word in the 1st, 2 of word w ... respectively, word w is a word Possibility, i.e. the existing probability p (w) of word w can be converted to the existing probability of each word of composition, wherein the presence of each word Probability is calculated by using the number that the word occurs in the text of train language model divided by total number of word.
6. the method according to claim 1, wherein it is described according to part of speech relationship, clause regular collocation and The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file include:
The preliminary word segmentation result is ranked up according to pinyin order;
Successively the preliminary word segmentation result of sequence is handled;The processing specifically:
For Chinese character and punctuation mark be divided into a word as a result, according to the punctuation mark redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, word to identical lead-in, according to part of speech and linguistics Knowledge is judged, when for verb+noun form, is split as two verb, noun words;When for adjective+noun When form, it is split as two adjective, noun words;According to the identical lead-in redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, when word identical for preceding two word, according to part of speech and Linguistic knowledge is judged, when for verb+noun form, is split as two verb, noun words;When for adjective+ When the form of noun, it is split as two adjective, noun words;According to the identical preceding two words redaction rule file;
For Chinese character and Chinese character be divided into a word as a result, arrange have identical ending character word, according to part of speech Judged with linguistic knowledge, when for verb+noun form, is split as two verb, noun words;When for adjective When the form of+noun, it is split as two adjective, noun words;According to the ending character redaction rule file.
7. according to the method described in claim 6, it is characterized in that, it is described according to part of speech relationship, clause regular collocation and The step of linguistic knowledge summarizes to the preliminary word segmentation result, sorts out segmentation rules, formation rule file is also wrapped It includes:
Count the number that whole words and each word in the preliminary word segmentation result occur;
According to number being ranked up from high to low, the word of predetermined quantity before obtaining;
Whether the word of predetermined quantity is in preset vocabulary before judging;
If according to the word establishment rules file.
8. according to the method described in claim 2, it is characterized in that, described use the vocabulary, to first time amendment knot The step of fruit is modified, and obtains final word segmentation result include:
The word in the vocabulary is found in the urtext, as word to be modified;
Record position of the word to be modified in the urtext;
According to the position of record, the word segmentation result of the first time revised word to be modified is found;
Judge whether the word segmentation result and the word in the vocabulary are consistent;
If it is inconsistent, modifying according to word segmentation result of the vocabulary to the word to be modified;If consistent, retain Word segmentation result;
Vocabulary amendment successively is carried out to the first time correction result, obtains final word segmentation result;Or
Described to use the vocabulary, the step of being modified to the first time correction result, obtain final word segmentation result, is also Include:
When first word is both the sub- word of a word in the word in the vocabulary and the vocabulary, carries out length word and disappear Discrimination.
9. a kind of participle device towards Chinese medical book document characterized by comprising
Preprocessing module pre-processes the literature of ancient book of traditional Chinese medical science field, generates the corpus of train language model
Training module is trained the corpus, generates language model;
Word segmentation module carries out unsupervised participle to the literature of ancient book using the language model, generates preliminary word segmentation result;
Rule establishes module, according to the regular collocation and linguistic knowledge of part of speech relationship, clause, to the preliminary word segmentation result It summarizes, sorts out segmentation rules, formation rule file;
First correction module carries out first time amendment to the preliminary word segmentation result according to the rule in the rule file, raw At first time correction result.
10. device according to claim 9, which is characterized in that further include:
Module is obtained, the traditional Chinese medical science field term sorted out according to the literature of ancient book is obtained, as vocabulary;
Second correction module is modified the first time correction result using the vocabulary, obtains final participle knot Fruit.
CN201910384880.XA 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents Active CN110134766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910384880.XA CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910384880.XA CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Publications (2)

Publication Number Publication Date
CN110134766A true CN110134766A (en) 2019-08-16
CN110134766B CN110134766B (en) 2021-06-25

Family

ID=67576958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910384880.XA Active CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Country Status (1)

Country Link
CN (1) CN110134766B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment
CN111259667A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Chinese medicine word segmentation algorithm
CN112735556A (en) * 2019-10-28 2021-04-30 北京中医药大学 Traditional Chinese medicine ancient book data processing method for diagnosing and treating insomnia

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110144992A1 (en) * 2009-12-15 2011-06-16 Microsoft Corporation Unsupervised learning using global features, including for log-linear model word segmentation
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110144992A1 (en) * 2009-12-15 2011-06-16 Microsoft Corporation Unsupervised learning using global features, including for log-linear model word segmentation
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
付璐: "以清代医籍为例探讨中医古籍分词规范标准", 《中华中医药杂志》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735556A (en) * 2019-10-28 2021-04-30 北京中医药大学 Traditional Chinese medicine ancient book data processing method for diagnosing and treating insomnia
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment
CN111259667A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Chinese medicine word segmentation algorithm

Also Published As

Publication number Publication date
CN110134766B (en) 2021-06-25

Similar Documents

Publication Publication Date Title
Bengfort et al. Applied text analysis with Python: Enabling language-aware data products with machine learning
US5384703A (en) Method and apparatus for summarizing documents according to theme
Perkins Python 3 text processing with NLTK 3 cookbook
Saad et al. Arabic morphological tools for text mining
Winograd Computer software for working with language
US20180060306A1 (en) Extracting facts from natural language texts
GB2561660A (en) Computer-implemented method of querying a dataset
Hickey Corpus presenter: software for language analysis with a manual and" A corpus of Irish English" as sample data
CN110134766A (en) A kind of segmenting method and device towards Chinese medical book document
US20030212543A1 (en) Integrated development tool for building a natural language understanding application
US8321197B2 (en) Method and process for performing category-based analysis, evaluation, and prescriptive practice creation upon stenographically written and voice-written text files
US20110313756A1 (en) Text sizer (TM)
KR20120099578A (en) Reconstruction of lists in a document
Hanafiah et al. Text normalization algorithm on twitter in complaint category
Yudhana et al. Indonesian words error detection system using nazief adriani stemmer algorithm
Pal et al. Anubhuti--An annotated dataset for emotional analysis of Bengali short stories
US11868313B1 (en) Apparatus and method for generating an article
Ghafour et al. AEDA: Arabic edit distance algorithm Towards a new approach for Arabic name matching
Syafiq et al. A concise review of named entity recognition system: Methods and features
Jayashree et al. A Jaccards Similarity Score Based Methodology for Kannada Text Document Summarization
Luong et al. Word graph-based multi-sentence compression: Re-ranking candidates using frequent words
Kavros et al. SoundexGR: An algorithm for phonetic matching for the Greek language
Tapsai et al. Thai Language Segmentation by Automatic Ranking Trie with Misspelling Correction
JP4829819B2 (en) Word classification device and word classification program
Kaur et al. Text Generator using Natural Language Processing Methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant