CN108509419B

CN108509419B - Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Info

Publication number: CN108509419B
Application number: CN201810233868.4A
Authority: CN
Inventors: 付先军; 李学博; 王振国; 陈晓康; 桑晓明; 鞠芳凝; 周扬; 陈聪; 邵欣欣
Original assignee: Shandong University of Traditional Chinese Medicine
Current assignee: Shandong University of Traditional Chinese Medicine
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2022-02-22
Anticipated expiration: 2038-03-21
Also published as: CN108509419A

Abstract

The invention discloses a Chinese medicine ancient book document word segmentation and part of speech indexing method and a system; the method comprises the following steps: step (1): constructing a Chinese medicine word segmentation dictionary; step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary; and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation; and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.

Description

Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Technical Field

The invention relates to a method and a system for indexing Chinese medicinal ancient book documents by word segmentation and part of speech.

Background

The literature is important for the civilization of human beings and the progress of society, and is the basis of all scientific researches. The traditional Chinese medicine literature is an important component of ancient Chinese literature, is an important basis for researching clinical medication experience of ancient physicians, integrates knowledge of theory, method, formula, medicine and the like of the traditional Chinese medicine, also accumulates academic thought and clinical medication experience accumulated in the process of thousands of years of development of the traditional Chinese medicine, and mining the precious cultural heritage is an important premise and basis of inheritance and innovation of the traditional Chinese medicine academic inheritance. Modern interpretation of the theory of traditional Chinese medicine and modern research on the symptoms, treatment methods and prescriptions of traditional Chinese medicine are not independent of classical medicines, and for example, the discovery of artemisinin does not depart from the inspiration acquired in classical documents of traditional Chinese medicine such as elbow backup acute prescription.

The sorting and analysis of the traditional Chinese medicine documents are based on word segmentation and part-of-speech tagging. Word segmentation is a process of recombining continuous word sequences into word sequences according to certain specifications, most of the research on Chinese word segmentation theory, methods and technologies at home and abroad at the present stage is still in the theoretical or experimental stage and is biased to natural language processing and information retrieval, and available Chinese word segmentation software is few in forming; the software and the method specially aiming at Chinese medicine word segmentation and part-of-speech tagging are not reported, due to the particularity of the professional terms of the Chinese medicine, the word segmentation result accuracy and the recall rate of Chinese medicine documents in the application of general Chinese word segmentation software are low, the word segmentation accuracy of the ancient word segmentation of Chinese medicine documents with the highest report is 0.735, the recall rate is only 0.663, the accuracy rate, the recall rate and the comprehensive classification rate (F1) of other Chinese word segmentation systems are even below 0.5, for example, the PHP Analysis accuracy is only 0.312, the recall rate is only 0.369, and the special part-of-speech tagging can not be carried out aiming at the professional characteristics of the Chinese medicine. This greatly restricts the utilization and exploitation of the traditional Chinese medicine literature. Most software needs to configure the environment, has specific requirements on the system, has poor portability and is difficult to operate.

Therefore, the system and the method for word segmentation and word tagging of the traditional Chinese medicine literature, which are suitable for the traditional Chinese medicine literature characteristics, high in accuracy and recall rate and capable of performing word tagging conforming to the characteristics of the professional terms of the traditional Chinese medicine, break through the current main technical bottleneck restricting the traditional Chinese medicine literature mining and knowledge discovery, and have very important significance for inheritance and innovation of the traditional Chinese medicine and exertion of the original advantages of the traditional Chinese medicine.

Disclosure of Invention

The invention aims to provide a method and a system for indexing Chinese medicinal ancient book documents by word segmentation and parts of speech, which can improve the accuracy and recall rate of Chinese medicinal ancient book documents by word segmentation and part of speech tagging conforming to the characteristics of professional terms of Chinese medicaments, solve the problem that the prior Chinese word segmentation system has low accuracy and recall rate of Chinese medicinal documents by word segmentation and part of speech tagging of texts in the typhoid treatises and cannot carry out professional part of speech tagging of the Chinese medicaments, and find that the word segmentation system has higher accuracy and recall rate than the general Chinese word segmentation system and is very close to the level of professionals by the part of speech tagging of the literatures in the typhoid treatises.

The invention provides a method for indexing Chinese medicinal ancient book literature participles and parts of speech;

the Chinese medicine ancient book document word segmentation and part of speech indexing method comprises the following steps:

step (1): constructing a Chinese medicine word segmentation dictionary;

step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary;

and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation;

and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.

Further, the step (1) of constructing the Chinese medicine word segmentation dictionary comprises the following steps:

a step (101): constructing a word bank of professional terms of traditional Chinese medicine;

a step (102): performing part-of-speech classification and marking on words in a Chinese medical professional term word bank;

step (103): a three-column dictionary construction method is adopted to construct a Chinese medicine word segmentation dictionary.

Further, the step (101) of constructing the thesaurus of the professional terms of traditional Chinese medicine comprises the following steps:

extracting traditional Chinese medicine professional terms from traditional Chinese medicine ancient books and traditional Chinese medicine dictionaries;

the term of the traditional Chinese medicine comprises: the Chinese medicine name, the prescription name, the ancient Chinese medical book name, the doctor name, the symptom name of the Chinese medical illness, the efficacy name of the Chinese medical, the acupuncture point name, the dosage name of the Chinese medicine, the ancient Chinese vocabulary and the professional vocabulary in the modern medicine.

Further, the step (102) of performing part-of-speech classification on the words in the thesaurus of medical and professional terms comprises the following steps:

according to the disease part, syndrome part or therapeutic part of the national standard Chinese medical clinical diagnosis and treatment terms of the people's republic of China, Chinese medical nouns are divided into a plurality of parts of speech by combining the characteristics of the Chinese medical noun terms, a 14-class classified part of speech table is constructed, and the 14-class classified parts of speech include: 1. the traditional Chinese medicine theory basis, 2. the traditional Chinese medicine diagnosis method, 3. the traditional Chinese medicine nouns, 4. the prescription nouns, 5. the typhoid fever and the epidemic febrile disease, 6. the traditional Chinese medicine rule, 7. the traditional Chinese medicine treatment method, 8. the traditional Chinese medicine and the related subject, 9. the traditional Chinese medicine ancient book, 10. the traditional Chinese medicine organization, the equipment or the medical health personnel, 11. the name words, 12. the geographical name, 13. the season time words, 14. other words; each word is divided into a plurality of levels of subclasses, and the Chinese medicine nouns in the word stock are classified and marked according to the level of the part of speech and the sequence from low to high.

Each category of words is divided into several sub-categories, for example, the four diagnostic methods in TCM include the four diagnostic methods, the four diagnostic methods include inspection, auscultation, inquiry and palpation, the inspection includes tongue diagnosis, the tongue diagnosis includes tongue manifestation, which includes tongue coating and tongue proper, the tongue coating includes tongue fur color and tongue fur proper, and at most there are 7 sub-categories.

Further, the step (103) adopts a three-column dictionary construction method to construct a traditional Chinese medicine word segmentation dictionary, which is divided into three columns, which are respectively:

no. 1 is listed as a term of professional Chinese medicine, such as qiong equisetum, cinnabar tranquilizing pills and the like;

column 2 is a part-of-speech classified letter, such as cinnabar tranquilizer, belonging to the prescription classification in part-of-speech, wherein the part-of-speech classified letter is FCzzasj;

column 3 is a part-of-speech hierarchical identification. The tranquilizers in the formula classification belong to the 4 th level in the classification, and are labeled as 4.

Further, the step (2) comprises the following steps:

step (201): extracting keywords from the text to be segmented by applying a word bag model;

step (202): training a conditional random field CRF model by using existing words in a Chinese medicine word segmentation dictionary, finding new words by using the conditional random field CRF model, and bringing the new words into the Chinese medicine word segmentation dictionary;

step (203): constructing a double-array wire tree by using all existing words in the word segmentation dictionary;

a step (204): performing single-string mode matching on the keywords extracted from the text to be word-segmented and the double-array Tire tree, and segmenting the currently extracted keywords by using the double-array Tire tree to obtain a word segmentation result;

step (205): training a hidden Markov model: taking each existing word in the word segmentation dictionary as an observation state sequence, and taking the part of speech of each word as a hidden state sequence to carry out hidden Markov model training to obtain a trained hidden Markov model;

step (206): and (3) performing part-of-speech tagging by using a trained hidden Markov model: and (3) inputting the word sequence in the word segmentation result obtained in the step (204) as an observation state sequence into a trained hidden Markov model, and generating a hidden state sequence of the current observation state sequence through a viterbi algorithm so as to obtain a corresponding hidden state, wherein the hidden state is the part of speech of the text to be segmented, thereby completing part of speech tagging.

Further, the step (3) judges whether all the texts to be participled are successfully participled, and the judgment standard is as follows:

if each word segmentation result is provided with a part-of-speech tagging letter, the word segmentation is successful, otherwise, the word segmentation is failed.

The second aspect of the invention provides a Chinese medicine ancient book document word segmentation and part of speech indexing system;

the system for indexing Chinese traditional medicine ancient book documents by word segmentation and part of speech comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

In a third aspect of the invention, a computer-readable storage medium is provided;

a computer readable storage medium having computer instructions embodied thereon, which, when executed by a processor, perform the steps of any of the above methods.

Compared with the prior art, the invention has the beneficial effects that:

the recall rate and the accuracy rate of Chinese medical ancient book document segmentation are far higher than those of the prior art.

The invention realizes the professional part-of-speech tagging of the traditional Chinese medicine for the first time and provides a foundation for the literature mining and knowledge discovery of the traditional Chinese medicine.

The invention ensures the integrity and accuracy of the word segmentation result by performing the word segmentation processing twice.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

As shown in fig. 1, the method for indexing Chinese traditional medicine ancient book documents by word segmentation and part of speech includes:

step (1): constructing a Chinese medicine word segmentation dictionary;

1. Construction of Chinese medicine word segmentation dictionary

1.1 construction of term lexicon of professional Chinese medicine

One of the main reasons for the difference in the accuracy of Chinese medicine word segmentation in the current general Chinese word segmentation software is the difference in the recognition capability of terms such as Chinese medicine syndromes, meridians and collaterals and acupuncture points, so the system firstly constructs a perfect Chinese medicine term lexicon. A special word bank containing traditional Chinese medicine professional terms such as traditional Chinese medicine names, prescription names and the like is extracted and constructed from traditional Chinese medicine ancient book documents and various traditional Chinese medicine dictionaries by adopting a web crawler, an artificial neural network and an artificial correction, extraction and standardization processing method, and relates to 155,343 related words of traditional Chinese medicines, which are the word bank of traditional Chinese medicine professional terms with the largest word receiving amount at present.

TABLE 1 Chinese medicine participle word bank composition table

1.2 part-of-speech tagging method special for traditional Chinese medicine

Part-of-Speech tagging or POS tagging, also called Part-of-Speech tagging or POS tagging, refers to a procedure for tagging each word in a word segmentation result with a correct Part-of-Speech, and generally, the present Part-of-Speech tagging is a process for determining whether each word is a noun, a verb, an adjective, or other parts-of-Speech. The part-of-speech tagging is not very significant in text mining and analysis of traditional Chinese medicine documents, and based on the fact that the Chinese medicine nouns are divided into 14 types of 818 parts-of-speech according to a classification method of a Chinese medicine theoretical system by combining the professional characteristics of the traditional Chinese medicine: the theory basis of traditional Chinese medicine, the diagnosis and treatment of traditional Chinese medicine, the related nouns of traditional Chinese medicine and prescription, typhoid fever and epidemic febrile disease, the treatment principle, the treatment method, the related subjects of traditional Chinese medicine, the ancient books of traditional Chinese medicine, the traditional Chinese medicine institution, the equipment of traditional Chinese medicine, the names of medical and health staff, the geographical names and others.

A first-order hidden Markov model is adopted, in the hidden Markov model, hidden states are 818 parts of speech, explicit states are 818 letter abbreviations, and FC is added in front of the hidden states for distinguishing from general part of speech marks.

And meanwhile, according to the level of the part of speech, labeling is performed according to the priority sequence from low to high as possible.

TABLE 2 Chinese medicine professional parts-of-speech composition table (part)

1.3 construction and expansion of Chinese medicine word segmentation dictionary

The word segmentation dictionary is a core part of the system, and has important influence on the accuracy and speed of word segmentation results, the system is based on the word bank and the part-of-speech tagging method of the traditional Chinese medicine professional terms, a 3-column dictionary construction method is adopted, the 1 st column is the traditional Chinese medicine professional noun term, the 2 nd column is part-of-speech tagging letters, and the 3 rd column is a grading mark.

1.4tire Tree (dictionary Tree) construction Process

(1) Establishing root node root, making base root equal to 1

(2) Finding out a root child node set (i ═ 1.. n), so that check [ root

(3) For each element in root, children:

1) find { element

2) Set base [ element

3) Childreni, if an element is traversed to have no children, i.e. a leaf node, then base [ element ] is set to be a negative value

2. Chinese medicine document word segmentation algorithm and part of speech tagging

The core algorithm of the word segmentation system is an open source code of Ansj, is a Java Chinese word segmentation tool, and is higher in word segmentation accuracy rate than other common open source word segmentation tools (such as mmseg4j) based on the ictclas Chinese word segmentation algorithm of Chinese academy of sciences.

On the basis, a Chinese medicine professional dictionary built by the user is used for replacing a default dictionary, a dictionary of Ansj is used as a supplement, and part-of-speech tagging is performed based on an HMM.

3. Construction and use of Chinese medicine document word segmentation and part of speech tagging service system

The Chinese traditional ancient book literature word segmentation system is developed by adopting Java language, and comprises a word segmentation architecture and a user interface. The user interface is presented to the user in a webpage form, the user logs in and registers through the webpage, and the user who does not log in can only access the website and cannot use the word segmentation function. The login user can submit the text to be participled in a form of copying and pasting the text, and also can submit the participled text in a form of uploading txt text, and the participled result has two modes, namely copying and downloading the txt text.

4. Effects of the implementation

4.1 improving word segmentation accuracy and recall rate

Word segmentation tests are carried out by taking the text content of the whole text of the ancient book of Shanghai treatise on typhoid treatise and the original Ansj program as comparison, and the results show that the recall rate and the accuracy rate of the word segmentation of the ancient book document word segmentation system of the traditional Chinese medicine are far higher than those of the Ansj source program and the system word bank, the special nouns of the traditional Chinese medicine in the test text such as solar disease, sweating, aversion to wind, slow pulse and the like cannot be identified by the Ansj source program and the system word bank, correct word segmentation cannot be carried out, and the ancient book word segmentation system of the traditional Chinese medicine can accurately identify and carry out word segmentation.

TABLE 3 word segmentation effect comparison

4.2 realizes the part-of-speech tagging of the traditional Chinese medicine specialty

On the basis of accurate word segmentation, accurate special part-of-speech tagging is realized, as shown in table 3, FCbm is accurately tagged for "solar disease" and "apoplexy", and the word is a "disease name of traditional Chinese medicine"; the terms "fever", "sweating" and "slow pulse" are labeled FCzz, and represent the names of symptoms in traditional Chinese medicine, which is of great significance for statistical analysis and knowledge discovery in later text mining.

4.3 the system is simple to operate and has strong portability

The Chinese traditional ancient book literature word segmentation system is developed by adopting Java language, has strong readability, and is easy to expand and modify. The system comprises user login, registration and user authority control, and users who do not log in can only access the website and cannot use the word segmentation function. The system has friendly interface, easy use and humanized prompt.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The Chinese medicine ancient book document word segmentation and part of speech indexing method is characterized by comprising the following steps:

step (1): constructing a Chinese medicine word segmentation dictionary;

the step (1) of constructing the Chinese medicine word segmentation dictionary comprises the following steps:

step (103): constructing a Chinese medicine word segmentation dictionary by adopting a three-column dictionary construction method;

the step (2) comprises the following steps:

step (206): and (3) performing part-of-speech tagging by using a trained hidden Markov model: inputting the word sequence in the word segmentation result obtained in the step (204) as an observation state sequence into a trained hidden Markov model, and generating a hidden state sequence of the current observation state sequence through a viterbi algorithm so as to obtain a corresponding hidden state, wherein the hidden state is the part of speech of the text to be segmented, thereby completing part of speech tagging;

2. The method for indexing Chinese medicine ancient book literature participles and parts of speech according to claim 1, wherein the step (101) of constructing the thesaurus of Chinese medicine professional terms comprises the steps of:

the Chinese medicine technical terms are extracted from Chinese medicine ancient books documents and Chinese medicine dictionaries.

3. The method as claimed in claim 2, wherein the term of the ancient Chinese medical book includes: the Chinese medicine name, the prescription name, the ancient Chinese medical book name, the doctor name, the symptom name of the Chinese medical illness, the efficacy name of the Chinese medical, the acupuncture point name, the dosage name of the Chinese medicine, the ancient Chinese vocabulary and the professional vocabulary in the modern medicine.

4. The method for indexing Chinese medicine ancient book literature participles and parts of speech as claimed in claim 1, wherein said step (102) of performing part of speech classification on words in the Chinese medicine professional term lexicon comprises the steps of:

according to the disease part, syndrome part or therapeutic part of the national standard Chinese medical clinical diagnosis and treatment terms of the people's republic of China, Chinese medical nouns are divided into a plurality of parts of speech by combining the characteristics of the Chinese medical noun terms, a 14-class classified part of speech table is constructed, and the 14-class classified parts of speech include: 1. the traditional Chinese medicine theory basis, 2. the traditional Chinese medicine diagnosis method, 3. the traditional Chinese medicine nouns, 4. the prescription nouns, 5. the typhoid fever and the epidemic febrile disease, 6. the traditional Chinese medicine rule, 7. the traditional Chinese medicine treatment method, 8. the traditional Chinese medicine and the related subject, 9. the traditional Chinese medicine ancient book, 10. the traditional Chinese medicine organization, the equipment or the medical health personnel, 11. the name words, 12. the geographical name, 13. the season time words, 14. other words;

each word is divided into a plurality of levels of subclasses, and the Chinese medicine nouns in the word stock are classified and marked according to the level of the part of speech and the sequence from low to high.

5. The method as claimed in claim 1, wherein the method for indexing ancient Chinese medicine books by word segmentation and part of speech,

the step (103) adopts a three-column dictionary construction method to construct a traditional Chinese medicine word segmentation dictionary, and the traditional Chinese medicine word segmentation dictionary is divided into three columns which are respectively as follows: 1, listing as a professional word of traditional Chinese medicine; column 2 is part-of-speech sorted letters; column 3 is a part-of-speech hierarchical identification.

6. The method as claimed in claim 1, wherein the method for indexing ancient Chinese medicine books by word segmentation and part of speech,

and (3) judging whether all the texts to be participled are successfully participled, wherein the judgment standard is as follows:

7. Traditional chinese medicine ancient book literature segmentation and part of speech indexing system, characterized by includes: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the steps of any of claims 1-6.

8. A computer readable storage medium having computer instructions embodied thereon, which when executed by a processor, perform the steps of any of claims 1-6.