CN108509419B - Chinese medicine ancient book document word segmentation and part of speech indexing method and system - Google Patents

Chinese medicine ancient book document word segmentation and part of speech indexing method and system Download PDF

Info

Publication number
CN108509419B
CN108509419B CN201810233868.4A CN201810233868A CN108509419B CN 108509419 B CN108509419 B CN 108509419B CN 201810233868 A CN201810233868 A CN 201810233868A CN 108509419 B CN108509419 B CN 108509419B
Authority
CN
China
Prior art keywords
chinese medicine
word segmentation
speech
word
traditional chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810233868.4A
Other languages
Chinese (zh)
Other versions
CN108509419A (en
Inventor
付先军
李学博
王振国
陈晓康
桑晓明
鞠芳凝
周扬
陈聪
邵欣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Traditional Chinese Medicine
Original Assignee
Shandong University of Traditional Chinese Medicine
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Traditional Chinese Medicine filed Critical Shandong University of Traditional Chinese Medicine
Priority to CN201810233868.4A priority Critical patent/CN108509419B/en
Publication of CN108509419A publication Critical patent/CN108509419A/en
Application granted granted Critical
Publication of CN108509419B publication Critical patent/CN108509419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese medicine ancient book document word segmentation and part of speech indexing method and a system; the method comprises the following steps: step (1): constructing a Chinese medicine word segmentation dictionary; step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary; and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation; and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.

Description

Chinese medicine ancient book document word segmentation and part of speech indexing method and system
Technical Field
The invention relates to a method and a system for indexing Chinese medicinal ancient book documents by word segmentation and part of speech.
Background
The literature is important for the civilization of human beings and the progress of society, and is the basis of all scientific researches. The traditional Chinese medicine literature is an important component of ancient Chinese literature, is an important basis for researching clinical medication experience of ancient physicians, integrates knowledge of theory, method, formula, medicine and the like of the traditional Chinese medicine, also accumulates academic thought and clinical medication experience accumulated in the process of thousands of years of development of the traditional Chinese medicine, and mining the precious cultural heritage is an important premise and basis of inheritance and innovation of the traditional Chinese medicine academic inheritance. Modern interpretation of the theory of traditional Chinese medicine and modern research on the symptoms, treatment methods and prescriptions of traditional Chinese medicine are not independent of classical medicines, and for example, the discovery of artemisinin does not depart from the inspiration acquired in classical documents of traditional Chinese medicine such as elbow backup acute prescription.
The sorting and analysis of the traditional Chinese medicine documents are based on word segmentation and part-of-speech tagging. Word segmentation is a process of recombining continuous word sequences into word sequences according to certain specifications, most of the research on Chinese word segmentation theory, methods and technologies at home and abroad at the present stage is still in the theoretical or experimental stage and is biased to natural language processing and information retrieval, and available Chinese word segmentation software is few in forming; the software and the method specially aiming at Chinese medicine word segmentation and part-of-speech tagging are not reported, due to the particularity of the professional terms of the Chinese medicine, the word segmentation result accuracy and the recall rate of Chinese medicine documents in the application of general Chinese word segmentation software are low, the word segmentation accuracy of the ancient word segmentation of Chinese medicine documents with the highest report is 0.735, the recall rate is only 0.663, the accuracy rate, the recall rate and the comprehensive classification rate (F1) of other Chinese word segmentation systems are even below 0.5, for example, the PHP Analysis accuracy is only 0.312, the recall rate is only 0.369, and the special part-of-speech tagging can not be carried out aiming at the professional characteristics of the Chinese medicine. This greatly restricts the utilization and exploitation of the traditional Chinese medicine literature. Most software needs to configure the environment, has specific requirements on the system, has poor portability and is difficult to operate.
Therefore, the system and the method for word segmentation and word tagging of the traditional Chinese medicine literature, which are suitable for the traditional Chinese medicine literature characteristics, high in accuracy and recall rate and capable of performing word tagging conforming to the characteristics of the professional terms of the traditional Chinese medicine, break through the current main technical bottleneck restricting the traditional Chinese medicine literature mining and knowledge discovery, and have very important significance for inheritance and innovation of the traditional Chinese medicine and exertion of the original advantages of the traditional Chinese medicine.
Disclosure of Invention
The invention aims to provide a method and a system for indexing Chinese medicinal ancient book documents by word segmentation and parts of speech, which can improve the accuracy and recall rate of Chinese medicinal ancient book documents by word segmentation and part of speech tagging conforming to the characteristics of professional terms of Chinese medicaments, solve the problem that the prior Chinese word segmentation system has low accuracy and recall rate of Chinese medicinal documents by word segmentation and part of speech tagging of texts in the typhoid treatises and cannot carry out professional part of speech tagging of the Chinese medicaments, and find that the word segmentation system has higher accuracy and recall rate than the general Chinese word segmentation system and is very close to the level of professionals by the part of speech tagging of the literatures in the typhoid treatises.
The invention provides a method for indexing Chinese medicinal ancient book literature participles and parts of speech;
the Chinese medicine ancient book document word segmentation and part of speech indexing method comprises the following steps:
step (1): constructing a Chinese medicine word segmentation dictionary;
step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary;
and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation;
and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.
Further, the step (1) of constructing the Chinese medicine word segmentation dictionary comprises the following steps:
a step (101): constructing a word bank of professional terms of traditional Chinese medicine;
a step (102): performing part-of-speech classification and marking on words in a Chinese medical professional term word bank;
step (103): a three-column dictionary construction method is adopted to construct a Chinese medicine word segmentation dictionary.
Further, the step (101) of constructing the thesaurus of the professional terms of traditional Chinese medicine comprises the following steps:
extracting traditional Chinese medicine professional terms from traditional Chinese medicine ancient books and traditional Chinese medicine dictionaries;
the term of the traditional Chinese medicine comprises: the Chinese medicine name, the prescription name, the ancient Chinese medical book name, the doctor name, the symptom name of the Chinese medical illness, the efficacy name of the Chinese medical, the acupuncture point name, the dosage name of the Chinese medicine, the ancient Chinese vocabulary and the professional vocabulary in the modern medicine.
Further, the step (102) of performing part-of-speech classification on the words in the thesaurus of medical and professional terms comprises the following steps:
according to the disease part, syndrome part or therapeutic part of the national standard Chinese medical clinical diagnosis and treatment terms of the people's republic of China, Chinese medical nouns are divided into a plurality of parts of speech by combining the characteristics of the Chinese medical noun terms, a 14-class classified part of speech table is constructed, and the 14-class classified parts of speech include: 1. the traditional Chinese medicine theory basis, 2. the traditional Chinese medicine diagnosis method, 3. the traditional Chinese medicine nouns, 4. the prescription nouns, 5. the typhoid fever and the epidemic febrile disease, 6. the traditional Chinese medicine rule, 7. the traditional Chinese medicine treatment method, 8. the traditional Chinese medicine and the related subject, 9. the traditional Chinese medicine ancient book, 10. the traditional Chinese medicine organization, the equipment or the medical health personnel, 11. the name words, 12. the geographical name, 13. the season time words, 14. other words; each word is divided into a plurality of levels of subclasses, and the Chinese medicine nouns in the word stock are classified and marked according to the level of the part of speech and the sequence from low to high.
Each category of words is divided into several sub-categories, for example, the four diagnostic methods in TCM include the four diagnostic methods, the four diagnostic methods include inspection, auscultation, inquiry and palpation, the inspection includes tongue diagnosis, the tongue diagnosis includes tongue manifestation, which includes tongue coating and tongue proper, the tongue coating includes tongue fur color and tongue fur proper, and at most there are 7 sub-categories.
Further, the step (103) adopts a three-column dictionary construction method to construct a traditional Chinese medicine word segmentation dictionary, which is divided into three columns, which are respectively:
no. 1 is listed as a term of professional Chinese medicine, such as qiong equisetum, cinnabar tranquilizing pills and the like;
column 2 is a part-of-speech classified letter, such as cinnabar tranquilizer, belonging to the prescription classification in part-of-speech, wherein the part-of-speech classified letter is FCzzasj;
column 3 is a part-of-speech hierarchical identification. The tranquilizers in the formula classification belong to the 4 th level in the classification, and are labeled as 4.
Further, the step (2) comprises the following steps:
step (201): extracting keywords from the text to be segmented by applying a word bag model;
step (202): training a conditional random field CRF model by using existing words in a Chinese medicine word segmentation dictionary, finding new words by using the conditional random field CRF model, and bringing the new words into the Chinese medicine word segmentation dictionary;
step (203): constructing a double-array wire tree by using all existing words in the word segmentation dictionary;
a step (204): performing single-string mode matching on the keywords extracted from the text to be word-segmented and the double-array Tire tree, and segmenting the currently extracted keywords by using the double-array Tire tree to obtain a word segmentation result;
step (205): training a hidden Markov model: taking each existing word in the word segmentation dictionary as an observation state sequence, and taking the part of speech of each word as a hidden state sequence to carry out hidden Markov model training to obtain a trained hidden Markov model;
step (206): and (3) performing part-of-speech tagging by using a trained hidden Markov model: and (3) inputting the word sequence in the word segmentation result obtained in the step (204) as an observation state sequence into a trained hidden Markov model, and generating a hidden state sequence of the current observation state sequence through a viterbi algorithm so as to obtain a corresponding hidden state, wherein the hidden state is the part of speech of the text to be segmented, thereby completing part of speech tagging.
Further, the step (3) judges whether all the texts to be participled are successfully participled, and the judgment standard is as follows:
if each word segmentation result is provided with a part-of-speech tagging letter, the word segmentation is successful, otherwise, the word segmentation is failed.
The second aspect of the invention provides a Chinese medicine ancient book document word segmentation and part of speech indexing system;
the system for indexing Chinese traditional medicine ancient book documents by word segmentation and part of speech comprises: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.
In a third aspect of the invention, a computer-readable storage medium is provided;
a computer readable storage medium having computer instructions embodied thereon, which, when executed by a processor, perform the steps of any of the above methods.
Compared with the prior art, the invention has the beneficial effects that:
the recall rate and the accuracy rate of Chinese medical ancient book document segmentation are far higher than those of the prior art.
The invention realizes the professional part-of-speech tagging of the traditional Chinese medicine for the first time and provides a foundation for the literature mining and knowledge discovery of the traditional Chinese medicine.
The invention ensures the integrity and accuracy of the word segmentation result by performing the word segmentation processing twice.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
As shown in fig. 1, the method for indexing Chinese traditional medicine ancient book documents by word segmentation and part of speech includes:
step (1): constructing a Chinese medicine word segmentation dictionary;
step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary;
and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation;
and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.
Further, the step (1) of constructing the Chinese medicine word segmentation dictionary comprises the following steps:
a step (101): constructing a word bank of professional terms of traditional Chinese medicine;
a step (102): performing part-of-speech classification and marking on words in a Chinese medical professional term word bank;
step (103): a three-column dictionary construction method is adopted to construct a Chinese medicine word segmentation dictionary.
1. Construction of Chinese medicine word segmentation dictionary
1.1 construction of term lexicon of professional Chinese medicine
One of the main reasons for the difference in the accuracy of Chinese medicine word segmentation in the current general Chinese word segmentation software is the difference in the recognition capability of terms such as Chinese medicine syndromes, meridians and collaterals and acupuncture points, so the system firstly constructs a perfect Chinese medicine term lexicon. A special word bank containing traditional Chinese medicine professional terms such as traditional Chinese medicine names, prescription names and the like is extracted and constructed from traditional Chinese medicine ancient book documents and various traditional Chinese medicine dictionaries by adopting a web crawler, an artificial neural network and an artificial correction, extraction and standardization processing method, and relates to 155,343 related words of traditional Chinese medicines, which are the word bank of traditional Chinese medicine professional terms with the largest word receiving amount at present.
TABLE 1 Chinese medicine participle word bank composition table
Figure BDA0001603378580000041
Figure BDA0001603378580000051
1.2 part-of-speech tagging method special for traditional Chinese medicine
Part-of-Speech tagging or POS tagging, also called Part-of-Speech tagging or POS tagging, refers to a procedure for tagging each word in a word segmentation result with a correct Part-of-Speech, and generally, the present Part-of-Speech tagging is a process for determining whether each word is a noun, a verb, an adjective, or other parts-of-Speech. The part-of-speech tagging is not very significant in text mining and analysis of traditional Chinese medicine documents, and based on the fact that the Chinese medicine nouns are divided into 14 types of 818 parts-of-speech according to a classification method of a Chinese medicine theoretical system by combining the professional characteristics of the traditional Chinese medicine: the theory basis of traditional Chinese medicine, the diagnosis and treatment of traditional Chinese medicine, the related nouns of traditional Chinese medicine and prescription, typhoid fever and epidemic febrile disease, the treatment principle, the treatment method, the related subjects of traditional Chinese medicine, the ancient books of traditional Chinese medicine, the traditional Chinese medicine institution, the equipment of traditional Chinese medicine, the names of medical and health staff, the geographical names and others.
A first-order hidden Markov model is adopted, in the hidden Markov model, hidden states are 818 parts of speech, explicit states are 818 letter abbreviations, and FC is added in front of the hidden states for distinguishing from general part of speech marks.
And meanwhile, according to the level of the part of speech, labeling is performed according to the priority sequence from low to high as possible.
TABLE 2 Chinese medicine professional parts-of-speech composition table (part)
Figure BDA0001603378580000052
Figure BDA0001603378580000061
1.3 construction and expansion of Chinese medicine word segmentation dictionary
The word segmentation dictionary is a core part of the system, and has important influence on the accuracy and speed of word segmentation results, the system is based on the word bank and the part-of-speech tagging method of the traditional Chinese medicine professional terms, a 3-column dictionary construction method is adopted, the 1 st column is the traditional Chinese medicine professional noun term, the 2 nd column is part-of-speech tagging letters, and the 3 rd column is a grading mark.
1.4tire Tree (dictionary Tree) construction Process
(1) Establishing root node root, making base root equal to 1
(2) Finding out a root child node set (i ═ 1.. n), so that check [ root
(3) For each element in root, children:
1) find { element
2) Set base [ element
3) Childreni, if an element is traversed to have no children, i.e. a leaf node, then base [ element ] is set to be a negative value
2. Chinese medicine document word segmentation algorithm and part of speech tagging
The core algorithm of the word segmentation system is an open source code of Ansj, is a Java Chinese word segmentation tool, and is higher in word segmentation accuracy rate than other common open source word segmentation tools (such as mmseg4j) based on the ictclas Chinese word segmentation algorithm of Chinese academy of sciences.
On the basis, a Chinese medicine professional dictionary built by the user is used for replacing a default dictionary, a dictionary of Ansj is used as a supplement, and part-of-speech tagging is performed based on an HMM.
3. Construction and use of Chinese medicine document word segmentation and part of speech tagging service system
The Chinese traditional ancient book literature word segmentation system is developed by adopting Java language, and comprises a word segmentation architecture and a user interface. The user interface is presented to the user in a webpage form, the user logs in and registers through the webpage, and the user who does not log in can only access the website and cannot use the word segmentation function. The login user can submit the text to be participled in a form of copying and pasting the text, and also can submit the participled text in a form of uploading txt text, and the participled result has two modes, namely copying and downloading the txt text.
4. Effects of the implementation
4.1 improving word segmentation accuracy and recall rate
Word segmentation tests are carried out by taking the text content of the whole text of the ancient book of Shanghai treatise on typhoid treatise and the original Ansj program as comparison, and the results show that the recall rate and the accuracy rate of the word segmentation of the ancient book document word segmentation system of the traditional Chinese medicine are far higher than those of the Ansj source program and the system word bank, the special nouns of the traditional Chinese medicine in the test text such as solar disease, sweating, aversion to wind, slow pulse and the like cannot be identified by the Ansj source program and the system word bank, correct word segmentation cannot be carried out, and the ancient book word segmentation system of the traditional Chinese medicine can accurately identify and carry out word segmentation.
TABLE 3 word segmentation effect comparison
Figure BDA0001603378580000071
4.2 realizes the part-of-speech tagging of the traditional Chinese medicine specialty
On the basis of accurate word segmentation, accurate special part-of-speech tagging is realized, as shown in table 3, FCbm is accurately tagged for "solar disease" and "apoplexy", and the word is a "disease name of traditional Chinese medicine"; the terms "fever", "sweating" and "slow pulse" are labeled FCzz, and represent the names of symptoms in traditional Chinese medicine, which is of great significance for statistical analysis and knowledge discovery in later text mining.
4.3 the system is simple to operate and has strong portability
The Chinese traditional ancient book literature word segmentation system is developed by adopting Java language, has strong readability, and is easy to expand and modify. The system comprises user login, registration and user authority control, and users who do not log in can only access the website and cannot use the word segmentation function. The system has friendly interface, easy use and humanized prompt.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. The Chinese medicine ancient book document word segmentation and part of speech indexing method is characterized by comprising the following steps:
step (1): constructing a Chinese medicine word segmentation dictionary;
the step (1) of constructing the Chinese medicine word segmentation dictionary comprises the following steps:
a step (101): constructing a word bank of professional terms of traditional Chinese medicine;
a step (102): performing part-of-speech classification and marking on words in a Chinese medical professional term word bank;
step (103): constructing a Chinese medicine word segmentation dictionary by adopting a three-column dictionary construction method;
step (2): performing word segmentation processing and part-of-speech tagging on a text to be segmented by adopting a traditional Chinese medicine word segmentation dictionary;
the step (2) comprises the following steps:
step (201): extracting keywords from the text to be segmented by applying a word bag model;
step (202): training a conditional random field CRF model by using existing words in a Chinese medicine word segmentation dictionary, finding new words by using the conditional random field CRF model, and bringing the new words into the Chinese medicine word segmentation dictionary;
step (203): constructing a double-array wire tree by using all existing words in the word segmentation dictionary;
a step (204): performing single-string mode matching on the keywords extracted from the text to be word-segmented and the double-array Tire tree, and segmenting the currently extracted keywords by using the double-array Tire tree to obtain a word segmentation result;
step (205): training a hidden Markov model: taking each existing word in the word segmentation dictionary as an observation state sequence, and taking the part of speech of each word as a hidden state sequence to carry out hidden Markov model training to obtain a trained hidden Markov model;
step (206): and (3) performing part-of-speech tagging by using a trained hidden Markov model: inputting the word sequence in the word segmentation result obtained in the step (204) as an observation state sequence into a trained hidden Markov model, and generating a hidden state sequence of the current observation state sequence through a viterbi algorithm so as to obtain a corresponding hidden state, wherein the hidden state is the part of speech of the text to be segmented, thereby completing part of speech tagging;
and (3): judging whether all the texts to be segmented are successfully segmented; directly outputting the word segmentation result of successful word segmentation;
and (4): performing word segmentation again on the text with failed word segmentation by adopting an ansj dictionary; and obtaining a final word segmentation result.
2. The method for indexing Chinese medicine ancient book literature participles and parts of speech according to claim 1, wherein the step (101) of constructing the thesaurus of Chinese medicine professional terms comprises the steps of:
the Chinese medicine technical terms are extracted from Chinese medicine ancient books documents and Chinese medicine dictionaries.
3. The method as claimed in claim 2, wherein the term of the ancient Chinese medical book includes: the Chinese medicine name, the prescription name, the ancient Chinese medical book name, the doctor name, the symptom name of the Chinese medical illness, the efficacy name of the Chinese medical, the acupuncture point name, the dosage name of the Chinese medicine, the ancient Chinese vocabulary and the professional vocabulary in the modern medicine.
4. The method for indexing Chinese medicine ancient book literature participles and parts of speech as claimed in claim 1, wherein said step (102) of performing part of speech classification on words in the Chinese medicine professional term lexicon comprises the steps of:
according to the disease part, syndrome part or therapeutic part of the national standard Chinese medical clinical diagnosis and treatment terms of the people's republic of China, Chinese medical nouns are divided into a plurality of parts of speech by combining the characteristics of the Chinese medical noun terms, a 14-class classified part of speech table is constructed, and the 14-class classified parts of speech include: 1. the traditional Chinese medicine theory basis, 2. the traditional Chinese medicine diagnosis method, 3. the traditional Chinese medicine nouns, 4. the prescription nouns, 5. the typhoid fever and the epidemic febrile disease, 6. the traditional Chinese medicine rule, 7. the traditional Chinese medicine treatment method, 8. the traditional Chinese medicine and the related subject, 9. the traditional Chinese medicine ancient book, 10. the traditional Chinese medicine organization, the equipment or the medical health personnel, 11. the name words, 12. the geographical name, 13. the season time words, 14. other words;
each word is divided into a plurality of levels of subclasses, and the Chinese medicine nouns in the word stock are classified and marked according to the level of the part of speech and the sequence from low to high.
5. The method as claimed in claim 1, wherein the method for indexing ancient Chinese medicine books by word segmentation and part of speech,
the step (103) adopts a three-column dictionary construction method to construct a traditional Chinese medicine word segmentation dictionary, and the traditional Chinese medicine word segmentation dictionary is divided into three columns which are respectively as follows: 1, listing as a professional word of traditional Chinese medicine; column 2 is part-of-speech sorted letters; column 3 is a part-of-speech hierarchical identification.
6. The method as claimed in claim 1, wherein the method for indexing ancient Chinese medicine books by word segmentation and part of speech,
and (3) judging whether all the texts to be participled are successfully participled, wherein the judgment standard is as follows:
if each word segmentation result is provided with a part-of-speech tagging letter, the word segmentation is successful, otherwise, the word segmentation is failed.
7. Traditional chinese medicine ancient book literature segmentation and part of speech indexing system, characterized by includes: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the steps of any of claims 1-6.
8. A computer readable storage medium having computer instructions embodied thereon, which when executed by a processor, perform the steps of any of claims 1-6.
CN201810233868.4A 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system Active CN108509419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810233868.4A CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810233868.4A CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Publications (2)

Publication Number Publication Date
CN108509419A CN108509419A (en) 2018-09-07
CN108509419B true CN108509419B (en) 2022-02-22

Family

ID=63377776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810233868.4A Active CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Country Status (1)

Country Link
CN (1) CN108509419B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488497B (en) * 2019-01-25 2023-05-12 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN109829159B (en) * 2019-01-29 2020-02-18 南京师范大学 Integrated automatic lexical analysis method and system for ancient Chinese text
CN110134766B (en) * 2019-05-09 2021-06-25 北京科技大学 Word segmentation method and device for traditional Chinese medical ancient book documents
CN110675962A (en) * 2019-09-10 2020-01-10 电子科技大学 Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN111104801B (en) * 2019-12-26 2023-09-26 济南大学 Text word segmentation method, system, equipment and medium based on website domain name
CN111814464A (en) * 2020-05-25 2020-10-23 清华大学 Part-of-speech tagging method based on hidden Markov model
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN113377965B (en) * 2021-06-30 2024-02-23 中国农业银行股份有限公司 Method and related device for sensing text keywords

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary
CN102541865A (en) * 2010-12-15 2012-07-04 盛乐信息技术(上海)有限公司 Method for improving word segmentation property by using new words identified in word segmentation process
US9053089B2 (en) * 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN105426358A (en) * 2015-11-09 2016-03-23 中国农业大学 Automatic disease noun identification method
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
CN107169078A (en) * 2017-05-10 2017-09-15 京东方科技集团股份有限公司 Knowledge of TCM collection of illustrative plates and its method for building up and computer system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731395A (en) * 2005-08-18 2006-02-08 山东中医药大学 Chinese medicine ancient document database
CN101539907B (en) * 2008-03-19 2013-01-23 日电(中国)有限公司 Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN102314507B (en) * 2011-09-08 2013-07-03 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity
CN103365992B (en) * 2013-07-03 2017-02-15 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN107179085A (en) * 2016-03-10 2017-09-19 中国科学院地理科学与资源研究所 A kind of condition random field map-matching method towards sparse floating car data
CN107562834A (en) * 2017-08-23 2018-01-09 四川长虹电器股份有限公司 The method of geographic location criteriaization extraction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053089B2 (en) * 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary
CN102541865A (en) * 2010-12-15 2012-07-04 盛乐信息技术(上海)有限公司 Method for improving word segmentation property by using new words identified in word segmentation process
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN105426358A (en) * 2015-11-09 2016-03-23 中国农业大学 Automatic disease noun identification method
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
CN107169078A (en) * 2017-05-10 2017-09-15 京东方科技集团股份有限公司 Knowledge of TCM collection of illustrative plates and its method for building up and computer system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Applicability of commonly used Chinese word segmentation software in the field of TCM text and literature research;Yang Haifeng 等;《World Science and TechnologyTCM Modernization》;20171231;第19卷;536-541 *
基于条件随机场的中医病历命名实体抽取方法研究;刘凯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215(第S2期);正文第5-6、8-10、16-24、29-32、49页第1.3、2.2、3.1-3.2、3.4、5.1节 *
文献计量学视角的中医药文献信息化研究现状探讨;韩雅丽 等;《世界科学技术-中医药现代化》;20150320;第17卷(第3期);427-433 *

Also Published As

Publication number Publication date
CN108509419A (en) 2018-09-07

Similar Documents

Publication Publication Date Title
CN108509419B (en) Chinese medicine ancient book document word segmentation and part of speech indexing method and system
Matci et al. Address standardization using the natural language process for improving geocoding results
CN112487202B (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
Friedman et al. Natural language processing and its future in medicine
Masarie Jr et al. An interlingua for electronic interchange of medical information: using frames to map between clinical vocabularies
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN107368547A (en) A kind of intelligent medical automatic question-answering method based on deep learning
Deleger et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research
Bowern Chirila: Contemporary and historical resources for the indigenous languages of Australia
CN109033080A (en) Medical terms standardized method and system based on probability transfer matrix
CN109815341B (en) Text extraction model training method, text extraction method and device
Friedman et al. Natural language and text processing in biomedicine
CN112949308A (en) Method and system for identifying named entities of Chinese electronic medical record based on functional structure
Varshney et al. Knowledge grounded medical dialogue generation using augmented graphs
Adduru et al. Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification.
Ahamed et al. Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance
Guo et al. Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
Friedman Semantic text parsing for patient records
Maas Computer Aided Stemmatics—the Case of Fifty-Two Text Versions of Carakasaṃhitā Vimānasthāna 8.67-157
CN110543630B (en) Method and device for generating text structured representation and computer storage medium
Foufi et al. De-identification of medical narrative data
Champion et al. Tactical clinical text mining for improved patient characterization
Fu et al. Research on the method and system of word segmentation and POS tagging for ancient Chinese medicine literature
JP2011503730A5 (en)
Sager et al. Natural language processing of asthma discharge summaries for the monitoring of patient care.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant