CN114139610A

CN114139610A - Traditional Chinese medicine clinical literature data structuring method and device based on deep learning

Info

Publication number: CN114139610A
Application number: CN202111349067.2A
Authority: CN
Inventors: ***; 李海燕; 杨乐; 刘华云; 李小阳; 王晰
Original assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Current assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-03-04
Anticipated expiration: 2041-11-15
Also published as: CN114139610B

Abstract

The invention discloses a traditional Chinese medicine clinical literature data structuring method and device based on deep learning, and relates to the technical field of data processing. The method comprises the following steps: acquiring a document to be processed; inputting a document to be processed into a document data structured model which is constructed in advance; and obtaining a structured text based on the document to be processed and the document data structured model. The invention can solve the problems of inaccurate extraction result, large correction workload, complicated upgrading process, incapability of utilizing corrected contents to carry out self-learning and incapability of achieving the purpose of more accurate use due to the fact that the extraction rule is artificially and actively preset in the prior art.

Description

Traditional Chinese medicine clinical literature data structuring method and device based on deep learning

Technical Field

The invention relates to the technical field of data processing, in particular to a traditional Chinese medicine clinical literature data structuring method and device based on deep learning.

Background

The clinical literature of traditional Chinese medicine contains abundant text and digital information, wherein a great deal of effective clinical practice experience needs to be mined, and personalized diagnosis and treatment experience of famous old traditional Chinese medicine needs to be inherited and summarized. How to organically combine with direct evidence obtained from strict clinical randomized contrast trials at the present time when the Chinese medicine informatization wave is rising? How to combine the "soft indexes" of the symptoms and signs of traditional Chinese medicine with the "hard indexes" obtained by the physicochemical examination of modern medicine? How to obtain the best evidence for evidence-based medicine from the clinical research data of a large amount of traditional Chinese medicines? Therefore, the structured traditional Chinese medicine clinical literature data brings great convenience in the aspects of filing of traditional Chinese medicine clinical literature, knowledge base construction work, diagnosis and treatment experience analysis, new medicine research and development promotion, information methodology research and construction of talent team of traditional Chinese medicine data. However, the prior art has certain defects and shortcomings due to the fact that the combination of natural language processing and traditional Chinese medicine is not tight at present. Firstly, although some traditional Chinese medicine clinical literature data are simply structured by manual extraction or rule extraction plus manual proofreading, even in the case of large amount of traditional Chinese medicine clinical literature data and different content composition, writing law, dependency syntax, different names and other factors, even if a large amount of labor cost is consumed, accurate and efficient extraction and judgment still cannot be performed, and the method is not beneficial to further development of research in the background of a big data era. Secondly, the technology of natural language processing and deep learning of clinical documents of traditional Chinese medicine is less at present, and convenience cannot be provided for research of relationship between disease incidence rules and factors such as medicines and dosage by research personnel in the field of traditional Chinese medicine.

The traditional Chinese medicine document data structured processing system mainly comprises three parts, namely Chinese medicine document word extraction, PDF analysis and identification, client identity verification, user-defined word list and knowledge map construction. On one hand, the method extracts words by means of the traditional Chinese medicine word list, so that only the words appearing in the word list can be recognized, and the unknown words cannot be recognized, if the extraction accuracy is improved, new words are required to be supplemented to the word list, and a large amount of time is consumed in the process; on the other hand, the method needs to manually make an extraction rule, and the process of adding a new rule is complex.

Disclosure of Invention

The invention provides the method for extracting the content of the data, aiming at the problems that the extraction result is inaccurate, the correction workload is high, the updating process is complex due to the fact that the extraction rule is artificially and actively preset, the corrected content cannot be used for self-learning, and the purpose of increasing the use accuracy cannot be achieved in the prior art.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a deep learning-based method for structuring data of clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and the method includes:

and S1, acquiring the document to be processed.

And S2, inputting the document to be processed into a document data structured model which is constructed in advance.

And S3, obtaining a structured text based on the document to be processed and the document data structured model.

Optionally, the building process of the document data structured model in S2 includes:

and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.

And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.

And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.

And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.

S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.

Optionally, the preprocessing the sample data set in S21 includes:

and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.

Optionally, the data tagging of the preprocessed sample data set in S22 includes:

and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.

And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.

Optionally, the obtaining the regular pool according to the obtained labeling data in S22 includes:

and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.

Optionally, the deriving an annotation set according to the obtained annotation data in S22 includes:

and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.

In another aspect, the present invention provides a deep learning-based clinical literature data structuring apparatus for chinese medicine, which is applied to implement a deep learning-based clinical literature data structuring method for chinese medicine, the apparatus comprising:

and the acquisition module is used for acquiring the document to be processed.

And the input module is used for inputting the document to be processed into the document data structured model which is constructed in advance.

And the output module is used for obtaining the structured text based on the document to be processed and the document data structured model.

Optionally, the input module is further configured to:

In one aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above deep learning-based method for structuring data of clinical literature in traditional Chinese medicine.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above deep learning-based data structuring method for clinical literature of traditional Chinese medicine.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

in the scheme, the machine learning method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, results can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.

On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.

The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning according to the present invention;

FIG. 2 is a flow chart of a method for constructing a document data structured model according to the present invention;

FIG. 3 is a sample schematic of the clinical literature of medicine in the present invention;

FIG. 4 is a schematic view of the tag contents of the present invention;

FIG. 5 is a schematic diagram of a regularized sentence extraction of the present invention;

FIG. 6 is a schematic representation of the data tagging results of the present invention;

FIG. 7 is a schematic structural diagram of a data structuring method of a clinical literature of traditional Chinese medicine based on deep learning according to the present invention;

FIG. 8 is a block diagram of a data structuring device for clinical documents of traditional Chinese medicine based on deep learning according to the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, an embodiment of the present invention provides a deep learning-based data structuring method for clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and a processing flow of the method may include the following steps: ,

and S11, acquiring the document to be processed.

And S12, inputting the document to be processed into a document data structured model which is constructed in advance.

And S13, obtaining a structured text based on the document to be processed and the document data structured model.

Optionally, the building process of the document data structured model in S12 includes:

s121, acquiring a sample data set of clinical traditional Chinese medicine literature, and preprocessing the sample data set.

And S122, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.

And S123, constructing a neural network model based on a self-attention mechanism Transformer, and performing named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.

And S124, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.

S125, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S121; and if the manual proofreading results are consistent, outputting the document data structured model.

Optionally, the preprocessing the sample data set in S121 includes:

Optionally, the data tagging performed on the preprocessed sample data set in S122 includes:

Optionally, the obtaining the regular pool according to the obtained labeling data in S122 includes:

Optionally, the obtaining of the annotation set according to the obtained annotation data in S122 includes:

In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.

As shown in fig. 2, an embodiment of the present invention provides a method for constructing a document data structured model, where the method is applied to an electronic device, and the method includes:

In a possible implementation manner, as shown in fig. 3, each sample of clinical literature of traditional Chinese medicine can be divided into three contents, namely, abstract, data and method, and result. In the three parts, the structured content needs to filter the information such as keywords, dates, head-up, numbers and the like in the splitting process, so that the clinical document data of the traditional Chinese medicine forms the formats such as Map < abstract, content >, Map < data and method, content >, Map < result and content >.

And S22, carrying out data annotation on the preprocessed sample data set.

In one possible implementation, the tag definition and ordering is first performed, and the tag content is a description of the content to be structured, as shown in fig. 4. And selecting and marking the contents to be structured respectively according to the three parts of contents of the preprocessed sample data set, and associating the marked contents with corresponding labels.

And S23, obtaining a regular pool according to the obtained labeling data.

In one possible embodiment, the labeled results are extracted as a regular sentence pattern, as shown in fig. 5, wherein the "clinical chief complaint is dysphagia and dysvocalization" with dysphagia and dysvocalization as the target content, the regular sentence pattern is extracted as the clinical chief complaint.

And S24, obtaining an annotation set according to the obtained annotation data.

And carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set. And dividing the label set into a training set, a verification set and a test set.

In a possible embodiment, as shown in fig. 6, in the preprocessed sample data of clinical literature of traditional chinese medicine, one sequence represents one sentence after splitting, and the structured entity represents an element in one sentence.

The BIO labeling includes: labeling each element as "B-X", "I-X", or "O"; wherein "B-X" indicates that the fragment in which the element is located belongs to the X type and that the element is at the beginning of the fragment; "I-X" indicates that the fragment in which this element is located belongs to the X type and that this element is in the middle position of this fragment; "O" means not of any type.

The structured entities are required to be extracted and put into a word list, document data is converted into a line, the contents in the word list are searched and replaced in the forms of B-tags, I-tags and O, and then the words are scattered to construct a label set.

When performing NER (Named Entity Recognition, Named Entity Recognition neural network model) training, a training set, a verification set, and a test set need to be divided for a BIO label set, which may be according to 7: 2: the proportion of 1 is carried out, so as to obtain parameters such as accuracy, precision, recall rate, F1 and the like and evaluate the quality of the model.

And S25, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.

In a feasible implementation mode, a Transformer-based NLP-NER is constructed, and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between people and computers using natural language. It mainly includes two parts, NLU (Natural Language Understanding) and NLG (Natural Language generation).

The Transformer adopts an Encoder-Decoder architecture, Encoder-Decoder: the method is a model framework, is a general term of algorithms, is not particularly specific to a specific algorithm, firstly, an encoder converts an input sequence into a dense vector with fixed dimension in an encoding (encode), and a decoding (decode) stage generates a target translation from an activation state. The method has great advantages in parallelism and long-range dependence, but through analysis of a Transformer attention mechanism, the method has disadvantages in directivity, relative position and sparsity. Based on the characteristics of structured data of clinical literature of traditional Chinese medicine, the performance of the Transformer structure on the NER task of the clinical literature is greatly improved by simply improving the attention scoring function of the traditional Chinese medicine. The attention scoring function is the prior art, the description of the invention is omitted here, only the improvement part is described, after softmax (q dot k) is calculated, each point is weighted once, and part of the Pytorch codes are as follows:

self.time_weighting＝nn.Parameter(torch.ones(self.n_head,

config.window_len,config.

...

softmax (att, dim-1) # is the original code

Time _ weight [: att. T, T # only needs to be increased

attn _ drop (att) # is the original code

The advantages of the above improvement are two, one is: tokens at different distances should have different contributions to the location. Secondly, the following steps: for tokens near the beginning of training, the overall weight of self-attention should be reduced because the observation window is small and the amount of information is relatively low. Aiming at the defects of directionality, relative position, sparsity and the like in traditional Chinese medicine literature data, after softmax (Q dot K) is calculated, each point is weighted once, so that the prediction result is more accurate.

And (5) obtaining the model by terminating the training with the accuracy of the (n +1) round and the F1 value less than or equal to n rounds. The specific training parameters are as follows:

and S26, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.

In a feasible implementation mode, the test set sample data is firstly initialized, short sentences and length verification are carried out, a trained model is used for predicting the test set, coordinate positioning is carried out on the predicted content in the test set, which sentence in the test set is judged, and the coordinate is expanded into the sentence or sentences in which the sentence is located, so that the result is obtained. And calling the content in the regular pool to extract the obtained result, wherein the extracted result is the content needing structuring.

S27, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.

In a possible implementation, as shown in fig. 7, the structured content obtained by the methods S21-S26 is subjected to manual second proofreading, and after the proofreading is completed, S21 is repeated until the obtained content is consistent with the proofreading result, and a new model is updated, so that the model is continuously and accurately learned by the self-learning method.

As shown in fig. 8, an embodiment of the present invention provides an apparatus 800 for structuring data of clinical literature of traditional Chinese medicine based on deep learning, where the apparatus 800 is applied to implement a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning, and the apparatus 800 includes:

an obtaining module 810, configured to obtain a document to be processed.

And the input module 820 is used for inputting the document to be processed into the document data structured model which is constructed in advance.

And the output module 830 is configured to obtain a structured text based on the document to be processed and the document data structured model.

Optionally, the input module 820 is further configured to:

Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present invention, where the electronic device 900 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the following steps of the method for structuring data of clinical literature of traditional Chinese medicine based on deep learning:

and S1, acquiring the document to be processed.

In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, comprising instructions executable by a processor in a terminal to perform the above deep learning based method of structuring data of clinical literature of traditional Chinese medicine. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A traditional Chinese medicine clinical literature data structuring method based on deep learning is characterized by comprising the following steps:

s1, acquiring a document to be processed;

s2, inputting the document to be processed into a document data structured model which is constructed in advance;

2. The method according to claim 1, wherein the building process of the literature data structured model in S2 comprises:

s21, acquiring a sample data set of clinical literature of traditional Chinese medicine, and preprocessing the sample data set;

s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;

s23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model;

s24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more words where the predicted target point is located according to the regular pool to obtain a predicted structured text;

s25, manually correcting the predicted structured text, and if the results of manual correction are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.

3. The method according to claim 2, wherein the preprocessing of the sample data set in S21 comprises:

4. The method according to claim 2, wherein the data tagging of the preprocessed sample data set in S22 includes:

setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content;

5. The method according to claim 2, wherein the deriving the regular pool according to the derived labeling data in S22 comprises:

and extracting the sentence where the label data is, removing the label data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.

6. The method of claim 2, wherein the deriving an annotation set according to the derived annotation data in S22 comprises:

and carrying out sequence annotation on the annotation data by adopting a BIO annotation method to obtain an annotation set.

7. A traditional Chinese medicine clinical literature data structuring device based on deep learning is characterized in that the device comprises:

the acquisition module is used for acquiring documents to be processed;

the input module is used for inputting the document to be processed into a document data structured model which is constructed in advance;

and the output module is used for obtaining a structured text based on the document to be processed and the document data structured model.

8. The apparatus of claim 7, wherein the building process of the document data structured model comprises:

9. The apparatus according to claim 7, wherein the preprocessing of the sample data set in the S21 comprises:

10. The apparatus of claim 7, wherein data tagging the pre-processed sample data set comprises:

setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content; and marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.