CN114139610A - Traditional Chinese medicine clinical literature data structuring method and device based on deep learning - Google Patents
Traditional Chinese medicine clinical literature data structuring method and device based on deep learning Download PDFInfo
- Publication number
- CN114139610A CN114139610A CN202111349067.2A CN202111349067A CN114139610A CN 114139610 A CN114139610 A CN 114139610A CN 202111349067 A CN202111349067 A CN 202111349067A CN 114139610 A CN114139610 A CN 114139610A
- Authority
- CN
- China
- Prior art keywords
- data
- document
- structured
- sample data
- annotation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000003814 drug Substances 0.000 title claims abstract description 68
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000013135 deep learning Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 13
- 238000012937 correction Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 38
- 238000002372 labelling Methods 0.000 claims description 30
- 230000001915 proofreading effect Effects 0.000 claims description 19
- 238000012360 testing method Methods 0.000 claims description 19
- 238000012795 verification Methods 0.000 claims description 17
- 238000003062 neural network model Methods 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 abstract description 8
- 238000003058 natural language processing Methods 0.000 description 19
- 230000007547 defect Effects 0.000 description 10
- 230000015654 memory Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 206010063385 Intellectualisation Diseases 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000007418 data mining Methods 0.000 description 4
- 238000013499 data model Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 208000019505 Deglutition disease Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 101100063432 Caenorhabditis elegans dim-1 gene Proteins 0.000 description 1
- 208000032023 Signs and Symptoms Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000009411 base construction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 229940126680 traditional chinese medicines Drugs 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a traditional Chinese medicine clinical literature data structuring method and device based on deep learning, and relates to the technical field of data processing. The method comprises the following steps: acquiring a document to be processed; inputting a document to be processed into a document data structured model which is constructed in advance; and obtaining a structured text based on the document to be processed and the document data structured model. The invention can solve the problems of inaccurate extraction result, large correction workload, complicated upgrading process, incapability of utilizing corrected contents to carry out self-learning and incapability of achieving the purpose of more accurate use due to the fact that the extraction rule is artificially and actively preset in the prior art.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a traditional Chinese medicine clinical literature data structuring method and device based on deep learning.
Background
The clinical literature of traditional Chinese medicine contains abundant text and digital information, wherein a great deal of effective clinical practice experience needs to be mined, and personalized diagnosis and treatment experience of famous old traditional Chinese medicine needs to be inherited and summarized. How to organically combine with direct evidence obtained from strict clinical randomized contrast trials at the present time when the Chinese medicine informatization wave is rising? How to combine the "soft indexes" of the symptoms and signs of traditional Chinese medicine with the "hard indexes" obtained by the physicochemical examination of modern medicine? How to obtain the best evidence for evidence-based medicine from the clinical research data of a large amount of traditional Chinese medicines? Therefore, the structured traditional Chinese medicine clinical literature data brings great convenience in the aspects of filing of traditional Chinese medicine clinical literature, knowledge base construction work, diagnosis and treatment experience analysis, new medicine research and development promotion, information methodology research and construction of talent team of traditional Chinese medicine data. However, the prior art has certain defects and shortcomings due to the fact that the combination of natural language processing and traditional Chinese medicine is not tight at present. Firstly, although some traditional Chinese medicine clinical literature data are simply structured by manual extraction or rule extraction plus manual proofreading, even in the case of large amount of traditional Chinese medicine clinical literature data and different content composition, writing law, dependency syntax, different names and other factors, even if a large amount of labor cost is consumed, accurate and efficient extraction and judgment still cannot be performed, and the method is not beneficial to further development of research in the background of a big data era. Secondly, the technology of natural language processing and deep learning of clinical documents of traditional Chinese medicine is less at present, and convenience cannot be provided for research of relationship between disease incidence rules and factors such as medicines and dosage by research personnel in the field of traditional Chinese medicine.
The traditional Chinese medicine document data structured processing system mainly comprises three parts, namely Chinese medicine document word extraction, PDF analysis and identification, client identity verification, user-defined word list and knowledge map construction. On one hand, the method extracts words by means of the traditional Chinese medicine word list, so that only the words appearing in the word list can be recognized, and the unknown words cannot be recognized, if the extraction accuracy is improved, new words are required to be supplemented to the word list, and a large amount of time is consumed in the process; on the other hand, the method needs to manually make an extraction rule, and the process of adding a new rule is complex.
Disclosure of Invention
The invention provides the method for extracting the content of the data, aiming at the problems that the extraction result is inaccurate, the correction workload is high, the updating process is complex due to the fact that the extraction rule is artificially and actively preset, the corrected content cannot be used for self-learning, and the purpose of increasing the use accuracy cannot be achieved in the prior art.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the present invention provides a deep learning-based method for structuring data of clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and the method includes:
and S1, acquiring the document to be processed.
And S2, inputting the document to be processed into a document data structured model which is constructed in advance.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the building process of the document data structured model in S2 includes:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the preprocessing the sample data set in S21 includes:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the data tagging of the preprocessed sample data set in S22 includes:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the obtaining the regular pool according to the obtained labeling data in S22 includes:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the deriving an annotation set according to the obtained annotation data in S22 includes:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In another aspect, the present invention provides a deep learning-based clinical literature data structuring apparatus for chinese medicine, which is applied to implement a deep learning-based clinical literature data structuring method for chinese medicine, the apparatus comprising:
and the acquisition module is used for acquiring the document to be processed.
And the input module is used for inputting the document to be processed into the document data structured model which is constructed in advance.
And the output module is used for obtaining the structured text based on the document to be processed and the document data structured model.
Optionally, the input module is further configured to:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the input module is further configured to:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the input module is further configured to:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the input module is further configured to:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module is further configured to:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In one aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above deep learning-based method for structuring data of clinical literature in traditional Chinese medicine.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above deep learning-based data structuring method for clinical literature of traditional Chinese medicine.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
in the scheme, the machine learning method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, results can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning according to the present invention;
FIG. 2 is a flow chart of a method for constructing a document data structured model according to the present invention;
FIG. 3 is a sample schematic of the clinical literature of medicine in the present invention;
FIG. 4 is a schematic view of the tag contents of the present invention;
FIG. 5 is a schematic diagram of a regularized sentence extraction of the present invention;
FIG. 6 is a schematic representation of the data tagging results of the present invention;
FIG. 7 is a schematic structural diagram of a data structuring method of a clinical literature of traditional Chinese medicine based on deep learning according to the present invention;
FIG. 8 is a block diagram of a data structuring device for clinical documents of traditional Chinese medicine based on deep learning according to the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, an embodiment of the present invention provides a deep learning-based data structuring method for clinical documents of traditional Chinese medicine, where the method is implemented by an electronic device, and a processing flow of the method may include the following steps: ,
and S11, acquiring the document to be processed.
And S12, inputting the document to be processed into a document data structured model which is constructed in advance.
And S13, obtaining a structured text based on the document to be processed and the document data structured model.
Optionally, the building process of the document data structured model in S12 includes:
s121, acquiring a sample data set of clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S122, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S123, constructing a neural network model based on a self-attention mechanism Transformer, and performing named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S124, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S125, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S121; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the preprocessing the sample data set in S121 includes:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the data tagging performed on the preprocessed sample data set in S122 includes:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the obtaining the regular pool according to the obtained labeling data in S122 includes:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the obtaining of the annotation set according to the obtained annotation data in S122 includes:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
As shown in fig. 2, an embodiment of the present invention provides a method for constructing a document data structured model, where the method is applied to an electronic device, and the method includes:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
In a possible implementation manner, as shown in fig. 3, each sample of clinical literature of traditional Chinese medicine can be divided into three contents, namely, abstract, data and method, and result. In the three parts, the structured content needs to filter the information such as keywords, dates, head-up, numbers and the like in the splitting process, so that the clinical document data of the traditional Chinese medicine forms the formats such as Map < abstract, content >, Map < data and method, content >, Map < result and content >.
And S22, carrying out data annotation on the preprocessed sample data set.
In one possible implementation, the tag definition and ordering is first performed, and the tag content is a description of the content to be structured, as shown in fig. 4. And selecting and marking the contents to be structured respectively according to the three parts of contents of the preprocessed sample data set, and associating the marked contents with corresponding labels.
And S23, obtaining a regular pool according to the obtained labeling data.
And extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
In one possible embodiment, the labeled results are extracted as a regular sentence pattern, as shown in fig. 5, wherein the "clinical chief complaint is dysphagia and dysvocalization" with dysphagia and dysvocalization as the target content, the regular sentence pattern is extracted as the clinical chief complaint.
And S24, obtaining an annotation set according to the obtained annotation data.
And carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set. And dividing the label set into a training set, a verification set and a test set.
In a possible embodiment, as shown in fig. 6, in the preprocessed sample data of clinical literature of traditional chinese medicine, one sequence represents one sentence after splitting, and the structured entity represents an element in one sentence.
The BIO labeling includes: labeling each element as "B-X", "I-X", or "O"; wherein "B-X" indicates that the fragment in which the element is located belongs to the X type and that the element is at the beginning of the fragment; "I-X" indicates that the fragment in which this element is located belongs to the X type and that this element is in the middle position of this fragment; "O" means not of any type.
The structured entities are required to be extracted and put into a word list, document data is converted into a line, the contents in the word list are searched and replaced in the forms of B-tags, I-tags and O, and then the words are scattered to construct a label set.
When performing NER (Named Entity Recognition, Named Entity Recognition neural network model) training, a training set, a verification set, and a test set need to be divided for a BIO label set, which may be according to 7: 2: the proportion of 1 is carried out, so as to obtain parameters such as accuracy, precision, recall rate, F1 and the like and evaluate the quality of the model.
And S25, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
In a feasible implementation mode, a Transformer-based NLP-NER is constructed, and the NLP is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between people and computers using natural language. It mainly includes two parts, NLU (Natural Language Understanding) and NLG (Natural Language generation).
The Transformer adopts an Encoder-Decoder architecture, Encoder-Decoder: the method is a model framework, is a general term of algorithms, is not particularly specific to a specific algorithm, firstly, an encoder converts an input sequence into a dense vector with fixed dimension in an encoding (encode), and a decoding (decode) stage generates a target translation from an activation state. The method has great advantages in parallelism and long-range dependence, but through analysis of a Transformer attention mechanism, the method has disadvantages in directivity, relative position and sparsity. Based on the characteristics of structured data of clinical literature of traditional Chinese medicine, the performance of the Transformer structure on the NER task of the clinical literature is greatly improved by simply improving the attention scoring function of the traditional Chinese medicine. The attention scoring function is the prior art, the description of the invention is omitted here, only the improvement part is described, after softmax (q dot k) is calculated, each point is weighted once, and part of the Pytorch codes are as follows:
self.time_weighting=nn.Parameter(torch.ones(self.n_head,
config.window_len,config.
...
softmax (att, dim-1) # is the original code
Time _ weight [: att. T, T # only needs to be increased
attn _ drop (att) # is the original code
The advantages of the above improvement are two, one is: tokens at different distances should have different contributions to the location. Secondly, the following steps: for tokens near the beginning of training, the overall weight of self-attention should be reduced because the observation window is small and the amount of information is relatively low. Aiming at the defects of directionality, relative position, sparsity and the like in traditional Chinese medicine literature data, after softmax (Q dot K) is calculated, each point is weighted once, so that the prediction result is more accurate.
And (5) obtaining the model by terminating the training with the accuracy of the (n +1) round and the F1 value less than or equal to n rounds. The specific training parameters are as follows:
and S26, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
In a feasible implementation mode, the test set sample data is firstly initialized, short sentences and length verification are carried out, a trained model is used for predicting the test set, coordinate positioning is carried out on the predicted content in the test set, which sentence in the test set is judged, and the coordinate is expanded into the sentence or sentences in which the sentence is located, so that the result is obtained. And calling the content in the regular pool to extract the obtained result, wherein the extracted result is the content needing structuring.
S27, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
In a possible implementation, as shown in fig. 7, the structured content obtained by the methods S21-S26 is subjected to manual second proofreading, and after the proofreading is completed, S21 is repeated until the obtained content is consistent with the proofreading result, and a new model is updated, so that the model is continuously and accurately learned by the self-learning method.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
As shown in fig. 8, an embodiment of the present invention provides an apparatus 800 for structuring data of clinical literature of traditional Chinese medicine based on deep learning, where the apparatus 800 is applied to implement a method for structuring data of clinical literature of traditional Chinese medicine based on deep learning, and the apparatus 800 includes:
an obtaining module 810, configured to obtain a document to be processed.
And the input module 820 is used for inputting the document to be processed into the document data structured model which is constructed in advance.
And the output module 830 is configured to obtain a structured text based on the document to be processed and the document data structured model.
Optionally, the input module 820 is further configured to:
and S21, acquiring a sample data set of the clinical traditional Chinese medicine literature, and preprocessing the sample data set.
And S22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set.
And S23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model.
And S24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more sentences in which the predicted target point is located according to the regular pool to obtain a predicted structured text.
S25, manually correcting the predicted structured text, and if the manual correction results are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
Optionally, the input module 820 is further configured to:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
Optionally, the input module 820 is further configured to:
and setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of the structured content.
And marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Optionally, the input module 820 is further configured to:
and extracting the sentence where the labeled data is, removing the labeled data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
Optionally, the input module 820 is further configured to:
and carrying out sequence annotation on the annotated data by adopting a BIO annotation method to obtain an annotated set.
In the embodiment of the invention, the machine learning method is adopted, so that the method has the characteristics of convenience, rapidness, accuracy and high efficiency, and meanwhile, the result can be manually output, so that the system can complete multiple rounds of automatic learning, the time is saved, and the aim of optimizing the system can be fulfilled. And (3) utilizing a dynamic regular pool concept, performing secondary utilization on the data labeling result, extracting the data labeling result into an expression and warehousing the expression to form an expression set (which can assist in performing dependency syntactic analysis), and determining the expression according to the occurrence times of the expression and the weight in the expression set. In the document data structuring process, model prediction plays two roles, namely 1, directly obtaining a result 2 and positioning a sentence where the result is located, grading and judging which aspect is executed according to the result, and calling a dynamic regular pool to obtain accurate structured data if the 2 nd aspect is executed.
On the other hand, when the data structuring is carried out, the sentence where the target entity is located is firstly positioned, and the target entity in the sentence is extracted by utilizing the content in the regular pool after the positioning, so that the text segment needing to be extracted can be effectively extracted, the entities at other positions in the document are prevented from being identified, and the identification accuracy is improved.
The invention solves the problems of the prior art that the combination of natural language processing and traditional Chinese medicine is not tight, and a traditional Chinese medicine literature data structured processing system has defects and defects. In the method, data labeling personnel perform data labeling on traditional Chinese medicine clinical documents by using a BIO labeling method, and a neural network is constructed by a labeling result in a deep learning mode to train an NLP (Natural Language Processing) data model. And performing dynamic regular matching on the training result through an NLP model according to the ontology related attribute concept in the field of traditional Chinese medicine documents, completing the coordinate positioning of target content, realizing the extraction of related document data, and performing manual proofreading on the extraction result. The proofreading result is secondarily organized into a standard knowledge base, the standard sample knowledge base is utilized, training source data are perfected, multiple rounds of training are conducted, and intellectualization and accuracy of a data mining result are promoted.
Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present invention, where the electronic device 900 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the memory 902 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 901 to implement the following steps of the method for structuring data of clinical literature of traditional Chinese medicine based on deep learning:
and S1, acquiring the document to be processed.
And S2, inputting the document to be processed into a document data structured model which is constructed in advance.
And S3, obtaining a structured text based on the document to be processed and the document data structured model.
In an exemplary embodiment, there is also provided a computer readable storage medium, such as a memory, comprising instructions executable by a processor in a terminal to perform the above deep learning based method of structuring data of clinical literature of traditional Chinese medicine. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A traditional Chinese medicine clinical literature data structuring method based on deep learning is characterized by comprising the following steps:
s1, acquiring a document to be processed;
s2, inputting the document to be processed into a document data structured model which is constructed in advance;
and S3, obtaining a structured text based on the document to be processed and the document data structured model.
2. The method according to claim 1, wherein the building process of the literature data structured model in S2 comprises:
s21, acquiring a sample data set of clinical literature of traditional Chinese medicine, and preprocessing the sample data set;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
s23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model;
s24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more words where the predicted target point is located according to the regular pool to obtain a predicted structured text;
s25, manually correcting the predicted structured text, and if the results of manual correction are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
3. The method according to claim 2, wherein the preprocessing of the sample data set in S21 comprises:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
4. The method according to claim 2, wherein the data tagging of the preprocessed sample data set in S22 includes:
setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content;
and marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
5. The method according to claim 2, wherein the deriving the regular pool according to the derived labeling data in S22 comprises:
and extracting the sentence where the label data is, removing the label data from the sentence, dynamically generating a regular extraction sentence pattern, and storing the regular extraction sentence pattern into a regular pool.
6. The method of claim 2, wherein the deriving an annotation set according to the derived annotation data in S22 comprises:
and carrying out sequence annotation on the annotation data by adopting a BIO annotation method to obtain an annotation set.
7. A traditional Chinese medicine clinical literature data structuring device based on deep learning is characterized in that the device comprises:
the acquisition module is used for acquiring documents to be processed;
the input module is used for inputting the document to be processed into a document data structured model which is constructed in advance;
and the output module is used for obtaining a structured text based on the document to be processed and the document data structured model.
8. The apparatus of claim 7, wherein the building process of the document data structured model comprises:
s21, acquiring a sample data set of clinical literature of traditional Chinese medicine, and preprocessing the sample data set;
s22, carrying out data annotation on the preprocessed sample data set, obtaining a regular pool and an annotation set according to the obtained annotation data, and dividing the annotation set into a training set, a verification set and a test set;
s23, constructing a neural network model based on a self-attention mechanism Transformer, and carrying out named entity recognition training on the neural network model according to the training set and the verification set to obtain a document data structured model;
s24, inputting the test set into the document data structured model to obtain a predicted target point, and extracting one or more words where the predicted target point is located according to the regular pool to obtain a predicted structured text;
s25, manually correcting the predicted structured text, and if the results of manual correction are inconsistent, executing S21; and if the manual proofreading results are consistent, outputting the document data structured model.
9. The apparatus according to claim 7, wherein the preprocessing of the sample data set in the S21 comprises:
and splitting the sample data in the sample data set, and deleting the keyword, date, head-up and number information in the split sample data.
10. The apparatus of claim 7, wherein data tagging the pre-processed sample data set comprises:
setting a label and sequencing according to the content of the preprocessed sample data set, wherein the content of the label is the description of structured content; and marking the content of the preprocessed sample data set according to the label, and associating the marked content with the corresponding label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111349067.2A CN114139610B (en) | 2021-11-15 | 2021-11-15 | Deep learning-based traditional Chinese medicine clinical literature data structuring method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111349067.2A CN114139610B (en) | 2021-11-15 | 2021-11-15 | Deep learning-based traditional Chinese medicine clinical literature data structuring method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114139610A true CN114139610A (en) | 2022-03-04 |
CN114139610B CN114139610B (en) | 2024-04-26 |
Family
ID=80394333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111349067.2A Active CN114139610B (en) | 2021-11-15 | 2021-11-15 | Deep learning-based traditional Chinese medicine clinical literature data structuring method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114139610B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116644719A (en) * | 2023-05-29 | 2023-08-25 | 南通大学 | Element coding method for clinical research literature and application of element coding method in diabetic retinopathy |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153664A (en) * | 2016-03-04 | 2017-09-12 | 同方知网(北京)技术有限公司 | A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted |
CN107193798A (en) * | 2017-05-17 | 2017-09-22 | 南京大学 | A kind of examination question understanding method in rule-based examination question class automatically request-answering system |
US20170315984A1 (en) * | 2016-04-29 | 2017-11-02 | Cavium, Inc. | Systems and methods for text analytics processor |
CN108491383A (en) * | 2018-03-14 | 2018-09-04 | 昆明理工大学 | A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule |
CN110032649A (en) * | 2019-04-12 | 2019-07-19 | 北京科技大学 | Relation extraction method and device between a kind of entity of TCM Document |
CN110866113A (en) * | 2019-09-30 | 2020-03-06 | 浙江大学 | Text classification method based on sparse self-attention mechanism fine-tuning Bert model |
CN111382575A (en) * | 2020-03-19 | 2020-07-07 | 电子科技大学 | Event extraction method based on joint labeling and entity semantic information |
CN111428036A (en) * | 2020-03-23 | 2020-07-17 | 浙江大学 | Entity relationship mining method based on biomedical literature |
CN111834012A (en) * | 2020-07-14 | 2020-10-27 | 中国中医科学院中医药信息研究所 | Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism |
CN112487134A (en) * | 2020-12-08 | 2021-03-12 | 武汉大学 | Scientific and technological text problem extraction method based on extremely simple abstract strategy |
CN112685513A (en) * | 2021-01-07 | 2021-04-20 | 昆明理工大学 | Al-Si alloy material entity relation extraction method based on text mining |
CN113220768A (en) * | 2021-06-04 | 2021-08-06 | 杭州投知信息技术有限公司 | Resume information structuring method and system based on deep learning |
CN113420126A (en) * | 2021-06-30 | 2021-09-21 | 北京法意科技有限公司 | Legal rule map construction method and system based on legal text |
CN113505244A (en) * | 2021-09-10 | 2021-10-15 | 中国人民解放军总医院 | Knowledge graph construction method, system, equipment and medium based on deep learning |
-
2021
- 2021-11-15 CN CN202111349067.2A patent/CN114139610B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107153664A (en) * | 2016-03-04 | 2017-09-12 | 同方知网(北京)技术有限公司 | A kind of method flow that research conclusion is simplified based on the scientific and technical literature mark that assemblage characteristic is weighted |
US20170315984A1 (en) * | 2016-04-29 | 2017-11-02 | Cavium, Inc. | Systems and methods for text analytics processor |
CN107193798A (en) * | 2017-05-17 | 2017-09-22 | 南京大学 | A kind of examination question understanding method in rule-based examination question class automatically request-answering system |
CN108491383A (en) * | 2018-03-14 | 2018-09-04 | 昆明理工大学 | A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule |
CN110032649A (en) * | 2019-04-12 | 2019-07-19 | 北京科技大学 | Relation extraction method and device between a kind of entity of TCM Document |
CN110866113A (en) * | 2019-09-30 | 2020-03-06 | 浙江大学 | Text classification method based on sparse self-attention mechanism fine-tuning Bert model |
CN111382575A (en) * | 2020-03-19 | 2020-07-07 | 电子科技大学 | Event extraction method based on joint labeling and entity semantic information |
CN111428036A (en) * | 2020-03-23 | 2020-07-17 | 浙江大学 | Entity relationship mining method based on biomedical literature |
CN111834012A (en) * | 2020-07-14 | 2020-10-27 | 中国中医科学院中医药信息研究所 | Traditional Chinese medicine syndrome diagnosis method and device based on deep learning and attention mechanism |
CN112487134A (en) * | 2020-12-08 | 2021-03-12 | 武汉大学 | Scientific and technological text problem extraction method based on extremely simple abstract strategy |
CN112685513A (en) * | 2021-01-07 | 2021-04-20 | 昆明理工大学 | Al-Si alloy material entity relation extraction method based on text mining |
CN113220768A (en) * | 2021-06-04 | 2021-08-06 | 杭州投知信息技术有限公司 | Resume information structuring method and system based on deep learning |
CN113420126A (en) * | 2021-06-30 | 2021-09-21 | 北京法意科技有限公司 | Legal rule map construction method and system based on legal text |
CN113505244A (en) * | 2021-09-10 | 2021-10-15 | 中国人民解放军总医院 | Knowledge graph construction method, system, equipment and medium based on deep learning |
Non-Patent Citations (3)
Title |
---|
刘华云: "针刺临床基础研究文献数据库人机协同构建方法研究", 《中国优秀硕士学位论文全文数据库医药卫生科技辑》, no. 02, 28 February 2023 (2023-02-28), pages 056 - 1083 * |
李欣 等: "基于正则抽取的竹种数据结构化方法研究", 《计算机技术与发展》, vol. 28, no. 06, 8 February 2018 (2018-02-08), pages 147 - 150 * |
肖瑞 等: "基于 BiLSTM-CRF 的中医文本命名实体识别", 《世界科学技术-中医药现代化》, vol. 22, no. 7, pages 2504 - 2510 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116644719A (en) * | 2023-05-29 | 2023-08-25 | 南通大学 | Element coding method for clinical research literature and application of element coding method in diabetic retinopathy |
Also Published As
Publication number | Publication date |
---|---|
CN114139610B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11501182B2 (en) | Method and apparatus for generating model | |
CN110532554B (en) | Chinese abstract generation method, system and storage medium | |
Qiu et al. | DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain | |
CN112528034B (en) | Knowledge distillation-based entity relationship extraction method | |
JP7259650B2 (en) | Translation device, translation method and program | |
CN110597997A (en) | Military scenario text event extraction corpus iterative construction method and device | |
CN115599901B (en) | Machine question-answering method, device, equipment and storage medium based on semantic prompt | |
CN114580424B (en) | Labeling method and device for named entity identification of legal document | |
US10963647B2 (en) | Predicting probability of occurrence of a string using sequence of vectors | |
US11327971B2 (en) | Assertion-based question answering | |
CN113841168A (en) | Hierarchical machine learning architecture including a primary engine supported by distributed lightweight real-time edge engines | |
CN115357699A (en) | Text extraction method, device, equipment and storage medium | |
CN114139610B (en) | Deep learning-based traditional Chinese medicine clinical literature data structuring method and device | |
CN112036186A (en) | Corpus labeling method and device, computer storage medium and electronic equipment | |
CN113297852A (en) | Medical entity word recognition method and device | |
Fei et al. | GFMRC: A machine reading comprehension model for named entity recognition | |
CN113705222B (en) | Training method and device for slot identification model and slot filling method and device | |
CN112257447B (en) | Named entity recognition system and recognition method based on depth network AS-LSTM | |
CN115392255A (en) | Few-sample machine reading understanding method for bridge detection text | |
CN114611489A (en) | Text logic condition extraction AI model construction method, extraction method and system | |
Yun et al. | Project-specific code summarization with in-context learning | |
Han et al. | Sentence segmentation for classical Chinese based on LSTM with radical embedding | |
Ma et al. | An enhanced method for dialect transcription via error‐correcting thesaurus | |
CN117591666B (en) | Abstract extraction method for bridge management and maintenance document | |
CN114398492B (en) | Knowledge graph construction method, terminal and medium in digital field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |