CN113808752A - Medical document identification method, device and equipment - Google Patents

Medical document identification method, device and equipment Download PDF

Info

Publication number
CN113808752A
CN113808752A CN202011401403.9A CN202011401403A CN113808752A CN 113808752 A CN113808752 A CN 113808752A CN 202011401403 A CN202011401403 A CN 202011401403A CN 113808752 A CN113808752 A CN 113808752A
Authority
CN
China
Prior art keywords
medical
data
document
information
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011401403.9A
Other languages
Chinese (zh)
Inventor
徐滔伶
闾磊
樊淼淼
陈吟秋
钟应佳
熊亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Medical Science And Technology Co ltd
Original Assignee
Sichuan Medical Science And Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Medical Science And Technology Co ltd filed Critical Sichuan Medical Science And Technology Co ltd
Priority to CN202011401403.9A priority Critical patent/CN113808752A/en
Publication of CN113808752A publication Critical patent/CN113808752A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a medical document identification method, which comprises the steps of obtaining information of a document to be identified; inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention. The invention also provides a medical document identification device, equipment and a computer readable storage medium with the beneficial effects.

Description

Medical document identification method, device and equipment
Technical Field
The present invention relates to the field of medical assistance, and in particular, to a medical document identification method, apparatus, device, and computer-readable storage medium.
Background
In recent years, with the continuous deepening and development of hospitals in the aspect of digital construction, information such as electronic medical records, internal and external diagnosis and treatment data of hospitals, health management, network diagnosis and treatment, biomedical documents, education materials, news reports, industry data and the like is increased year by year, and considerable text data is achieved. The text data includes admission record, disease course record, discharge record, communication record, operation record, etc. in addition to the medical record data of the patient during the hospital diagnosis, and also includes network diagnosis and treatment interactive data, health management and consultation data, medical encyclopedia, medical data, medical literature, medical news and other life and health information data. . How to extract valuable contents from unstructured medical document texts has become a research hotspot in the medical field in recent years.
The Named Entity Recognition (NER) task is firstly introduced by the MUC conference evaluation meeting, and is classified as one of basic tasks in the information extraction technology, so that a theoretical basis is provided for constructing a knowledge base and a knowledge graph. In the medical field, Clinical Named Entity Recognition (CNER) is an important task for recognizing and classifying Clinical terms in medical documents, and further provides technical support for promoting intelligent medical treatment. However, the existing medical document identification does not depart from the supervised learning category in machine learning at all, so that high-quality and large-scale labeled clinical data are needed, and currently, because of certain privacy of the document data, large-scale high-quality labeled data are difficult to obtain, but the available data are unlabeled data, and the unlabeled data are required to be changed into labeled data and can only be obtained by a manual labeling method, so that the cost is too high.
Therefore, how to solve the problems of lack of traditional Chinese medicine labeling samples and high manual labeling cost in the prior art is a problem to be solved urgently by technical personnel in the field.
Disclosure of Invention
The invention aims to provide a medical document identification method, a medical document identification device, medical document identification equipment and a computer readable storage medium, and aims to solve the problems that medical labeling samples are lack and manual labeling cost is high in the prior art.
In order to solve the above technical problem, the present invention provides a medical document identification method, including:
acquiring document information to be identified;
inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
Optionally, in the medical document identification method, the method for obtaining the labeled data by reversely labeling the document data to be processed through the medical structured database includes:
acquiring a label library and a rule library of the medical structured database;
and acquiring words of the document data to be processed and labeling feature data through a structural unit of the medical structured database to obtain the labeled data.
Optionally, in the medical document identification method, the obtaining, by a structuring unit of the medical structured database, word extraction and feature data tagging on the to-be-processed document data, to obtain tagged data includes:
through a structural unit of the medical structured database, word extraction is carried out on the document data to be processed, and characteristic data are labeled to obtain rough labeling data;
determining character attribute information of the Chinese characters corresponding to the rough marking data through the rough marking data; the character attribute characteristics comprise at least one of component characteristic information, pinyin characteristic information, part of speech characteristic information or word boundary characteristic information;
and determining the labeling data according to the rough labeling data and the character attribute information.
Optionally, in the medical document identification method, the obtaining, by the structural unit of the medical structured database, the to-be-processed document data by word extraction and feature data tagging, to obtain coarse tagging data includes:
through a structural unit of the medical structured database, word extraction is carried out on the document data to be processed, and characteristic data are labeled to obtain data containing number labels;
and setting the corresponding detection value class data in the data-containing marking data to zero to obtain the rough marking data.
Optionally, in the medical document identification method, after obtaining the document identification information, the method further includes:
and updating the medical structured database according to the document identification information.
Optionally, in the medical document recognition method, the inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information includes:
reversely labeling the document information to be identified through a medical structured database to obtain labeled data to be identified;
inputting the document information to be identified into a pre-training language model to obtain a context semantic feature vector;
and obtaining the document identification information through a downstream language model according to the context semantic feature vector.
Optionally, in the medical document identification method, the CNN layer in the downstream language model is an ID-CNN layer.
A medical document identification apparatus comprising:
the acquisition module is used for acquiring the information of the document to be identified;
the identification module is used for inputting the document information to be identified as input quantity into a pre-trained medical document identification model to obtain document identification information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
A medical document identification device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the medical document identification method as described in any one of the above when said computer program is executed.
A computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the medical document identification method as claimed in any one of the preceding claims.
The medical document identification method provided by the invention comprises the steps of obtaining information of a document to be identified; inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention. The invention also provides a medical document identification device, equipment and a computer readable storage medium with the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flow chart illustrating an embodiment of a medical document identification method according to the present invention;
FIG. 2 is a flow chart illustrating another embodiment of a medical document identification method according to the present invention;
FIG. 3 is a flow chart illustrating another embodiment of the medical document identification method according to the present invention;
FIG. 4 is a flowchart illustrating a medical document identification method according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an embodiment of a medical document identification apparatus according to the present invention;
FIG. 6 is an example of a BERT model input for one embodiment of the medical document identification method provided by the present invention;
FIG. 7 is a schematic structural diagram of a BERT model according to an embodiment of the medical document identification method provided by the present invention;
fig. 8 is a schematic structural diagram of a medical document identification model according to an embodiment of the medical document identification method provided by the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The core of the present invention is to provide a medical document identification method, a flow diagram of an embodiment of which is shown in fig. 1, including:
s101: and acquiring the document information to be identified.
The document information to be identified can be manually input or can be sent by an upper computer.
S102: inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
The labeling data obtained by the reverse labeling in the invention is the labeling of the data of the document to be processed realized by the medical structured database, and of course, other labeling methods can be selected according to the actual situation.
As a preferred embodiment, the document information to be identified may be decoded by a viterbi algorithm to obtain decoded document information; and inputting the decoded document information serving as input quantity into a pre-trained medical document identification model to obtain document identification information. And decoding is carried out by adopting a Vibitit algorithm in the prediction process, so that an optimal labeling sequence is obtained, and the subsequent screening difficulty is greatly reduced.
In addition, after obtaining the document identification information, the method further includes:
and updating the medical structured database according to the document identification information.
The document identification information may include some vocabularies which do not exist in the medical structured database, and after the document identification information is obtained, the new vocabularies can be added into the medical structured database to update the database, so that the database is enriched, and a data basis is laid for subsequent relation extraction, knowledge graph construction, auxiliary medical decision making and intelligent medical question and answer.
The medical document identification method provided by the invention comprises the steps of obtaining information of a document to be identified; inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention.
On the basis of the first specific embodiment, the method for obtaining the annotation data is further improved to obtain a second specific embodiment, a flow diagram of which is shown in fig. 2, and includes:
s201: and acquiring a label library and a rule library of the medical structured database.
The label library and the rule library can be structured databases based on pre-established medicines, adverse reactions, treatment schemes, treatment paths, patient signs, detection diagnosis, screening prognosis and the like based on a tree structure of a specific disease, and the part-of-speech tagging is performed on lower-layer structured unit information through upper-layer structured unit information, so that the specific unit information has one or more part-of-speech tags.
Furthermore, the tag library and the rule library may use BIO labeling rules to define entity types as "level-level (lev)", "detection value-Test _ value (tsv)", "Test class-Test (tes)", "anatomy class-analysis (ant)", "degree-around (amo)", "disease class-disease (dis)", "drug class-drug (dru)", "treatment method-treatment (tre)", "cause-response (rea)", "method class-method (met)", "duration-duration (dur)", "surgery class-operation (ope)", "frequency-response (fre)", "symptom class-sym)", "side effect-sideeff", and other entity types may be labeled according to actual situations. In the following example, "the heart rate is 76 times/min, regular, powerful, unhealthy and pathological murmurs. "similar to" heart B-SYM bound I-SYM not I-SYM extended I-SYM large I-SYM after BIO labeling, O … "B is the beginning representation of a word, I is the following word, O is a word not belonging to an entity such as a punctuation mark in the previous sentence. The annotation data generated after BIO annotation is stored as five corresponding columns, wherein the first column is an annotation ID, the second column is an entity category, the third column is the starting position of an entity word, the fourth column is the ending position of the entity word, and the fifth column is a corresponding entity word, which is in the shape of 'T4 Disease 184518502 diabetes mellitus'.
S202: and acquiring words of the document data to be processed and labeling feature data through a structural unit of the medical structured database to obtain the labeled data.
After the marked data is obtained, a verification step can be added, namely, the marked data is screened and verified for the second time through a machine or a person, so that the marking accuracy is improved.
The characteristic data comprises at least one of a part of speech/attribute/hierarchy/weight/relationship/source of a word/phrase/paragraph.
As a preferred embodiment, S202 may include, by the structuring unit of the medical structured database, performing word extraction and feature data tagging on the to-be-processed document data to obtain coarse tagging data;
determining character attribute information of the Chinese characters corresponding to the rough marking data through the rough marking data; the character attribute characteristics comprise at least one of component characteristic information, pinyin characteristic information, part of speech characteristic information or word boundary characteristic information;
and determining the labeling data according to the rough labeling data and the character attribute information. In the embodiment, the character attribute information is additionally extracted from the village labeling data subjected to word extraction and characteristic data labeling, so that the model has higher robustness. The method comprises the steps of extracting the components and the pinyin characteristics of the Chinese characters by using a cnradical algorithm, and extracting the part of speech characteristics and the word boundary characteristics of the Chinese characters by using a jieba word segmentation algorithm. Taking "diabetes" as an example, the samples after extraction are:
"sugar, B, n, M, t-ng, urine, M, n, cadaver, ni ao, disease, E, n, , B mu ng", the above features are in order solid words, part-of-speech boundaries, part-of-speech, radical features and pinyin features.
Still further, the obtaining, by the structural unit of the medical structured database, the words from the document data to be processed and the feature data by labeling, to obtain the rough labeling data includes:
through a structural unit of the medical structured database, word extraction is carried out on the document data to be processed, and characteristic data are labeled to obtain data containing number labels;
and setting the corresponding detection value class data in the data-containing marking data to zero to obtain the rough marking data.
In this embodiment, all numbers of all labeled detection values-Test _ value (tsv) and the like are converted to 0, which is equivalent to only letting the model learn the digital location features without being influenced by the values, for example: "76 times/min" was converted to "00 times/min". The generalization capability of the model is increased, and meanwhile, the accuracy of the model is improved.
A flow diagram of the method for obtaining the annotation data improved by the two embodiments is shown in fig. 3, where S301, S302, S303, S304, and S305 are steps of obtaining the annotation data in the preferred embodiment, and it can be seen that S301 to S305 are not marked in the above specific embodiment. When the labeling data obtained through the reverse labeling in the present embodiment is used for training the medical document recognition model, the labeling data may be segmented according to a ratio of 20% to 80%, where 80% of the labeling data is used for training the model and 20% of the labeling data is used for evaluating the model.
On the basis of the first specific embodiment, a third specific embodiment is obtained by further limiting the recognition process of the medical document recognition model on the document information to be recognized, and a flow diagram of the third specific embodiment is shown in fig. 4, and includes:
s401: and acquiring the document information to be identified.
S402: and reversely labeling the document information to be identified through a medical structured database to obtain the labeled data to be identified.
The obtaining mode of the labeled data to be identified can refer to the obtaining mode of the labeled data in the training process of the model in the second embodiment.
S403: and inputting the document information to be identified into a pre-training language model to obtain a context semantic feature vector.
The pre-training language units include at least one of BERT, RNN (including LSTM, GRU), attention-based RNN, TextCNN, GPT-2, and XLNet language models.
The following describes a process of obtaining the context semantic feature vector through the BERT language model, taking a process of pre-training the BERT language model with unlabeled medical document data as an example:
1) for example, before "5 years, the patient has undergone laparoscopic surgery for cholecystitis due to cholelithiasis in the local hospital. Deny the history of infectious diseases such as hepatitis B and tuberculosis. The patient needs to be treated as [ CLS ]5 years ago, the patient has the operation treatment under the laparoscope in the local hospital due to the cholelithiasis and the cholecystitis. [ SEP ] repudiation of infectious diseases such as hepatitis B and tuberculosis. [ SEP ] "where [ CLS ] and [ SEP ] represent the beginning identifier of a sentence and the delimiter of the sentence (the end of the sentence is the terminator), respectively.
2) Since BERT needs to perform the complete shape filling task of the blocking words in the pre-training process, the above example sentences need to be processed by the blocking words (MASK), for example, the above sentences need to be processed as [ CLS ]5 years ago, and patients have cholecystitis under [ MASK ] in local hospitals. [ SEP ] deny the history of infectious diseases such as hepatitis B and [ MASK ]. [ SEP ] ", the concrete procedure when performing mask operation follows:
a mask operation is performed on the words at a rate of 2 a.70%.
2 b.15% of the ratio is to randomly replace a random word instead of a mask mark, for example, the patient is randomly replaced by the patient with the cholecystitis to be replaced by the patient with the tuberculosis.
The mask operation is not performed at a rate of 2 c.15%.
Of course, the ratio of the three steps 2a, 2b and 2c can be adjusted according to actual conditions.
3) And performing operations of cutting and extracting the preprocessed sentences, defining specified input length, and performing operations of cutting excessive sentences and extracting sentences which do not reach the length. The BERT in the method is not necessarily a sentence but a combination of a plurality of sentences or an incomplete sentence, and is therefore collectively referred to as token.
4) And (3) carrying out the following steps on the text data generated by preprocessing in the step 3) to form the input of the BERT model.
And 4a, firstly, initializing a word vector for the text data, wherein the word vector can be a static word vector generated by using word2vec or random initialization. Since BERT can generate dynamic context semantic feature vectors in the fine tuning process, which initialization mode is adopted has no influence, so that the method adopts a random initialization mode to obtain word vectors, which are expressed by EV (EV) { EV ═0,ev1,ev2,...,evi,...,evnIn ev, whereiniRepresenting the word vector corresponding to the ith word in the text.
Then, since the input of BERT is token composed of N sentences, here, embedding of the word belonging to the sentence is required to obtain a segment vector, which is represented by ES, where ES ═ 0, 0.. once, 1, 1.. once, N, N } where 0 represents that the word belongs to the first sentence and so on N represents that the word belongs to the nth sentence.
Finally, position embedding (position embedding) is carried out on the sentence to obtain a position vector EP ═ EP0,ep1,ep2,...,epi,...,epnIn which epiRepresenting the position vector corresponding to the ith word in the text.
The above 4a, 4b, and 4c together form the input of the BERT model, which is an input example of BERT as shown in fig. 6.
5) Connecting the three embedded vectors in the step 4) to obtain a final token input vector
Figure BDA0002817079440000101
Wherein
Figure BDA0002817079440000102
Is a vector join operation. Then, inputting the V into a BERT model for fine tuning pre-training, and fig. 7 is a schematic structural diagram of the BERT model, where Trm is that a kernel component Transformer in BERT inputs each token represented by the obtained V vector into the BERT model, and a Transformer at each layer outputs a corresponding hidden vector and then transmits the hidden vector to a lower layer, and finally a hidden state at each corresponding word position represents a predicted word.
S404: and obtaining the document identification information through a downstream language model according to the context semantic feature vector.
The context semantic feature vector comprises at least one feature of radical feature/pinyin feature/part-of-speech feature/word boundary/part-of-speech/attribute/hierarchy/weight/relation/source.
The specific structural schematic diagram of the hybrid neural network of the medical document identification model can be shown in fig. 8, where the to-be-identified labeled data is Input into the Batch generator at the Input layer (Input) to obtain Batch data (the Size of the Batch data may be set by itself, for example, the Size of the Batch Size is 16);
then mapping the context semantic feature vector and the above-mentioned radical, pinyin, word boundary and part-of-speech features corresponding to the word to vectors (50-dimensional, 20-dimensional and 50-dimensional) of corresponding dimensions respectively in a word Embedding Layer (Embedding Layer), and obtaining a final vector through connection operation; it should be noted that, when the labeled data is the rough labeled data and the text attribute information, the rough labeled data is firstly input into the BERT model to obtain an excessive vector, and then the context semantic feature vector is obtained according to the excessive vector and the text attribute information; in the specific implementation mode, a large amount of document label-free data is used for pre-training and fine-tuning the BERT language model, so that the word vectors can be dynamically generated according to the context. Then, the BERT is used as an upstream language model for an embedded layer of a downstream specific task, so that the downstream language model can achieve good effect even under a small amount of labeled data.
And inputting the final vector into a long-short term memory artificial neural network Layer (BilSTM Layer), extracting the time sequence characteristics of words in sentences by adopting the classical LSTM with three gate structures in the Layer, and adopting bidirectional LSTM (BilSTM) to enhance the extraction capability of the LSTM characteristics. The embedded vector is processed by the following formula after passing through the layer, and then the parameters of the layer are updated through back propagation;
the vector after passing through the long and short term memory artificial neural network enters a Convolutional neural network Layer (CNN Layer), and the CNN Layer used in the present embodiment is different from the conventional CNN Layer, and here, an iterative scaled Convolutional Layer (ID-CNNs) is used. The traditional CNN layer cannot effectively extract the global information of the text, and further loses information after pool operation, so that IDCNN is introduced to expand the receptive field of the CNN and ensure small information loss rate.
In the following, the medical document identification apparatus provided by the embodiment of the present invention is introduced, and the medical document identification apparatus described below and the medical document identification method described above may be referred to correspondingly.
Fig. 5 is a block diagram of a medical document identification apparatus according to an embodiment of the present invention, and with reference to fig. 5, the medical document identification apparatus may include:
an obtaining module 100, configured to obtain information of a document to be identified;
the recognition module 200 is configured to input the document information to be recognized as an input quantity into a pre-trained medical document recognition model to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
As a preferred embodiment, the identification module 200 includes:
the database acquisition unit is used for acquiring a label database and a rule database of the medical structured database;
and the marking unit is used for acquiring words from the to-be-processed document data and marking the feature data through the structural unit of the medical structured database to obtain the marking data.
As a preferred embodiment, the identification module 200 includes:
the rough labeling unit is used for performing word extraction and labeling characteristic data on the to-be-processed document data through the structural unit of the medical structured database to obtain rough labeling data;
the character attribute extraction unit is used for determining character attribute information of the Chinese characters corresponding to the rough marking data through the rough marking data; the character attribute characteristics comprise at least one of component characteristic information, pinyin characteristic information, part of speech characteristic information or word boundary characteristic information;
and the merging unit is used for determining the marking data according to the rough marking data and the character attribute information.
As a preferred embodiment, the identification module 200 includes:
the number marking unit is used for performing word extraction and marking characteristic data on the to-be-processed document data through the structural unit of the medical structured database to obtain number-containing marking data;
and the zero setting unit is used for setting zero to the corresponding detection value class data in the data-containing marking data to obtain the rough marking data.
As a preferred embodiment, the identification module 200 further includes:
and the updating unit is used for updating the medical structured database according to the document identification information.
As a preferred embodiment, the identification module 200 includes:
the identification marking unit is used for reversely marking the document information to be identified through a medical structured database to obtain marking data to be identified;
the pre-training language unit is used for inputting the document information to be identified into a pre-training language model to obtain a context semantic feature vector;
and the identification unit is used for obtaining the document identification information through a downstream language model according to the context semantic feature vector.
The medical document recognition apparatus of this embodiment is used to implement the foregoing medical document recognition method, and therefore specific implementation manners of the medical document recognition apparatus can be found in the foregoing embodiment portions of the medical document recognition method, for example, the obtaining module 100 and the recognition module 200 are respectively used to implement steps S101 and S102 in the foregoing medical document recognition method, so that the specific implementation manners thereof may refer to descriptions of corresponding respective embodiment portions, and are not repeated herein.
The medical document identification device provided by the invention is used for acquiring the information of a document to be identified through the acquisition module 100; the recognition module 200 is configured to input the document information to be recognized as an input quantity into a pre-trained medical document recognition model to obtain document recognition information; the medical document identification model is a model type obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention.
A medical document identification device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the medical document identification method as described in any one of the above when said computer program is executed. The medical document identification method provided by the invention comprises the steps of obtaining information of a document to be identified; inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention.
A computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the medical document identification method as claimed in any one of the preceding claims. The medical document identification method provided by the invention comprises the steps of obtaining information of a document to be identified; inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity. According to the invention, a medical structured database is utilized, a large amount of unmarked medical data are reversely marked, and a large amount of medical marked samples can be obtained without manual intervention.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is to be noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The medical document identification method, the medical document identification device, the medical document identification equipment and the computer readable storage medium provided by the invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. A medical document identification method, comprising:
acquiring document information to be identified;
inputting the document information to be recognized into a pre-trained medical document recognition model as an input quantity to obtain document recognition information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
2. The medical document identification method according to claim 1, wherein the method of reversely labeling the document data to be processed through the medical structured database to obtain the labeled data comprises:
acquiring a label library and a rule library of the medical structured database;
and acquiring words of the document data to be processed and labeling feature data through a structural unit of the medical structured database to obtain the labeled data.
3. The method for recognizing medical documents according to claim 2, wherein said obtaining labeled data by extracting words from said document data to be processed and labeling feature data through a structuring unit of said medical structured database comprises:
through a structural unit of the medical structured database, word extraction is carried out on the document data to be processed, and characteristic data are labeled to obtain rough labeling data;
determining character attribute information of the Chinese characters corresponding to the rough marking data through the rough marking data; the character attribute characteristics comprise at least one of component characteristic information, pinyin characteristic information, part of speech characteristic information or word boundary characteristic information;
and determining the labeling data according to the rough labeling data and the character attribute information.
4. The method according to claim 3, wherein the obtaining of the rough labeled data by the structural unit of the medical structured database by retrieving words from the document data to be processed and labeling feature data comprises:
through a structural unit of the medical structured database, word extraction is carried out on the document data to be processed, and characteristic data are labeled to obtain data containing number labels;
and setting the corresponding detection value class data in the data-containing marking data to zero to obtain the rough marking data.
5. The medical document identification method of claim 1, further comprising, after obtaining the document identification information:
and updating the medical structured database according to the document identification information.
6. The medical document recognition method of claim 1, wherein inputting the document information to be recognized as an input into a pre-trained medical document recognition model, and obtaining document recognition information comprises:
reversely labeling the document information to be identified through a medical structured database to obtain labeled data to be identified;
inputting the document information to be identified into a pre-training language model to obtain a context semantic feature vector;
and obtaining the document identification information through a downstream language model according to the context semantic feature vector.
7. The medical document identification method of claim 6, wherein the CNN layer in the downstream language model is an ID-CNN layer.
8. A medical document identification device, comprising:
the acquisition module is used for acquiring the information of the document to be identified;
the identification module is used for inputting the document information to be identified as input quantity into a pre-trained medical document identification model to obtain document identification information; the medical document identification model is obtained by training labeled data obtained by reversely labeling the data of the document to be processed through a medical structured database as input quantity.
9. A medical document identification device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the medical document identification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the medical document identification method according to any one of claims 1 to 7.
CN202011401403.9A 2020-12-04 2020-12-04 Medical document identification method, device and equipment Pending CN113808752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011401403.9A CN113808752A (en) 2020-12-04 2020-12-04 Medical document identification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011401403.9A CN113808752A (en) 2020-12-04 2020-12-04 Medical document identification method, device and equipment

Publications (1)

Publication Number Publication Date
CN113808752A true CN113808752A (en) 2021-12-17

Family

ID=78943541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011401403.9A Pending CN113808752A (en) 2020-12-04 2020-12-04 Medical document identification method, device and equipment

Country Status (1)

Country Link
CN (1) CN113808752A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114140810A (en) * 2022-01-30 2022-03-04 北京欧应信息技术有限公司 Method, apparatus and medium for structured recognition of documents
CN117153422A (en) * 2023-09-14 2023-12-01 天津大学 Sepsis early detection device based on deep learning and ChatGPT

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062983A (en) * 2018-07-02 2018-12-21 北京妙医佳信息技术有限公司 Name entity recognition method and system for medical health knowledge mapping
US20190006027A1 (en) * 2017-06-30 2019-01-03 Accenture Global Solutions Limited Automatic identification and extraction of medical conditions and evidences from electronic health records
CN109800761A (en) * 2019-01-25 2019-05-24 厦门商集网络科技有限责任公司 Method and terminal based on deep learning model creation paper document structural data
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN110135414A (en) * 2019-05-16 2019-08-16 京北方信息技术股份有限公司 Corpus update method, device, storage medium and terminal
CN110348008A (en) * 2019-06-17 2019-10-18 五邑大学 Medical text based on pre-training model and fine tuning technology names entity recognition method
CN110675962A (en) * 2019-09-10 2020-01-10 电子科技大学 Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN110807328A (en) * 2019-10-25 2020-02-18 华南师范大学 Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN110879831A (en) * 2019-10-12 2020-03-13 杭州师范大学 Chinese medicine sentence word segmentation method based on entity recognition technology
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111177414A (en) * 2019-12-31 2020-05-19 厦门快商通科技股份有限公司 Entity pre-labeling method, device and equipment
CN111177373A (en) * 2019-12-12 2020-05-19 北京明略软件***有限公司 Method and device for obtaining training data and method and device for training model

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190006027A1 (en) * 2017-06-30 2019-01-03 Accenture Global Solutions Limited Automatic identification and extraction of medical conditions and evidences from electronic health records
CN109062983A (en) * 2018-07-02 2018-12-21 北京妙医佳信息技术有限公司 Name entity recognition method and system for medical health knowledge mapping
CN109800761A (en) * 2019-01-25 2019-05-24 厦门商集网络科技有限责任公司 Method and terminal based on deep learning model creation paper document structural data
CN110059185A (en) * 2019-04-03 2019-07-26 天津科技大学 A kind of medical files specialized vocabulary automation mask method
CN110135414A (en) * 2019-05-16 2019-08-16 京北方信息技术股份有限公司 Corpus update method, device, storage medium and terminal
CN110348008A (en) * 2019-06-17 2019-10-18 五邑大学 Medical text based on pre-training model and fine tuning technology names entity recognition method
CN110675962A (en) * 2019-09-10 2020-01-10 电子科技大学 Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN110879831A (en) * 2019-10-12 2020-03-13 杭州师范大学 Chinese medicine sentence word segmentation method based on entity recognition technology
CN110807328A (en) * 2019-10-25 2020-02-18 华南师范大学 Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN111079377A (en) * 2019-12-03 2020-04-28 哈尔滨工程大学 Method for recognizing named entities oriented to Chinese medical texts
CN111177373A (en) * 2019-12-12 2020-05-19 北京明略软件***有限公司 Method and device for obtaining training data and method and device for training model
CN111177414A (en) * 2019-12-31 2020-05-19 厦门快商通科技股份有限公司 Entity pre-labeling method, device and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
丁龙;文雯;林强;: "基于预训练BERT字嵌入模型的领域实体识别", 情报工程, no. 06, pages 66 - 75 *
唐国强;高大启;阮彤;叶琪;王祺;: "融入语言模型和注意力机制的临床电子病历命名实体识别", 计算机科学, no. 03, pages 219 - 224 *
章成志;王玉琢;王如萍;: "情报学方法语料库构建", 科技情报研究, no. 01, pages 34 - 49 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114140810A (en) * 2022-01-30 2022-03-04 北京欧应信息技术有限公司 Method, apparatus and medium for structured recognition of documents
CN114140810B (en) * 2022-01-30 2022-04-22 北京欧应信息技术有限公司 Method, apparatus and medium for structured recognition of documents
CN117153422A (en) * 2023-09-14 2023-12-01 天津大学 Sepsis early detection device based on deep learning and ChatGPT

Similar Documents

Publication Publication Date Title
CN111274806B (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN106919793B (en) Data standardization processing method and device for medical big data
JP7008772B2 (en) Automatic identification and extraction of medical conditions and facts from electronic medical records
CN109299472B (en) Text data processing method and device, electronic equipment and computer readable medium
CN112802575B (en) Medication decision support method, device, equipment and medium based on graphic state machine
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN111834014A (en) Medical field named entity identification method and system
CN111048167B (en) Hierarchical case structuring method and system
KR20190102399A (en) System and method for interpreting medical images through the generation of refined artificial intelligence reinforcement learning data
CN113724882B (en) Method, device, equipment and medium for constructing user portrait based on inquiry session
CN110609983B (en) Structured decomposition method for policy file
CN106934220A (en) Towards the disease class entity recognition method and device of multi-data source
CN106909783A (en) A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline
JP7464800B2 (en) METHOD AND SYSTEM FOR RECOGNITION OF MEDICAL EVENTS UNDER SMALL SAMPLE WEAKLY LABELING CONDITIONS - Patent application
CN112241457A (en) Event detection method for event of affair knowledge graph fused with extension features
CN106844351A (en) A kind of medical institutions towards multi-data source organize class entity recognition method and device
CN113808752A (en) Medical document identification method, device and equipment
CN111145903A (en) Method and device for acquiring vertigo inquiry text, electronic equipment and inquiry system
CN112614559A (en) Medical record text processing method and device, computer equipment and storage medium
CN108735198B (en) Phoneme synthesizing method, device and electronic equipment based on medical conditions data
CN114927177B (en) Medical entity identification method and system integrating Chinese medical field characteristics
CN111477320A (en) Construction system of treatment effect prediction model, treatment effect prediction system and terminal
CN113724830B (en) Medication risk detection method based on artificial intelligence and related equipment
CN107085655B (en) Traditional Chinese medicine data processing method and system based on attribute constraint concept lattice
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination