CN114387602B - Medical OCR data optimization model training method, optimization method and equipment - Google Patents

Medical OCR data optimization model training method, optimization method and equipment Download PDF

Info

Publication number
CN114387602B
CN114387602B CN202210294556.0A CN202210294556A CN114387602B CN 114387602 B CN114387602 B CN 114387602B CN 202210294556 A CN202210294556 A CN 202210294556A CN 114387602 B CN114387602 B CN 114387602B
Authority
CN
China
Prior art keywords
medical
training
ocr
character
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210294556.0A
Other languages
Chinese (zh)
Other versions
CN114387602A (en
Inventor
安波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202210294556.0A priority Critical patent/CN114387602B/en
Publication of CN114387602A publication Critical patent/CN114387602A/en
Application granted granted Critical
Publication of CN114387602B publication Critical patent/CN114387602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a medical OCR data optimization model training method, an optimization method and equipment, wherein the training method comprises the following steps: acquiring large-scale label-free medical text data, and identifying medical terms and characters in the text data to form a training set; pre-training the training set to obtain a pre-training data set for training the medical OCR optimization model, and training the medical OCR optimization model by using the pre-training data set; the pre-training process comprises: carrying out data augmentation processing on low-frequency terms and low-frequency characters in the training set; randomly replacing a first target character in a training set with an error character; shielding a second target character in the training set; and segmenting the training set into a plurality of text paragraphs to obtain a pre-training data set for training the medical OCR optimization model. According to the invention, the pre-training language model in the medical field is utilized to perform structured extraction, error recognition and optimization on the medical OCR result, so that the accuracy of the medical OCR is improved.

Description

Medical OCR data optimization model training method, optimization method and equipment
Technical Field
The invention relates to the technical field of intelligent medical data processing, in particular to a medical OCR data optimization model training method, an optimization method, related electronic equipment and a computer-readable storage medium.
Background
With the rapid development of machine learning, Optical Character Recognition (OCR) is currently making a great progress in character recognition, and various commercial applications such as Baidu OCR and the like have appeared. In the medical field, paper data is structured in clinical medical research, case structuring, underwriting claims and the like. How to convert paper medical data into computer-processable structural data has become a key to intelligent medical development. The medical picture data structuring also requires optical recognition, and the recognition result determines the subsequent process. However, the accuracy of optical text recognition in the medical field is still more problematic. Unlike image text recognition in the general field, medical image text recognition includes a large number of medical professional terms, such as names of diseases and field names in medical records, and the size of a term base is large, and the number of commonly used medical professional terms exceeds 100 ten thousand. Moreover, the medical field contains a large number of uncommon and unusual characters, the appearance frequency of the characters in the general text is very low, meanwhile, the appearance frequency of non-uncommon medical terms such as rare diseases and the like in the corpus is also very low, such as Kawasaki disease, and the identification accuracy of the low-frequency terms is low (such as 'eyelid' is often identified as 'face'). In addition, characters are commonly used in the medical field to refer to various diseases, conditions and the like, but the occurrence frequency of the characters is low, so that the OCR character recognition model is difficult to accurately recognize the characters. In the structure of medical records, unlike common text materials, medical data usually has a specific structure, such as information containing a plurality of fields in medical record reports, and the data types contained in different fields are different, however, the current text recognition system lacks the utilization of the structural information. The text language style in the medical field is very concise, and medical staff often omits a large number of non-medical words when forming documents. Each of the above aspects presents new challenges to both OCR and existing post-processing models.
Disclosure of Invention
In order to solve the problem that the OCR in the prior art cannot accurately identify abnormal or wrong data, the invention provides the following technical scheme.
The invention provides a medical OCR optimization model training method in a first aspect, which comprises the following steps:
acquiring large-scale unmarked medical text data, and identifying medical terms and characters in the large-scale unmarked medical text data to form a training set;
pre-training the training set to obtain a pre-training data set for training the medical OCR optimization model, and training the medical OCR optimization model by using the pre-training data set;
wherein the pre-training process comprises:
performing data augmentation processing on the low-frequency terms and the low-frequency characters in the training set;
randomly replacing a first target character in the training set with an error character;
shielding a second target character in the training set; and
and segmenting the training set into a plurality of text paragraphs to obtain a pre-training data set for training the medical OCR optimization model.
Preferably, before the data amplification processing is performed on the low-frequency terms and the low-frequency characters in the training set, the method further includes:
and counting the frequency of each medical term and character in the identified training set, and determining low-frequency terms and low-frequency characters in the training set according to corresponding low-frequency thresholds.
Preferably, after the forming the training set, further comprising:
and performing representation learning of medical terms on the training set by using a medical knowledge map, and mapping in a representation space.
Preferably, the randomly replacing the first target character in the training set with an error character further includes:
and screening first target characters from the medical terms and characters in the training set, wherein the first target characters comprise characters contained in a similar character pattern dictionary and/or medical common characters.
Preferably, the training the medical OCR optimization model with the pre-training data set further comprises:
after the first target character has been randomly replaced with an erroneous character, using a current training set as a first data set, iteratively extracting the erroneous character in the first data set according to a current context, and predicting the first target character corresponding to the erroneous character to train a character error correction capability of the medical OCR optimization model.
Preferably, the training the medical OCR optimization model with the pre-training data set further comprises:
after the second target character has been occluded, iteratively predicting, using a current training set as a second data set, the second target character corresponding to an occluded position in the second data set according to a current context to train the ability of the medical OCR optimization model to identify occluded content.
Preferably, the training the medical OCR optimization model with the pre-training data set further comprises:
iteratively predicting paragraph ending sentences in the pre-training dataset according to a current context to train an ability of the medical OCR optimization model to automatically segment.
The present invention provides in a second aspect a medical OCR data optimization method comprising:
acquiring a target medical image, and performing OCR recognition on the target medical image to obtain text data to be optimized;
inputting the text data to be optimized into a medical OCR optimization model so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized;
the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method of the first aspect.
The invention provides an electronic device, which comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor is used for reading the instructions and executing the medical OCR data optimization model training method according to the first aspect or executing the medical OCR data optimization method according to the second aspect.
Yet another aspect of the present invention provides a computer-readable storage medium storing a plurality of instructions readable by a processor for executing the medical OCR data optimization model training method according to the first aspect or executing the medical OCR data optimization method according to the second aspect.
The beneficial effects of the invention are: according to the technical scheme, on the basis of data enhancement, a medical OCR result is subjected to structured extraction, error recognition and optimization by using a pre-training language model in the medical field, so that the accuracy of medical image character recognition is improved, particularly the accuracy of recognition of key words such as medical terms, medical record keywords and the like is improved, and meanwhile, text paragraphs can be subjected to auxiliary segmentation to realize subsequent medical knowledge extraction and event extraction.
Drawings
Fig. 1 is a flowchart of a medical OCR optimization model training method according to the present invention.
FIG. 2 is a schematic diagram of a process for forming a pre-training data set for model training according to the present invention.
FIG. 3 is a schematic diagram of a training process of a pre-training language model for post-processing of character recognition according to the present invention.
Fig. 4 is a flowchart of the medical OCR data optimizing method according to the present invention.
Fig. 5 is a detailed flowchart of the medical image text recognition method according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Aiming at the problems, in order to realize optimization of a medical image OCR result, the invention provides a medical OCR optimization model training method and a medical image character recognition (OCR) optimization method based on a pre-training language model. Specifically, a preset training model is used for recognizing error characters in a text obtained by OCR, and correct characters are predicted to obtain a correct text recognition result. The method utilizes the pre-training language model in the medical field to perform structured extraction, error recognition and optimization on the medical OCR result on the basis of data enhancement, and aims to improve the accuracy of the medical OCR.
Example one
As shown in fig. 1, an embodiment of the present invention provides a medical OCR optimization model training method, including:
s101, acquiring large-scale label-free medical text data, and identifying medical terms and characters in the large-scale label-free medical text data to form a training set;
the original training data of the pre-training language model is large-scale medical text data, and the data are from text data in clinical guidelines, medical textbooks, medical encyclopedias, medical forums and accumulated medical records. Based on the large-scale label-free medical text data, the existing medical named entity recognition model is utilized to carry out recognition processing to obtain preliminary medical term and character recognition results, such as diseases, diagnoses, operations, medicines and medical record keywords (chief complaints, ultrasonic diagnoses and the like), and the specific term recognition model can adopt entity recognition based on deep learning, entity recognition based on statistical learning or a dictionary-based method.
S102, pre-training the training set to obtain a pre-training data set used for training the medical OCR optimization model, and training the medical OCR optimization model by using the pre-training data set.
In order to improve the accuracy of medical image text recognition, the method mainly aims at a pre-training language model for character recognition post-processing, optimizes the character recognition result and obtains a model with better processing capability for the medical image character recognition. In a preferred embodiment, the better processing capability can be embodied in that the model can correctly recognize and correct the character recognition error, supplement missing characters and reasonably divide paragraphs of long texts. Therefore, in step S102, the training set obtained in step S101 is subjected to pre-training processing to obtain a new training set in which the recognition processing capability is enhanced, and the original medical OCR optimization model is trained based on the new training set.
In a further preferred embodiment, the training set may be subjected to representation learning of medical terms using a medical knowledge-graph and mapped in a representation space. Specifically, a dictionary and a model can be used for labeling large-scale non-labeled medical text data, and a medical knowledge graph and a labeled text are used for expression learning of terms, wherein the knowledge graph can adopt a Graph Neural Network (GNN), and the text can adopt a Transformer for expression learning and is mapped in a representation space, so that data learned by different modalities have the same vector representation.
Wherein the pre-training process may include one or more of the following aspects, as shown in fig. 2:
and S1021, performing data amplification processing on the low-frequency terms and the low-frequency characters in the training set.
Aiming at the identification of rarely-used words, the invention utilizes the repeat generation to realize data augmentation, namely, the word frequency of low-frequency words is increased so as to better meet the training requirement. The specific mode can be a sentence level repeat generation mode, and a seq2seq model and the like can be combined, namely, sentences with low-frequency terms and low-frequency characters are directly copied and pasted. In a specific operation, statistical analysis may be performed on the medical text and terms to obtain the frequency of each medical term and character in the training set. Low frequency terms and low frequency characters in the training set are determined according to corresponding low frequency thresholds, for example, terms with the occurrence frequency lower than 20 times, or characters with the occurrence frequency lower than 5 times, etc. The part of data is augmented to enhance the text data, and more balanced data is obtained.
And S1022, randomly replacing the first target character in the training set with an error character.
Specifically, a first target character may be filtered from the recognized medical terms and characters in the training set, where the first target character may include characters included in a similar-to-font dictionary and/or medical commonly used characters. Since OCR recognition is image-based recognition, characters with similar fonts are easily misrecognized. Therefore, aiming at the recognition of abnormal characters, the method utilizes the character pattern similarity dictionary to randomly use homomorphic characters to replace the characters in the medical text, and replaces the correct characters in the homomorphic character library with the wrong homomorphic characters in the training data. The font-likeness dictionary contains common characters (such as "person" and "e.g.) and common medical characters (such as" face "and" eyelid "). For example, the correct medical term "meibomian glands" in the original training set may be used to determine the first target character "meibomian". The character "face" corresponding to the homographic font library is replaced with the wrong term "facial gland". During language model training, a training set containing errors can be input into a pre-training model, the model is positively stimulated to predict characters 'faces' which do not accord with the current context, and correct target characters 'eyelids' are correspondingly predicted, so that the text error correction capability of the model is improved.
In a further preferred embodiment, when the replaced target character is randomly selected, compared with the common character, the replacement frequency of the medical commonly used character can be increased, so that the model predicts the correct medical commonly used character. Since the first target character can be more than one, namely, a plurality of error-prone characters can be randomly replaced, and the current training set is used as the first data set, accordingly, during the training of the medical OCR optimization model, each error character in the first data set is iteratively extracted according to the current context, and each correct first target character corresponding to each error character is predicted, so as to train the character error correction capability of the medical OCR optimization model.
And S1023, blocking the second target character in the training set.
Aiming at the problem of character missing which often occurs after the medical image character recognition, the language model optimized by the invention needs to have the recognition capability of lacking vocabularies and characters. The second target character to be occluded can be screened from the medical terms and characters obtained by recognition in the training set according to a preset probability, and then the content of the occluded position can be predicted during the training of the language model. I.e. after completing the occlusion operation of one or more characters, the current training set is taken as the second data set. Accordingly, during training of the medical OCR optimization model, each correct second target character corresponding to an occluded position in the second data set is iteratively predicted from the current context to train the ability of the medical OCR optimization model to recognize missing content.
Unlike common language model occlusion operations, the present invention preferably increases the probability that the medical term vocabulary is occluded. In a preferred embodiment, the probability of random substitution of medical terms may be 3 times that of other words, while also partially obscuring the characters in the medical word. The frequency of replacement of a character may be inversely proportional to the number of times the character appears in the corpus, with lower frequencies having a higher probability of being replaced.
S1024, segmenting the training set into a plurality of text paragraphs to obtain a pre-training data set used for training the medical OCR optimization model.
Aiming at the characteristic of medical record structuralization, in the pre-training stage, the invention utilizes keywords and a language model to divide different text blocks, automatically divides medical data into a plurality of independent text blocks, and uses the text blocks as an independent task training language model, so that the model has the capability of field division. The paragraphs obtained by correct splitting play an important role in subsequent medical knowledge extraction and event extraction. The segmentation method may use a pre-obtained keyword, etc. The content of different paragraph descriptions in medical text is usually very different, and the language model of the present invention performs paragraph segmentation by predicting whether the current sentence is the last sentence of the current paragraph. The prediction can be based on the current text and the next sentence text, and paragraph segmentation is carried out when no obvious semantic association exists between the two sentences. The prediction mode of whether the current sentence is the last sentence of the current paragraph may specifically refer to formula (1) to predict the probability that the current sentence is correct according to the current context.
During training of a medical OCR optimization model, one or more paragraph ending sentences in the pre-training data set are iteratively predicted according to a current context to train an ability of the medical OCR optimization model to automatically segment.
By performing the pre-training process of S1022, S1023, and S1024 to obtain a pre-training data set, and then iteratively training and updating the medical OCR optimization model using the obtained pre-training data set, a model with better processing capability for medical image character recognition can be obtained, including the capability of correctly recognizing and correcting errors and missing characters of character recognition and optimizing the division of text segments, respectively.
FIG. 3 illustrates the complete training process of the pre-trained language model at the offline stage. It should be noted that in the above-described flow of the present invention, character error correction, character supplementation, and text segmentation are relatively independent model optimization processes. Therefore, in the pre-training process of practical application, the training data of the language model can be enhanced by selecting at least one pre-training mode. And the sequence of steps S1022, S1023, and S1024 may be arbitrarily adjusted, and is not limited to the sequence described in the above embodiment. For example, the training set may be segmented into a plurality of text paragraphs, and then the preset target characters in the training set may be subjected to error replacement and/or occlusion, so as to obtain a pre-training data set, and so on.
In a further embodiment, for the recognition problem of the medical language style, the pre-training process of the present invention may further comprise:
s1025, extracting a large number of diagnosis result texts to fine tune the language model.
Model fine-tuning (fine-tuning) is performed through learning and training of a large number of diagnosis result texts, so that the understanding of the model on grammar habits and Chinese styles of medical staff is enhanced, and the recognition accuracy of diagnosis results is improved.
In addition, as a specific way to implement the sentence error detection, in the model training stage after completing the random replacement of the error characters in step S1022, the probability that the current sentence is the correct sentence may be estimated first, then the error characters are identified, finally the correct characters are predicted according to the context, and the probability that the corrected sentence is the correct sentence is calculated. Specifically, the calculation method of sentence error detection can be expressed as formula (1):
P(s)=P(w1, w2,w3,...,wn) Formula (1)
Where s is an OCR statement consisting of a sequence of characters w1,w2,w3,...,wnAnd P(s) is the probability that the statement is a correct statement.
When the value of P(s) is less than a given threshold, it is determined that an error is included. That is, when the wrong character is predicted after the wrong sentence is recognized, the method for predicting the wrong character by the model is shown as formula (2):
Perror(wi)=minP(wi|w1,..,wi-1,wi+1,..,wn) Formula (2)
In which is the character wiIs a character recognized incorrectly, P (w)i|w1,..,wi-1,wi+1,..,wn) Is w in a given contextiThe probability of occurrence, the character with the lowest probability in the sentence is the error character Perror(wi). The method for giving correct characters by the model is shown in formula (3):
w’=maxP(w’|w1,w2,w3,..) formula (3)
Where w 'is the predicted correct character, P (w' | w)1,w2,w3,..) the probability that the character w' can constitute a reasonable sentence, given the context.
Compared with the prior art, the method can further improve the accuracy of character recognition of the medical image, particularly the accuracy of recognition of key words such as medical terms and medical record keywords (chief complaints) and the like, can assist segmentation of text paragraphs, and is beneficial to subsequent medical knowledge extraction and event extraction. Experimental data show that after the medical OCR optimization model training method is adopted, the error detection rate of text recognition can reach 78%, the prediction accuracy rate of correct characters can reach 85%, and therefore the accuracy rate of OCR recognition of medical images can be obviously optimized. The medical OCR optimization model training method has important value for the applications of medical record structuralization, clinical medicine statistics, underwriting claims and the like.
Example two
As shown in fig. 4, the present invention provides in a second aspect a medical OCR data optimization method comprising:
s201, obtaining a target medical image, and performing OCR recognition on the target medical image to obtain text data to be optimized;
the target medical image to be identified may include image files such as medical records, diagnostic reports, etc. that are scanned or taken. After the target medical image is acquired, image text data is extracted as initial text data according to an existing OCR recognition algorithm.
S202, inputting the text data to be optimized into a medical OCR optimization model, so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized.
As described above, before the OCR online recognition, the final medical OCR optimization model is obtained in the offline stage based on the medical OCR optimization model training method of embodiment one in advance. And the OCR recognition method for the complete medical image in the online stage is shown in FIG. 5. Inputting the initial text data to be optimized into a medical OCR optimization model so as to enable the model to output corresponding optimized text data. Because the model is trained to have higher capabilities of correcting word recognition errors, recognizing missing characters, and optimizing text segment segmentation, the optimized text data at least comprises segment segmentation on the initial text data, and in the case that an incorrect medical term or character exists in the initial text data, the incorrect medical term or character in the initial text data is identified; the erroneous terms are then replaced with the corresponding correct elemental terms, and in the event that there is a missing term in the initial text data, the missing medical term or character is predicted.
EXAMPLE III
Another aspect of the present invention further includes a functional module architecture completely corresponding to and consistent with the foregoing method flow, that is, an embodiment of the present invention further provides a medical OCR data optimization model training apparatus, including:
the acquisition module 301 is configured to acquire large-scale label-free medical text data and identify medical terms and characters in the large-scale label-free medical text data to form a training set;
a pre-training module 302, configured to perform pre-training processing on the training set to obtain a pre-training data set used for training the medical OCR optimization model, and train the medical OCR optimization model by using the pre-training data set;
wherein the pre-training processing module comprises:
an augmentation module 3021, configured to perform data augmentation on low-frequency terms and low-frequency characters in the training set;
a replacing module 3022, configured to randomly replace a first target character in the training set with an error character;
an occlusion module 3023, configured to occlude the second target character in the training set; and
a segmentation module 3024, configured to segment the training set into a plurality of text paragraphs to obtain a pre-training data set used for training the medical OCR optimization model.
The device can be implemented by the medical OCR data optimization model training method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.
Example four
Correspondingly, the embodiment of the invention also provides a medical OCR data optimization device, which comprises:
the recognition module 401 is configured to obtain a target medical image, and perform OCR recognition on the target medical image to obtain text data to be optimized;
the optimizing module 402 is configured to input the text data to be optimized into a medical OCR optimizing model, so that the medical OCR optimizing model outputs a medical term and a character recognition result corresponding to the text data to be optimized;
the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method in the first embodiment.
EXAMPLE five
Another aspect of the present invention provides an electronic device, including a processor and a memory, where the memory stores a plurality of instructions, and the processor is configured to read the instructions and execute the medical OCR data optimization model training method according to the first embodiment or execute the medical OCR data optimization method according to the second embodiment. Where the processor and memory may be connected by a bus or otherwise, such as by a bus connection. The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the medical OCR data optimization model training method and the optimization method in the embodiments of the present application. The processor executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions and modules stored in the memory, that is, the method in the above-described method embodiment is implemented.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
EXAMPLE six
Yet another aspect of the present invention provides a computer-readable storage medium storing a plurality of instructions readable by a processor for performing the medical OCR data optimization model training method according to the first embodiment or performing the medical OCR data optimization method according to the second embodiment. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (7)

1. A medical OCR optimization model training method is characterized by comprising the following steps:
acquiring large-scale label-free medical text data, and identifying medical terms and characters in the large-scale label-free medical text data to form a training set;
pre-training the training set to obtain a pre-training data set for training the medical OCR optimization model, and training the medical OCR optimization model by using the pre-training data set;
wherein the pre-training process comprises:
performing data augmentation processing on the low-frequency terms and the low-frequency characters in the training set;
randomly replacing a first target character in the training set with an error character;
screening a second target character to be shielded from the training set according to a preset probability; occlusion is carried out on a second target character in the training set, wherein the occlusion probability of the medical term is higher than that of other words in the training set; and
segmenting the training set into a plurality of text paragraphs to obtain a pre-training data set for training the medical OCR optimization model;
the randomly replacing the first target character in the training set with the error character further comprises:
filtering first target characters from medical terms and characters in the training set, wherein the first target characters comprise characters contained in a similar character form dictionary and/or medical common characters;
the training the medical OCR optimization model with the pre-training dataset further comprises:
after the first target character has been randomly replaced with an erroneous character, using a current training set as a first data set, iteratively extracting the erroneous character in the first data set according to a current context, and predicting the first target character corresponding to the erroneous character to train a character error correction capability of the medical OCR optimization model;
after the second target character has been occluded, iteratively predicting, according to a current context, the second target character corresponding to an occluded position in the second data set, using a current training set as a second data set, to train the ability of the medical OCR optimization model to identify occluded content;
in the model training stage after the random replacement of the error characters is finished, firstly, the probability that the current sentence is a correct sentence is estimated, then the error characters are identified, finally, the correct characters are predicted according to the context, and the probability that the corrected sentence is the correct sentence is calculated; the calculation method of statement error detection is represented by formula (1):
P(s)=P(w1, w2,w3,...,wn) Formula (1);
where s is an OCR statement consisting of a sequence of characters w1,w2,w3,...,wnComposition, P(s) is the probability that the statement is a correct statement;
when the value of P(s) is less than a given threshold, determining that an error is included; that is, when the wrong character is predicted after the wrong sentence is recognized, the method for predicting the wrong character by the model is shown as formula (2):
Perror(wi)=minP(wi|w1,..,wi-1,wi+1,..,wn) Formula (2);
wherein the character wiIs a character recognized incorrectly, P (w)i|w1,..,wi-1,wi+1,..,wn) Is w in a given contextiThe probability of occurrence, the character with the lowest probability in the sentence is the error character Perror(wi) (ii) a The method for giving correct characters by the model is shown in formula (3):
w’=maxP(w’|w1,w2,w3,..) formula (3);
where w 'is the predicted correct character, P (w' | w)1,w2,w3,..) the probability that the character w' can constitute a reasonable sentence, given the context.
2. The method of claim 1, further comprising, prior to the data augmentation processing of the low frequency terms and low frequency characters in the training set:
and counting the frequency of each medical term and character in the identified training set, and determining low-frequency terms and low-frequency characters in the training set according to corresponding low-frequency thresholds.
3. The method of claim 1, further comprising, after said forming a training set:
and performing representation learning of medical terms on the training set by using a medical knowledge map, and mapping in a representation space.
4. The method of claim 1, wherein the training the medical OCR optimization model with the pre-training dataset further comprises:
iteratively predicting paragraph ending sentences in the pre-training dataset according to a current context to train an ability of the medical OCR optimization model to automatically segment.
5. A medical OCR data optimization method, comprising:
acquiring a target medical image, and performing OCR recognition on the target medical image to obtain text data to be optimized;
inputting the text data to be optimized into a medical OCR optimization model so that the medical OCR optimization model outputs medical terms and character recognition results corresponding to the text data to be optimized;
wherein the medical OCR optimization model is obtained in advance based on the medical OCR optimization model training method as claimed in any one of claims 1 to 4.
6. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the instructions and execute the medical OCR optimization model training method according to any one of claims 1 to 4 or the medical OCR data optimization method according to claim 5.
7. A computer-readable storage medium storing a plurality of instructions readable by a processor and performing the medical OCR optimization model training method according to any one of claims 1 to 4 or the medical OCR data optimization method according to claim 5.
CN202210294556.0A 2022-03-24 2022-03-24 Medical OCR data optimization model training method, optimization method and equipment Active CN114387602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210294556.0A CN114387602B (en) 2022-03-24 2022-03-24 Medical OCR data optimization model training method, optimization method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210294556.0A CN114387602B (en) 2022-03-24 2022-03-24 Medical OCR data optimization model training method, optimization method and equipment

Publications (2)

Publication Number Publication Date
CN114387602A CN114387602A (en) 2022-04-22
CN114387602B true CN114387602B (en) 2022-07-08

Family

ID=81205628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210294556.0A Active CN114387602B (en) 2022-03-24 2022-03-24 Medical OCR data optimization model training method, optimization method and equipment

Country Status (1)

Country Link
CN (1) CN114387602B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861637B (en) * 2022-05-18 2023-06-16 北京百度网讯科技有限公司 Spelling error correction model generation method and device, and spelling error correction method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427609A (en) * 2019-06-25 2019-11-08 首都师范大学 One kind writing people's composition structure of an article reasonability method for automatically evaluating
CN110503100A (en) * 2019-08-16 2019-11-26 湖南星汉数智科技有限公司 A kind of medical document recognition methods, device, computer installation and computer readable storage medium
CN111079447A (en) * 2020-03-23 2020-04-28 深圳智能思创科技有限公司 Chinese-oriented pre-training method and system
CN111178049A (en) * 2019-12-09 2020-05-19 天津幸福生命科技有限公司 Text correction method and device, readable medium and electronic equipment
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN111797908A (en) * 2020-06-18 2020-10-20 浪潮金融信息技术有限公司 Training set generation method of deep learning model for print character recognition
CN111984845A (en) * 2020-08-17 2020-11-24 江苏百达智慧网络科技有限公司 Website wrongly-written character recognition method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10817509B2 (en) * 2017-03-16 2020-10-27 Massachusetts Institute Of Technology System and method for semantic mapping of natural language input to database entries via convolutional neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111191456A (en) * 2018-11-15 2020-05-22 零氪科技(天津)有限公司 Method for identifying text segmentation by using sequence label
CN110427609A (en) * 2019-06-25 2019-11-08 首都师范大学 One kind writing people's composition structure of an article reasonability method for automatically evaluating
CN110503100A (en) * 2019-08-16 2019-11-26 湖南星汉数智科技有限公司 A kind of medical document recognition methods, device, computer installation and computer readable storage medium
CN111178049A (en) * 2019-12-09 2020-05-19 天津幸福生命科技有限公司 Text correction method and device, readable medium and electronic equipment
CN111079447A (en) * 2020-03-23 2020-04-28 深圳智能思创科技有限公司 Chinese-oriented pre-training method and system
CN111797908A (en) * 2020-06-18 2020-10-20 浪潮金融信息技术有限公司 Training set generation method of deep learning model for print character recognition
CN111984845A (en) * 2020-08-17 2020-11-24 江苏百达智慧网络科技有限公司 Website wrongly-written character recognition method and system

Also Published As

Publication number Publication date
CN114387602A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
US10853638B2 (en) System and method for extracting structured information from image documents
CN111476284B (en) Image recognition model training and image recognition method and device and electronic equipment
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
US11106879B2 (en) Multilingual translation device and method
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
CN111444723B (en) Information extraction method, computer device, and storage medium
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN109086654B (en) Handwriting model training method, text recognition method, device, equipment and medium
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
WO2017161899A1 (en) Text processing method, device, and computing apparatus
AU2010311067A1 (en) System and method for increasing the accuracy of optical character recognition (OCR)
CN112101031B (en) Entity identification method, terminal equipment and storage medium
CN111177375B (en) Electronic document classification method and device
CN112287680A (en) Entity extraction method, device, equipment and storage medium of inquiry information
CN111046659A (en) Context information generating method, context information generating device, and computer-readable recording medium
CN110543637A (en) Chinese word segmentation method and device
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN111563380A (en) Named entity identification method and device
CN114387602B (en) Medical OCR data optimization model training method, optimization method and equipment
CN116663536B (en) Matching method and device for clinical diagnosis standard words
CN117594183A (en) Radiological report generation method based on inverse fact data enhancement
CN112632956A (en) Text matching method, device, terminal and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN111144345A (en) Character recognition method, device, equipment and storage medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant