CN113836901A

CN113836901A - Chinese and English medicine synonym data cleaning method and system

Info

Publication number: CN113836901A
Application number: CN202111074910.0A
Authority: CN
Inventors: 王则远; 刘鹏
Original assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Current assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2021-12-24
Anticipated expiration: 2041-09-14
Also published as: CN113836901B

Abstract

The invention provides a method and a system for cleaning Chinese and English medical synonym data, wherein the method comprises the following steps: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data. The invention realizes the cleaning of the irregular medical synonym data by means of the AI technology, more accurately and quickly solves the problems of complexity and long time consumption of data processing work, and simultaneously intelligently completes the missing medical vocabulary and the short board with clear medical synonym data.

Description

Chinese and English medicine synonym data cleaning method and system

Technical Field

The invention relates to the technical field of medical data processing, in particular to a method and a system for cleaning Chinese and English medical synonym data.

Background

In recent years, with the continuous and deep development of internet technology, the data volume of enterprises is greatly increased in the processes of data generation, mining and use, and particularly, the quality requirement on medical data is continuously improved due to the increase of the scale of internet and medical industry. Currently, there may be a large amount of redundancy, loss, and a lot of spam and useless data in the acquired medical data. In order to meet business requirements and improve product quality, a large amount of irregular data needs to be cleaned to obtain high-quality data meeting product requirements. However, in medical data, the medical synonym data is more complex and the cleaning difficulty is greater.

For cleaning of medical synonym data, the traditional method is based on a more mature medical lexicon, such as: ICD data, a Mesh lexicon, a WHO adverse reaction set and the like, vocabulary correction and cleaning and filtering are carried out through text string matching, and the method can be used for accurately matching medical synonym, but synonym data are easy to miss and time is long. With the development of deep learning techniques, the task of text classification in Natural Language Processing (NLP) has been widely studied. Although the accuracy rate is improved, the model cannot learn complex semantics outside the rules, and the influence of the quality of the training set on the performance of the model is large. The NLP technology based on the convolutional neural network and the cyclic neural network can represent more abstract and complex text semantics by learning word vectors, further improve the performance of the model and improve the accuracy. But the convolutional neural network is more prone to capture local semantic information, while the circular neural network does not consider context semantics at the same time, and has a long-term dependence problem.

Disclosure of Invention

The embodiment of the invention provides a method and a system for cleaning Chinese and English medical synonym data, which are used for solving the problems of part or all of the problems in the conventional method for cleaning the medical synonym data.

In a first aspect, an embodiment of the present invention provides a method for cleaning chinese and english medical synonym data, including:

determining Chinese and English medical synonym data to be cleaned;

inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is obtained by training based on standard synonym data.

Preferably, the data cleaning model comprises a data loading and Chinese-English judgment model, a filter and a Chinese-English synonym training model;

inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:

inputting the Chinese and English medical synonym data to be cleaned into the data loading and Chinese and English judging model, and outputting a judging result of the Chinese and English data;

inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data meeting a preset rule and data not meeting the preset rule;

inputting the data which do not accord with the preset rule into the Chinese and English synonym training model, and outputting the data cleaning result;

and fusing the filtered data which accords with the preset rule into the data cleaning result.

Preferably, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;

the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;

the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;

the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.

Preferably, the inputting the data which does not conform to the preset rule into the training model of the synonyms in chinese and english, and outputting the data cleaning result includes:

inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting the predicted positive data and the predicted negative data, and judging whether the predicted positive data proportion reaches a preset proportion or not:

if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.

In a second aspect, an embodiment of the present invention provides a system for cleaning chinese and english synonym data, including:

the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;

the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is obtained by training based on standard synonym data.

Preferably, the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;

the data loading and Chinese-English judging module is used for inputting the Chinese-English medical synonym data to be cleaned and outputting a judging result of the Chinese-English data;

the filter is used for inputting the judgment result of the Chinese and English data and outputting the filtered data meeting the preset rule and the data not meeting the preset rule;

and the Chinese and English synonym training module is used for inputting the data which do not accord with the preset rule, outputting the data cleaning result and fusing the filtered data which accord with the preset rule into the data cleaning result.

the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;

Preferably, the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset ratio:

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the steps of the method for cleaning data of chinese and english medical synonyms as provided in any one of the above first aspects.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the chinese-english medical synonym data cleaning method according to any one of the preceding aspects.

According to the method and the system for cleaning the Chinese and English medical synonym data, provided by the embodiment of the invention, the Chinese and English medical synonym data to be cleaned are input into the data cleaning model, and a data cleaning result output by the data cleaning model is obtained; the data cleaning model is obtained based on standard synonym data training. According to the invention, the biomedical pre-training model is used for cleaning the medical synonym data, redundant data, garbage data and useless data can be cleaned, and missing data can be accurately complemented, so that the problem of cleaning boring medical data is efficiently solved, high-quality medical data can assist the Internet to assist clinical diagnosis, and a doctor can make more comprehensive and accurate clinical decisions.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for cleaning Chinese and English synonym data according to the present invention;

FIG. 2 is a schematic structural diagram of a Chinese and English medical synonym data cleaning system provided by the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention applies the NLP technology of the front edge to the cleaning of the biomedical synonym data, and the cleaning of the biomedical synonym data is carried out by a method of combining a specific judgment rule with a pre-training model, and the method has the following key points:

1) using standard synonym data as a training set fine-tuning (fine-tuning) model;

2) a cleaning jump-out mechanism is determined by evaluating indexes such as accuracy, f value and the like of the model;

3) completing data cleaning by circularly expanding training data to continuously improve the quality of the model;

4) the cleaning idea can be provided for the cleaning problem of large-scale data, the process is more automatic, and the influence of the labor cost and human factors on data cleaning is reduced.

The following describes a method and a system for cleaning chinese and english synonym data according to the present invention with reference to fig. 1 to 3.

The embodiment of the invention provides a method for cleaning Chinese and English medical synonym data. Fig. 1 is a schematic flow chart of a method for cleaning chinese and english medical synonym data according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, determining Chinese and English medical synonym data to be cleaned;

step 120, inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is obtained by training based on standard synonym data.

According to the method provided by the embodiment of the invention, the AI technology is used for cleaning the irregular medical synonym data, the problems of complexity and long time consumption of data processing work are solved more accurately and quickly, and meanwhile, the medical missing vocabulary is completed intelligently, and the short board with clear medical synonym data is completed.

Based on any one of the above embodiments, the data cleaning model comprises a data loading and Chinese-English judgment model, a filter and a Chinese-English synonym training model;

specifically, data to be screened is sent to a data loading and Chinese and English judging module, Chinese and English data are cut off and then sent to a filter, judgment is carried out according to filter rules respectively, data which accord with the rules are fused and included into a final cleaning result, and data which do not accord with the filter rules are sent to a corresponding Chinese and English SynonymBert model.

it should be noted that the preset rules of the filter are a set of rules capable of filtering out part of negative data summarized according to the data characteristics of the medical lexicon and the data characteristics of the clinician in the negative data screening process. That is, the data to be cleaned is sent to the filter for rule judgment, and the data which cannot be judged is sent to the Chinese and English SynonymBert model.

Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;

specifically, medical synonyms meeting clinical standards are screened out as positive data of a standard set by professional clinicians based on ICD10, a Mesh thesaurus, a WHO adverse reaction set, a company A self-built medical synonym dictionary and other medical term libraries, negative data of medical words not meeting clinical standards are screened out as the negative data of the standard set, and the positive data and the negative data form a standard set of training data.

Specifically, the chinese-english synonym training model, i.e., the synymbert model, is classified into chsynymbert and ensynymbert according to the different languages of data to be cleaned, and currently, only chinese-english data cleaning is supported. The ChSynonymBert is a Chinese synonym pre-training model obtained by continuously training through 800 or more thousand medical synonym data and then performing fine-tuning by using a Chinese synonym standard set on the basis of Chinese Bert, and the EnSynonymBert is an English synonym pre-training model obtained by continuously training through 1000 or more thousand medical synonym data and then performing fine-tuning by using an English synonym standard set on the basis of BioBert.

It should be noted that after the pre-training model Bert (bidirectional Encoder replication from transformations) is released, a plurality of Bert biomedical field pre-training model variants (such as BioBert, PubMedBert, etc.) are generated, and a new idea is provided for cleaning medical data by means of an NLP algorithm. The biomedical pre-training model obtains a model irrelevant to a task by performing self-supervision learning on a large-scale unmarked biomedical corpus, and then fine-tunes on the specific task. The model can better understand context information, represent text semantics and continuously create new and high accuracy on medical text tasks.

Based on any one of the above embodiments, inputting the data that does not conform to the preset rule into the chinese-english synonym training model, and outputting the data cleaning result, including:

Specifically, the Chinese and English SynonymBert model obtains data predicted as a positive class and data predicted as a negative class by reasoning received data, then determines whether to output a cleaning result according to whether the proportion of the positive class data to the received filter data reaches 80% (can be freely adjusted according to the current data quality), if not, the data predicted as the positive class and the data predicted as the negative class of the model are expanded to an original standardized data set to continue training the model, the original SynonymBert model is automatically replaced by the trained optimal model, and the synonym data cleaning is completed by circularly executing the procedures.

In the following, a chinese-english medical synonym data cleaning system according to the present invention is described, and the following description and the above-described chinese-english medical synonym data cleaning method may be referred to in correspondence.

Fig. 2 is a schematic structural diagram of a chinese and english medical synonym data cleaning system according to an embodiment of the present invention, and as shown in fig. 2, the system includes a data determining unit 210 and a data cleaning unit 220;

the data determining unit 210 is configured to determine chinese and english medical synonym data to be cleaned;

the data cleaning unit 220 is configured to input the chinese and english medical synonym data to be cleaned to a data cleaning model, and obtain a data cleaning result output by the data cleaning model;

the data cleaning model is obtained by training based on standard synonym data.

According to the system provided by the embodiment of the invention, the AI technology is used for cleaning the irregular medical synonym data, the problems of complexity and long time consumption of data processing work are solved more accurately and quickly, and meanwhile, the medical missing vocabulary is completed intelligently, and the short board with clear medical synonym data is completed.

Based on any one of the above embodiments, the data cleaning unit includes a data loading and Chinese-English judging module, a filter, and a Chinese-English synonym training module;

Based on any of the above embodiments, the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset ratio:

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a chinese and english medical synonym data cleansing method, the method comprising: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.

In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the chinese and english medical synonym data cleaning method provided by the above methods, where the method includes: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method for cleaning chinese and english medical synonym data provided in the foregoing aspects, and the method includes: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese and English medicine synonym data cleaning method is characterized by comprising the following steps:

determining Chinese and English medical synonym data to be cleaned;

the data cleaning model is obtained by training based on standard synonym data.

2. The method according to claim 1, wherein the data cleaning model comprises a data loading and chinese-english judgment model, a filter, a chinese-english synonym training model;

3. The method according to claim 2, wherein the standard synonym data includes a chinese synonym standard set and an english synonym standard set;

4. The method for cleaning Chinese and English medical synonym data according to claim 2, wherein the step of inputting the data which does not conform to the preset rule into the Chinese and English synonym training model and outputting the data cleaning result comprises the steps of:

5. A Chinese and English medicine synonym data cleaning system is characterized by comprising:

the data cleaning model is obtained by training based on standard synonym data.

6. The system according to claim 5, wherein the data cleaning unit comprises a data loading and Chinese-English judging module, a filter, a Chinese-English synonym training module;

7. The system according to claim 6, wherein the standard synonym data includes a set of chinese synonym criteria and a set of english synonym criteria;

8. The system according to claim 6, wherein the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset proportion:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for cleaning Chinese and English medical synonym data according to any one of claims 1 to 4 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the chinese-english medical synonym data washing method according to any one of claims 1 to 4.