CN113836901B

CN113836901B - Method and system for cleaning Chinese and English medical synonym data

Info

Publication number: CN113836901B
Application number: CN202111074910.0A
Authority: CN
Inventors: 王则远; 刘鹏
Original assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Current assignee: Lingxi Quantum Beijing Medical Technology Co ltd
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2023-11-14
Anticipated expiration: 2041-09-14
Also published as: CN113836901A

Abstract

The invention provides a method and a system for cleaning Chinese and English medical synonym data, wherein the method comprises the following steps: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data. The invention realizes the cleaning of complex and disordered non-regular medical synonym data by means of the AI technology, solves the problem of complicated and long time consumption of data processing work more accurately and rapidly, and simultaneously intelligently supplements the medical deficiency vocabulary and supplements the short plates with clear medical synonym data.

Description

Method and system for cleaning Chinese and English medical synonym data

Technical Field

The invention relates to the technical field of medical data processing, in particular to a method and a system for cleaning Chinese and English medical synonym data.

Background

In recent years, with the continuous and deep development of internet technology, the data volume of enterprises is greatly increased in the processes of data generation, mining and use, and in particular, the mass of internet and medical industry is increased, so that the quality requirements on medical data are continuously improved. Currently, there may be a large amount of redundancy, loss, and many junk and useless data in the acquired medical data. In order to meet the business requirements and improve the product quality, a large amount of non-regular data needs to be cleaned out to meet the high-quality data of the product requirements. However, among the medical data, the medical synonym data is more complex and the cleaning difficulty is also greater.

For cleaning medical synonym data, the traditional method is based on the already mature medical word stock, such as: ICD data, mesh word library, WHO adverse reaction set and the like, and the correction, cleaning and filtering of the words are carried out through text string matching, so that the method can accurately match medical synonymous words, but synonym data omission is easy to cause, and the time consumption is long. With the development of deep learning technology, text classification tasks in Natural Language Processing (NLP) have been widely studied. The NLP technology based on machine learning can not learn complex semantics outside rules although the accuracy is improved, and the quality of a training set has a large influence on the performance of the model. And NLP technology based on convolutional neural network and cyclic neural network can represent more abstract and complex text semantics by learning word vectors, further improve the performance of the model and improve the accuracy. Convolutional neural networks are more prone to capture local semantic information, whereas recurrent neural networks do not consider both upper and lower Wen Yuyi and have long-term dependency issues.

Disclosure of Invention

The embodiment of the invention provides a method and a system for cleaning Chinese and English medical synonym data, which are used for solving the problems of part or all of the prior method for cleaning the medical synonym data.

In a first aspect, an embodiment of the present invention provides a method for cleaning data of Chinese and English medical synonyms, including:

determining Chinese and English medical synonym data to be cleaned;

inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is trained based on standard synonym data.

Preferably, the data cleaning model comprises a data loading and Chinese and English judging model, a filter and a Chinese and English synonym training model;

inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:

inputting the Chinese-English medical synonym data to be cleaned into the data loading and Chinese-English judging model, and outputting a judging result of the Chinese-English data;

inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;

inputting the data which do not accord with the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result;

and fusing the filtered data meeting the preset rule into the data cleaning result.

Preferably, the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;

the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;

the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set;

the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set.

Preferably, inputting the data which does not meet the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result includes:

inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting predicted positive class data and negative class data, and judging whether the predicted positive class data duty ratio reaches the preset proportion or not:

if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.

In a second aspect, an embodiment of the present invention provides a system for cleaning data of synonyms in chinese and english medicine, including:

the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;

the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is trained based on standard synonym data.

Preferably, the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;

the data loading and Chinese and English judging module is used for inputting the Chinese and English medical synonym data to be cleaned and outputting a judging result of the Chinese and English data;

the filter is used for inputting the judging result of the Chinese and English data and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;

the Chinese and English synonym training module is used for inputting the data which does not accord with the preset rule, outputting the data cleaning result, and fusing the filtered data which accords with the preset rule into the data cleaning result.

the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;

Preferably, the chinese-english synonym training module is specifically configured to input the data that does not conform to the preset rule into the chinese-english synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive class data and negative class data, and determine whether the predicted positive class data duty ratio reaches a preset ratio:

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the method for cleaning data of chinese-english medical synonyms according to any one of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for cleaning chinese-english medical synonym data as provided in any one of the first aspects above.

According to the method and the system for cleaning the Chinese and English medical synonym data, the Chinese and English medical synonym data to be cleaned are input into the data cleaning model, and the data cleaning result output by the data cleaning model is obtained; the data cleaning model is trained based on standard synonym data. The invention cleans the medical synonym data by means of the biomedical pre-training model, can clean redundant data, garbage data and useless data, and accurately complements the missing data, thereby efficiently solving the problem of cleaning the boring medical data, assisting the high-quality medical data to assist the clinical diagnosis through the Internet, and making more comprehensive and accurate clinical decisions by matching with doctors.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for cleaning Chinese and English medical synonym data;

fig. 2 is a schematic structural diagram of a system for cleaning data of Chinese and English medical synonyms provided by the invention;

fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention applies the leading edge NLP technology to biomedical synonym data cleaning, and cleans the medical synonym data by combining a specific judgment rule with a pre-training model, and has the following key points:

1) Using standard synonym data as a training set fine-tuning (fine-tuning) model;

2) The cleaning jump-out mechanism is established through evaluating indexes such as the accuracy rate, the f value and the like of the model;

3) The model quality is continuously improved through circularly expanding training data to finish data cleaning;

4) The method can provide a cleaning thought for the cleaning problem of large-scale data, and the process is more automatic, so that the influence of labor cost and artificial factors on data cleaning is reduced.

The invention provides a method and a system for cleaning Chinese and English medical synonym data, which are described below with reference to fig. 1-3.

The embodiment of the invention provides a method for cleaning Chinese and English medical synonym data. Fig. 1 is a flow chart of a method for cleaning data of Chinese and English medical synonyms, which is provided by the embodiment of the invention, as shown in fig. 1, and the method comprises the following steps:

step 110, determining Chinese and English medical synonym data to be cleaned;

step 120, inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is trained based on standard synonym data.

According to the method provided by the embodiment of the invention, the AI technology is used for cleaning the complex and disordered non-regular medical synonym data, so that the problem of complicated and long time consumption of data processing work is solved more accurately and rapidly, meanwhile, the medical missing vocabulary is intelligently complemented, and the short plates with clear medical synonym data are complemented.

Based on any one of the above embodiments, the data cleaning model includes a data loading and Chinese-English judging model, a filter, and a Chinese-English synonym training model;

specifically, the data to be screened is sent to a data loading and Chinese and English judging module, the Chinese and English data is sent to a filter after being cut off, judgment is carried out according to the rules of the filter respectively, the data conforming to the rules are integrated into a final cleaning result, and the data not conforming to the rules of the filter are sent to a corresponding Chinese and English synonymBert model.

it should be noted that, the preset rules of the filter are a set of rules which can filter out part of negative data and summarized according to the data characteristics in the process of screening the negative data by the medical word stock and the clinician. That is, the data to be cleaned is sent to the filter for rule judgment, and the data which cannot be judged is sent to the Chinese-English synonymBert model.

Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;

specifically, based on ICD10, mesh word library, WHO adverse reaction set, A company self-built medical synonymous dictionary and other medical term library, a professional clinician screens out medical synonyms meeting clinical standards as positive class data of a standard set, screens out medical words not meeting clinical standards as negative class data of the standard set, and forms a standard set of training data by the positive class data and the negative class data.

Specifically, the Chinese and English synonym training model, namely the SynnymBert model, is divided into ChSynnymBert and EnsynonymBert according to the different data languages to be cleaned, and only Chinese and English data cleaning is supported at present. Wherein, chasynonymBert is a Chinese synonym pre-training model obtained by continuing training through more than 800 thousands of medical synonym vocabulary data and then fine-tuning by using a Chinese synonym standard set, and EnsynonymBert is an English synonym pre-training model obtained by continuing training through more than 1000 thousands of medical synonym vocabulary data and then fine-tuning by using an English synonym standard set on the basis of BioBert.

It should be noted that after the pre-training model Bert (Bidirectional Encoder Representation from Transformers) is released, a plurality of variants (such as BioBert, pubMedBert) of the pre-training model in the Bert biomedical field are generated, so as to provide a new idea for cleaning medical data by means of the NLP algorithm. The biomedical pre-training model obtains a model irrelevant to a task by self-supervision learning on a large-scale unmarked biomedical corpus, and then carries out fine adjustment on a specific task. The model can better understand the context information, characterize text semantics, and continuously create new high accuracy in medical text tasks.

Based on any one of the above embodiments, inputting the data that does not conform to the preset rule into the training model of the Chinese-English synonym, and outputting the data cleaning result includes:

Specifically, the Chinese and English synonymBert model obtains the data predicted as positive class and negative class through reasoning of the received data, then decides whether to output a cleaning result according to whether the proportion of the positive class data to the received filter data reaches 80% (which can be freely adjusted according to the current data quality), if not, expands the data predicted as positive class and predicted as negative class into the original standardized data set to continuously train the model, automatically replaces the original synonymBert model by the trained optimal model, and circularly executes the above processes to finish the synonym data cleaning.

The system for cleaning Chinese and English medical synonym data provided by the invention is described below, and the method for cleaning Chinese and English medical synonym data described below and the method for cleaning Chinese and English medical synonym data described above can be correspondingly referred to each other.

Fig. 2 is a schematic structural diagram of a system for cleaning data of synonyms of chinese and english medicine according to an embodiment of the invention, as shown in fig. 2, the system includes a data determining unit 210 and a data cleaning unit 220;

the data determining unit 210 is configured to determine data of Chinese and English medical synonyms to be cleaned;

the data cleaning unit 220 is configured to input the data of the Chinese and English medical synonyms to be cleaned to a data cleaning model, so as to obtain a data cleaning result output by the data cleaning model;

the data cleaning model is trained based on standard synonym data.

The system provided by the embodiment of the invention cleans the complex and disordered non-regular medical synonym data by means of the AI technology, solves the problem of complicated and long time consumption of data processing work more accurately and rapidly, and simultaneously intelligently supplements the medical deficiency vocabulary and short plates with clear medical synonym data.

Based on any one of the above embodiments, the data cleaning unit includes a data loading and chinese-english determining module, a filter, and a chinese-english synonym training module;

Based on any of the foregoing embodiments, the chinese-english synonym training module is specifically configured to input the data that does not conform to the preset rule into the chinese-synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive class data and negative class data, and determine whether the predicted positive class data duty ratio reaches a preset ratio:

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 3, the electronic device may include: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 accomplish communication with each other through communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a Chinese and English medical synonym data cleansing method comprising: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.

Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, which when executed by a computer, enable the computer to perform the method for cleaning chinese-english medical synonym data provided by the above methods, the method including: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.

In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the above-provided method for cleaning data of chinese-english medical synonyms, the method including: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for cleaning Chinese and English medical synonym data is characterized by comprising the following steps:

determining Chinese and English medical synonym data to be cleaned;

the data cleaning model is obtained based on standard synonym data training;

the data cleaning model comprises a data loading and Chinese and English judging model, a filter and a Chinese and English synonym training model;

inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model, and obtaining a data cleaning result output by the data cleaning model comprises the following steps:

fusing the filtered data meeting the preset rule into the data cleaning result; the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;

the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set on the basis of a Chinese Bert model;

the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set on the basis of a BioBert model.

2. The method for cleaning data of Chinese and English synonyms according to claim 1, wherein inputting the data which does not accord with the preset rule into the Chinese and English synonym training model and outputting the data cleaning result comprises the following steps:

3. A system for cleaning data of Chinese and English medical synonyms, comprising:

the data cleaning model is obtained based on standard synonym data training;

the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;

the filter is used for inputting the judging result of the Chinese and English data and outputting filtered data conforming to a preset rule and data not conforming to the preset rule; the Chinese-English synonym training module is used for inputting the data which does not accord with the preset rule, outputting the data cleaning result, and fusing the filtered data which accords with the preset rule into the data cleaning result; the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;

4. The system for cleaning chinese-english medical synonym data according to claim 3, wherein the chinese-english synonym training module is configured to input the data that does not conform to the preset rule into the chinese-synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive-class data and negative-class data, and determine whether the predicted positive-class data duty ratio reaches a preset ratio:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method for cleaning chinese-english medical synonym data as claimed in any one of claims 1 to 2.

6. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the method for cleaning chinese-english medical synonym data according to any one of claims 1 to 2.