CN113836901B - Method and system for cleaning Chinese and English medical synonym data - Google Patents

Method and system for cleaning Chinese and English medical synonym data Download PDF

Info

Publication number
CN113836901B
CN113836901B CN202111074910.0A CN202111074910A CN113836901B CN 113836901 B CN113836901 B CN 113836901B CN 202111074910 A CN202111074910 A CN 202111074910A CN 113836901 B CN113836901 B CN 113836901B
Authority
CN
China
Prior art keywords
data
synonym
chinese
english
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111074910.0A
Other languages
Chinese (zh)
Other versions
CN113836901A (en
Inventor
王则远
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingxi Quantum Beijing Medical Technology Co ltd
Original Assignee
Lingxi Quantum Beijing Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingxi Quantum Beijing Medical Technology Co ltd filed Critical Lingxi Quantum Beijing Medical Technology Co ltd
Priority to CN202111074910.0A priority Critical patent/CN113836901B/en
Publication of CN113836901A publication Critical patent/CN113836901A/en
Application granted granted Critical
Publication of CN113836901B publication Critical patent/CN113836901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method and a system for cleaning Chinese and English medical synonym data, wherein the method comprises the following steps: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data. The invention realizes the cleaning of complex and disordered non-regular medical synonym data by means of the AI technology, solves the problem of complicated and long time consumption of data processing work more accurately and rapidly, and simultaneously intelligently supplements the medical deficiency vocabulary and supplements the short plates with clear medical synonym data.

Description

Method and system for cleaning Chinese and English medical synonym data
Technical Field
The invention relates to the technical field of medical data processing, in particular to a method and a system for cleaning Chinese and English medical synonym data.
Background
In recent years, with the continuous and deep development of internet technology, the data volume of enterprises is greatly increased in the processes of data generation, mining and use, and in particular, the mass of internet and medical industry is increased, so that the quality requirements on medical data are continuously improved. Currently, there may be a large amount of redundancy, loss, and many junk and useless data in the acquired medical data. In order to meet the business requirements and improve the product quality, a large amount of non-regular data needs to be cleaned out to meet the high-quality data of the product requirements. However, among the medical data, the medical synonym data is more complex and the cleaning difficulty is also greater.
For cleaning medical synonym data, the traditional method is based on the already mature medical word stock, such as: ICD data, mesh word library, WHO adverse reaction set and the like, and the correction, cleaning and filtering of the words are carried out through text string matching, so that the method can accurately match medical synonymous words, but synonym data omission is easy to cause, and the time consumption is long. With the development of deep learning technology, text classification tasks in Natural Language Processing (NLP) have been widely studied. The NLP technology based on machine learning can not learn complex semantics outside rules although the accuracy is improved, and the quality of a training set has a large influence on the performance of the model. And NLP technology based on convolutional neural network and cyclic neural network can represent more abstract and complex text semantics by learning word vectors, further improve the performance of the model and improve the accuracy. Convolutional neural networks are more prone to capture local semantic information, whereas recurrent neural networks do not consider both upper and lower Wen Yuyi and have long-term dependency issues.
Disclosure of Invention
The embodiment of the invention provides a method and a system for cleaning Chinese and English medical synonym data, which are used for solving the problems of part or all of the prior method for cleaning the medical synonym data.
In a first aspect, an embodiment of the present invention provides a method for cleaning data of Chinese and English medical synonyms, including:
determining Chinese and English medical synonym data to be cleaned;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is trained based on standard synonym data.
Preferably, the data cleaning model comprises a data loading and Chinese and English judging model, a filter and a Chinese and English synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:
inputting the Chinese-English medical synonym data to be cleaned into the data loading and Chinese-English judging model, and outputting a judging result of the Chinese-English data;
inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;
inputting the data which do not accord with the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result;
and fusing the filtered data meeting the preset rule into the data cleaning result.
Preferably, the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;
the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set.
Preferably, inputting the data which does not meet the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result includes:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting predicted positive class data and negative class data, and judging whether the predicted positive class data duty ratio reaches the preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
In a second aspect, an embodiment of the present invention provides a system for cleaning data of synonyms in chinese and english medicine, including:
the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;
the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is trained based on standard synonym data.
Preferably, the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;
the data loading and Chinese and English judging module is used for inputting the Chinese and English medical synonym data to be cleaned and outputting a judging result of the Chinese and English data;
the filter is used for inputting the judging result of the Chinese and English data and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;
the Chinese and English synonym training module is used for inputting the data which does not accord with the preset rule, outputting the data cleaning result, and fusing the filtered data which accords with the preset rule into the data cleaning result.
Preferably, the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set.
Preferably, the chinese-english synonym training module is specifically configured to input the data that does not conform to the preset rule into the chinese-english synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive class data and negative class data, and determine whether the predicted positive class data duty ratio reaches a preset ratio:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the method for cleaning data of chinese-english medical synonyms according to any one of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for cleaning chinese-english medical synonym data as provided in any one of the first aspects above.
According to the method and the system for cleaning the Chinese and English medical synonym data, the Chinese and English medical synonym data to be cleaned are input into the data cleaning model, and the data cleaning result output by the data cleaning model is obtained; the data cleaning model is trained based on standard synonym data. The invention cleans the medical synonym data by means of the biomedical pre-training model, can clean redundant data, garbage data and useless data, and accurately complements the missing data, thereby efficiently solving the problem of cleaning the boring medical data, assisting the high-quality medical data to assist the clinical diagnosis through the Internet, and making more comprehensive and accurate clinical decisions by matching with doctors.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for cleaning Chinese and English medical synonym data;
fig. 2 is a schematic structural diagram of a system for cleaning data of Chinese and English medical synonyms provided by the invention;
fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention applies the leading edge NLP technology to biomedical synonym data cleaning, and cleans the medical synonym data by combining a specific judgment rule with a pre-training model, and has the following key points:
1) Using standard synonym data as a training set fine-tuning (fine-tuning) model;
2) The cleaning jump-out mechanism is established through evaluating indexes such as the accuracy rate, the f value and the like of the model;
3) The model quality is continuously improved through circularly expanding training data to finish data cleaning;
4) The method can provide a cleaning thought for the cleaning problem of large-scale data, and the process is more automatic, so that the influence of labor cost and artificial factors on data cleaning is reduced.
The invention provides a method and a system for cleaning Chinese and English medical synonym data, which are described below with reference to fig. 1-3.
The embodiment of the invention provides a method for cleaning Chinese and English medical synonym data. Fig. 1 is a flow chart of a method for cleaning data of Chinese and English medical synonyms, which is provided by the embodiment of the invention, as shown in fig. 1, and the method comprises the following steps:
step 110, determining Chinese and English medical synonym data to be cleaned;
step 120, inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is trained based on standard synonym data.
According to the method provided by the embodiment of the invention, the AI technology is used for cleaning the complex and disordered non-regular medical synonym data, so that the problem of complicated and long time consumption of data processing work is solved more accurately and rapidly, meanwhile, the medical missing vocabulary is intelligently complemented, and the short plates with clear medical synonym data are complemented.
Based on any one of the above embodiments, the data cleaning model includes a data loading and Chinese-English judging model, a filter, and a Chinese-English synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:
inputting the Chinese-English medical synonym data to be cleaned into the data loading and Chinese-English judging model, and outputting a judging result of the Chinese-English data;
specifically, the data to be screened is sent to a data loading and Chinese and English judging module, the Chinese and English data is sent to a filter after being cut off, judgment is carried out according to the rules of the filter respectively, the data conforming to the rules are integrated into a final cleaning result, and the data not conforming to the rules of the filter are sent to a corresponding Chinese and English synonymBert model.
Inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;
it should be noted that, the preset rules of the filter are a set of rules which can filter out part of negative data and summarized according to the data characteristics in the process of screening the negative data by the medical word stock and the clinician. That is, the data to be cleaned is sent to the filter for rule judgment, and the data which cannot be judged is sent to the Chinese-English synonymBert model.
Inputting the data which do not accord with the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result;
and fusing the filtered data meeting the preset rule into the data cleaning result.
Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
specifically, based on ICD10, mesh word library, WHO adverse reaction set, A company self-built medical synonymous dictionary and other medical term library, a professional clinician screens out medical synonyms meeting clinical standards as positive class data of a standard set, screens out medical words not meeting clinical standards as negative class data of the standard set, and forms a standard set of training data by the positive class data and the negative class data.
The Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set.
Specifically, the Chinese and English synonym training model, namely the SynnymBert model, is divided into ChSynnymBert and EnsynonymBert according to the different data languages to be cleaned, and only Chinese and English data cleaning is supported at present. Wherein, chasynonymBert is a Chinese synonym pre-training model obtained by continuing training through more than 800 thousands of medical synonym vocabulary data and then fine-tuning by using a Chinese synonym standard set, and EnsynonymBert is an English synonym pre-training model obtained by continuing training through more than 1000 thousands of medical synonym vocabulary data and then fine-tuning by using an English synonym standard set on the basis of BioBert.
It should be noted that after the pre-training model Bert (Bidirectional Encoder Representation from Transformers) is released, a plurality of variants (such as BioBert, pubMedBert) of the pre-training model in the Bert biomedical field are generated, so as to provide a new idea for cleaning medical data by means of the NLP algorithm. The biomedical pre-training model obtains a model irrelevant to a task by self-supervision learning on a large-scale unmarked biomedical corpus, and then carries out fine adjustment on a specific task. The model can better understand the context information, characterize text semantics, and continuously create new high accuracy in medical text tasks.
Based on any one of the above embodiments, inputting the data that does not conform to the preset rule into the training model of the Chinese-English synonym, and outputting the data cleaning result includes:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting predicted positive class data and negative class data, and judging whether the predicted positive class data duty ratio reaches the preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
Specifically, the Chinese and English synonymBert model obtains the data predicted as positive class and negative class through reasoning of the received data, then decides whether to output a cleaning result according to whether the proportion of the positive class data to the received filter data reaches 80% (which can be freely adjusted according to the current data quality), if not, expands the data predicted as positive class and predicted as negative class into the original standardized data set to continuously train the model, automatically replaces the original synonymBert model by the trained optimal model, and circularly executes the above processes to finish the synonym data cleaning.
The system for cleaning Chinese and English medical synonym data provided by the invention is described below, and the method for cleaning Chinese and English medical synonym data described below and the method for cleaning Chinese and English medical synonym data described above can be correspondingly referred to each other.
Fig. 2 is a schematic structural diagram of a system for cleaning data of synonyms of chinese and english medicine according to an embodiment of the invention, as shown in fig. 2, the system includes a data determining unit 210 and a data cleaning unit 220;
the data determining unit 210 is configured to determine data of Chinese and English medical synonyms to be cleaned;
the data cleaning unit 220 is configured to input the data of the Chinese and English medical synonyms to be cleaned to a data cleaning model, so as to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is trained based on standard synonym data.
The system provided by the embodiment of the invention cleans the complex and disordered non-regular medical synonym data by means of the AI technology, solves the problem of complicated and long time consumption of data processing work more accurately and rapidly, and simultaneously intelligently supplements the medical deficiency vocabulary and short plates with clear medical synonym data.
Based on any one of the above embodiments, the data cleaning unit includes a data loading and chinese-english determining module, a filter, and a chinese-english synonym training module;
the data loading and Chinese and English judging module is used for inputting the Chinese and English medical synonym data to be cleaned and outputting a judging result of the Chinese and English data;
the filter is used for inputting the judging result of the Chinese and English data and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;
the Chinese and English synonym training module is used for inputting the data which does not accord with the preset rule, outputting the data cleaning result, and fusing the filtered data which accords with the preset rule into the data cleaning result.
Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set.
Based on any of the foregoing embodiments, the chinese-english synonym training module is specifically configured to input the data that does not conform to the preset rule into the chinese-synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive class data and negative class data, and determine whether the predicted positive class data duty ratio reaches a preset ratio:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 3, the electronic device may include: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 accomplish communication with each other through communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a Chinese and English medical synonym data cleansing method comprising: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.
Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions, which when executed by a computer, enable the computer to perform the method for cleaning chinese-english medical synonym data provided by the above methods, the method including: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.
In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the above-provided method for cleaning data of chinese-english medical synonyms, the method including: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is trained based on standard synonym data.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method for cleaning Chinese and English medical synonym data is characterized by comprising the following steps:
determining Chinese and English medical synonym data to be cleaned;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained based on standard synonym data training;
the data cleaning model comprises a data loading and Chinese and English judging model, a filter and a Chinese and English synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model, and obtaining a data cleaning result output by the data cleaning model comprises the following steps:
inputting the Chinese-English medical synonym data to be cleaned into the data loading and Chinese-English judging model, and outputting a judging result of the Chinese-English data;
inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data conforming to a preset rule and data not conforming to the preset rule;
inputting the data which do not accord with the preset rule into the Chinese-English synonym training model, and outputting the data cleaning result;
fusing the filtered data meeting the preset rule into the data cleaning result; the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;
the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set on the basis of a Chinese Bert model;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set on the basis of a BioBert model.
2. The method for cleaning data of Chinese and English synonyms according to claim 1, wherein inputting the data which does not accord with the preset rule into the Chinese and English synonym training model and outputting the data cleaning result comprises the following steps:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting predicted positive class data and negative class data, and judging whether the predicted positive class data duty ratio reaches the preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
3. A system for cleaning data of Chinese and English medical synonyms, comprising:
the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;
the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained based on standard synonym data training;
the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;
the data loading and Chinese and English judging module is used for inputting the Chinese and English medical synonym data to be cleaned and outputting a judging result of the Chinese and English data;
the filter is used for inputting the judging result of the Chinese and English data and outputting filtered data conforming to a preset rule and data not conforming to the preset rule; the Chinese-English synonym training module is used for inputting the data which does not accord with the preset rule, outputting the data cleaning result, and fusing the filtered data which accords with the preset rule into the data cleaning result; the standard synonym data comprises a Chinese synonym standard set and an English synonym standard set;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine adjustment based on the Chinese synonym standard set on the basis of a Chinese Bert model;
the English synonym training model is obtained by carrying out data set fine adjustment based on the English synonym standard set on the basis of a BioBert model.
4. The system for cleaning chinese-english medical synonym data according to claim 3, wherein the chinese-english synonym training module is configured to input the data that does not conform to the preset rule into the chinese-synonym training model or the english-synonym training model based on the determination result of the chinese-english data, output predicted positive-class data and negative-class data, and determine whether the predicted positive-class data duty ratio reaches a preset ratio:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly, performing training on the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the method for cleaning chinese-english medical synonym data as claimed in any one of claims 1 to 2.
6. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the method for cleaning chinese-english medical synonym data according to any one of claims 1 to 2.
CN202111074910.0A 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data Active CN113836901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111074910.0A CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111074910.0A CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Publications (2)

Publication Number Publication Date
CN113836901A CN113836901A (en) 2021-12-24
CN113836901B true CN113836901B (en) 2023-11-14

Family

ID=78959327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111074910.0A Active CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Country Status (1)

Country Link
CN (1) CN113836901B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001001305A1 (en) * 1999-06-25 2001-01-04 International Diagnostic Technology, Inc. Method and system for accessing medical data
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111127385A (en) * 2019-06-06 2020-05-08 昆明理工大学 Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111738001A (en) * 2020-08-06 2020-10-02 腾讯科技(深圳)有限公司 Training method of synonym recognition model, synonym determination method and equipment
CN112232065A (en) * 2020-10-29 2021-01-15 腾讯科技(深圳)有限公司 Method and device for mining synonyms
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN112528003A (en) * 2020-12-24 2021-03-19 北京理工大学 Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN112989848A (en) * 2021-03-29 2021-06-18 华南理工大学 Training method for neural machine translation model of field adaptive medical literature
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
US11113175B1 (en) * 2018-05-31 2021-09-07 The Ultimate Software Group, Inc. System for discovering semantic relationships in computer programs
CN113361285A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Training method of natural language processing model, natural language processing method and device

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001001305A1 (en) * 1999-06-25 2001-01-04 International Diagnostic Technology, Inc. Method and system for accessing medical data
US11113175B1 (en) * 2018-05-31 2021-09-07 The Ultimate Software Group, Inc. System for discovering semantic relationships in computer programs
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
CN111127385A (en) * 2019-06-06 2020-05-08 昆明理工大学 Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111738001A (en) * 2020-08-06 2020-10-02 腾讯科技(深圳)有限公司 Training method of synonym recognition model, synonym determination method and equipment
CN112232065A (en) * 2020-10-29 2021-01-15 腾讯科技(深圳)有限公司 Method and device for mining synonyms
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN112528003A (en) * 2020-12-24 2021-03-19 北京理工大学 Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
CN112989848A (en) * 2021-03-29 2021-06-18 华南理工大学 Training method for neural machine translation model of field adaptive medical literature
CN113361285A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Training method of natural language processing model, natural language processing method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Domain-independent data cleaning via analysis of entity-relationship graph;Kalashnikov Dmitri V. 等;《ACM Transactions on Database Systems (TODS)》;第31卷(第2期);716-767 *
Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering;Soni Sarvesh 等;《Proceedings of the Twelfth Language Resources and Evaluation Conference》;5532-5538 *
基于BERT的心血管医疗指南实体关系抽取方法;武小平 等;《计算机应用》;第41卷(第1期);145-149 *
生物医学文本挖掘若干关键技术研究;罗凌;《中国博士学位论文全文数据库医药卫生科技辑》(第06期);E080-12 *

Also Published As

Publication number Publication date
CN113836901A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
CN110134968B (en) Poem generation method, device, equipment and storage medium based on deep learning
CN111199795A (en) System for extracting semantic triples to build a knowledge base
CN113361266B (en) Text error correction method, electronic device and storage medium
CN110765759B (en) Intention recognition method and device
CN109858042B (en) Translation quality determining method and device
CN111460833A (en) Text generation method, device and equipment
CN106294466A (en) Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN115048944B (en) Open domain dialogue reply method and system based on theme enhancement
CN111144137B (en) Method and device for generating corpus of machine post-translation editing model
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN113836901B (en) Method and system for cleaning Chinese and English medical synonym data
CN110390093B (en) Language model building method and device
Liu et al. Augmenting multi-turn text-to-SQL datasets with self-play
CN112347773A (en) Medical application model training method and device based on BERT model
CN109657244B (en) English long sentence automatic segmentation method and system
CN111898337A (en) Single-sentence abstract defect report title automatic generation method based on deep learning
JP2021140558A (en) Training apparatus and program
CN106021225A (en) Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
CN113033179B (en) Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium
CN111666734B (en) Sequence labeling method and device
CN113988047A (en) Corpus screening method and apparatus
CN104537461A (en) Method and device for carrying out compliance inspection on enterprise internal control systems
KR100574887B1 (en) Apparatus And Method For Word Sense Disambiguation In Machine Translation System
KR102600839B1 (en) Method and apparatus for generating summarized document using of sentence similarity relation predictive model, method and apparatus for learning predictive model used to generate summarized document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant