CN113836901A - Chinese and English medicine synonym data cleaning method and system - Google Patents

Chinese and English medicine synonym data cleaning method and system Download PDF

Info

Publication number
CN113836901A
CN113836901A CN202111074910.0A CN202111074910A CN113836901A CN 113836901 A CN113836901 A CN 113836901A CN 202111074910 A CN202111074910 A CN 202111074910A CN 113836901 A CN113836901 A CN 113836901A
Authority
CN
China
Prior art keywords
data
synonym
chinese
english
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111074910.0A
Other languages
Chinese (zh)
Other versions
CN113836901B (en
Inventor
王则远
刘鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingxi Quantum Beijing Medical Technology Co ltd
Original Assignee
Lingxi Quantum Beijing Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingxi Quantum Beijing Medical Technology Co ltd filed Critical Lingxi Quantum Beijing Medical Technology Co ltd
Priority to CN202111074910.0A priority Critical patent/CN113836901B/en
Publication of CN113836901A publication Critical patent/CN113836901A/en
Application granted granted Critical
Publication of CN113836901B publication Critical patent/CN113836901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a method and a system for cleaning Chinese and English medical synonym data, wherein the method comprises the following steps: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data. The invention realizes the cleaning of the irregular medical synonym data by means of the AI technology, more accurately and quickly solves the problems of complexity and long time consumption of data processing work, and simultaneously intelligently completes the missing medical vocabulary and the short board with clear medical synonym data.

Description

Chinese and English medicine synonym data cleaning method and system
Technical Field
The invention relates to the technical field of medical data processing, in particular to a method and a system for cleaning Chinese and English medical synonym data.
Background
In recent years, with the continuous and deep development of internet technology, the data volume of enterprises is greatly increased in the processes of data generation, mining and use, and particularly, the quality requirement on medical data is continuously improved due to the increase of the scale of internet and medical industry. Currently, there may be a large amount of redundancy, loss, and a lot of spam and useless data in the acquired medical data. In order to meet business requirements and improve product quality, a large amount of irregular data needs to be cleaned to obtain high-quality data meeting product requirements. However, in medical data, the medical synonym data is more complex and the cleaning difficulty is greater.
For cleaning of medical synonym data, the traditional method is based on a more mature medical lexicon, such as: ICD data, a Mesh lexicon, a WHO adverse reaction set and the like, vocabulary correction and cleaning and filtering are carried out through text string matching, and the method can be used for accurately matching medical synonym, but synonym data are easy to miss and time is long. With the development of deep learning techniques, the task of text classification in Natural Language Processing (NLP) has been widely studied. Although the accuracy rate is improved, the model cannot learn complex semantics outside the rules, and the influence of the quality of the training set on the performance of the model is large. The NLP technology based on the convolutional neural network and the cyclic neural network can represent more abstract and complex text semantics by learning word vectors, further improve the performance of the model and improve the accuracy. But the convolutional neural network is more prone to capture local semantic information, while the circular neural network does not consider context semantics at the same time, and has a long-term dependence problem.
Disclosure of Invention
The embodiment of the invention provides a method and a system for cleaning Chinese and English medical synonym data, which are used for solving the problems of part or all of the problems in the conventional method for cleaning the medical synonym data.
In a first aspect, an embodiment of the present invention provides a method for cleaning chinese and english medical synonym data, including:
determining Chinese and English medical synonym data to be cleaned;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
Preferably, the data cleaning model comprises a data loading and Chinese-English judgment model, a filter and a Chinese-English synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:
inputting the Chinese and English medical synonym data to be cleaned into the data loading and Chinese and English judging model, and outputting a judging result of the Chinese and English data;
inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data meeting a preset rule and data not meeting the preset rule;
inputting the data which do not accord with the preset rule into the Chinese and English synonym training model, and outputting the data cleaning result;
and fusing the filtered data which accords with the preset rule into the data cleaning result.
Preferably, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
Preferably, the inputting the data which does not conform to the preset rule into the training model of the synonyms in chinese and english, and outputting the data cleaning result includes:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting the predicted positive data and the predicted negative data, and judging whether the predicted positive data proportion reaches a preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
In a second aspect, an embodiment of the present invention provides a system for cleaning chinese and english synonym data, including:
the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;
the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
Preferably, the data cleaning unit comprises a data loading and Chinese and English judging module, a filter and a Chinese and English synonym training module;
the data loading and Chinese-English judging module is used for inputting the Chinese-English medical synonym data to be cleaned and outputting a judging result of the Chinese-English data;
the filter is used for inputting the judgment result of the Chinese and English data and outputting the filtered data meeting the preset rule and the data not meeting the preset rule;
and the Chinese and English synonym training module is used for inputting the data which do not accord with the preset rule, outputting the data cleaning result and fusing the filtered data which accord with the preset rule into the data cleaning result.
Preferably, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
Preferably, the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset ratio:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the steps of the method for cleaning data of chinese and english medical synonyms as provided in any one of the above first aspects.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the chinese-english medical synonym data cleaning method according to any one of the preceding aspects.
According to the method and the system for cleaning the Chinese and English medical synonym data, provided by the embodiment of the invention, the Chinese and English medical synonym data to be cleaned are input into the data cleaning model, and a data cleaning result output by the data cleaning model is obtained; the data cleaning model is obtained based on standard synonym data training. According to the invention, the biomedical pre-training model is used for cleaning the medical synonym data, redundant data, garbage data and useless data can be cleaned, and missing data can be accurately complemented, so that the problem of cleaning boring medical data is efficiently solved, high-quality medical data can assist the Internet to assist clinical diagnosis, and a doctor can make more comprehensive and accurate clinical decisions.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for cleaning Chinese and English synonym data according to the present invention;
FIG. 2 is a schematic structural diagram of a Chinese and English medical synonym data cleaning system provided by the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention applies the NLP technology of the front edge to the cleaning of the biomedical synonym data, and the cleaning of the biomedical synonym data is carried out by a method of combining a specific judgment rule with a pre-training model, and the method has the following key points:
1) using standard synonym data as a training set fine-tuning (fine-tuning) model;
2) a cleaning jump-out mechanism is determined by evaluating indexes such as accuracy, f value and the like of the model;
3) completing data cleaning by circularly expanding training data to continuously improve the quality of the model;
4) the cleaning idea can be provided for the cleaning problem of large-scale data, the process is more automatic, and the influence of the labor cost and human factors on data cleaning is reduced.
The following describes a method and a system for cleaning chinese and english synonym data according to the present invention with reference to fig. 1 to 3.
The embodiment of the invention provides a method for cleaning Chinese and English medical synonym data. Fig. 1 is a schematic flow chart of a method for cleaning chinese and english medical synonym data according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 110, determining Chinese and English medical synonym data to be cleaned;
step 120, inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
According to the method provided by the embodiment of the invention, the AI technology is used for cleaning the irregular medical synonym data, the problems of complexity and long time consumption of data processing work are solved more accurately and quickly, and meanwhile, the medical missing vocabulary is completed intelligently, and the short board with clear medical synonym data is completed.
Based on any one of the above embodiments, the data cleaning model comprises a data loading and Chinese-English judgment model, a filter and a Chinese-English synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:
inputting the Chinese and English medical synonym data to be cleaned into the data loading and Chinese and English judging model, and outputting a judging result of the Chinese and English data;
specifically, data to be screened is sent to a data loading and Chinese and English judging module, Chinese and English data are cut off and then sent to a filter, judgment is carried out according to filter rules respectively, data which accord with the rules are fused and included into a final cleaning result, and data which do not accord with the filter rules are sent to a corresponding Chinese and English SynonymBert model.
Inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data meeting a preset rule and data not meeting the preset rule;
it should be noted that the preset rules of the filter are a set of rules capable of filtering out part of negative data summarized according to the data characteristics of the medical lexicon and the data characteristics of the clinician in the negative data screening process. That is, the data to be cleaned is sent to the filter for rule judgment, and the data which cannot be judged is sent to the Chinese and English SynonymBert model.
Inputting the data which do not accord with the preset rule into the Chinese and English synonym training model, and outputting the data cleaning result;
and fusing the filtered data which accords with the preset rule into the data cleaning result.
Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
specifically, medical synonyms meeting clinical standards are screened out as positive data of a standard set by professional clinicians based on ICD10, a Mesh thesaurus, a WHO adverse reaction set, a company A self-built medical synonym dictionary and other medical term libraries, negative data of medical words not meeting clinical standards are screened out as the negative data of the standard set, and the positive data and the negative data form a standard set of training data.
The Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
Specifically, the chinese-english synonym training model, i.e., the synymbert model, is classified into chsynymbert and ensynymbert according to the different languages of data to be cleaned, and currently, only chinese-english data cleaning is supported. The ChSynonymBert is a Chinese synonym pre-training model obtained by continuously training through 800 or more thousand medical synonym data and then performing fine-tuning by using a Chinese synonym standard set on the basis of Chinese Bert, and the EnSynonymBert is an English synonym pre-training model obtained by continuously training through 1000 or more thousand medical synonym data and then performing fine-tuning by using an English synonym standard set on the basis of BioBert.
It should be noted that after the pre-training model Bert (bidirectional Encoder replication from transformations) is released, a plurality of Bert biomedical field pre-training model variants (such as BioBert, PubMedBert, etc.) are generated, and a new idea is provided for cleaning medical data by means of an NLP algorithm. The biomedical pre-training model obtains a model irrelevant to a task by performing self-supervision learning on a large-scale unmarked biomedical corpus, and then fine-tunes on the specific task. The model can better understand context information, represent text semantics and continuously create new and high accuracy on medical text tasks.
Based on any one of the above embodiments, inputting the data that does not conform to the preset rule into the chinese-english synonym training model, and outputting the data cleaning result, including:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting the predicted positive data and the predicted negative data, and judging whether the predicted positive data proportion reaches a preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
Specifically, the Chinese and English SynonymBert model obtains data predicted as a positive class and data predicted as a negative class by reasoning received data, then determines whether to output a cleaning result according to whether the proportion of the positive class data to the received filter data reaches 80% (can be freely adjusted according to the current data quality), if not, the data predicted as the positive class and the data predicted as the negative class of the model are expanded to an original standardized data set to continue training the model, the original SynonymBert model is automatically replaced by the trained optimal model, and the synonym data cleaning is completed by circularly executing the procedures.
In the following, a chinese-english medical synonym data cleaning system according to the present invention is described, and the following description and the above-described chinese-english medical synonym data cleaning method may be referred to in correspondence.
Fig. 2 is a schematic structural diagram of a chinese and english medical synonym data cleaning system according to an embodiment of the present invention, and as shown in fig. 2, the system includes a data determining unit 210 and a data cleaning unit 220;
the data determining unit 210 is configured to determine chinese and english medical synonym data to be cleaned;
the data cleaning unit 220 is configured to input the chinese and english medical synonym data to be cleaned to a data cleaning model, and obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
According to the system provided by the embodiment of the invention, the AI technology is used for cleaning the irregular medical synonym data, the problems of complexity and long time consumption of data processing work are solved more accurately and quickly, and meanwhile, the medical missing vocabulary is completed intelligently, and the short board with clear medical synonym data is completed.
Based on any one of the above embodiments, the data cleaning unit includes a data loading and Chinese-English judging module, a filter, and a Chinese-English synonym training module;
the data loading and Chinese-English judging module is used for inputting the Chinese-English medical synonym data to be cleaned and outputting a judging result of the Chinese-English data;
the filter is used for inputting the judgment result of the Chinese and English data and outputting the filtered data meeting the preset rule and the data not meeting the preset rule;
and the Chinese and English synonym training module is used for inputting the data which do not accord with the preset rule, outputting the data cleaning result and fusing the filtered data which accord with the preset rule into the data cleaning result.
Based on any of the above embodiments, the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
Based on any of the above embodiments, the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset ratio:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may invoke logic instructions in the memory 330 to perform a chinese and english medical synonym data cleansing method, the method comprising: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is capable of executing the chinese and english medical synonym data cleaning method provided by the above methods, where the method includes: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method for cleaning chinese and english medical synonym data provided in the foregoing aspects, and the method includes: determining Chinese and English medical synonym data to be cleaned; inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model; the data cleaning model is obtained by training based on standard synonym data.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A Chinese and English medicine synonym data cleaning method is characterized by comprising the following steps:
determining Chinese and English medical synonym data to be cleaned;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
2. The method according to claim 1, wherein the data cleaning model comprises a data loading and chinese-english judgment model, a filter, a chinese-english synonym training model;
inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model, wherein the data cleaning result comprises the following steps:
inputting the Chinese and English medical synonym data to be cleaned into the data loading and Chinese and English judging model, and outputting a judging result of the Chinese and English data;
inputting the judgment result of the Chinese and English data into the filter, and outputting filtered data meeting a preset rule and data not meeting the preset rule;
inputting the data which do not accord with the preset rule into the Chinese and English synonym training model, and outputting the data cleaning result;
and fusing the filtered data which accords with the preset rule into the data cleaning result.
3. The method according to claim 2, wherein the standard synonym data includes a chinese synonym standard set and an english synonym standard set;
the Chinese and English synonym training model comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
4. The method for cleaning Chinese and English medical synonym data according to claim 2, wherein the step of inputting the data which does not conform to the preset rule into the Chinese and English synonym training model and outputting the data cleaning result comprises the steps of:
inputting the data which does not accord with the preset rule into the Chinese synonym training model or the English synonym training model based on the judgment result of the Chinese and English data, outputting the predicted positive data and the predicted negative data, and judging whether the predicted positive data proportion reaches a preset proportion or not:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
5. A Chinese and English medicine synonym data cleaning system is characterized by comprising:
the data determining unit is used for determining Chinese and English medical synonym data to be cleaned;
the data cleaning unit is used for inputting the Chinese and English medical synonym data to be cleaned into a data cleaning model to obtain a data cleaning result output by the data cleaning model;
the data cleaning model is obtained by training based on standard synonym data.
6. The system according to claim 5, wherein the data cleaning unit comprises a data loading and Chinese-English judging module, a filter, a Chinese-English synonym training module;
the data loading and Chinese-English judging module is used for inputting the Chinese-English medical synonym data to be cleaned and outputting a judging result of the Chinese-English data;
the filter is used for inputting the judgment result of the Chinese and English data and outputting the filtered data meeting the preset rule and the data not meeting the preset rule;
and the Chinese and English synonym training module is used for inputting the data which do not accord with the preset rule, outputting the data cleaning result and fusing the filtered data which accord with the preset rule into the data cleaning result.
7. The system according to claim 6, wherein the standard synonym data includes a set of chinese synonym criteria and a set of english synonym criteria;
the Chinese and English synonym training module comprises a Chinese synonym training model and an English synonym training model;
the Chinese synonym training model is obtained by carrying out data set fine tuning on the basis of the Chinese synonym standard set;
the English synonym training model is obtained by carrying out data set fine adjustment on the basis of the English synonym standard set.
8. The system according to claim 6, wherein the chinese-english synonym training module is specifically configured to input the data that does not meet the preset rule into the chinese synonym training model or the english synonym training model based on the determination result of the chinese-english data, output predicted positive data and predicted negative data, and determine whether the predicted positive data proportion reaches a preset proportion:
if the preset proportion is reached, outputting the data cleaning result; otherwise, expanding the predicted positive class data and the predicted negative class data to the Chinese synonym standard set or the English synonym standard set, and correspondingly training the Chinese synonym training model or the English synonym training model and then iteratively updating the data cleaning model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for cleaning Chinese and English medical synonym data according to any one of claims 1 to 4 when executing the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the chinese-english medical synonym data washing method according to any one of claims 1 to 4.
CN202111074910.0A 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data Active CN113836901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111074910.0A CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111074910.0A CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Publications (2)

Publication Number Publication Date
CN113836901A true CN113836901A (en) 2021-12-24
CN113836901B CN113836901B (en) 2023-11-14

Family

ID=78959327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111074910.0A Active CN113836901B (en) 2021-09-14 2021-09-14 Method and system for cleaning Chinese and English medical synonym data

Country Status (1)

Country Link
CN (1) CN113836901B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001001305A1 (en) * 1999-06-25 2001-01-04 International Diagnostic Technology, Inc. Method and system for accessing medical data
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111127385A (en) * 2019-06-06 2020-05-08 昆明理工大学 Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111738001A (en) * 2020-08-06 2020-10-02 腾讯科技(深圳)有限公司 Training method of synonym recognition model, synonym determination method and equipment
CN112232065A (en) * 2020-10-29 2021-01-15 腾讯科技(深圳)有限公司 Method and device for mining synonyms
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN112528003A (en) * 2020-12-24 2021-03-19 北京理工大学 Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN112989848A (en) * 2021-03-29 2021-06-18 华南理工大学 Training method for neural machine translation model of field adaptive medical literature
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
US11113175B1 (en) * 2018-05-31 2021-09-07 The Ultimate Software Group, Inc. System for discovering semantic relationships in computer programs
CN113361285A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Training method of natural language processing model, natural language processing method and device

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001001305A1 (en) * 1999-06-25 2001-01-04 International Diagnostic Technology, Inc. Method and system for accessing medical data
US11113175B1 (en) * 2018-05-31 2021-09-07 The Ultimate Software Group, Inc. System for discovering semantic relationships in computer programs
CN109960815A (en) * 2019-03-27 2019-07-02 河南大学 A kind of creation method and system of nerve machine translation NMT model
CN110134772A (en) * 2019-04-18 2019-08-16 五邑大学 Medical text Relation extraction method based on pre-training model and fine tuning technology
CN111127385A (en) * 2019-06-06 2020-05-08 昆明理工大学 Medical information cross-modal Hash coding learning method based on generative countermeasure network
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN111738001A (en) * 2020-08-06 2020-10-02 腾讯科技(深圳)有限公司 Training method of synonym recognition model, synonym determination method and equipment
CN112232065A (en) * 2020-10-29 2021-01-15 腾讯科技(深圳)有限公司 Method and device for mining synonyms
CN112417206A (en) * 2020-11-24 2021-02-26 杭州一知智能科技有限公司 Weak supervision video time interval retrieval method and system based on two-branch proposed network
CN112528003A (en) * 2020-12-24 2021-03-19 北京理工大学 Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN113111180A (en) * 2021-03-22 2021-07-13 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
CN112989848A (en) * 2021-03-29 2021-06-18 华南理工大学 Training method for neural machine translation model of field adaptive medical literature
CN113361285A (en) * 2021-06-30 2021-09-07 北京百度网讯科技有限公司 Training method of natural language processing model, natural language processing method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KALASHNIKOV DMITRI V. 等: "Domain-independent data cleaning via analysis of entity-relationship graph", 《ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS)》, vol. 31, no. 2, pages 716 - 767, XP058320034, DOI: 10.1145/1138394.1138401 *
SONI SARVESH 等: "Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering", 《PROCEEDINGS OF THE TWELFTH LANGUAGE RESOURCES AND EVALUATION CONFERENCE》, pages 5532 - 5538 *
武小平 等: "基于BERT的心血管医疗指南实体关系抽取方法", 《计算机应用》, vol. 41, no. 1, pages 145 - 149 *
罗凌: "生物医学文本挖掘若干关键技术研究", 《中国博士学位论文全文数据库医药卫生科技辑》, no. 06, pages 080 - 12 *

Also Published As

Publication number Publication date
CN113836901B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN111199795A (en) System for extracting semantic triples to build a knowledge base
CN110765759B (en) Intention recognition method and device
CN104899190B (en) The generation method and device and participle processing method and device of dictionary for word segmentation
CN113361266B (en) Text error correction method, electronic device and storage medium
CN110427627A (en) Task processing method and device based on semantic expressiveness model
TWI833072B (en) Speech recognition system and speech recognition method
CN106294466A (en) Disaggregated model construction method, disaggregated model build equipment and sorting technique
CN112631436B (en) Method and device for filtering sensitive words of input method
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN110335608A (en) Voice print verification method, apparatus, equipment and storage medium
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN113326696B (en) Text generation method and device
CN113836901B (en) Method and system for cleaning Chinese and English medical synonym data
CN109657244B (en) English long sentence automatic segmentation method and system
CN113033179B (en) Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium
CN115017876A (en) Method and terminal for automatically generating emotion text
CN111666734B (en) Sequence labeling method and device
CN114300127A (en) Method, device, equipment and storage medium for inquiry processing
CN116186529A (en) Training method and device for semantic understanding model
Galinsky et al. Improving neural models for natural language processing in Russian with synonyms
CN104537461A (en) Method and device for carrying out compliance inspection on enterprise internal control systems
CN115905500B (en) Question-answer pair data generation method and device
Ghosh et al. Homophone ambiguity reduction from word level speech recognition using artificial immune system
CN115048907B (en) Text data quality determining method and device
CN108804627B (en) Information acquisition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant