CN106934038B - A kind of medical data duplicate checking and the method and system associated - Google Patents

A kind of medical data duplicate checking and the method and system associated Download PDF

Info

Publication number
CN106934038B
CN106934038B CN201710153199.5A CN201710153199A CN106934038B CN 106934038 B CN106934038 B CN 106934038B CN 201710153199 A CN201710153199 A CN 201710153199A CN 106934038 B CN106934038 B CN 106934038B
Authority
CN
China
Prior art keywords
data
data item
medical
item
medical data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710153199.5A
Other languages
Chinese (zh)
Other versions
CN106934038A (en
Inventor
刘劲松
王友柱
饶江
李广东
李楠
王东
陈桂太
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Huasheng Gene Data Technology Co Ltd
Original Assignee
Jiangsu Huasheng Gene Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Huasheng Gene Data Technology Co Ltd filed Critical Jiangsu Huasheng Gene Data Technology Co Ltd
Priority to CN201710153199.5A priority Critical patent/CN106934038B/en
Publication of CN106934038A publication Critical patent/CN106934038A/en
Application granted granted Critical
Publication of CN106934038B publication Critical patent/CN106934038B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • G06F16/287Visualization; Browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3349Reuse of stored results of previous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • G06F19/32

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present invention relates to a kind of medical data duplicate checking and the processing method and system that associate, methods described includes (1) and extracts core data item in pending medical data;(2) core data item is classified;(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array;(4) depth examination is carried out to every data item in core data item;(5) the threshold value M of doubtful duplicate data similarity is set2And/or the threshold value M of doubtful associated data3;(6) it after manually verifying doubtful repetition and/or associated data and giving judgement, will be judged as in unduplicated data input medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.The present invention has the characteristics of misdetection rate is low, false determination ratio is low, duplicate checking efficiency high compared with prior art, less demanding to the artificial professional degree manually verified, therefore duplicate checking significantly reduces with the operating cost associated.

Description

A kind of medical data duplicate checking and the method and system associated
Technical field
This invention relates generally to data processing technique, and more specifically it relates to medical data duplicate checking and the place associated Manage method and system.
Background technology
In the practice of medical data acquisition process, same data be present by the multiple possibility for collecting simultaneously input database Property, the possibility for being taken as different pieces of information to collect simultaneously input database after data are slightly changed by specialty or layman also be present Property.In order to ensure the real effectiveness of data in medical data base, it is necessary to set scheme, after data submission, formal examination & verification is logical Cross before storage, duplicate checking processing is carried out to it, duplicate data is blocked in outside database gate.Due to existing in medical data Substantial amounts of unstructured data, such as symptom description in case history, the therapeutic scheme etc. of disease, the depth of medical data is looked at present Weight depends on the manual operation of relevant medical experience substantially, and this is not only less efficient, and expends a large amount of manpower and materials, and cost is high It is high.
In addition, medical research is different from other natural science subjects, related body's experiment management and control is strict, can not be in real time to it Theoretical research is verified.Therefore medical research is highly dependent on the collection of historical medical data patient medical record etc. and divided Analysis.Therefore a kind of effective medical data processing method is needed so that automatic mining goes out correlation case in medical data base and turned into May, for further medical research and analysis.
Chinese patent CN101609466B provides a kind of " mass data duplicate checking method and system ",:Extract magnanimity number Data key words in, the data key words are used to separate place data and other data fields;Closed according to the data Data key words described in the preceding N+M grapheme segmentation of key word, preceding N+M alphabetical identical data key words are put into same text In part, key data file is obtained;Wherein, the top n letter of the data key words is identical, and preceding N+M letter is not exclusively Identical, N, M are nonnegative integer);Duplicate checking is carried out to the data in each key data file respectively, obtains duplicate checking result.The hair The bright data for being relatively applied to structuring, effective duplicate checking can not be carried out for the medical data of a large amount of unstructured datas be present. In addition, the invention is without reference to the similitude and related question between data.
Chinese patent CN101751423A is provided " a kind of method and system of article duplicate checking ", including:Production database In manuscript information, because being operated to the contribution on the space of a whole page by after corresponding changed, Trigger of time obtains amended Manuscript information, the manuscript information include contribution content;Duplicate checking server in the manuscript information of acquisition to not carrying out repeating contribution The manuscript information that content compares carries out repetition contribution content and compared, and determines lofty information, because duplicate checking server triggers to event Do not carry out repeating the manuscript information that compares of contribution content in the manuscript information that device obtains and carry out recombinating contribution content to compare so that most Weight original text information is determined eventually.It is recognised that having the technical effect that of actually reaching of the patent realizes contribution, a kind of unstructured information, The automatic duplicate checking before submission, reduce and deliver the number that middle heavy original text occurs.Although the patent is referred in embodiment to transport Contribution content is compared with Chinese word segmentation storehouse technology, produces the similarity between contribution data, so as to carry out duplicate checking processing, But the patent does not announce the similarity problem how calculated between contribution data specifically, also it is not directed to and how utilizes contribution Similarity between data between contribution data to being associated.
The content of the invention
In view of the above-mentioned problems, the present invention solves existing by a kind of medical data duplicate checking and the method and system associated It can not be established in technology to the effective duplicate checking of the medical data of unstructured data largely be present and lack between medical data The problem of association.
To achieve these goals, the present invention adopts the following technical scheme that.
A kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps:
(1) extract the core data item in pending medical data, the core data item be used for by place data with Other data fields separate;
(2) core data item is classified, core data item is first first divided into structural data item and unstructured data item, One group of data item is then chosen in structural data item as array, other structures data item is excluded and is then used as fuzzy number Group;
(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array,
(3a) is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base The medical data that breaks is not repeated or onrelevant and inputted in medical data base,
Or (3b) when in fuzzy array different pieces of information item number be more than the threshold value set with the ratio of fuzzy array total item M1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,
Other situations then enter next step;
(4) depth examination is carried out to every data item in core data item, by the weight a of each data itemiAssigned Value, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
(5) the threshold value M of doubtful duplicate data similarity is set2And/or the threshold value M of doubtful associated data3, work as M2<F<When 1 The medical data is judged for doubtful duplicate data and submits artificial verification, works as M3<F<Judge the medical data for doubtful association when 1 Data simultaneously submit artificial verify;
(6) after manually verifying doubtful repetition and/or associated data and giving judgement, it is defeated that unduplicated data will be judged as Enter in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.
Further, the similarity fiJudgement computational methods be:
For structural data item, when it is identical with having homogeneous data item in medical data base, its similarityValue is 1, and otherwise value is 0;
For unstructured data item, its similarityFor the text collection T and medical data of the unstructured data item Jaccard similarities SIM (S, T) in storehouse between the text collection S of existing homogeneous data item,
Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,
Further, the text collection T of unstructured data item and existing like numbers in medical data base are directly being calculated Before the Jaccard similarities between the text collection S of item, also text collection T and text collection S are pre-processed as follows:
(i) using segmenting storehouse, the text field in T and S is resolved into some words, and by each word minimum treat data ,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.As
Further, the participle storehouse includes three parts:
Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom description part, it includes the common symptom of medical treatment and describes everyday words;
Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.
It is a further object to provide a kind of medical data duplicate checking and the system that associates, it is characterised in that this is System includes:
Core data item unit, for extracting the core data item in pending medical data, the core data item For place data and other data fields to be separated;
Taxon, for core data item to be classified, core data item is first first divided into structural data item and non-knot Structure data item, one group of data item is then chosen in structural data item as array is excluded, other structures data item is then As fuzzy array;
Preliminary examination unit, for respectively to excluding the preliminary examination of each data item in array and fuzzy array:Work as row Except in array any one data item with then judging that the medical data does not weigh during existing homogeneous data item difference in medical data base Multiple or onrelevant is simultaneously inputted in medical data base, or when different pieces of information item number and the ratio of fuzzy array total item in fuzzy array Example is more than the threshold value M of setting1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base;
Depth examination unit, will be each for carrying out depth examination to every data item after preliminary examination cell processing The weight a of individual data itemiAssignment is carried out, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
Judging unit, for setting the threshold value M of doubtful duplicate data similarity2And/or the threshold value M of doubtful associated data3, Work as M2<F<The medical data is judged when 1 for doubtful duplicate data and submits artificial verification, works as M3<F<The medical data is judged when 1 For doubtful associated data and submit artificial verify;
Artificial check list member, for manually verifying doubtful repetition and/or associated data, it is artificial give judgement after, will be by It is judged as in unduplicated data input medical data base, and gives and be judged as in the presence of the data one or more associated accordingly Correlation tag.
Further, the depth examination unit includes:
Weight assignment subelement, the weight a for each data item in core data itemiCarry out assignment;
Similarity fiComputation subunit is judged, for the similarity f to each data itemiJudgement calculating is carried out, including:
Structural data item module, for structural data item similarityAssignment, when the structural data item with When homogeneous data item is identical in medical data base, its similarityValue is 1, and otherwise value is 0,
Unstructured data item module, for unstructured data item similarityAssignment, unstructured data item phase Like degreeFor existing homogeneous data item in text collection T and the medical data base of the unstructured data item text collection S it Between Jaccard similarities SIM (S, T), SIM (S, T)=| S ∩ T |/| S ∪ T |;
Total similarity F calculates judgment sub-unit, for calculating the medical data and existing medical data in medical data base Total similarity F:
Wherein,0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted.
Further, the depth examination unit also includes unstructured data item mould pretreatment module, for by non-knot The text collection T of structure data item and existing homogeneous data item in medical data base text collection S are pre-processed,
The pretreatment is:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat number According to item,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
Further, the depth examination unit also includes participle library module, for by text data item present in storehouse The text field in T and S is decomposed,
Including three submodules:
Medicine name submodule, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom describes submodule, and it includes the common symptom of medical treatment and describes everyday words;
Genetic test submodule, it includes the result description of the site abbreviation of genetic test and genetic test.
Unless otherwise instructed, structural data of the present invention refers to row data, is stored in lane database, Ke Yiyong Bivariate table structure carrys out the data of logical expression realization.
Unless otherwise instructed, the inconvenience that unstructured data of the present invention refers to is with database two dimension logical table come table It is subset X ML, HTML under existing data, including the office documents of all formats, text, picture, standard generalized markup language, each Class form, image and audio/visual information.
A kind of medical data duplicate checking of the present invention and management method and system compared with prior art the advantages of In effectively calculating the similarity in medical data and existing database, solution by new data classification method and similarity algorithm The association that a large amount of effective duplicate checkings of medical data that unstructured data be present and shortage of having determined are established between medical data is asked Topic;, will to carrying out the medical professionalism degree of personage of artificial verification with the characteristics of misdetection rate is low, false determination ratio is low, duplicate checking efficiency high Ask not high, therefore duplicate checking significantly reduces with the operating cost associated.
Embodiment
Embodiments of the invention are described below in detail.The embodiment is exemplary, is only used for explaining the present invention, without Limitation of the present invention can be considered as.In order to avoid unnecessarily obscuring the embodiment, this part is to known in some this areas Technology, i.e., technology that it would have been obvious for a person skilled in the art, is not described in detail.
A kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps:
S101 extracts the core data item in pending medical data, the core data item be used for by place data with Other data fields separate;
S102 classifies core data item, and core data item is first first divided into structural data item and unstructured data , one group of data item is then chosen in structural data item as array is excluded, other structures data item is then as fuzzy Array;
S103 respectively to excluding the preliminary examination of each data item in array and fuzzy array,
S103a is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base The medical data that breaks is not repeated or onrelevant and inputted in medical data base,
Or S103b works as the threshold value that different pieces of information item number and the ratio of fuzzy array total item in fuzzy array are more than setting M1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,
Other situations then enter next step;
S104 carries out depth examination to every data item in core data item, by the weight a of each data itemiAssigned Value, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
S105 sets the threshold value M of doubtful duplicate data similarity2And/or the threshold value M of doubtful associated data3, work as M2<F<When 1 The medical data is judged for doubtful duplicate data and submits artificial verification, works as M3<F<Judge the medical data for doubtful association when 1 Data simultaneously submit artificial verify;
After S106 manually verifies doubtful repetition and/or associated data and gives judgement, unduplicated data will be judged as Input in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.
Further, the similarity fiJudgement computational methods be:
For structural data item, when it is identical with having homogeneous data item in medical data base, its similarityValue is 1, and otherwise value is 0;
For unstructured data item, its similarityFor the text collection T and medical data of the unstructured data item Jaccard similarities SIM (S, T) in storehouse between the text collection S of existing homogeneous data item,
Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,
Further, the text collection T of unstructured data item and existing like numbers in medical data base are directly being calculated Before the Jaccard similarities between the text collection S of item, also text collection T and text collection S are pre-processed as follows:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat number According to item,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
Further, the participle storehouse includes three parts:
Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom description part, it includes the common symptom of medical treatment and describes everyday words;
Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.
The classification of medical data is illustrated by taking tumour case history as an example below.
Core texture data item is listed in table 1 in case history,
Table 1
, unstructured data is:T1 main suit, T2 historical therapeutics scheme and T3 this therapeutic scheme.
According to inventor's practical experience, above structuring core data is classified again:
It is detailed that A excludes array:
Disease name first-level class, disease name secondary classification, patient name abbreviation, sex, birthplace, the date of birth, Hang up one's hat ground, occupation, nationality, hospital name, section office, ID number (case history/Reference Number), admission number, admission time, discharge time, Differentiation degree and pathology title,
It is detailed that B obscures array:
Marital status, staging, disease TNM stage, whether shift, metastasis site, first admission time, be admitted to hospital first Symptom, first symptom occur to go to a doctor period, this symptom of being admitted to hospital, this symptom time of occurrence, smoking, the length of smoking, drink, Wine storage time, obsterical history and familial inheritance tumour medical history.
It should be noted that above classification for tumour case history only for make core data item of the present invention and Core data these definition of classifying are more directly perceived, and can not limit core data item of the present invention and to core data item The method of classification.For different types of medical data, different core data items can be set, same core can also be directed to Heart data item carries out the different setting for excluding array and fuzzy array.But for these settings, its authority is only limitted to specific A few peoples, general curative data input person is not opened.
It should be noted that in the method and system of present disclosure, medical data there may be following several states:
1- is pending, i.e., the original state that data are handled in method/system;
The doubtful repetitions of 2-, i.e. method/system treat that artificial treatment provides final result according to the automatically derived conclusion of algorithm;
3- is automatically normal, i.e., method/system is according to conclusion of the automatically derived data of algorithm without repetition, and makes data input number According in storehouse;
4- is automatically deleted, i.e., method/system is according to the conclusion of the automatically derived Data duplication of algorithm, and deletes the data;
5- is manually normal, i.e., judges the data to be non-duplicate and add can after manually handling the data of doubtful repetition The final result of the connective marker of energy;
6- is manually deleted, i.e., manually repeats to judge that the data should for duplicate data and deletion after record is handled to doubtful Data.
Experimental example
By one group (group A), manually mode filters out 10,000 parts of effective tumour case histories to the people with medical professionalism background, at this 200 parts are randomly selected in a little effectively case histories.By group A people using the 200 parts of case histories extracted as template, pass through artificial edit-modify Some data wherein in addition to patient's essential information obtain 200 parts of new case history A.Do not have medical professionalism by one group (group B) 200 parts of same case histories of the people of background are template, pass through some numbers of artificial edit-modify wherein in addition to patient's essential information According to obtaining 200 parts of new case history B.
After original 10,000 parts effective case histories, 200 parts of " repetition " case history A and 200 parts of " repetition " case history B are mixed, by another four People and another four group (group G, organize H, group I and group J) of the group (group C, organize D, group E and organize F) with medical professionalism background do not have medical treatment The people of specialty background is not knowing the premise for repeating case history number of packages by way of machine traversal duplicate checking adds and manually compared respectively Lower " repetition " case history screened in this 1.04 ten thousand parts of case histories.With more than duplicate checking screening system provided by the invention same 1.04 Ten thousand parts of case histories, wherein artificial screening part respectively by another group (group K) have medical professionalism background people and another group (group L) no People with medical professionalism background is operated.
Group A is identical with the group member's number for organizing B, and the group member's number for organizing C-L is identical.
Table 2 is taken time by the result of different screening modes and screening, when wherein the time is by group member's 8 work for each person every day Between calculate.
Table 2
It can be seen that, using medical data duplicate checking system provided by the invention, effectively it can shortened by table 2 at data The misdetection rate and false determination ratio of duplicate checking are reduced while managing the time.Also, medical data duplicate checking system provided by the invention is artificial Even if examination part is operated using the people without medical professionalism background, its misdetection rate and false determination ratio are also than using with doctor The people for treating specialty background adds the misdetection rate of the mode manually compared and false determination ratio to be significantly reduced with machine traversal duplicate checking.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by Software adds the mode of required general hardware platform to realize, naturally it is also possible to which by hardware, or the combination of the two is implemented. Based on such understanding, the part that technical scheme substantially contributes to prior art in other words can be with software The form of product is embodied, and the software module or computer software product can be stored in a storage medium, if including It is dry to instruct to cause a computer equipment (be personal computer, server, or network equipment etc.) to perform this hair Method described in bright each embodiment.Storage medium can be random access memory (RAM), internal memory, read-only storage (ROM), electricity Well known in programming ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field Any other form of storage medium.

Claims (8)

1. a kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps:
(1) the core data item in pending medical data is extracted, the core data item is used for place data and other Data field separates;
(2) core data item is classified, core data item is first first divided into structural data item and unstructured data item, then One group of data item is chosen in structural data item as array is excluded, other structures data item is then as fuzzy array;
(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array,
(3a) then judges this when excluding and having homogeneous data item difference in any one data item and medical data base in array Medical data is not repeated or onrelevant and inputted in medical data base,
Or (3b) when in fuzzy array different pieces of information item number be more than the threshold value M set with the ratio of fuzzy array total item1Shi Ze Judge that the medical data is not repeated or onrelevant and inputted in medical data base,
Other situations then enter next step;
(4) depth examination is carried out to every data item in core data item, by the weight a of each data itemiAssignment is carried out, it is right The similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
(5) the threshold value M of doubtful duplicate data similarity is set2And/or the threshold value M of doubtful associated data3, work as M2<F<Judge when 1 The medical data is doubtful duplicate data and submits artificial verification, works as M3<F<Judge the medical data for doubtful associated data when 1 And submit artificial verify;
(6) after manually verifying doubtful repetition and/or associated data and giving judgement, unduplicated data input doctor will be judged as Treat in database, and give the one or more corresponding correlation tags of data being judged as in the presence of association.
2. according to the method for claim 1, it is characterised in that the similarity fiJudgement computational methods be:For structure Change data item, when it is identical with having homogeneous data item in medical data base, its similarity fi sValue is 1, is otherwise taken It is worth for 0;
For unstructured data item, its similarity fi nFor in the text collection T and medical data base of the unstructured data item Jaccard similarities SIM (S, T) between the text collection S of existing homogeneous data item, wherein, SIM (S, T)=| S ∩ T |/| S∪T|,fi s∈fi, fi n∈fi
3. according to the method for claim 2, it is characterised in that directly calculating the text collection T of unstructured data item Before Jaccard similarities between the text collection S of existing homogeneous data item in medical data base, also to text collection T and Text collection S is pre-processed as follows:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat data ,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
4. according to the method for claim 3, it is characterised in that the participle storehouse includes three parts:
Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom description part, it includes the common symptom of medical treatment and describes everyday words;
Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.
5. a kind of medical data duplicate checking and the system associated, it is characterised in that the system includes:
Core data item unit, for extracting the core data item in pending medical data, the core data item is used for Place data and other data fields are separated;
Taxon, for core data item to be classified, core data item is first first divided into structural data item and unstructured Data item, one group of data item is then chosen in structural data item as exclusion array, the then conduct of other structures data item Fuzzy array;
Preliminary examination unit, for respectively to excluding the preliminary examination of each data item in array and fuzzy array:When exclusion number Then judge during existing homogeneous data item difference in any one data item and medical data base in group the medical data do not repeat or Onrelevant is simultaneously inputted in medical data base, or big when obscuring different pieces of information item number and the ratio of fuzzy array total item in array In the threshold value M of setting1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base;
Depth examination unit, for carrying out depth examination to every data item after preliminary examination cell processing, by each number According to the weight a of itemiAssignment is carried out, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
Judging unit, for setting the threshold value M of doubtful duplicate data similarity2And/or the threshold value M of doubtful associated data3, work as M2 <F<The medical data is judged when 1 for doubtful duplicate data and submits artificial verification, works as M3<F<It is doubtful that the medical data is judged when 1 Like associated data and submit artificial verify;
Artificial check list member, for manually verifying doubtful repetition and/or associated data, after manually judgement is given, will be judged For in unduplicated data input medical data base, and give the one or more corresponding associations of data being judged as in the presence of association Label.
6. system according to claim 5, it is characterised in that the depth examination unit includes:
Weight assignment subelement, the weight a for each data item in core data itemiCarry out assignment;
Similarity fiComputation subunit is judged, for the similarity f to each data itemiJudgement calculating is carried out, including:
Structural data item module, for structural data item similarity fi sAssignment, when the structural data item with medical treatment When homogeneous data item is identical in database, its similarity fi sValue is 1, and otherwise value is 0,
Unstructured data item module, for unstructured data item similarity fi nAssignment, unstructured data item similarity fi nBetween the text collection S of existing homogeneous data item in the text collection T and medical data base of the unstructured data item Jaccard similarities SIM (S, T), SIM (S, T)=| S ∩ T |/| S ∪ T |;
Total similarity F calculates judgment sub-unit, for calculating the medical data with having the total of medical data in medical data base Similarity F:
Wherein, fi s∈fi, fi n∈fi, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted.
7. system according to claim 6, it is characterised in that the depth examination unit also includes unstructured data item Mould pretreatment module, for the text that will have homogeneous data item in the text collection T of unstructured data item and medical data base This set S is pre-processed,
The pretreatment is:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat data ,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
8. system according to claim 6, it is characterised in that the depth examination unit also includes participle library module, uses The text data item present in Jiang Ku decomposes the text field in T and S,
Including three submodules:
Medicine name submodule, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom describes submodule, and it includes the common symptom of medical treatment and describes everyday words;
Genetic test submodule, it includes the result description of the site abbreviation of genetic test and genetic test.
CN201710153199.5A 2017-03-15 2017-03-15 A kind of medical data duplicate checking and the method and system associated Expired - Fee Related CN106934038B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710153199.5A CN106934038B (en) 2017-03-15 2017-03-15 A kind of medical data duplicate checking and the method and system associated

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710153199.5A CN106934038B (en) 2017-03-15 2017-03-15 A kind of medical data duplicate checking and the method and system associated

Publications (2)

Publication Number Publication Date
CN106934038A CN106934038A (en) 2017-07-07
CN106934038B true CN106934038B (en) 2018-01-05

Family

ID=59433291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710153199.5A Expired - Fee Related CN106934038B (en) 2017-03-15 2017-03-15 A kind of medical data duplicate checking and the method and system associated

Country Status (1)

Country Link
CN (1) CN106934038B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090185A (en) * 2017-12-16 2018-05-29 河北慧日信息技术有限公司 A kind of customer information duplicate checking method
CN109887562B (en) * 2019-02-20 2021-10-29 广州天鹏计算机科技有限公司 Similarity determination method, device, equipment and storage medium for electronic medical records
CN110390084B (en) * 2019-06-19 2021-01-26 平安国际智慧城市科技股份有限公司 Text duplicate checking method, device, equipment and storage medium
CN110990591A (en) * 2019-12-26 2020-04-10 北京亚信数据有限公司 Method and system for auditing transcoding quality of medical data
CN111814447B (en) * 2020-06-24 2022-05-27 平安科技(深圳)有限公司 Electronic case duplicate checking method and device based on word segmentation text and computer equipment
CN112559506A (en) * 2020-12-22 2021-03-26 卫宁健康科技集团股份有限公司 Health data processing method and device, processing equipment and storage medium
CN112765144B (en) * 2021-01-22 2023-04-25 武汉大学 Method for checking and correcting conflict items after merging big health medical data
CN115809286B (en) * 2023-01-16 2023-04-25 江苏智碘数据科技有限公司 Structured data statistical analysis and report intelligent generation system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252488A (en) * 2013-06-28 2014-12-31 华为技术有限公司 Data processing method and server
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10573415B2 (en) * 2014-04-21 2020-02-25 Medtronic, Inc. System for using patient data combined with database data to predict and report outcomes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252488A (en) * 2013-06-28 2014-12-31 华为技术有限公司 Data processing method and server
CN105718506A (en) * 2016-01-04 2016-06-29 胡新伟 Duplicate-checking comparison method for science and technology projects
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高性能重复数据检测与删除技术研究;魏建生;《中国博士学位论文全文数据库 信息科技辑》;20130715;全文 *

Also Published As

Publication number Publication date
CN106934038A (en) 2017-07-07

Similar Documents

Publication Publication Date Title
CN106934038B (en) A kind of medical data duplicate checking and the method and system associated
CN105830064B (en) Mood generating means and computer-readable recording medium
CN106104519B (en) Phrase is to collection device and computer-readable storage medium
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN106294319A (en) One is combined related cases recognition methods
Speiser et al. Random forest classification of etiologies for an orphan disease
CN109615012A (en) Medical data exception recognition methods, equipment and storage medium based on machine learning
Eickhoff Crowd-powered experts: Helping surgeons interpret breast cancer images
CN109784387A (en) Multi-level progressive classification method and system based on neural network and Bayesian model
CN110335643A (en) Immunologic test point inhibitor for treating associated biomarkers solution read apparatus and its construction method and device
CN110110325A (en) It is a kind of to repeat case lookup method and device, computer readable storage medium
Iqbal et al. Mitochondrial organelle movement classification (fission and fusion) via convolutional neural network approach
Ma et al. Constructing a semantic graph with depression symptoms extraction from twitter
Basilio et al. Knowledge discovery in research on policing strategies: An overview of the past fifty years
Karthiga et al. Heart disease analysis system using data mining techniques
Maulana et al. The Scientific Progress and Prospects of Artificial Intelligence for Cancer Detection: A Bibliometric Analysis
Razzaq et al. Extraction of psychological effects of COVID-19 pandemic through topic-level sentiment dynamics
Loh et al. Knowledge discovery in texts for constructing decision support systems
Zhang et al. Mining evolutionary topic patterns in community question answering systems
Dligach et al. Semi-supervised learning for phenotyping tasks
Zahedi et al. Employing data mining to explore association rules in drug addicts
CN104778479B (en) A kind of image classification method and system based on sparse coding extraction
Sharma et al. Portable phenotyping system: a portable machine-learning approach to i2b2 obesity challenge
Zhang et al. Research on classification method of network resources based on modified SVM algorithm
Zhou et al. Semantic-based text classification of environmental regulatory documents for supporting automated environmental compliance checking in construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Medical data duplicate checking and associating method and system

Effective date of registration: 20190709

Granted publication date: 20180105

Pledgee: Chen Guitai

Pledgor: JIANGSU HUASHENG GENE DATA TECHNOLOGY Co.,Ltd.

Registration number: 2019320000317

PP01 Preservation of patent right

Effective date of registration: 20191112

Granted publication date: 20180105

PP01 Preservation of patent right
PD01 Discharge of preservation of patent

Date of cancellation: 20221112

Granted publication date: 20180105

PD01 Discharge of preservation of patent
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180105

CF01 Termination of patent right due to non-payment of annual fee