CN106934038B

CN106934038B - A kind of medical data duplicate checking and the method and system associated

Info

Publication number: CN106934038B
Application number: CN201710153199.5A
Authority: CN
Inventors: 刘劲松; 王友柱; 饶江; 李广东; 李楠; 王东; 陈桂太
Original assignee: Jiangsu Huasheng Gene Data Technology Co Ltd
Current assignee: Jiangsu Huasheng Gene Data Technology Co Ltd
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2018-01-05
Anticipated expiration: 2037-03-15
Also published as: CN106934038A

Abstract

The present invention relates to a kind of medical data duplicate checking and the processing method and system that associate, methods described includes (1) and extracts core data item in pending medical data；(2) core data item is classified；(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array；(4) depth examination is carried out to every data item in core data item；(5) the threshold value M of doubtful duplicate data similarity is set₂And/or the threshold value M of doubtful associated data₃；(6) it after manually verifying doubtful repetition and/or associated data and giving judgement, will be judged as in unduplicated data input medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.The present invention has the characteristics of misdetection rate is low, false determination ratio is low, duplicate checking efficiency high compared with prior art, less demanding to the artificial professional degree manually verified, therefore duplicate checking significantly reduces with the operating cost associated.

Description

A kind of medical data duplicate checking and the method and system associated

Technical field

This invention relates generally to data processing technique, and more specifically it relates to medical data duplicate checking and the place associated Manage method and system.

Background technology

In the practice of medical data acquisition process, same data be present by the multiple possibility for collecting simultaneously input database Property, the possibility for being taken as different pieces of information to collect simultaneously input database after data are slightly changed by specialty or layman also be present Property.In order to ensure the real effectiveness of data in medical data base, it is necessary to set scheme, after data submission, formal examination ＆ verification is logical Cross before storage, duplicate checking processing is carried out to it, duplicate data is blocked in outside database gate.Due to existing in medical data Substantial amounts of unstructured data, such as symptom description in case history, the therapeutic scheme etc. of disease, the depth of medical data is looked at present Weight depends on the manual operation of relevant medical experience substantially, and this is not only less efficient, and expends a large amount of manpower and materials, and cost is high It is high.

In addition, medical research is different from other natural science subjects, related body's experiment management and control is strict, can not be in real time to it Theoretical research is verified.Therefore medical research is highly dependent on the collection of historical medical data patient medical record etc. and divided Analysis.Therefore a kind of effective medical data processing method is needed so that automatic mining goes out correlation case in medical data base and turned into May, for further medical research and analysis.

Chinese patent CN101609466B provides a kind of " mass data duplicate checking method and system ",：Extract magnanimity number Data key words in, the data key words are used to separate place data and other data fields；Closed according to the data Data key words described in the preceding N+M grapheme segmentation of key word, preceding N+M alphabetical identical data key words are put into same text In part, key data file is obtained；Wherein, the top n letter of the data key words is identical, and preceding N+M letter is not exclusively Identical, N, M are nonnegative integer)；Duplicate checking is carried out to the data in each key data file respectively, obtains duplicate checking result.The hair The bright data for being relatively applied to structuring, effective duplicate checking can not be carried out for the medical data of a large amount of unstructured datas be present. In addition, the invention is without reference to the similitude and related question between data.

Chinese patent CN101751423A is provided " a kind of method and system of article duplicate checking ", including：Production database In manuscript information, because being operated to the contribution on the space of a whole page by after corresponding changed, Trigger of time obtains amended Manuscript information, the manuscript information include contribution content；Duplicate checking server in the manuscript information of acquisition to not carrying out repeating contribution The manuscript information that content compares carries out repetition contribution content and compared, and determines lofty information, because duplicate checking server triggers to event Do not carry out repeating the manuscript information that compares of contribution content in the manuscript information that device obtains and carry out recombinating contribution content to compare so that most Weight original text information is determined eventually.It is recognised that having the technical effect that of actually reaching of the patent realizes contribution, a kind of unstructured information, The automatic duplicate checking before submission, reduce and deliver the number that middle heavy original text occurs.Although the patent is referred in embodiment to transport Contribution content is compared with Chinese word segmentation storehouse technology, produces the similarity between contribution data, so as to carry out duplicate checking processing, But the patent does not announce the similarity problem how calculated between contribution data specifically, also it is not directed to and how utilizes contribution Similarity between data between contribution data to being associated.

The content of the invention

In view of the above-mentioned problems, the present invention solves existing by a kind of medical data duplicate checking and the method and system associated It can not be established in technology to the effective duplicate checking of the medical data of unstructured data largely be present and lack between medical data The problem of association.

To achieve these goals, the present invention adopts the following technical scheme that.

A kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps：

(1) extract the core data item in pending medical data, the core data item be used for by place data with Other data fields separate；

(2) core data item is classified, core data item is first first divided into structural data item and unstructured data item, One group of data item is then chosen in structural data item as array, other structures data item is excluded and is then used as fuzzy number Group；

(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array,

(3a) is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base The medical data that breaks is not repeated or onrelevant and inputted in medical data base,

Or (3b) when in fuzzy array different pieces of information item number be more than the threshold value set with the ratio of fuzzy array total item M₁When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,

Other situations then enter next step；

(4) depth examination is carried out to every data item in core data item, by the weight a of each data item_iAssigned Value, to the similarity f of each data item_iJudgement calculating is carried out,

And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation：

Wherein, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted；

(5) the threshold value M of doubtful duplicate data similarity is set₂And/or the threshold value M of doubtful associated data₃, work as M₂<F<When 1 The medical data is judged for doubtful duplicate data and submits artificial verification, works as M₃<F<Judge the medical data for doubtful association when 1 Data simultaneously submit artificial verify；

(6) after manually verifying doubtful repetition and/or associated data and giving judgement, it is defeated that unduplicated data will be judged as Enter in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.

Further, the similarity f_iJudgement computational methods be：

For structural data item, when it is identical with having homogeneous data item in medical data base, its similarityValue is 1, and otherwise value is 0；

For unstructured data item, its similarityFor the text collection T and medical data of the unstructured data item Jaccard similarities SIM (S, T) in storehouse between the text collection S of existing homogeneous data item,

Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,

Further, the text collection T of unstructured data item and existing like numbers in medical data base are directly being calculated Before the Jaccard similarities between the text collection S of item, also text collection T and text collection S are pre-processed as follows：

(i) using segmenting storehouse, the text field in T and S is resolved into some words, and by each word minimum treat data ,

(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.As

Further, the participle storehouse includes three parts：

Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme；

Symptom description part, it includes the common symptom of medical treatment and describes everyday words；

Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.

It is a further object to provide a kind of medical data duplicate checking and the system that associates, it is characterised in that this is System includes：

Core data item unit, for extracting the core data item in pending medical data, the core data item For place data and other data fields to be separated；

Taxon, for core data item to be classified, core data item is first first divided into structural data item and non-knot Structure data item, one group of data item is then chosen in structural data item as array is excluded, other structures data item is then As fuzzy array；

Preliminary examination unit, for respectively to excluding the preliminary examination of each data item in array and fuzzy array：Work as row Except in array any one data item with then judging that the medical data does not weigh during existing homogeneous data item difference in medical data base Multiple or onrelevant is simultaneously inputted in medical data base, or when different pieces of information item number and the ratio of fuzzy array total item in fuzzy array Example is more than the threshold value M of setting₁When then judge that the medical data is not repeated or onrelevant and inputted in medical data base；

Depth examination unit, will be each for carrying out depth examination to every data item after preliminary examination cell processing The weight a of individual data item_iAssignment is carried out, to the similarity f of each data item_iJudgement calculating is carried out,

Wherein, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted；

Judging unit, for setting the threshold value M of doubtful duplicate data similarity₂And/or the threshold value M of doubtful associated data₃, Work as M₂<F<The medical data is judged when 1 for doubtful duplicate data and submits artificial verification, works as M₃<F<The medical data is judged when 1 For doubtful associated data and submit artificial verify；

Artificial check list member, for manually verifying doubtful repetition and/or associated data, it is artificial give judgement after, will be by It is judged as in unduplicated data input medical data base, and gives and be judged as in the presence of the data one or more associated accordingly Correlation tag.

Further, the depth examination unit includes：

Weight assignment subelement, the weight a for each data item in core data item_iCarry out assignment；

Similarity f_iComputation subunit is judged, for the similarity f to each data item_iJudgement calculating is carried out, including：

Structural data item module, for structural data item similarityAssignment, when the structural data item with When homogeneous data item is identical in medical data base, its similarityValue is 1, and otherwise value is 0,

Unstructured data item module, for unstructured data item similarityAssignment, unstructured data item phase Like degreeFor existing homogeneous data item in text collection T and the medical data base of the unstructured data item text collection S it Between Jaccard similarities SIM (S, T), SIM (S, T)=| S ∩ T |/| S ∪ T |；

Total similarity F calculates judgment sub-unit, for calculating the medical data and existing medical data in medical data base Total similarity F：

Wherein,0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted.

Further, the depth examination unit also includes unstructured data item mould pretreatment module, for by non-knot The text collection T of structure data item and existing homogeneous data item in medical data base text collection S are pre-processed,

The pretreatment is：

(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat number According to item,

(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.

Further, the depth examination unit also includes participle library module, for by text data item present in storehouse The text field in T and S is decomposed,

Including three submodules：

Medicine name submodule, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme；

Symptom describes submodule, and it includes the common symptom of medical treatment and describes everyday words；

Genetic test submodule, it includes the result description of the site abbreviation of genetic test and genetic test.

Unless otherwise instructed, structural data of the present invention refers to row data, is stored in lane database, Ke Yiyong Bivariate table structure carrys out the data of logical expression realization.

Unless otherwise instructed, the inconvenience that unstructured data of the present invention refers to is with database two dimension logical table come table It is subset X ML, HTML under existing data, including the office documents of all formats, text, picture, standard generalized markup language, each Class form, image and audio/visual information.

A kind of medical data duplicate checking of the present invention and management method and system compared with prior art the advantages of In effectively calculating the similarity in medical data and existing database, solution by new data classification method and similarity algorithm The association that a large amount of effective duplicate checkings of medical data that unstructured data be present and shortage of having determined are established between medical data is asked Topic；, will to carrying out the medical professionalism degree of personage of artificial verification with the characteristics of misdetection rate is low, false determination ratio is low, duplicate checking efficiency high Ask not high, therefore duplicate checking significantly reduces with the operating cost associated.

Embodiment

Embodiments of the invention are described below in detail.The embodiment is exemplary, is only used for explaining the present invention, without Limitation of the present invention can be considered as.In order to avoid unnecessarily obscuring the embodiment, this part is to known in some this areas Technology, i.e., technology that it would have been obvious for a person skilled in the art, is not described in detail.

S101 extracts the core data item in pending medical data, the core data item be used for by place data with Other data fields separate；

S102 classifies core data item, and core data item is first first divided into structural data item and unstructured data , one group of data item is then chosen in structural data item as array is excluded, other structures data item is then as fuzzy Array；

S103 respectively to excluding the preliminary examination of each data item in array and fuzzy array,

S103a is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base The medical data that breaks is not repeated or onrelevant and inputted in medical data base,

Or S103b works as the threshold value that different pieces of information item number and the ratio of fuzzy array total item in fuzzy array are more than setting M₁When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,

Other situations then enter next step；

S104 carries out depth examination to every data item in core data item, by the weight a of each data item_iAssigned Value, to the similarity f of each data item_iJudgement calculating is carried out,

Wherein, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted；

S105 sets the threshold value M of doubtful duplicate data similarity₂And/or the threshold value M of doubtful associated data₃, work as M₂<F<When 1 The medical data is judged for doubtful duplicate data and submits artificial verification, works as M₃<F<Judge the medical data for doubtful association when 1 Data simultaneously submit artificial verify；

After S106 manually verifies doubtful repetition and/or associated data and gives judgement, unduplicated data will be judged as Input in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.

Further, the similarity f_iJudgement computational methods be：

Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,

Further, the participle storehouse includes three parts：

The classification of medical data is illustrated by taking tumour case history as an example below.

Core texture data item is listed in table 1 in case history,

Table 1

, unstructured data is：T1 main suit, T2 historical therapeutics scheme and T3 this therapeutic scheme.

According to inventor's practical experience, above structuring core data is classified again：

It is detailed that A excludes array：

Disease name first-level class, disease name secondary classification, patient name abbreviation, sex, birthplace, the date of birth, Hang up one's hat ground, occupation, nationality, hospital name, section office, ID number (case history/Reference Number), admission number, admission time, discharge time, Differentiation degree and pathology title,

It is detailed that B obscures array：

Marital status, staging, disease TNM stage, whether shift, metastasis site, first admission time, be admitted to hospital first Symptom, first symptom occur to go to a doctor period, this symptom of being admitted to hospital, this symptom time of occurrence, smoking, the length of smoking, drink, Wine storage time, obsterical history and familial inheritance tumour medical history.

It should be noted that above classification for tumour case history only for make core data item of the present invention and Core data these definition of classifying are more directly perceived, and can not limit core data item of the present invention and to core data item The method of classification.For different types of medical data, different core data items can be set, same core can also be directed to Heart data item carries out the different setting for excluding array and fuzzy array.But for these settings, its authority is only limitted to specific A few peoples, general curative data input person is not opened.

It should be noted that in the method and system of present disclosure, medical data there may be following several states：

1- is pending, i.e., the original state that data are handled in method/system；

The doubtful repetitions of 2-, i.e. method/system treat that artificial treatment provides final result according to the automatically derived conclusion of algorithm；

3- is automatically normal, i.e., method/system is according to conclusion of the automatically derived data of algorithm without repetition, and makes data input number According in storehouse；

4- is automatically deleted, i.e., method/system is according to the conclusion of the automatically derived Data duplication of algorithm, and deletes the data；

5- is manually normal, i.e., judges the data to be non-duplicate and add can after manually handling the data of doubtful repetition The final result of the connective marker of energy；

6- is manually deleted, i.e., manually repeats to judge that the data should for duplicate data and deletion after record is handled to doubtful Data.

Experimental example

By one group (group A), manually mode filters out 10,000 parts of effective tumour case histories to the people with medical professionalism background, at this 200 parts are randomly selected in a little effectively case histories.By group A people using the 200 parts of case histories extracted as template, pass through artificial edit-modify Some data wherein in addition to patient's essential information obtain 200 parts of new case history A.Do not have medical professionalism by one group (group B) 200 parts of same case histories of the people of background are template, pass through some numbers of artificial edit-modify wherein in addition to patient's essential information According to obtaining 200 parts of new case history B.

After original 10,000 parts effective case histories, 200 parts of " repetition " case history A and 200 parts of " repetition " case history B are mixed, by another four People and another four group (group G, organize H, group I and group J) of the group (group C, organize D, group E and organize F) with medical professionalism background do not have medical treatment The people of specialty background is not knowing the premise for repeating case history number of packages by way of machine traversal duplicate checking adds and manually compared respectively Lower " repetition " case history screened in this 1.04 ten thousand parts of case histories.With more than duplicate checking screening system provided by the invention same 1.04 Ten thousand parts of case histories, wherein artificial screening part respectively by another group (group K) have medical professionalism background people and another group (group L) no People with medical professionalism background is operated.

Group A is identical with the group member's number for organizing B, and the group member's number for organizing C-L is identical.

Table 2 is taken time by the result of different screening modes and screening, when wherein the time is by group member's 8 work for each person every day Between calculate.

Table 2

It can be seen that, using medical data duplicate checking system provided by the invention, effectively it can shortened by table 2 at data The misdetection rate and false determination ratio of duplicate checking are reduced while managing the time.Also, medical data duplicate checking system provided by the invention is artificial Even if examination part is operated using the people without medical professionalism background, its misdetection rate and false determination ratio are also than using with doctor The people for treating specialty background adds the misdetection rate of the mode manually compared and false determination ratio to be significantly reduced with machine traversal duplicate checking.

Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by Software adds the mode of required general hardware platform to realize, naturally it is also possible to which by hardware, or the combination of the two is implemented. Based on such understanding, the part that technical scheme substantially contributes to prior art in other words can be with software The form of product is embodied, and the software module or computer software product can be stored in a storage medium, if including It is dry to instruct to cause a computer equipment (be personal computer, server, or network equipment etc.) to perform this hair Method described in bright each embodiment.Storage medium can be random access memory (RAM), internal memory, read-only storage (ROM), electricity Well known in programming ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field Any other form of storage medium.

Claims

1. a kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps：

(1) the core data item in pending medical data is extracted, the core data item is used for place data and other Data field separates；

(2) core data item is classified, core data item is first first divided into structural data item and unstructured data item, then One group of data item is chosen in structural data item as array is excluded, other structures data item is then as fuzzy array；

(3a) then judges this when excluding and having homogeneous data item difference in any one data item and medical data base in array Medical data is not repeated or onrelevant and inputted in medical data base,

Or (3b) when in fuzzy array different pieces of information item number be more than the threshold value M set with the ratio of fuzzy array total item₁Shi Ze Judge that the medical data is not repeated or onrelevant and inputted in medical data base,

Other situations then enter next step；

(4) depth examination is carried out to every data item in core data item, by the weight a of each data item_iAssignment is carried out, it is right The similarity f of each data item_iJudgement calculating is carried out,

Wherein, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted；

(5) the threshold value M of doubtful duplicate data similarity is set₂And/or the threshold value M of doubtful associated data₃, work as M₂<F<Judge when 1 The medical data is doubtful duplicate data and submits artificial verification, works as M₃<F<Judge the medical data for doubtful associated data when 1 And submit artificial verify；

(6) after manually verifying doubtful repetition and/or associated data and giving judgement, unduplicated data input doctor will be judged as Treat in database, and give the one or more corresponding correlation tags of data being judged as in the presence of association.

2. according to the method for claim 1, it is characterised in that the similarity f_iJudgement computational methods be：For structure Change data item, when it is identical with having homogeneous data item in medical data base, its similarity f_i ^sValue is 1, is otherwise taken It is worth for 0；

For unstructured data item, its similarity f_i ⁿFor in the text collection T and medical data base of the unstructured data item Jaccard similarities SIM (S, T) between the text collection S of existing homogeneous data item, wherein, SIM (S, T)=| S ∩ T |/| S∪T|,f_i ^s∈f_i, f_i ⁿ∈f_i。

3. according to the method for claim 2, it is characterised in that directly calculating the text collection T of unstructured data item Before Jaccard similarities between the text collection S of existing homogeneous data item in medical data base, also to text collection T and Text collection S is pre-processed as follows：

(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat data ,

4. according to the method for claim 3, it is characterised in that the participle storehouse includes three parts：

5. a kind of medical data duplicate checking and the system associated, it is characterised in that the system includes：

Core data item unit, for extracting the core data item in pending medical data, the core data item is used for Place data and other data fields are separated；

Taxon, for core data item to be classified, core data item is first first divided into structural data item and unstructured Data item, one group of data item is then chosen in structural data item as exclusion array, the then conduct of other structures data item Fuzzy array；

Preliminary examination unit, for respectively to excluding the preliminary examination of each data item in array and fuzzy array：When exclusion number Then judge during existing homogeneous data item difference in any one data item and medical data base in group the medical data do not repeat or Onrelevant is simultaneously inputted in medical data base, or big when obscuring different pieces of information item number and the ratio of fuzzy array total item in array In the threshold value M of setting₁When then judge that the medical data is not repeated or onrelevant and inputted in medical data base；

Depth examination unit, for carrying out depth examination to every data item after preliminary examination cell processing, by each number According to the weight a of item_iAssignment is carried out, to the similarity f of each data item_iJudgement calculating is carried out,

Wherein, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted；

Judging unit, for setting the threshold value M of doubtful duplicate data similarity₂And/or the threshold value M of doubtful associated data₃, work as M₂ <F<The medical data is judged when 1 for doubtful duplicate data and submits artificial verification, works as M₃<F<It is doubtful that the medical data is judged when 1 Like associated data and submit artificial verify；

Artificial check list member, for manually verifying doubtful repetition and/or associated data, after manually judgement is given, will be judged For in unduplicated data input medical data base, and give the one or more corresponding associations of data being judged as in the presence of association Label.

6. system according to claim 5, it is characterised in that the depth examination unit includes：

Structural data item module, for structural data item similarity f_i ^sAssignment, when the structural data item with medical treatment When homogeneous data item is identical in database, its similarity f_i ^sValue is 1, and otherwise value is 0,

Unstructured data item module, for unstructured data item similarity f_i ⁿAssignment, unstructured data item similarity f_i ⁿBetween the text collection S of existing homogeneous data item in the text collection T and medical data base of the unstructured data item Jaccard similarities SIM (S, T), SIM (S, T)=| S ∩ T |/| S ∪ T |；

Total similarity F calculates judgment sub-unit, for calculating the medical data with having the total of medical data in medical data base Similarity F：

Wherein, f_i ^s∈f_i, f_i ⁿ∈f_i, 0≤f_i≤1,0<F≤1,

As F=1, then the medical data is judged for duplicate data and is deleted.

7. system according to claim 6, it is characterised in that the depth examination unit also includes unstructured data item Mould pretreatment module, for the text that will have homogeneous data item in the text collection T of unstructured data item and medical data base This set S is pre-processed,

The pretreatment is：

8. system according to claim 6, it is characterised in that the depth examination unit also includes participle library module, uses The text data item present in Jiang Ku decomposes the text field in T and S,

Including three submodules：