A kind of medical data duplicate checking and the method and system associated
Technical field
This invention relates generally to data processing technique, and more specifically it relates to medical data duplicate checking and the place associated
Manage method and system.
Background technology
In the practice of medical data acquisition process, same data be present by the multiple possibility for collecting simultaneously input database
Property, the possibility for being taken as different pieces of information to collect simultaneously input database after data are slightly changed by specialty or layman also be present
Property.In order to ensure the real effectiveness of data in medical data base, it is necessary to set scheme, after data submission, formal examination & verification is logical
Cross before storage, duplicate checking processing is carried out to it, duplicate data is blocked in outside database gate.Due to existing in medical data
Substantial amounts of unstructured data, such as symptom description in case history, the therapeutic scheme etc. of disease, the depth of medical data is looked at present
Weight depends on the manual operation of relevant medical experience substantially, and this is not only less efficient, and expends a large amount of manpower and materials, and cost is high
It is high.
In addition, medical research is different from other natural science subjects, related body's experiment management and control is strict, can not be in real time to it
Theoretical research is verified.Therefore medical research is highly dependent on the collection of historical medical data patient medical record etc. and divided
Analysis.Therefore a kind of effective medical data processing method is needed so that automatic mining goes out correlation case in medical data base and turned into
May, for further medical research and analysis.
Chinese patent CN101609466B provides a kind of " mass data duplicate checking method and system ",:Extract magnanimity number
Data key words in, the data key words are used to separate place data and other data fields;Closed according to the data
Data key words described in the preceding N+M grapheme segmentation of key word, preceding N+M alphabetical identical data key words are put into same text
In part, key data file is obtained;Wherein, the top n letter of the data key words is identical, and preceding N+M letter is not exclusively
Identical, N, M are nonnegative integer);Duplicate checking is carried out to the data in each key data file respectively, obtains duplicate checking result.The hair
The bright data for being relatively applied to structuring, effective duplicate checking can not be carried out for the medical data of a large amount of unstructured datas be present.
In addition, the invention is without reference to the similitude and related question between data.
Chinese patent CN101751423A is provided " a kind of method and system of article duplicate checking ", including:Production database
In manuscript information, because being operated to the contribution on the space of a whole page by after corresponding changed, Trigger of time obtains amended
Manuscript information, the manuscript information include contribution content;Duplicate checking server in the manuscript information of acquisition to not carrying out repeating contribution
The manuscript information that content compares carries out repetition contribution content and compared, and determines lofty information, because duplicate checking server triggers to event
Do not carry out repeating the manuscript information that compares of contribution content in the manuscript information that device obtains and carry out recombinating contribution content to compare so that most
Weight original text information is determined eventually.It is recognised that having the technical effect that of actually reaching of the patent realizes contribution, a kind of unstructured information,
The automatic duplicate checking before submission, reduce and deliver the number that middle heavy original text occurs.Although the patent is referred in embodiment to transport
Contribution content is compared with Chinese word segmentation storehouse technology, produces the similarity between contribution data, so as to carry out duplicate checking processing,
But the patent does not announce the similarity problem how calculated between contribution data specifically, also it is not directed to and how utilizes contribution
Similarity between data between contribution data to being associated.
The content of the invention
In view of the above-mentioned problems, the present invention solves existing by a kind of medical data duplicate checking and the method and system associated
It can not be established in technology to the effective duplicate checking of the medical data of unstructured data largely be present and lack between medical data
The problem of association.
To achieve these goals, the present invention adopts the following technical scheme that.
A kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps:
(1) extract the core data item in pending medical data, the core data item be used for by place data with
Other data fields separate;
(2) core data item is classified, core data item is first first divided into structural data item and unstructured data item,
One group of data item is then chosen in structural data item as array, other structures data item is excluded and is then used as fuzzy number
Group;
(3) respectively to excluding the preliminary examination of each data item in array and fuzzy array,
(3a) is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base
The medical data that breaks is not repeated or onrelevant and inputted in medical data base,
Or (3b) when in fuzzy array different pieces of information item number be more than the threshold value set with the ratio of fuzzy array total item
M1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,
Other situations then enter next step;
(4) depth examination is carried out to every data item in core data item, by the weight a of each data itemiAssigned
Value, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
(5) the threshold value M of doubtful duplicate data similarity is set2And/or the threshold value M of doubtful associated data3, work as M2<F<When 1
The medical data is judged for doubtful duplicate data and submits artificial verification, works as M3<F<Judge the medical data for doubtful association when 1
Data simultaneously submit artificial verify;
(6) after manually verifying doubtful repetition and/or associated data and giving judgement, it is defeated that unduplicated data will be judged as
Enter in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.
Further, the similarity fiJudgement computational methods be:
For structural data item, when it is identical with having homogeneous data item in medical data base, its similarityValue is 1, and otherwise value is 0;
For unstructured data item, its similarityFor the text collection T and medical data of the unstructured data item
Jaccard similarities SIM (S, T) in storehouse between the text collection S of existing homogeneous data item,
Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,
Further, the text collection T of unstructured data item and existing like numbers in medical data base are directly being calculated
Before the Jaccard similarities between the text collection S of item, also text collection T and text collection S are pre-processed as follows:
(i) using segmenting storehouse, the text field in T and S is resolved into some words, and by each word minimum treat data
,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.As
Further, the participle storehouse includes three parts:
Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom description part, it includes the common symptom of medical treatment and describes everyday words;
Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.
It is a further object to provide a kind of medical data duplicate checking and the system that associates, it is characterised in that this is
System includes:
Core data item unit, for extracting the core data item in pending medical data, the core data item
For place data and other data fields to be separated;
Taxon, for core data item to be classified, core data item is first first divided into structural data item and non-knot
Structure data item, one group of data item is then chosen in structural data item as array is excluded, other structures data item is then
As fuzzy array;
Preliminary examination unit, for respectively to excluding the preliminary examination of each data item in array and fuzzy array:Work as row
Except in array any one data item with then judging that the medical data does not weigh during existing homogeneous data item difference in medical data base
Multiple or onrelevant is simultaneously inputted in medical data base, or when different pieces of information item number and the ratio of fuzzy array total item in fuzzy array
Example is more than the threshold value M of setting1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base;
Depth examination unit, will be each for carrying out depth examination to every data item after preliminary examination cell processing
The weight a of individual data itemiAssignment is carried out, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
Judging unit, for setting the threshold value M of doubtful duplicate data similarity2And/or the threshold value M of doubtful associated data3,
Work as M2<F<The medical data is judged when 1 for doubtful duplicate data and submits artificial verification, works as M3<F<The medical data is judged when 1
For doubtful associated data and submit artificial verify;
Artificial check list member, for manually verifying doubtful repetition and/or associated data, it is artificial give judgement after, will be by
It is judged as in unduplicated data input medical data base, and gives and be judged as in the presence of the data one or more associated accordingly
Correlation tag.
Further, the depth examination unit includes:
Weight assignment subelement, the weight a for each data item in core data itemiCarry out assignment;
Similarity fiComputation subunit is judged, for the similarity f to each data itemiJudgement calculating is carried out, including:
Structural data item module, for structural data item similarityAssignment, when the structural data item with
When homogeneous data item is identical in medical data base, its similarityValue is 1, and otherwise value is 0,
Unstructured data item module, for unstructured data item similarityAssignment, unstructured data item phase
Like degreeFor existing homogeneous data item in text collection T and the medical data base of the unstructured data item text collection S it
Between Jaccard similarities SIM (S, T), SIM (S, T)=| S ∩ T |/| S ∪ T |;
Total similarity F calculates judgment sub-unit, for calculating the medical data and existing medical data in medical data base
Total similarity F:
Wherein,0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted.
Further, the depth examination unit also includes unstructured data item mould pretreatment module, for by non-knot
The text collection T of structure data item and existing homogeneous data item in medical data base text collection S are pre-processed,
The pretreatment is:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat number
According to item,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
Further, the depth examination unit also includes participle library module, for by text data item present in storehouse
The text field in T and S is decomposed,
Including three submodules:
Medicine name submodule, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom describes submodule, and it includes the common symptom of medical treatment and describes everyday words;
Genetic test submodule, it includes the result description of the site abbreviation of genetic test and genetic test.
Unless otherwise instructed, structural data of the present invention refers to row data, is stored in lane database, Ke Yiyong
Bivariate table structure carrys out the data of logical expression realization.
Unless otherwise instructed, the inconvenience that unstructured data of the present invention refers to is with database two dimension logical table come table
It is subset X ML, HTML under existing data, including the office documents of all formats, text, picture, standard generalized markup language, each
Class form, image and audio/visual information.
A kind of medical data duplicate checking of the present invention and management method and system compared with prior art the advantages of
In effectively calculating the similarity in medical data and existing database, solution by new data classification method and similarity algorithm
The association that a large amount of effective duplicate checkings of medical data that unstructured data be present and shortage of having determined are established between medical data is asked
Topic;, will to carrying out the medical professionalism degree of personage of artificial verification with the characteristics of misdetection rate is low, false determination ratio is low, duplicate checking efficiency high
Ask not high, therefore duplicate checking significantly reduces with the operating cost associated.
Embodiment
Embodiments of the invention are described below in detail.The embodiment is exemplary, is only used for explaining the present invention, without
Limitation of the present invention can be considered as.In order to avoid unnecessarily obscuring the embodiment, this part is to known in some this areas
Technology, i.e., technology that it would have been obvious for a person skilled in the art, is not described in detail.
A kind of medical data duplicate checking and the method associated, it is characterised in that comprise the following steps:
S101 extracts the core data item in pending medical data, the core data item be used for by place data with
Other data fields separate;
S102 classifies core data item, and core data item is first first divided into structural data item and unstructured data
, one group of data item is then chosen in structural data item as array is excluded, other structures data item is then as fuzzy
Array;
S103 respectively to excluding the preliminary examination of each data item in array and fuzzy array,
S103a is when excluding in array any one data item with then sentencing during existing homogeneous data item difference in medical data base
The medical data that breaks is not repeated or onrelevant and inputted in medical data base,
Or S103b works as the threshold value that different pieces of information item number and the ratio of fuzzy array total item in fuzzy array are more than setting
M1When then judge that the medical data is not repeated or onrelevant and inputted in medical data base,
Other situations then enter next step;
S104 carries out depth examination to every data item in core data item, by the weight a of each data itemiAssigned
Value, to the similarity f of each data itemiJudgement calculating is carried out,
And total similarity F of the medical data and existing medical data in medical data base is calculated according to following equation:
Wherein, 0≤fi≤1,0<F≤1,
As F=1, then the medical data is judged for duplicate data and is deleted;
S105 sets the threshold value M of doubtful duplicate data similarity2And/or the threshold value M of doubtful associated data3, work as M2<F<When 1
The medical data is judged for doubtful duplicate data and submits artificial verification, works as M3<F<Judge the medical data for doubtful association when 1
Data simultaneously submit artificial verify;
After S106 manually verifies doubtful repetition and/or associated data and gives judgement, unduplicated data will be judged as
Input in medical data base, and give the one or more corresponding correlation tags of data being judged as in the presence of association.
Further, the similarity fiJudgement computational methods be:
For structural data item, when it is identical with having homogeneous data item in medical data base, its similarityValue is 1, and otherwise value is 0;
For unstructured data item, its similarityFor the text collection T and medical data of the unstructured data item
Jaccard similarities SIM (S, T) in storehouse between the text collection S of existing homogeneous data item,
Wherein, SIM (S, T)=| S ∩ T |/| S ∪ T |,
Further, the text collection T of unstructured data item and existing like numbers in medical data base are directly being calculated
Before the Jaccard similarities between the text collection S of item, also text collection T and text collection S are pre-processed as follows:
(i) using storehouse is segmented, the text field in T and S is resolved into some words, and using each word as minimum treat number
According to item,
(ii) by the corresponding participle data item from T and S according to K-shingle algorithms one by one compared with.
Further, the participle storehouse includes three parts:
Medicine name part, it includes medicine trade name, the English alphabet abbreviation of common name and common scheme;
Symptom description part, it includes the common symptom of medical treatment and describes everyday words;
Genetic test part, it includes the result description of the site abbreviation of genetic test and genetic test.
The classification of medical data is illustrated by taking tumour case history as an example below.
Core texture data item is listed in table 1 in case history,
Table 1
, unstructured data is:T1 main suit, T2 historical therapeutics scheme and T3 this therapeutic scheme.
According to inventor's practical experience, above structuring core data is classified again:
It is detailed that A excludes array:
Disease name first-level class, disease name secondary classification, patient name abbreviation, sex, birthplace, the date of birth,
Hang up one's hat ground, occupation, nationality, hospital name, section office, ID number (case history/Reference Number), admission number, admission time, discharge time,
Differentiation degree and pathology title,
It is detailed that B obscures array:
Marital status, staging, disease TNM stage, whether shift, metastasis site, first admission time, be admitted to hospital first
Symptom, first symptom occur to go to a doctor period, this symptom of being admitted to hospital, this symptom time of occurrence, smoking, the length of smoking, drink,
Wine storage time, obsterical history and familial inheritance tumour medical history.
It should be noted that above classification for tumour case history only for make core data item of the present invention and
Core data these definition of classifying are more directly perceived, and can not limit core data item of the present invention and to core data item
The method of classification.For different types of medical data, different core data items can be set, same core can also be directed to
Heart data item carries out the different setting for excluding array and fuzzy array.But for these settings, its authority is only limitted to specific
A few peoples, general curative data input person is not opened.
It should be noted that in the method and system of present disclosure, medical data there may be following several states:
1- is pending, i.e., the original state that data are handled in method/system;
The doubtful repetitions of 2-, i.e. method/system treat that artificial treatment provides final result according to the automatically derived conclusion of algorithm;
3- is automatically normal, i.e., method/system is according to conclusion of the automatically derived data of algorithm without repetition, and makes data input number
According in storehouse;
4- is automatically deleted, i.e., method/system is according to the conclusion of the automatically derived Data duplication of algorithm, and deletes the data;
5- is manually normal, i.e., judges the data to be non-duplicate and add can after manually handling the data of doubtful repetition
The final result of the connective marker of energy;
6- is manually deleted, i.e., manually repeats to judge that the data should for duplicate data and deletion after record is handled to doubtful
Data.
Experimental example
By one group (group A), manually mode filters out 10,000 parts of effective tumour case histories to the people with medical professionalism background, at this
200 parts are randomly selected in a little effectively case histories.By group A people using the 200 parts of case histories extracted as template, pass through artificial edit-modify
Some data wherein in addition to patient's essential information obtain 200 parts of new case history A.Do not have medical professionalism by one group (group B)
200 parts of same case histories of the people of background are template, pass through some numbers of artificial edit-modify wherein in addition to patient's essential information
According to obtaining 200 parts of new case history B.
After original 10,000 parts effective case histories, 200 parts of " repetition " case history A and 200 parts of " repetition " case history B are mixed, by another four
People and another four group (group G, organize H, group I and group J) of the group (group C, organize D, group E and organize F) with medical professionalism background do not have medical treatment
The people of specialty background is not knowing the premise for repeating case history number of packages by way of machine traversal duplicate checking adds and manually compared respectively
Lower " repetition " case history screened in this 1.04 ten thousand parts of case histories.With more than duplicate checking screening system provided by the invention same 1.04
Ten thousand parts of case histories, wherein artificial screening part respectively by another group (group K) have medical professionalism background people and another group (group L) no
People with medical professionalism background is operated.
Group A is identical with the group member's number for organizing B, and the group member's number for organizing C-L is identical.
Table 2 is taken time by the result of different screening modes and screening, when wherein the time is by group member's 8 work for each person every day
Between calculate.
Table 2
It can be seen that, using medical data duplicate checking system provided by the invention, effectively it can shortened by table 2 at data
The misdetection rate and false determination ratio of duplicate checking are reduced while managing the time.Also, medical data duplicate checking system provided by the invention is artificial
Even if examination part is operated using the people without medical professionalism background, its misdetection rate and false determination ratio are also than using with doctor
The people for treating specialty background adds the misdetection rate of the mode manually compared and false determination ratio to be significantly reduced with machine traversal duplicate checking.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by
Software adds the mode of required general hardware platform to realize, naturally it is also possible to which by hardware, or the combination of the two is implemented.
Based on such understanding, the part that technical scheme substantially contributes to prior art in other words can be with software
The form of product is embodied, and the software module or computer software product can be stored in a storage medium, if including
It is dry to instruct to cause a computer equipment (be personal computer, server, or network equipment etc.) to perform this hair
Method described in bright each embodiment.Storage medium can be random access memory (RAM), internal memory, read-only storage (ROM), electricity
Well known in programming ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field
Any other form of storage medium.