CN109634949A

CN109634949A - A kind of blended data cleaning method based on more versions of data

Info

Publication number: CN109634949A
Application number: CN201811628044.3A
Authority: CN
Inventors: 高云君; 陈刚; 陈纯; 葛丛丛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-04-16
Anticipated expiration: 2038-12-28
Also published as: CN109634949B

Abstract

The invention discloses a kind of blended data cleaning methods based on more versions of data.The present invention is using Markov logical network probability graph model and minimizes reparation principle, Qualitative and quantitative technique are integrated in the present invention, design efficient data cleaning method, the structural data of mistake is detected and corrected, guarantee that wash result can either clean the dirty data for the constraint that breaks the rules and the change cost met to data set is minimum, and the statistical properties can be complied with.Entire data set is first divided into block and group according to Markov logic index technology by the present invention, then executes two stage data cleansing.First stage by the evaluation criterion of introducing confidence level score, cleans the data in each group to obtain the data cleansing result of multi version；Second stage merges the evaluation criterion of score by introducing, and the multi version result generated to the preposition stage merges, to generate final unified wash result.

Description

A kind of blended data cleaning method based on more versions of data

Technical field

The present invention relates to, to the cleaning technique of wrong data, be based particularly on more versions of data in Computer Database field Blended data cleaning method.

Background technique

The purpose of data cleansing is to find the content that wrong data is most likely to be in data set, and provides one reliably Data of righting the wrong method.Dirty data is exactly the data that there is mistake in data set.

Nowadays, with being continued to bring out using social networks, e-commerce as the novel information published method of representative, Yi Jiyun It calculating, the rise of Internet of Things computer technology, data just constantly increase and accumulate at an unprecedented rate, and in data point In analysis, the presence of dirty data not only results in the decision and insecure analysis of mistake, can also cause to hit to corporate economy.Cause No matter this all produces great interest to data cleansing in industry or academia.Data cleansing be to wrong data into Process row detection and repaired corrects existing error message, keeps the one of data its object is to delete wherein redundancy Cause property.

For data cleaning method, a few thing is had been made in domestic and foreign scholars at present.The method of mainstream can be at present Be roughly divided into two class of qualitative method and quantitative approach: (1) qualitative method is mainly to clean the mistake for violating integrity constraint rule Data, evaluation criterion are minimum cost principle, that is, require the cost of cleaning to minimize the change of data set, disadvantage is it The wrong data for being unsatisfactory for minimum cost principle can not be cleaned, although it still violates integrity constraint；(2) quantitative approach is Suitable model is constructed to determine cleaning strategy, its shortcoming is that such method is strongly dependent upon training based on data probability distributions Collection, it is desirable to provide enough and clean given datas are used as training set to construct reliable model, and this is for present big It being not suitable for for data environment, the Data Representation that overwhelming majority quantitative approach is cleaned at present is poorer than qualitative method, and Existing method runing time is longer.

Summary of the invention

In view of the above deficiencies, the present invention provides a kind of blended data cleaning method based on more versions of data, of the invention Method is not only to have guaranteed to execute cleaning to the data for violating ICs, but also accord with wash result by method that is qualitative and quantitatively combining Close statistical property.This method is based on Markov logical network, first according to Markov logic index technology by entire data set It is divided into block and group, then executes two stage data cleansing again, wherein data cleansing is individually performed to each piece in the first stage, Obtain multi-edition data wash result；Second stage, the data result based on multi version eliminate conflict, obtain final global system One wash result.Markov logic index technology reduces the detection range of dirty data, and it is clear can be effectively carried out data It washes.

In order to achieve the above object, the used technical solution of the present invention is as follows: a kind of mixed number based on more versions of data According to cleaning method, the step of this method, is as follows:

(1) it obtains regular (ICs) with dirty data collection and relevant integrity constraint；

(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and with dirty The constant that each tuple includes in data set instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data Piece；

(3) Markov logic index structure is established to dirty data collection, is first difference according to regular partition by dirty data collection Data block, each corresponding data block of rule, the minimum unit in each data block is data slice, then again by every number Different data groups is again divided into according to block；

(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, passes through Independent cleaning is carried out to each data group to obtain the versions of data of multiple preliminary wash results；

(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, generates to the first stage multiple preliminary The versions of data of wash result is merged, and the collision problem between multi version is eliminated, to generate final unified wash result；

(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeat number According to deletion；

(7) data set after output data cleaning.

Further, the step (2) specifically:

(2.1) the different types of integrity constraint of input is standardized as Markov by conjunctive normal form transformation rule Logical network rule；

(2.2) the corresponding constant of all variables data set in the rule after standardization is replaced.

Further, the step (3) specifically:

(3.1) entire dirty data collection is divided into multiple data by the integrity constraint rule for being included according to dirty data collection Block, each rule correspond to a data block, include several data slices in each data block；

(3.2) in each data block, the entry in attribute containing same keyword is divided into same group；It is wherein crucial Word is the reason item of rule, and the data slice with same cause is divided into one group.

Further, the step (4) specifically:

(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be drawn It assigns to the phenomenon in incorrect group and is known as "abnormal", then repartition the data slice of these mistakes in corresponding group；

(4.2) it is calculated in each group according to similarity distance metric method and Markov logical network weight learning method The confidence level score (reliability score) of abnormal data；

(4.3) independent to clean each data group: cleaning unit is each of data block, selects confidence level score maximum Benchmark of the data slice γ as replacement, will be belonged to using this data and be replaced with other data that leave a question open in data group It changes, until each data group cleaning in the data block finishes, that is, the independence for completing the data block is cleaned；

Above-mentioned cleaning is also similarly executed to other data blocks；The multiple preliminary wash results that will be cleaned by the stage It is considered as multiple versions of data, each data block is a versions of data.

Further, the step (5) specifically:

(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then with each base Standard is starting, finds in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Ma Erke The data slice of authority of the husband weight, and it is merged with benchmark；

(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed；Then it calculates under the benchmark The fusion score f-score (t) of amalgamation result=w₁×…×w_m, wherein w_iIndicate the data slice being merged in i-th of data block Markov weight；

(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and remember Record, until obtaining the fusion score of the amalgamation result under all different benchmark；Then the selection fusion maximum amalgamation result of score The wash result unified as the final overall situation of the tuple.

Further, the step (6) is scanned entire data set specifically, after completing two stage cleaning, Hash table is established for each tuple therein, when duplicate keys are arrived in scanning, it is rejected.

The invention has the advantages that: the present invention be based on qualitative and quantitative technique blended data cleaning method, By Markov logical network rule, a plurality of types of integrity constraints are combined, introduce Markov logical network power Weight learning method and Similar distance measuring method are used as the foundation of data cleansing simultaneously, meet wash result can either qualitative Technology needs the minimum cost principle followed, and can meet the statistical property of quantitative technique.In addition, the present invention design it is excellent Change method, i.e. Markov logic index, reduce the detection range of dirty data, accelerate the runing time of data cleansing.This Invention with the data set of synthesis using really being tested, more higher than the currently a popular system cleaning efficiency of result presentation and clearly Wash precision.

Detailed description of the invention

Fig. 1 is implementation steps flow chart of the invention；

Fig. 2 (a) is hospital data collection according to rule (r₁)FD:The Markov logical network of formation indexes knot Structure；

Fig. 2 (b) is hospital data collection according to rule (r₂)DC:Shape At Markov logical network index structure；

Fig. 2 (c) is hospital data collection according to rule (r₃) CFD:HN [" ELIZA "], CT [" BOAZ "]=> PN The Markov logical network index structure that [" 2567688400 "] are formed；

Fig. 3 (a) is the rule r after the first stage cleans₁Corresponding Markov logical network index structure schematic diagram；

Fig. 3 (b) is the rule r after the first stage cleans₂Corresponding Markov logical network index structure schematic diagram；

Fig. 3 (c) is the rule r after the first stage cleans₃Corresponding Markov logical network index structure schematic diagram；

Fig. 4 is second stage cleaning process schematic diagram.

Specific embodiment

Technical solution of the present invention is described further now in conjunction with attached drawing and specific implementation:

As shown in Figure 1, specific implementation process of the present invention and working principle are as follows:

Step (1): the integrity constraint (IC) in frame and the data set with dirty data are input in frame；Under Face is illustrated dirty data collection and integrity constraint with table 1:

Table 1 illustrates a information for hospital data set record, includes 4 attributes, is hospital name (HN), city respectively (CT), affiliated state (ST), contact method (PN), grey shading label is wrong data in table 1.Given three integralities are about Beam:

Wherein D represents data set, t₁,t₂Represent two different tuples, functional dependence (Functional Dependency, abbreviation FD) rule r₁Indicate that a city can only belong to a state, negative constraint (Denial Constraint, abbreviation DC) rule r₂Indicate that the hospital in not Tonzhou has different telephone numbers, conditional function dependent Rule (Conditional Functional Dependency, abbreviation CFD) r₃Indicate that the name of hospital, corresponding city Hezhou are determined The telephone number of Ding Liao hospital.

Table 1:

Step (2): converting Markov logical network normalisation rule for different types of integrity constraint rule, and The constant for including with each tuple that dirty data is concentrated instantiates the normalisation rule after conversion, and each instantiation rule is referred to as Data slice.

Specific steps include:

1) the different types of integrity constraint of input Markov is standardized as by conjunctive normal form transformation rule to patrol Collect networking rule；

2) constant of the data set of the variable in the rule after standardization is replaced.

Step (3): Markov logic index structure is established to dirty data collection, is first according to regular partition by dirty data collection Different data blocks, each rule correspond to a data block, and the minimum unit in each data block is data slice, then again will be every A data block is again divided into different groups, and specific steps include:

1) entire dirty data collection is divided into multiple data blocks, Mei Gegui by the integrity constraint rule that dirty data collection is included A data block is then corresponded to, includes several data slices γ in each data block；

2) in each data block, the entry in attribute containing same keyword is divided into same group, wherein keyword For the reason item of rule, the γ with same cause is divided into one group.

Markov logical network index construct is illustrated by taking Fig. 2 (a), Fig. 2 (b), Fig. 2 (c) as an example below:

Using the data set of table 1 as sample, given constraint rule is related to HN, CT, ST and PN, will according to three rules Data set is accordingly divided into three block B₁、B₂、B₃, and pay attention to distinguishing attribute and result attribute the reason of in constraint rule.It connects down Come, operation is grouped to three blocks respectively, the identical array of reason attribute keyword in a group is divided into a group, such as B₁Middle G₁₃Three arrays the reason of keyword be all identical, so being classified as one group.B₁Corresponding Markov logical network Shown in index structure such as Fig. 2 (a), B₂Shown in corresponding Markov logical network index structure such as Fig. 2 (b), B₃Corresponding Ma Er Shown in section husband logical network index structure such as Fig. 2 (c)；

Step (4): on the basis of step (3), the cleaning of first stage is executed, introduces the evaluation mark of confidence level score Standard, by carrying out the independent multiple versions of data (each versions of data is from different blocks) of cleaning to each data group, specifically It is as follows:

1) abnormal data is handled.Item due to being appeared in error in data and cause its corresponding data slice to be divided into Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group；

2) exception in each group is calculated according to similarity distance metric method and Markov logical network weight learning method The confidence level score (reliability score, r-score) of data, formula is Wherein d (γ_i,γ^*) represent the candidate alternate data γ of data slice γ and it^*The distance between, w (γ_i) be data slice γ horse That section authority of the husband weight.

3) independent to clean each data block.Specifically, cleaning unit is each of data block, we select credible Spend benchmark of the maximum data slice γ of score as replacement, using this data will belong to other in a group leave a question open data into Row replacement.It all cleans and finishes until each of the data block, that is, complete the independent cleaning of the data block.Similarly to other numbers Above-mentioned cleaning is also executed according to block；The multiple preliminary wash results cleaned by the stage are considered as multiple versions of data, often A data block is a versions of data.Markov logic index structure such as Fig. 3 (a), Fig. 3 (b), Fig. 3 after stage cleaning (c) shown in.

Step (5):, may between different data version since first stage cleaning step produces the data result of multi version Conflict is generated, i.e., the same position in data set generates different wash results between different editions.Therefore, melted by introducing The evaluation criterion of score is closed, multi-edition data collision problem is eliminated, to obtain final global unified wash result.

With the tuple t in table 1₃For, after having executed first stage cleaning, in B₁In with t₃Relevant data slice is { CT:DOTHAN, ST:AL } (first versions of data), however in B₃In with t₃Relevant data slice be HN:ELIZA, CT: BOAZ, PN:2567688400 } (third versions of data).Obviously, t₃Two are corresponded to after the cleaning of [CT] in the first stage not Same value (that is, " DOTHAN " and " BOAZ "), they are from two different versions of data.In other words, for t₃For, it There is conflict on attribute CT, and final consistent wash result in order to obtain, conflict needs are solved.

The step is specific as follows:

1) all tuples comprising conflict are detected, and record the data slice where each conflict.As shown in figure 4, t₃It is corresponding The data slice of two conflicts, respectively α₁∈B₁And α₂∈B₂, and using the two as the benchmark for generating different candidate schemes.

2) merging of corresponding data piece between different data block is executed for each benchmark.Need to consider two kinds of situations, if wait close And data slice and benchmark between there is no conflict, directly merging；Conflict if it exists, then needs corresponding in data slice to be combined Block in find another data slice (it does not conflict between benchmark, and corresponding Markov weight is maximum), then hold Row union operation, and using data slice new after merging as benchmark, above-mentioned steps are executed again, until all data blocks are all It completes to merge.It is noted that if can not find satisfactory data slice in merging process, then it is assumed that under the benchmark It is unable to complete merging.

3) after executing the step 2), multiple possible candidate schemes is generated for each tuple comprising conflict, are led to Introducing fusion score (fusion score, f-score) is crossed, is given a mark to each candidate scheme, it is final for selecting score highest item As a result, fusion score formula be f-score (t)=w₁×…×w_m.As shown in figure 4, for α₁∈B₁On the basis of conjunction And scheme, due to merging B₃In corresponding data slice when, can not find satisfactory data slice, therefore, it is considered that under the benchmark It is unable to complete merging, therefore remembers f-score (t₃)=0.And with α₂∈B₂On the basis of, amalgamation result t₃=HN:ELIZA, CT:BOAZ, ST:AL, PN:2567688400 }, corresponding f-score (t₃)=0.0678.Therefore, by second of Merge Scenarios As final t₃Wash result.

Step (6): after completing two stage cleaning, we are scanned entire data set, are each member therein Group establishes Hash table, when duplicate keys are arrived in scanning, rejects to it.

Step (7): output data treated data set.

Claims

1. a kind of blended data cleaning method based on more versions of data, which is characterized in that the step of this method is as follows:

(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and uses dirty data The constant for concentrating each tuple to include instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data slice；

(3) Markov logic index structure is established to dirty data collection, is first different numbers according to regular partition by dirty data collection According to block, each rule corresponds to a data block, and the minimum unit in each data block is data slice, then again by each data block It is again divided into different data groups；

(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, by every A data group carries out independent cleaning to obtain the versions of data of multiple preliminary wash results；

(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, the multiple preliminary cleanings generate to the first stage As a result versions of data is merged, and the collision problem between multi version is eliminated, to generate final unified wash result；

(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeated data delete It removes；

(7) data set after output data cleaning.

2. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (2) specifically:

(2.1) the different types of integrity constraint of input is standardized as Markov logic by conjunctive normal form transformation rule Networking rule；

3. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (3) specifically:

(3.1) entire dirty data collection is divided into multiple data blocks by the integrity constraint rule for being included according to dirty data collection, often A rule corresponds to a data block, includes several data slices in each data block；

(3.2) in each data block, the entry in attribute containing same keyword is divided into same group；Wherein keyword is The reason item of rule, the data slice with same cause are divided into one group.

4. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step Suddenly (4) specifically:

(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be divided into Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group；

(4.2) it is calculated according to similarity distance metric method and Markov logical network weight learning method abnormal in each group The confidence level score (reliability score) of data；

(4.3) independent to clean each data group: cleaning unit is each of data block, selects the maximum number of confidence level score Benchmark according to piece γ as replacement will be belonged to using this data and be replaced with other data that leave a question open in data group, directly Each data group cleaning into the data block finishes, that is, completes the independent cleaning of the data block；

Above-mentioned cleaning is also similarly executed to other data blocks；The multiple preliminary wash results cleaned by the stage are considered as Multiple versions of data, each data block are a versions of data.

5. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step Suddenly (5) specifically:

(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then it is with each benchmark Starting is found in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Markov power The data slice of weight, and it is merged with benchmark；

(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed；Then the merging under the benchmark is calculated As a result fusion score f-score (t)=w₁×…×w_m, wherein w_iIndicate the horse for the data slice being merged in i-th of data block That section authority of the husband weight；

(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and record, directly To the fusion score for obtaining the amalgamation result under all different benchmark；Then select the fusion maximum amalgamation result of score as this The unified wash result of the final overall situation of tuple.

6. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (6) specifically, being scanned after completing two stage cleaning to entire data set, Hash is established for each tuple therein Table rejects it when duplicate keys are arrived in scanning.