CN109491991A

CN109491991A - A kind of unsupervised data auto-cleaning method

Info

Publication number: CN109491991A
Application number: CN201811325335.5A
Authority: CN
Inventors: 李玲; 唐军; 吴纯彬; 于跃; 陈秋宇
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-03-19
Anticipated expiration: 2038-11-08
Also published as: CN109491991B

Abstract

The invention discloses a kind of unsupervised data auto-cleaning methods, the following steps are included: A. data model learns, from may include invalid data initial data in learn attribute between dependence, by finding out implicit nisi or relatively weak dependence, the data model indicated with the form of Bayesian network is obtained；B. the generation of data cleansing rule；The generation of data cleansing rule is carried out after the complete data model for obtaining initial data or initial data sampling, and specifically generates predicate and first-order predicate rule；C. Markov logical network is generated based on the predicate generated in step B and first-order predicate rule；D. the generation of rule is made inferences based on the Markov logical network generated in step C and carries out the cleaning of data based on the reasoning results.Method of the invention may be implemented in the quality of data without effectively improving each operation system of company in the case where expending a large amount of manpower and material resources, facilitates management level and makes correct decisions.

Description

A kind of unsupervised data auto-cleaning method

Technical field

The present invention relates to technical field of data administration, in particular to a kind of unsupervised data auto-cleaning method.

Background technique

What data in the real world were typically required cleaning (is below dirty number to the data definition cleaned According to), it may such as include such as inconsistent, there are noise, incomplete or duplicate value.In commercial field, error number According to may cause very huge economic loss.The mistake for commodity that client buys is thrown as the customer information of mistake may cause company It passs, this not only adds the delivery costs of enterprise, while also having biggish bear to the image of enterprise within considerable time Face is rung.

In existing data cleaning method, certain methods need artificial severe to participate in during data cleansing, such as Cleaning is provided and suggests or confirms reparation etc.；Although not needing manually to participate in certain methods cleaning process, needs are mentioned Before make relevant cleaning rule.In the case where data rule is unknown or cost of labor is difficult to bear, existing data Cleaning method is simultaneously not suitable for.In view of the status of current data cleaning method, what this patent solved is exactly good without predefined Cleaning rule is not necessarily to carry out data cleansing in the case of manually participating in simultaneously, promotes the quality of data.

Summary of the invention

It is insufficient in above-mentioned background technique the purpose of the present invention is overcoming, the unsupervised data of one kind side of cleaning automatically is provided Method is based on statistical relational learning learning rules from data, carries out data cleansing based on probability inference, can effectively improve data Cleaning efficiency and effect are realized in the data without effectively improving each operation system of company in the case where expending a large amount of manpower and material resources Quality, promotes the satisfaction of user, while facilitating management level and making correct decisions based on the quality of data of promotion.

In order to reach above-mentioned technical effect, the present invention takes following technical scheme:

A kind of unsupervised data auto-cleaning method, comprising the following steps:

A. data model learn, from may include invalid data initial data in learn attribute between dependence, By finding out implicit nisi or relatively weak dependence, the data indicated with the form of Bayesian network are obtained Model；

B. the generation of data cleansing rule；Obtain initial data or initial data sampling complete data model it The generation of data cleansing rule is carried out afterwards, and specifically generates predicate and first-order predicate rule, that is, first-order predicate logic expression formula；

C. Markov logical network is generated based on the predicate generated in step B and first-order predicate rule；

D. based on the Markov logical network generated in step C make inferences rule generation and based on the reasoning results into The cleaning of row data.

Further, the step A is specifically included:

A1. treating repair data may be assessed and be sampled comprising the initial data of invalid data；

A2. the data set after raw data set or sampling is learnt, obtains being indicated with the form of Bayesian network Data model structure；The structure of Bayesian network reflects dependence and degree of dependence between data attribute,

A3. the data set after raw data set or sampling is learnt to obtain the parameter of data model, specific shape Formula is the conditional probability table of dependence；

A4. the parameter of the structure of merging data model and data model obtains complete data model.Further, institute Step B is stated to specifically include:

B1. the relationship constant for indicating relationship between main body is defined；

B2. the complete data model according to obtained in step A4 generates corresponding first-order predicate logic expression formula: specific Including generating predicate and first-order predicate rule according to the obtained Bayesian network of study, for single attribute be directed toward an attribute and Multiple attributes are directed toward the different situations of an attribute, formulate the conversion that dependence is converted to first-order predicate logic expression formula respectively Rule.

Further, in the step B2；

When being directed toward an attribute for single attribute, i.e. attribute A₁And A₂Between there are a directed edge and from A₁It is directed toward A₂, Then by A₁And A₂Between dependence form turn to following first-order predicate logic:

Wherein v is tuple id₁And id₂A attribute value；

When being directed toward an attribute for multiple attributes, attribute A₁、A₂、…、A_iIt is directed toward A simultaneously_j, then its dependence form Turn to following first-order predicate logic:

Wherein, v₁、v₂、…、v_iIt is tuple id₁And id₂In attribute A₁、A₂、…、A_iOn attribute value.Further, described Step C is specifically included:

C1. to the first-order predicate rule of generation according to whether being logical validity formula, i.e., explain that lower probability carries out as 1 any It distinguishes, is divided into absolute rule and non-absolute rule；

C2. the calculating of weight is carried out to first-order predicate logic, including is formulated respectively for absolute rule and non-absolute rule Different weight calculation strategy, wherein the weight assignment to absolute rule is positive infinity, uses mutual information to non-absolute rule Calculate the weight of these rules；

C3. the first-order predicate generated according to step B2 is regular, and the mutual information between the rule-based attribute being related to calculates rule Weight then；

C4. according to the weight calculation in step C3 as a result, obtaining the Ma Er of the data set after raw data set or sampling It can husband's logical network.

Further, the step C3 is specifically included:

C3.1 is related to the different situations of two attributes and multiple attributes for a first-order predicate logic rule, makes respectively Fixed different regular weighing computation method；Wherein,

The case where being related to two attributes for a first-order predicate logic rule, using two attributes in raw data set Or the mutual information on the data set after sampling carries out the calculating of regular weight；

The mutual information is the real number of a value range between zero and one, if attribute is perfectly correlated, mutual information is 1, if uncorrelated, mutual information 0 completely, herein if rule is related to two attributes, mutual information is two attribute variables Joint probability density assembly average, be related to the weight of two attributes, such as weight in this, as first-order predicate logic rule Higher, then correlation is strong, explanatory strong；Because first-order predicate logic rule is related to the discrete feature of attribute, mutual information is defined as:

Wherein, P (x, y) is joint probability distribution function, and p (x) and p (y) is marginal probability density function

C3.2 introduces exponential function and is calculated when carrying out the calculating of regular weight, it is ensured that weights are not Number less than 0 is equivalent to several attributes because the exponential function introduced is the potential function of several attributes of non-negative real function characterization The weighted feature amount of feature, plays the role of normalized, and formula is as follows:

Further, the step D is specifically included:

D1. it is made inferences based on the step C4 Markov logical network generated, using Markov Chain Meng Teka Gibbs sampling method in Lip river carries out rule-based reasoning, and the rule of gibbs sampler reasoning are generated according to Markov logical network Then, the weight of gibbs sampler inference rule is determined；

D2. gibbs sampler inference pattern is constructed, usage factor figure determines reasoning mould as gibbs sampler inference pattern The variable and the factor of factor graph in type, wherein the factor is for assessing the relationship between variable；

D3. according to the possible world of the step B2 predicate constructed variable generated；

D4. it is made inferences in the possible world of the predicate of step D3 according to the inference pattern that step D2 is constructed；

D5. based on step D4 reasoning as a result, being cleaned, being repaired to raw data set.

It further, is to select it is expected maximum value as the value after repairing when being repaired in the step D5.

Compared with prior art, the present invention have it is below the utility model has the advantages that

Unsupervised data auto-cleaning method of the invention is that the unsupervised automaticdata based on statistical relational learning is clear Washing method is not necessarily to manpower intervention when carrying out data cleansing, therefore can greatly save the human cost of data cleansing, together When due to being to carry out rule discovery from the initial data comprising dirty data automatically, there is no need to formulate in advance the quality of data rule Then.This unsupervised automaticdata cleaning method can effectively promote the effect of data cleansing, improve the accuracy of data, simultaneously The efficiency of data cleansing can also be improved.

Detailed description of the invention

Fig. 1 is the frame diagram of unsupervised data auto-cleaning method of the invention.

Specific embodiment

Below with reference to the embodiment of the present invention, the invention will be further elaborated.

Embodiment:

As shown in Figure 1, a kind of unsupervised data auto-cleaning method, it can be in the quality of data mode/rule feelings lacked Under condition and without realizing data cleansing in the case of manpower intervention, while ensuring the effect and efficiency of data cleansing.

Specifically includes the following steps:

S10. data model learns:

To find out implicit mode/rule, need from may include invalid data initial data in learn between attribute Dependence.Since there may be invalid data, the absolute or strong dependence between data Table Properties is not necessarily In the presence of being indicated by finding out implicit nisi or relatively weak dependence, and with the form of Bayesian network To data model.

The emphasis process that the step extracts is as follows:

S101. repair data is treated to be assessed and sampled；

S102. the data set after raw data set or sampling is learnt, the form for obtaining Bayesian network indicates Data model structure, concrete form be Bayesian network；

S103. the data set after raw data set or sampling is learnt, obtains the parameter of data model, it is specific Form is the conditional probability table of dependence；

S104. the structure and parameter for merging step S102 and the data model in step S103, obtains complete data mould Type.

S20. the generation of data cleansing rule:

After the complete data model for obtaining initial data or initial data sampling, i.e. progress data cleansing is regular Generation.

The generation of data cleansing rule has following main several steps:

S201. relationship constant is defined.Relationship constant has contained the relationship between multiple elements, be mainly used for indicate main body it Between relationship, need to be defined the relationships constant such as " equivalence ", " matching " in this step.

S202. corresponding first-order predicate logic expression formula is generated according to data model.

Bayesian network is a kind of reflection of dependence between attribute in relation table, if node N₁It is directed toward N₂, then it represents that N₂N is depended in a way₁.In view of this consideration, the Bayesian network building first-order predicate logic obtained according to study.

It is assumed that attribute A₁And A₂Between there are a directed edge and from A₁It is directed toward A₂, then can be by A₁And A₂Between dependence close It is that form turns to following first-order predicate logic expression formula:

Wherein v is tuple id₁And id₂A attribute value.

If there is multiple attributes are directed toward an attribute, such as attribute A₁、A₂、…、A_iIt is directed toward A simultaneously_j, then between them according to Bad relationship equally can turn to following first-order predicate logic in the form of:

Wherein v₁、v₂、…、v_iIt is tuple id₁And id₂In attribute A₁、A₂、…、A_iOn attribute value.

It needs to be directed toward an attribute for single attribute in this step and multiple attributes is directed toward not sympathizing with for an attribute Condition formulates dependence respectively and is converted to the transformation rule of first-order predicate logic expression formula, and is obtained according to S104 complete Data model automatically generates predicate and first-order predicate rule.

S30. Markov logical network is generated based on the step S202 predicate generated and first-order predicate rule.

Markov Logic Network defines the probability distribution in possible world, the possible world under data cleansing scene Refer to the possibility reparation of wrong data.Markov Logic Network includes first-order predicate logic rule and corresponding weight.Weight It is the reflection of first-order predicate logic satisfaction degree, weight is bigger, and the degree for illustrating that first-order predicate logic meets is higher.

S301. the first-order predicate rule of generation is distinguished, is divided into absolute rule and non-absolute rule, specially to life At first-order predicate rule according to whether being that logical validity formula explains that lower probability distinguishes as 1 any, be divided into absolute rule Then with non-absolute rule.

S302. the calculating of weight is carried out to first-order predicate logic.

For absolute rule and non-absolute rule, different weight calculation strategies is formulated respectively.For absolute rule, weight It is assigned a value of positive infinity.Non- absolute rule belongs to approximate satisfaction, for non-absolute rule, calculates these rules using mutual information Weight.Each approximate first-order predicate logic rule met is a kind of reflection of dependence between attribute in relation table, is passed through Mutual information between computation attribute indicates the degree of dependence of dependence.

S303. the first-order predicate generated according to step S302 is regular, the mutual information meter between the rule-based attribute being related to Calculate the weight of rule.

It is related to the different situations of two attributes and multiple attributes for a first order logic rule, formulates respectively different Regular weighing computation method.

The case where being related to two attributes for a first order logic rule, using two attributes in raw data set or Mutual information in raw data set sampling carries out the calculating of regular weight.

Mutual information is the real number of a value range between zero and one, if attribute is perfectly correlated, mutual information 1, such as Fruit is completely uncorrelated, then mutual information is 0.Herein if rule is related to two attributes, mutual information is the connection of two attribute variables The assembly average for closing probability density, is related to the weight of two attributes in this, as first-order predicate logic rule, if weight is higher, Then correlation is strong, explanatory strong；Because first-order predicate logic rule is related to the discrete feature of attribute, mutual information is defined as:

Wherein, P (x, y) is joint probability distribution function, and p (x) and p (y) is marginal probability density function.

It when carrying out the calculating of regular weight, introduces exponential function and is calculated, it is ensured that weights are >=0 Number, this is but also obtained weight can preferably reflect the dependence between attribute.Because of the exponential function right and wrong introduced The potential function of several attributes of negative real function characterization, is equivalent to the weighted feature amount of several attributive character, plays normalized work With formula is as follows:

Simultaneously as the increase of mutual information, weight exponentially increase, it is clear in data that high weight rule can be increased in this way Effect during washing promotes the effect of data cleansing.

S304. according to the weight calculation of step S303 as a result, automatically deriving the horse of initial data or initial data sampling Er Kefu logical network.

S40. the generation of rule is made inferences based on the step S304 Markov logical network generated and is based on reasoning knot The cleaning of fruit progress data.

Specifically includes the following steps:

S401. it is made inferences based on the step S304 Markov logical network generated, using Markov Chain Meng Teka Gibbs sampling method in Lip river carries out rule-based reasoning.The rule of gibbs sampler reasoning are generated according to Markov logical network Then, the weight of gibbs sampler inference rule is determined.

S402. gibbs sampler inference pattern is constructed.

Usage factor figure is as gibbs sampler inference pattern.The variable and the factor for determining factor graph in inference pattern, because Son is for assessing the relationship between variable.

S403. the possible world based on the step S202 predicate constructed variable generated, which is the basis of reasoning.

S404. the inference pattern based on step S402 building makes inferences in the possible world of the predicate of step S403.

S405. based on step S404 reasoning as a result, being cleaned, being repaired to raw data set.To each to be repaired Data select it is expected maximum value as the value after repairing.

In summary, unsupervised data auto-cleaning method of the invention, is based on the unsupervised of statistical relational learning Automaticdata cleaning method is not necessarily to manpower intervention when carrying out data cleansing, therefore can greatly save data cleansing Human cost, simultaneously because carrying out rule discovery from the initial data comprising dirty data automatically, there is no need to formulate in advance Quality of data rule.This unsupervised automaticdata cleaning method can effectively promote the effect of data cleansing, improve data Accuracy, while the efficiency of data cleansing can also be improved.

It is understood that the principle that embodiment of above is intended to be merely illustrative of the present and the exemplary implementation that uses Mode, however the present invention is not limited thereto.For those skilled in the art, essence of the invention is not being departed from In the case where mind and essence, various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.

Claims

1. a kind of unsupervised data auto-cleaning method, which comprises the following steps:

A. data model learn, from may include invalid data initial data in learn attribute between dependence, pass through Implicit nisi or relatively weak dependence is found out, the data mould indicated with the form of Bayesian network is obtained Type；

B. the generation of data cleansing rule；After the complete data model for obtaining initial data or initial data sampling i.e. The generation of data cleansing rule is carried out, and specifically generates predicate and first-order predicate rule；

D. the generation of rule is made inferences based on the Markov logical network generated in step C and is counted based on the reasoning results According to cleaning.

2. the unsupervised data auto-cleaning method of one kind according to claim 1, which is characterized in that the step A tool Body includes:

A2. the data set after raw data set or sampling is learnt, obtains the number indicated with the form of Bayesian network According to the structure of model；

A3. the data set after raw data set or sampling is learnt to obtain the parameter of data model, concrete form is The conditional probability table of dependence；

A4. the parameter of the structure of merging data model and data model obtains complete data model.

3. the unsupervised data auto-cleaning method of one kind according to claim 2, which is characterized in that the step B tool Body includes:

B2. the complete data model according to obtained in step A4 generates corresponding first-order predicate logic expression formula: specifically including The Bayesian network obtained according to study generates predicate and first-order predicate rule i.e. first-order predicate logic expression formula, belongs to for single Property be directed toward the different situations that an attribute and multiple attributes be directed toward an attribute, formulate dependence respectively and be converted to first-order predicate The transformation rule of logical expression.

4. the unsupervised data auto-cleaning method of one kind according to claim 3, which is characterized in that the step B2 In；

When being directed toward an attribute for single attribute, i.e. attribute A₁And A₂Between there are a directed edge and from A₁It is directed toward A₂, then will A₁And A₂Between dependence form turn to following first-order predicate logic:

Wherein v is tuple id₁And id₂A attribute value；

When being directed toward an attribute for multiple attributes, attribute A₁、A₂、…、A_iIt is directed toward A simultaneously_j, then its dependence form turn to as Under first-order predicate logic:

Wherein, v₁、v₂、…、v_iIt is tuple id₁And id₂In attribute A₁、A₂、…、A_iOn attribute value.

5. the unsupervised data auto-cleaning method of one kind according to claim 3, which is characterized in that the step C tool Body includes:

C1. the first-order predicate rule of generation is distinguished, is divided into absolute rule and non-absolute rule；

C2. the calculating of weight is carried out to first-order predicate logic, including formulates difference respectively for absolute rule and non-absolute rule Weight calculation strategy, wherein be positive infinity to the weight assignment of absolute rule, non-absolute rule calculated using mutual information The weight of these rules；

C3. the first-order predicate generated according to step B2 is regular, the mutual information computation rule between the rule-based attribute being related to Weight；

C4. according to the weight calculation in step C3 as a result, obtaining the markov of the data set after raw data set or sampling Logical network.

6. the unsupervised data auto-cleaning method of one kind according to claim 5, which is characterized in that the step C3 tool Body includes:

C3.1 is related to the different situations of two attributes and multiple attributes for a first-order predicate logic rule, formulates respectively not Same regular weighing computation method；Wherein,

The case where being related to two attributes for a first-order predicate logic rule, using two attributes in raw data set or Mutual information on data set after sampling carries out the calculating of regular weight；

The mutual information is the real number of a value range between zero and one, if attribute is perfectly correlated, mutual information 1, such as Fruit is completely uncorrelated, then mutual information is 0；

C3.2 introduces exponential function and is calculated when carrying out the calculating of regular weight, it is ensured that weights are not less than 0 Number.

7. the unsupervised data auto-cleaning method of one kind according to claim 6, which is characterized in that the step D tool Body includes:

D1. it is made inferences based on the step C4 Markov logical network generated, using in Markov chain Monte-Carlo Gibbs sampling method carry out rule-based reasoning, according to Markov logical network generate gibbs sampler reasoning rule, really Determine the weight of gibbs sampler inference rule；

D2. gibbs sampler inference pattern is constructed, usage factor figure determines in inference pattern as gibbs sampler inference pattern The variable and the factor of factor graph, wherein the factor is for assessing the relationship between variable；

8. the unsupervised data auto-cleaning method of one kind according to claim 7, which is characterized in that in the step D5 It is to select it is expected maximum value as the value after repairing when being repaired.