CN114281809B

CN114281809B - Multi-source heterogeneous data cleaning method and device

Info

Publication number: CN114281809B
Application number: CN202111577423.6A
Authority: CN
Inventors: 刘峰; 张纪林; 陈军相; 袁俊峰; 刘涛; 金峻帆; 钱瑞祥
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-03-28
Anticipated expiration: 2041-12-22
Also published as: CN114281809A

Abstract

The invention discloses a multi-source heterogeneous data cleaning method and device, which are used for solving the problems of invalid and low-quality data repair caused by improper data cleaning sequence under multiple data quality dimensions. The method starts from multiple data quality dimensions in the smart campus context, and guarantees the effectiveness of overall data cleaning through standardizing the data checking and repairing sequence. In the data repairing process, the currently known campus internal knowledge is used as an external constraint condition, the repairing rule set is expanded, and the accuracy of data cleaning is improved. In the intelligent campus construction process, the cleaned campus data can be effectively applied to all processes of data management, data opening, data mining and analysis and the like in colleges and universities. The consistency problem caused by data restoration under the condition of multiple data quality dimensions is avoided, and the data availability is greatly improved.

Description

Multi-source heterogeneous data cleaning method and device

Technical Field

The invention relates to the technical field of computers, in particular to a multi-source heterogeneous data cleaning method and device, and more particularly relates to a data inspection and repair method for data with integrity, consistency, uniqueness and other data quality problems in the field of data cleaning.

Background

With the rapid development of information technology, data is growing explosively in the background of the big data era. In the process of integrating multi-source heterogeneous data, any improper operation can cause a series of data quality problems. In the field of data mining, the data quality determines whether more valuable knowledge can be mined from massive and complex data, and therefore more reliable and accurate decision support is provided for users.

At present, the industry mainly divides the measurement standards of data quality into six dimensions of completeness, consistency, uniqueness, accuracy, effectiveness, timeliness and the like. Most of the traditional research on data quality only aims at the data quality of a single dimension, or correlation existing among multiple dimensions of the data is ignored, so that the usability of the data after cleaning is low. Data in reality tend to be multidimensional, and each data dimension is not completely independent of each other. Therefore, the traditional single-dimension and simple data cleaning method and device are no longer suitable for solving the problem of multi-dimension data quality under the current complex scene.

Disclosure of Invention

Aiming at the problems, the invention provides a multi-source heterogeneous data cleaning method and device, and aims to solve the problem of multi-dimensional data quality caused by omission, constraint violation, repeated operation and the like when an operator collects and inputs data in real life. By the method and the device, data cleaning of data in three quality dimensions of integrity, consistency and uniqueness can be completed, and the usability of the data is improved.

In order to achieve the purpose, the invention provides a multi-source heterogeneous data cleaning method, which comprises the following specific steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples; wherein each tuple consists of a set of data of all attributes;

the multi-source means that the sources of the data have diversity, and the isomerism means that the types, the characteristics and the like of the data have difference;

step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigma _cfd And external constraint ∑ _fc Adding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;

the external constraints refer to various constraints which are set artificially and relate to hard constraints, quantity constraints, equivalence constraints and the like on data;

and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;

3-1 integrity check

Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T _L If not, add to the complete tuple set T _C ；

3-2 integrity repair

Traversing missing tuple sets T in sequence _L Check if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ) _cfd And/or external constraints ∑ _fc ) Matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;

the improved KNN-based hybrid filling algorithm comprises the following specific steps:

1) Dividing the non-missing data column of the current missing tuple into 5 types of missing subclass tuples, such as a numerical type (num), a binary type (dual), an ordinal type (ordi), a classification type (category), a text type (text) and the like;

2) Set the complete tuples T _C Dividing the same data columns corresponding to each type of subclass tuple in the current missing tuple into 5 types of complete subclass tuple sets;

3) Respectively calculating the subclass distance between each type of missing subclass tuple and the complete subclass tuple;

for the numerical subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a standardized Euclidean distance formula (1);

where n represents the total number of numeric data in the subclass tuple, x _Li Indicates a deletionData i of sub-class tuple, x _Ci Representing the ith data, s, of a complete sub-class tuple _i Representing the standard deviation of all values of ith column data of the subclass tuple;

for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);

if two values of binary data are respectively regarded as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of missing subclass tuple data, which is 0, and the number of corresponding data in the complete subclass tuple, which is 1, r represents the number of missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;

for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);

D(L,C) _ordi ＝D(L,C) _num formula (4)

Wherein, if all values of the ith row data of the ordinal type sub-tuple are sequentially regarded as a sequence from 0 to N, N is _i Indicates the total number N, M of serial numbers of the ith row data _i Indicating the number of values of the data in the sequence, X _i Representing the converted numerical data;

for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);

the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;

for the text type subclass tuples, calculating the distance between character string data by using an edit distance formula (6), and then calculating the subclass distance between the missing subclass tuples and the complete subclass tuples by using a formula (7) and carrying out normalization processing;

wherein D is _i (L,C) _text Indicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple _j 、C _k Respectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple (j is more than or equal to 0 and less than or equal to U _i ,0≤k≤V _i ) Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple _i 、V _i Respectively representing the total length of ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein Max represents a maximum function;

4) Computing missing tuples t ₁ And complete tuple t ₂ Tuple distance between;

missing tuple t ₁ And complete tuple t ₂ The tuple distances between them are multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W, respectively _i Adding the obtained products, and obtaining the product by the formula (8) and the formula (9);

where i represents a subclass tuple of 5 types, W _i Weight coefficient, D, representing the i-th type of sub-class tuple in the current tuple _i (t ₁ ,t ₂ ) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, Y _i Representing the number of ith type data in the current tuple;

5) Sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;

6) Selecting the first k complete tuples with the minimum tuple distance as a target tuple set;

the k value is obtained through training, and the method comprises the following specific steps:

6-1) dividing all complete tuples into a test tuple set and a training tuple set;

6-2) dividing the training tuple set into n sub-tuple sets with the same size;

6-3) taking each sub-tuple set as a complete tuple set T in turn _C Repairing the current missing tuple using 1 to 100 as training k values, respectively;

6-4) acquiring a k value with the highest repairing accuracy in each sub-element set;

6-5) taking the average value of the n k values as a repairing k value of the test tuple set;

in order to ensure that the value of a training k value is between 1 and 100, the size n of each sub-tuple set is not less than 100 when the training tuple sets are divided;

7) Selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of the missing tuple missing data;

and 4, step 4: carrying out consistency check and consistency repair on all tuples in the data processed in the step 3;

4-1 consistency check

Sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, checking the rule violated by the current tuple (namely, conditional function dependence sigma) _cfd And/or external constraints ∑ _fc ) Adding to an abnormal rule set Σ';

4-2 consistency repair

The consistency restoration mainly comprises 3 processes of determining a rule restoration sequence, positioning an abnormal tuple and selecting a target tuple;

4-2-1 determining a rule repair order

1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V = Σ'; for any two conditional function dependencies

If it is not

Then->

There is a pass between>

Point on>

Is on or is greater than>

There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;

2) Sequentially selecting nodes with the degree of income of 0 (namely conditional function dependence) in the rule sequence diagram as priority repair rules, and adding the priority repair rules to the repair rule set sigma _rep Then, the node and the edge connected with the node are carried outDeleting until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigma _rep ；

The in degree is 0, namely that no edge points to the node in the rule sequence diagram;

the repair cost refers to the total number of times of modification of tuple data generated when one tuple is used for performing consistency repair on all abnormal tuples violating the current rule;

4-2-2 location anomaly tuples

Sequentially traversing the repair rule set sigma _rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T _e ；

4-2-3 select target tuples

Combining external constraint rules Σ in rule sets Σ _fc In the abnormal tuple set T _e Selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple;

and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4

Checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step 4; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, repeating the steps until all tuples finish uniqueness check, and realizing the check and repair of the uniqueness of the data;

the improved SNM algorithm based on the hybrid distance and the dynamic window specifically comprises the following steps:

5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;

5-2) sorting all tuples according to the sorting keywords;

5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window according to a formula (8), deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold, and otherwise moving the sliding window by one step length to move out the first tuple in the sliding window and move in the next tuple of the last tuple in the sliding window;

5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, and taking the ratio as the average density of the window, if the average density of the window is less than a density threshold value, increasing the size of the sliding window, if the average density of the window is equal to the density threshold value, keeping the size of the sliding window unchanged, if the average density of the window is greater than the density threshold value, decreasing the size of the sliding window, and continuously sliding until all tuples are checked;

step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition, completing the cleaning of the data set, and if not, returning to the step (4-2) to continue the execution.

In order to achieve the purpose, the invention also provides a multi-source heterogeneous data cleaning device, which comprises the following specific modules:

the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;

the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;

the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;

the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;

the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;

and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.

The technical scheme of the invention has the following advantages:

1. compared with the traditional single-dimension data cleaning, the method provided by the invention starts from three data quality dimensions of integrity, consistency and uniqueness, and designs a cleaning method and steps for data of each dimension respectively, so that the overall quality of multi-dimensional data is improved.

2. Compared with the traditional data cleaning which only depends on the condition function, the invention not only uses the condition function dependence existing among the data, but also uses the external constraint condition, expands the rule set of the data cleaning, and improves the data quality detection and repair effect.

3. Compared with the traditional single type data cleaning, the method can solve the data quality problems of five mixed types such as numerical type, binary type, ordinal type, classification type and text type, and respectively selects a proper distance measurement formula for each type of data, thereby improving the accuracy of data cleaning.

4. Compared with the traditional data cleaning device, the invention avoids the influence of integrity repair on consistency repair and uniqueness repair and the influence of consistency repair on uniqueness repair by designing the standardized data cleaning device, and ensures the effectiveness of data cleaning.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a multi-source heterogeneous data cleaning method according to an embodiment of the present invention;

FIG. 2 is a flow chart of integrity check and repair in an embodiment of the present invention;

FIG. 3 is a flow chart of consistency checking and repair in an embodiment of the present invention;

FIG. 4 is a flow chart of uniqueness checking and repairing in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a dynamic sliding window in an embodiment of the present invention;

FIG. 6 is a block diagram of a multi-source heterogeneous data cleaning apparatus according to an embodiment of the present disclosure;

Detailed Description

In order to fully and clearly communicate the technical solutions of the embodiments of the present invention to those skilled in the art, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, a multi-source heterogeneous data cleaning method provided in an embodiment of the present invention includes the following steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples, wherein one tuple is formed by a group of data with all attributes;

in the step, the data source is a database of each business system in the campus, and the database comprises multi-source heterogeneous data such as student basic information data, score data, library access data, campus card consumption data and the like. Firstly, an extraction task is created by using a key (ETL tool), connection configuration information of a source database and a target database is set, then a conversion task is created to convert fields with the same attribute in all tables into a uniform data format, and finally the extraction task and the conversion task are added to a job and executed to obtain an initial data set.

Step 2: constructing conditional function dependence existing among different attributes for the initial data acquired in the step 1, and then enabling the conditional function dependence to be sigma _cfd And external constraint ∑ _fc Adding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;

in this step, first, a corresponding conditional function dependency is established between attribute fields having an association relationship in all data tables, for example, the personal identification number may determine the age, date of birth, etc., and then added to the rule set. Secondly, some external constraint conditions which can be artificially determined in the business department, such as the number of students in each province, the proportion of men and women, and the like of a certain college, are also added into the rule set. The rule set is specifically defined as follows:

given student basic information data instance I: (school number, name, age, date of birth, identification number, province, city, zip code), conditional function dependency set

External constraint set Σ _fc ＝∑ψ _i Then rule set Σ = Σ _cfd ∪Σ _fc . Dependence on the condition function->

X, Y are different attribute fields in a data table that mean to any two tuples (t) ₁ ，t ₂ ) If t is ₁ [X]＝t ₂ [X]Then t ₁ [Y]＝t ₂ [Y]. On the contrary, if t ₁ [X]＝t ₂ [X]But t is ₁ [Y]≠t ₂ [Y]Then t ₁ And t ₂ Tuple is on rule pick>

There is a consistency error.

For example, for a student basic information table, the first conditional function that can be established depends on the following:

identification number → age, date of birth

Zip code → city, province

Schooling → name->

Secondly, by knowing the enrollment information of schools and computer schools, the external constraints that can be determined are as follows:

ψ ₁ : the number of students in Hangzhou school is not more than 100

ψ ₂ : the ratio of male to female in computer institute is not less than 3:1

And finally, combining the conditional function dependence and the external constraint condition to obtain a required rule set. It should be noted that the rules described above, such as conditional function dependencies and external constraints, are only used to describe the establishment of the rule set, and are not used to limit the rule set.

And step 3: sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T _L If not, add to the complete tuple set T _C . Then using the rule set, complete tuple set T of step 2 _C Repairing the missing tuples based on the improved KNN mixed filling algorithm, wherein the integrity repairing process is shown in FIG. 2;

the step specifically includes two processes of integrity check and integrity repair.

And (4) integrity checking: sequentially traversing all tuples in the data, judging whether the current tuple has deficiency, if so, adding the current tuple to the deficiency tuple set T _L If not, add to the complete tuple set T _C All tuples containing missing data are detected.

And (3) integrity repair: after the integrity check, sequentially traversing the missing tuple sets T _L Check if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ) _cfd And/or external constraints Σ _fc ) And matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a modified KNN-based mixed filling algorithm.

And 4, step 4: and (3) traversing all tuples of the data processed in the step (3) in sequence, checking whether the tuples are matched with all rules in the rule set in the step (2), and recording the violated conditional function dependence and/or external constraint of the tuples with consistency errors. Then, repairing the error data of the error tuple according to the rule repairing sequence and the target tuple to realize the checking and repairing of the data consistency, wherein the consistency repairing flow is shown in fig. 3;

the step specifically comprises two processes of consistency check and consistency repair.

And (3) checking consistency: sequentially traversing all tuples in the data, checking whether the current tuple is matched with all rules in the rule set sigma in the step 2, if so, continuously checking the next tuple, and otherwise, checking the rule violated by the current tuple (conditional function depends on sigma) _cfd And/or external constraints ∑ _fc ) Adding to an abnormal rule set Σ';

and (3) consistency repair: the method specifically comprises 3 processes of determining a rule repairing sequence, positioning an abnormal tuple and selecting a target tuple.

Determining the rule repairing sequence, wherein the rule repairing sequence needs to be determined according to which rule sequences the repairing is carried out in view of the fact that the same attribute field may be contained between different conditional function dependencies in the abnormal rule set, otherwise, the repairing is carried out according to the rule sequencesMay result in erroneous repairs. In a specific implementation, the rule repair order is determined by constructing a rule sequence diagram, then performing topology sorting on the rule sequence diagram, sequentially selecting nodes (conditional function dependencies) with an in-degree of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule sequence set Σ _rep The node and the edges connected to the node are then deleted until there are no more nodes remaining in the graph. If the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a rule sequence with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set;

positioning abnormal tuples and traversing the repairing rule set sigma in sequence _rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T _e ；

Selecting the target tuple, the selection of the abnormal data target value is a key problem for consistency repair. And giving an abnormal tuple set, selecting different repairing target values, wherein the repairing results are different greatly, and the corresponding repairing costs are different. In a specific implementation, it is necessary to combine the external constraint Σ in the rule set Σ _fc In the abnormal tuple set T _e One tuple with the smallest repair cost is selected as a target tuple to repair other abnormal tuples.

And 5: and (4) checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step (4). If so, the two tuple data are considered to be repeated, and the repeated tuple in the window is deleted; and if not, the first tuple and other tuples are considered to meet the uniqueness condition, and the first tuple in the sliding window is moved out and the next tuple of the last tuple in the window is moved in. Repeating the steps until the uniqueness check of all the tuples is completed, wherein the uniqueness repair flow is shown in FIG. 4;

in this step, first, one or a group of data is selected for all tuples in the data set, and a key value of each tuple is calculated and used as a sorting key of the tuple.

Secondly, all the tuples are sorted according to the sorting key, and the tuples with similar and repeated data are adjacent in sequence.

Then, a sliding window with an initial size of N is set on the sorted tuples (as shown in fig. 5), the tuple distance between the first tuple in the window and the other N-1 tuples in the window is calculated, and if the tuple distance between a certain tuple and the first tuple is smaller than the set distance threshold, the similar duplicate tuple is deleted.

And finally, moving a sliding window step length, moving out the first tuple in the sliding window and moving in the next tuple of the last tuple, and repeating the steps until all tuples in the data are checked.

In the process of sliding the window, the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the window is calculated and used as the average density of the window, if the average density of the window is higher than a set density threshold value, the similarity between the tuples in the sliding window is considered to be lower, the size of the sliding window can be properly reduced to reduce the comparison times, and the repair efficiency is improved. On the contrary, if the average density of the window is lower than the set density threshold, the similarity between the elements in the sliding window is considered to be higher, and the size of the sliding window can be properly increased to expand the matching range and improve the repairing accuracy.

In order to further reduce matching errors among all tuples, new sorting keywords can be reselected to perform sorting, checking and repairing again, tuples which are similar and repeated on data are deleted as far as possible through a multiple sliding window detection mechanism, and the accuracy of uniqueness checking and repairing is improved.

Step 6: and (4) rechecking whether all tuples in the data processed in the step (5) are matched with all rules in the rule set, if so, all tuples meet the consistency condition, finishing the cleaning of the data, and if not, returning to the step (4-2) to continue executing.

As shown in fig. 6, in an embodiment of the present invention, there is also provided a multi-source heterogeneous data cleaning apparatus, including: the system comprises a data acquisition and preprocessing module, a rule set construction module, an integrity checking and repairing module, a consistency checking and repairing module, a uniqueness checking and repairing module and a consistency secondary checking module.

The above are merely specific embodiments of the present invention, and are not intended to limit the present invention. It will be apparent to those skilled in the art that the present application is susceptible to modifications and variations in light of the above teachings. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A multi-source heterogeneous data cleaning method is characterized by comprising the following steps:

step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples;

3-1 integrity check

3-2 integrity repair

Traversing missing tuple sets T in sequence _L Checking whether the missing items of the current missing tuples are matched with some rules in the rule set sigma or not, if so, filling the missing data of the current missing tuples by using the rules, otherwise, filling the missing data of the current missing tuples by using a modified KNN-based mixed filling algorithm;

1) Dividing the non-missing data column of the current missing tuple into 5 types of missing tuples, namely numerical type, binary type, ordinal type, classification type and text type;

2) Set the complete tuples T _C According to whenThe same data columns corresponding to each type of tuple in the front missing tuples are also divided into 5 types of complete tuple sets;

3) Respectively calculating the subclass distance between each type of missing tuple and the complete tuple;

7) Selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of missing data of the missing tuples;

and 4, step 4: performing consistency check and consistency repair on all tuples in the data processed in the step 3;

4-1 consistency check:

sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, adding the rule violated by the current tuple to the abnormal rule set sigma';

4-2, consistency repair;

and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4;

step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data, and if not, returning to the step (4-2) to continue the execution.

2. The method for cleaning multi-source heterogeneous data according to claim 1, wherein the step 3) in the integrity repair of the step 3-2 is specifically:

for the numerical subclass tuple, calculating a subclass distance between the missing subclass tuple and the complete subclass tuple by using a standardized Euclidean distance formula (1);

where n represents the total number of numeric data in the subclass tuple, x _Li Represents the ith data, x, of the missing subclass tuple _Ci Represents the ith data, s, of the complete sub-class tuple _i Representing the standard deviation of all values of the ith column data of the subclass tuple;

if two values of the binary data are respectively considered as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of data in the missing subclass tuple data, which is 0, and the data in the complete subclass tuple, which is 1, r represents the number of data in the missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;

D(L,C) _ordi ＝D(L,C) _num formula (4)

Wherein, if all values of the ith column data of the subclass tuple are sequentially regarded as a sequence from 0 to N, then N is _i Indicates the total number N, M of serial numbers of the ith row data _i Indicating the number of values of the data in the sequence, X _i Representing the converted numerical data;

for the text type sub-class tuples, the distance between character string data is calculated by using an editing distance formula (6), and then the sub-class distance between the missing sub-class tuples and the complete sub-class tuples is calculated by using a formula (7) and normalized;

wherein D is _i (L,C) _text Indicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple _j 、C _k Respectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein j is more than or equal to 0 and is less than or equal to U _i ,0≤k≤V _i Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple _i 、V _i The total length of the ith string data in the missing sub-class tuple and the complete sub-class tuple is represented respectively, and Max represents a maximum function.

3. The method for cleaning multi-source heterogeneous data according to claim 2, wherein the step 4) in the integrity repair in the step 3-2 is specifically:

missing tuple t ₁ And complete tuple t ₂ The tuple distance between them is multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W _i Adding the obtained products, and obtaining the product by the formula (8) and the formula (9);

where i represents a subclass tuple of 5 types, W _i Weight coefficient, D, representing the i-th type of sub-class tuple in the current tuple _i (t ₁ ,t ₂ ) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, Y _i Representing the number of i-th type data in the current tuple.

4. The method according to claim 3, wherein the k value in step 6) of the integrity repair in step 3-2 is obtained by:

6-2) dividing the training tuple set into n sub-tuple sets with the same size;

6-4) obtaining the k value with the highest repairing accuracy in each sub-element set;

6-5) the average of these n k values is used as the repair k value for the set of test tuples.

5. The method according to claim 1, wherein the consistency repair in step 4 mainly comprises determining a rule repair order, locating an abnormal tuple, and selecting a target tuple; the method comprises the following steps:

4-2-1 determining a rule repair order

If it is not

Then->

There is a combination of>

Point on>

Is on or is greater than>

2) Sequentially selecting nodes with the degree of income of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule set sigma _rep Then deleting the node and the edges connected with the node until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum sum of repair costs from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigma _rep ；

4-2-2 location anomaly tuples

Traversing the repairing rule set sigma in sequence _rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T _e ；

4-2-3 select target tuples

Combining external constraint rules Σ in rule sets Σ _fc In the abnormal tuple set T _e And selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple.

6. The method for cleaning multi-source heterogeneous data according to claim 1, wherein in step 5, an improved SNM algorithm based on a mixed distance and a dynamic window is used for the data processed in step 4, and whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold is checked; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; and moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, and repeating the steps until all tuples finish uniqueness check, so as to realize data uniqueness check and repair.

7. The method for cleaning multi-source heterogeneous data according to claim 6, wherein the improved SNM algorithm based on the hybrid distance and the dynamic window comprises the following specific steps:

5-2) sorting all tuples according to the sorting keywords;

5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window, deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold value, and otherwise moving the sliding window by one step length;

5-4) calculating the ratio of the tuple distance between the head tuple and the tail tuple in the sliding window to the tuple number in the sliding window, taking the ratio as the window average density, increasing the size of the sliding window if the window average density is smaller than a density threshold, keeping the size of the sliding window unchanged if the window average density is equal to the density threshold, decreasing the size of the sliding window if the window average density is larger than the density threshold, and continuing to slide until all tuple inspection is finished.

8. A multi-source heterogeneous data washing apparatus for implementing the method according to any one of claims 1 to 7, characterized in that the apparatus comprises the following modules: