CN114281809B - Multi-source heterogeneous data cleaning method and device - Google Patents

Multi-source heterogeneous data cleaning method and device Download PDF

Info

Publication number
CN114281809B
CN114281809B CN202111577423.6A CN202111577423A CN114281809B CN 114281809 B CN114281809 B CN 114281809B CN 202111577423 A CN202111577423 A CN 202111577423A CN 114281809 B CN114281809 B CN 114281809B
Authority
CN
China
Prior art keywords
tuple
data
tuples
missing
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111577423.6A
Other languages
Chinese (zh)
Other versions
CN114281809A (en
Inventor
刘峰
张纪林
陈军相
袁俊峰
刘涛
金峻帆
钱瑞祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111577423.6A priority Critical patent/CN114281809B/en
Publication of CN114281809A publication Critical patent/CN114281809A/en
Application granted granted Critical
Publication of CN114281809B publication Critical patent/CN114281809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-source heterogeneous data cleaning method and device, which are used for solving the problems of invalid and low-quality data repair caused by improper data cleaning sequence under multiple data quality dimensions. The method starts from multiple data quality dimensions in the smart campus context, and guarantees the effectiveness of overall data cleaning through standardizing the data checking and repairing sequence. In the data repairing process, the currently known campus internal knowledge is used as an external constraint condition, the repairing rule set is expanded, and the accuracy of data cleaning is improved. In the intelligent campus construction process, the cleaned campus data can be effectively applied to all processes of data management, data opening, data mining and analysis and the like in colleges and universities. The consistency problem caused by data restoration under the condition of multiple data quality dimensions is avoided, and the data availability is greatly improved.

Description

Multi-source heterogeneous data cleaning method and device
Technical Field
The invention relates to the technical field of computers, in particular to a multi-source heterogeneous data cleaning method and device, and more particularly relates to a data inspection and repair method for data with integrity, consistency, uniqueness and other data quality problems in the field of data cleaning.
Background
With the rapid development of information technology, data is growing explosively in the background of the big data era. In the process of integrating multi-source heterogeneous data, any improper operation can cause a series of data quality problems. In the field of data mining, the data quality determines whether more valuable knowledge can be mined from massive and complex data, and therefore more reliable and accurate decision support is provided for users.
At present, the industry mainly divides the measurement standards of data quality into six dimensions of completeness, consistency, uniqueness, accuracy, effectiveness, timeliness and the like. Most of the traditional research on data quality only aims at the data quality of a single dimension, or correlation existing among multiple dimensions of the data is ignored, so that the usability of the data after cleaning is low. Data in reality tend to be multidimensional, and each data dimension is not completely independent of each other. Therefore, the traditional single-dimension and simple data cleaning method and device are no longer suitable for solving the problem of multi-dimension data quality under the current complex scene.
Disclosure of Invention
Aiming at the problems, the invention provides a multi-source heterogeneous data cleaning method and device, and aims to solve the problem of multi-dimensional data quality caused by omission, constraint violation, repeated operation and the like when an operator collects and inputs data in real life. By the method and the device, data cleaning of data in three quality dimensions of integrity, consistency and uniqueness can be completed, and the usability of the data is improved.
In order to achieve the purpose, the invention provides a multi-source heterogeneous data cleaning method, which comprises the following specific steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set consisting of a plurality of tuples; wherein each tuple consists of a set of data of all attributes;
the multi-source means that the sources of the data have diversity, and the isomerism means that the types, the characteristics and the like of the data have difference;
step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigma cfd And external constraint ∑ fc Adding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
the external constraints refer to various constraints which are set artificially and relate to hard constraints, quantity constraints, equivalence constraints and the like on data;
and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;
3-1 integrity check
Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T L If not, add to the complete tuple set T C
3-2 integrity repair
Traversing missing tuple sets T in sequence L Check if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ) cfd And/or external constraints ∑ fc ) Matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the improved KNN-based hybrid filling algorithm comprises the following specific steps:
1) Dividing the non-missing data column of the current missing tuple into 5 types of missing subclass tuples, such as a numerical type (num), a binary type (dual), an ordinal type (ordi), a classification type (category), a text type (text) and the like;
2) Set the complete tuples T C Dividing the same data columns corresponding to each type of subclass tuple in the current missing tuple into 5 types of complete subclass tuple sets;
3) Respectively calculating the subclass distance between each type of missing subclass tuple and the complete subclass tuple;
for the numerical subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a standardized Euclidean distance formula (1);
Figure BDA0003425739410000021
where n represents the total number of numeric data in the subclass tuple, x Li Indicates a deletionData i of sub-class tuple, x Ci Representing the ith data, s, of a complete sub-class tuple i Representing the standard deviation of all values of ith column data of the subclass tuple;
for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);
Figure BDA0003425739410000022
if two values of binary data are respectively regarded as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of missing subclass tuple data, which is 0, and the number of corresponding data in the complete subclass tuple, which is 1, r represents the number of missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;
for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);
Figure BDA0003425739410000031
D(L,C) ordi =D(L,C) num formula (4)
Wherein, if all values of the ith row data of the ordinal type sub-tuple are sequentially regarded as a sequence from 0 to N, N is i Indicates the total number N, M of serial numbers of the ith row data i Indicating the number of values of the data in the sequence, X i Representing the converted numerical data;
for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);
Figure BDA0003425739410000032
the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;
for the text type subclass tuples, calculating the distance between character string data by using an edit distance formula (6), and then calculating the subclass distance between the missing subclass tuples and the complete subclass tuples by using a formula (7) and carrying out normalization processing;
Figure BDA0003425739410000033
Figure BDA0003425739410000034
wherein D is i (L,C) text Indicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple j 、C k Respectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple (j is more than or equal to 0 and less than or equal to U i ,0≤k≤V i ) Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple i 、V i Respectively representing the total length of ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein Max represents a maximum function;
4) Computing missing tuples t 1 And complete tuple t 2 Tuple distance between;
missing tuple t 1 And complete tuple t 2 The tuple distances between them are multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W, respectively i Adding the obtained products, and obtaining the product by the formula (8) and the formula (9);
Figure BDA0003425739410000041
Figure BDA0003425739410000042
where i represents a subclass tuple of 5 types, W i Weight coefficient, D, representing the i-th type of sub-class tuple in the current tuple i (t 1 ,t 2 ) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, Y i Representing the number of ith type data in the current tuple;
5) Sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;
6) Selecting the first k complete tuples with the minimum tuple distance as a target tuple set;
the k value is obtained through training, and the method comprises the following specific steps:
6-1) dividing all complete tuples into a test tuple set and a training tuple set;
6-2) dividing the training tuple set into n sub-tuple sets with the same size;
6-3) taking each sub-tuple set as a complete tuple set T in turn C Repairing the current missing tuple using 1 to 100 as training k values, respectively;
6-4) acquiring a k value with the highest repairing accuracy in each sub-element set;
6-5) taking the average value of the n k values as a repairing k value of the test tuple set;
in order to ensure that the value of a training k value is between 1 and 100, the size n of each sub-tuple set is not less than 100 when the training tuple sets are divided;
7) Selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of the missing tuple missing data;
and 4, step 4: carrying out consistency check and consistency repair on all tuples in the data processed in the step 3;
4-1 consistency check
Sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, checking the rule violated by the current tuple (namely, conditional function dependence sigma) cfd And/or external constraints ∑ fc ) Adding to an abnormal rule set Σ';
4-2 consistency repair
The consistency restoration mainly comprises 3 processes of determining a rule restoration sequence, positioning an abnormal tuple and selecting a target tuple;
4-2-1 determining a rule repair order
1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V = Σ'; for any two conditional function dependencies
Figure BDA0003425739410000051
If it is not
Figure BDA0003425739410000052
Then->
Figure BDA0003425739410000053
There is a pass between>
Figure BDA0003425739410000054
Point on>
Figure BDA0003425739410000055
Is on or is greater than>
Figure BDA0003425739410000056
There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;
2) Sequentially selecting nodes with the degree of income of 0 (namely conditional function dependence) in the rule sequence diagram as priority repair rules, and adding the priority repair rules to the repair rule set sigma rep Then, the node and the edge connected with the node are carried outDeleting until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigma rep
The in degree is 0, namely that no edge points to the node in the rule sequence diagram;
the repair cost refers to the total number of times of modification of tuple data generated when one tuple is used for performing consistency repair on all abnormal tuples violating the current rule;
4-2-2 location anomaly tuples
Sequentially traversing the repair rule set sigma rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T e
4-2-3 select target tuples
Combining external constraint rules Σ in rule sets Σ fc In the abnormal tuple set T e Selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple;
and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4
Checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step 4; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, repeating the steps until all tuples finish uniqueness check, and realizing the check and repair of the uniqueness of the data;
the improved SNM algorithm based on the hybrid distance and the dynamic window specifically comprises the following steps:
5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;
5-2) sorting all tuples according to the sorting keywords;
5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window according to a formula (8), deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold, and otherwise moving the sliding window by one step length to move out the first tuple in the sliding window and move in the next tuple of the last tuple in the sliding window;
5-4) calculating the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the sliding window, and taking the ratio as the average density of the window, if the average density of the window is less than a density threshold value, increasing the size of the sliding window, if the average density of the window is equal to the density threshold value, keeping the size of the sliding window unchanged, if the average density of the window is greater than the density threshold value, decreasing the size of the sliding window, and continuously sliding until all tuples are checked;
step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition, completing the cleaning of the data set, and if not, returning to the step (4-2) to continue the execution.
In order to achieve the purpose, the invention also provides a multi-source heterogeneous data cleaning device, which comprises the following specific modules:
the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
The technical scheme of the invention has the following advantages:
1. compared with the traditional single-dimension data cleaning, the method provided by the invention starts from three data quality dimensions of integrity, consistency and uniqueness, and designs a cleaning method and steps for data of each dimension respectively, so that the overall quality of multi-dimensional data is improved.
2. Compared with the traditional data cleaning which only depends on the condition function, the invention not only uses the condition function dependence existing among the data, but also uses the external constraint condition, expands the rule set of the data cleaning, and improves the data quality detection and repair effect.
3. Compared with the traditional single type data cleaning, the method can solve the data quality problems of five mixed types such as numerical type, binary type, ordinal type, classification type and text type, and respectively selects a proper distance measurement formula for each type of data, thereby improving the accuracy of data cleaning.
4. Compared with the traditional data cleaning device, the invention avoids the influence of integrity repair on consistency repair and uniqueness repair and the influence of consistency repair on uniqueness repair by designing the standardized data cleaning device, and ensures the effectiveness of data cleaning.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a multi-source heterogeneous data cleaning method according to an embodiment of the present invention;
FIG. 2 is a flow chart of integrity check and repair in an embodiment of the present invention;
FIG. 3 is a flow chart of consistency checking and repair in an embodiment of the present invention;
FIG. 4 is a flow chart of uniqueness checking and repairing in an embodiment of the present invention;
FIG. 5 is a diagram illustrating a dynamic sliding window in an embodiment of the present invention;
FIG. 6 is a block diagram of a multi-source heterogeneous data cleaning apparatus according to an embodiment of the present disclosure;
Detailed Description
In order to fully and clearly communicate the technical solutions of the embodiments of the present invention to those skilled in the art, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As shown in fig. 1, a multi-source heterogeneous data cleaning method provided in an embodiment of the present invention includes the following steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples, wherein one tuple is formed by a group of data with all attributes;
in the step, the data source is a database of each business system in the campus, and the database comprises multi-source heterogeneous data such as student basic information data, score data, library access data, campus card consumption data and the like. Firstly, an extraction task is created by using a key (ETL tool), connection configuration information of a source database and a target database is set, then a conversion task is created to convert fields with the same attribute in all tables into a uniform data format, and finally the extraction task and the conversion task are added to a job and executed to obtain an initial data set.
Step 2: constructing conditional function dependence existing among different attributes for the initial data acquired in the step 1, and then enabling the conditional function dependence to be sigma cfd And external constraint ∑ fc Adding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
in this step, first, a corresponding conditional function dependency is established between attribute fields having an association relationship in all data tables, for example, the personal identification number may determine the age, date of birth, etc., and then added to the rule set. Secondly, some external constraint conditions which can be artificially determined in the business department, such as the number of students in each province, the proportion of men and women, and the like of a certain college, are also added into the rule set. The rule set is specifically defined as follows:
given student basic information data instance I: (school number, name, age, date of birth, identification number, province, city, zip code), conditional function dependency set
Figure BDA0003425739410000081
External constraint set Σ fc =∑ψ i Then rule set Σ = Σ cfd ∪Σ fc . Dependence on the condition function->
Figure BDA0003425739410000082
X, Y are different attribute fields in a data table that mean to any two tuples (t) 1 ,t 2 ) If t is 1 [X]=t 2 [X]Then t 1 [Y]=t 2 [Y]. On the contrary, if t 1 [X]=t 2 [X]But t is 1 [Y]≠t 2 [Y]Then t 1 And t 2 Tuple is on rule pick>
Figure BDA0003425739410000083
There is a consistency error.
For example, for a student basic information table, the first conditional function that can be established depends on the following:
Figure BDA0003425739410000084
identification number → age, date of birth
Figure BDA0003425739410000085
Zip code → city, province
Figure BDA0003425739410000086
Schooling → name->
Secondly, by knowing the enrollment information of schools and computer schools, the external constraints that can be determined are as follows:
ψ 1 : the number of students in Hangzhou school is not more than 100
ψ 2 : the ratio of male to female in computer institute is not less than 3:1
And finally, combining the conditional function dependence and the external constraint condition to obtain a required rule set. It should be noted that the rules described above, such as conditional function dependencies and external constraints, are only used to describe the establishment of the rule set, and are not used to limit the rule set.
And step 3: sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T L If not, add to the complete tuple set T C . Then using the rule set, complete tuple set T of step 2 C Repairing the missing tuples based on the improved KNN mixed filling algorithm, wherein the integrity repairing process is shown in FIG. 2;
the step specifically includes two processes of integrity check and integrity repair.
And (4) integrity checking: sequentially traversing all tuples in the data, judging whether the current tuple has deficiency, if so, adding the current tuple to the deficiency tuple set T L If not, add to the complete tuple set T C All tuples containing missing data are detected.
And (3) integrity repair: after the integrity check, sequentially traversing the missing tuple sets T L Check if the missing entries of the current missing tuple match some of the rules in the rule set Σ (i.e., conditional function depends on Σ) cfd And/or external constraints Σ fc ) And matching, if so, filling the missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a modified KNN-based mixed filling algorithm.
And 4, step 4: and (3) traversing all tuples of the data processed in the step (3) in sequence, checking whether the tuples are matched with all rules in the rule set in the step (2), and recording the violated conditional function dependence and/or external constraint of the tuples with consistency errors. Then, repairing the error data of the error tuple according to the rule repairing sequence and the target tuple to realize the checking and repairing of the data consistency, wherein the consistency repairing flow is shown in fig. 3;
the step specifically comprises two processes of consistency check and consistency repair.
And (3) checking consistency: sequentially traversing all tuples in the data, checking whether the current tuple is matched with all rules in the rule set sigma in the step 2, if so, continuously checking the next tuple, and otherwise, checking the rule violated by the current tuple (conditional function depends on sigma) cfd And/or external constraints ∑ fc ) Adding to an abnormal rule set Σ';
and (3) consistency repair: the method specifically comprises 3 processes of determining a rule repairing sequence, positioning an abnormal tuple and selecting a target tuple.
Determining the rule repairing sequence, wherein the rule repairing sequence needs to be determined according to which rule sequences the repairing is carried out in view of the fact that the same attribute field may be contained between different conditional function dependencies in the abnormal rule set, otherwise, the repairing is carried out according to the rule sequencesMay result in erroneous repairs. In a specific implementation, the rule repair order is determined by constructing a rule sequence diagram, then performing topology sorting on the rule sequence diagram, sequentially selecting nodes (conditional function dependencies) with an in-degree of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule sequence set Σ rep The node and the edges connected to the node are then deleted until there are no more nodes remaining in the graph. If the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a rule sequence with the minimum repair cost sum from all conditional function dependent combinations in the rule sequence diagram as a repair rule set;
positioning abnormal tuples and traversing the repairing rule set sigma in sequence rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T e
Selecting the target tuple, the selection of the abnormal data target value is a key problem for consistency repair. And giving an abnormal tuple set, selecting different repairing target values, wherein the repairing results are different greatly, and the corresponding repairing costs are different. In a specific implementation, it is necessary to combine the external constraint Σ in the rule set Σ fc In the abnormal tuple set T e One tuple with the smallest repair cost is selected as a target tuple to repair other abnormal tuples.
And 5: and (4) checking whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold value by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed in the step (4). If so, the two tuple data are considered to be repeated, and the repeated tuple in the window is deleted; and if not, the first tuple and other tuples are considered to meet the uniqueness condition, and the first tuple in the sliding window is moved out and the next tuple of the last tuple in the window is moved in. Repeating the steps until the uniqueness check of all the tuples is completed, wherein the uniqueness repair flow is shown in FIG. 4;
in this step, first, one or a group of data is selected for all tuples in the data set, and a key value of each tuple is calculated and used as a sorting key of the tuple.
Secondly, all the tuples are sorted according to the sorting key, and the tuples with similar and repeated data are adjacent in sequence.
Then, a sliding window with an initial size of N is set on the sorted tuples (as shown in fig. 5), the tuple distance between the first tuple in the window and the other N-1 tuples in the window is calculated, and if the tuple distance between a certain tuple and the first tuple is smaller than the set distance threshold, the similar duplicate tuple is deleted.
And finally, moving a sliding window step length, moving out the first tuple in the sliding window and moving in the next tuple of the last tuple, and repeating the steps until all tuples in the data are checked.
In the process of sliding the window, the ratio of the tuple distance between head and tail tuples in the sliding window to the tuple number in the window is calculated and used as the average density of the window, if the average density of the window is higher than a set density threshold value, the similarity between the tuples in the sliding window is considered to be lower, the size of the sliding window can be properly reduced to reduce the comparison times, and the repair efficiency is improved. On the contrary, if the average density of the window is lower than the set density threshold, the similarity between the elements in the sliding window is considered to be higher, and the size of the sliding window can be properly increased to expand the matching range and improve the repairing accuracy.
In order to further reduce matching errors among all tuples, new sorting keywords can be reselected to perform sorting, checking and repairing again, tuples which are similar and repeated on data are deleted as far as possible through a multiple sliding window detection mechanism, and the accuracy of uniqueness checking and repairing is improved.
Step 6: and (4) rechecking whether all tuples in the data processed in the step (5) are matched with all rules in the rule set, if so, all tuples meet the consistency condition, finishing the cleaning of the data, and if not, returning to the step (4-2) to continue executing.
As shown in fig. 6, in an embodiment of the present invention, there is also provided a multi-source heterogeneous data cleaning apparatus, including: the system comprises a data acquisition and preprocessing module, a rule set construction module, an integrity checking and repairing module, a consistency checking and repairing module, a uniqueness checking and repairing module and a consistency secondary checking module.
The data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
The above are merely specific embodiments of the present invention, and are not intended to limit the present invention. It will be apparent to those skilled in the art that the present application is susceptible to modifications and variations in light of the above teachings. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A multi-source heterogeneous data cleaning method is characterized by comprising the following steps:
step 1: the method comprises the steps of obtaining multi-source heterogeneous data, converting the data with the same attribute into a uniform data format, and obtaining a data set formed by a plurality of tuples;
step 2: constructing conditional function dependence existing among different attributes for the data processed in the step 1, and then enabling the conditional function dependence to be sigma cfd And external constraint ∑ fc Adding the rule set sigma, wherein each rule in the rule set sigma corresponds to a certain conditional function dependency or a certain external constraint;
and step 3: carrying out integrity check and integrity repair on all tuples in the data processed in the step 1;
3-1 integrity check
Sequentially traversing all tuples in the data processed in the step 1, judging whether the current tuple is missing, if so, adding the current tuple to the missing tuple set T L If not, add to the complete tuple set T C
3-2 integrity repair
Traversing missing tuple sets T in sequence L Checking whether the missing items of the current missing tuples are matched with some rules in the rule set sigma or not, if so, filling the missing data of the current missing tuples by using the rules, otherwise, filling the missing data of the current missing tuples by using a modified KNN-based mixed filling algorithm;
the improved KNN-based hybrid filling algorithm comprises the following specific steps:
1) Dividing the non-missing data column of the current missing tuple into 5 types of missing tuples, namely numerical type, binary type, ordinal type, classification type and text type;
2) Set the complete tuples T C According to whenThe same data columns corresponding to each type of tuple in the front missing tuples are also divided into 5 types of complete tuple sets;
3) Respectively calculating the subclass distance between each type of missing tuple and the complete tuple;
4) Computing missing tuples t 1 And complete tuple t 2 Tuple distance between;
5) Sorting the tuple distances between the missing tuples and the complete tuples in an increasing way;
6) Selecting the first k complete tuples with the minimum tuple distance as a target tuple set;
7) Selecting data with the most frequency in the corresponding columns of the target tuple set and the missing tuple missing items as filling values of missing data of the missing tuples;
and 4, step 4: performing consistency check and consistency repair on all tuples in the data processed in the step 3;
4-1 consistency check:
sequentially traversing all tuples, checking whether the current tuple is matched with all rules in the rule set in the step 2, if so, continuously checking the next tuple, otherwise, adding the rule violated by the current tuple to the abnormal rule set sigma';
4-2, consistency repair;
and 5: performing uniqueness check and repair on all tuples in the data processed in the step 4;
step 6: and (4) rechecking the data processed in the step (5) to determine whether all tuples are matched with all rules in the rule set, if so, all tuples meet the consistency condition to finish the cleaning of the data, and if not, returning to the step (4-2) to continue the execution.
2. The method for cleaning multi-source heterogeneous data according to claim 1, wherein the step 3) in the integrity repair of the step 3-2 is specifically:
for the numerical subclass tuple, calculating a subclass distance between the missing subclass tuple and the complete subclass tuple by using a standardized Euclidean distance formula (1);
Figure FDA0003925109720000021
where n represents the total number of numeric data in the subclass tuple, x Li Represents the ith data, x, of the missing subclass tuple Ci Represents the ith data, s, of the complete sub-class tuple i Representing the standard deviation of all values of the ith column data of the subclass tuple;
for binary subclass tuples, calculating subclass distances between missing subclass tuples and complete subclass tuples by using a formula (2);
Figure FDA0003925109720000022
if two values of the binary data are respectively considered as 0 and 1, p represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 1, q represents the number of data in the missing subclass tuple data, which is 0, and the data in the complete subclass tuple, which is 1, r represents the number of data in the missing subclass tuple data, which is 0, and s represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple, which are both 0;
for ordinal type subclass tuples, firstly, converting ordinal data in tuples into numerical data by using a formula (3), and then calculating subclass distances between missing subclass tuples and complete subclass tuples by using a numerical tuple distance formula (4);
Figure FDA0003925109720000023
D(L,C) ordi =D(L,C) num formula (4)
Wherein, if all values of the ith column data of the subclass tuple are sequentially regarded as a sequence from 0 to N, then N is i Indicates the total number N, M of serial numbers of the ith row data i Indicating the number of values of the data in the sequence, X i Representing the converted numerical data;
for the categorical subclass tuple, calculating the subclass distance between the missing subclass tuple and the complete subclass tuple using formula (5);
Figure FDA0003925109720000031
the number of data in the missing subclass tuple and the number of data in the complete subclass tuple are the same, T represents the total number of data in the missing subclass tuple or the complete subclass tuple, and E represents the number of corresponding data in the missing subclass tuple and the complete subclass tuple which are the same;
for the text type sub-class tuples, the distance between character string data is calculated by using an editing distance formula (6), and then the sub-class distance between the missing sub-class tuples and the complete sub-class tuples is calculated by using a formula (7) and normalized;
Figure FDA0003925109720000032
Figure FDA0003925109720000033
wherein D is i (L,C) text Indicating the edit distance, L, between the ith character string data in the missing sub-class tuple and the full sub-class tuple j 、C k Respectively representing the first j and k character data of the ith character string data in the missing sub-class tuple and the complete sub-class tuple, wherein j is more than or equal to 0 and is less than or equal to U i ,0≤k≤V i Min represents a minimum function; since the number of data in the missing subclass tuple and the complete subclass tuple is the same, m represents the total number of character string data in the missing subclass tuple or the complete subclass tuple, and U represents the total number of character string data in the missing subclass tuple or the complete subclass tuple i 、V i The total length of the ith string data in the missing sub-class tuple and the complete sub-class tuple is represented respectively, and Max represents a maximum function.
3. The method for cleaning multi-source heterogeneous data according to claim 2, wherein the step 4) in the integrity repair in the step 3-2 is specifically:
missing tuple t 1 And complete tuple t 2 The tuple distance between them is multiplied by the above-mentioned 5 types of subclass distances by their corresponding external weights W i Adding the obtained products, and obtaining the product by the formula (8) and the formula (9);
Figure FDA0003925109720000034
Figure FDA0003925109720000035
where i represents a subclass tuple of 5 types, W i Weight coefficient, D, representing the i-th type of sub-class tuple in the current tuple i (t 1 ,t 2 ) Representing the subclass distance between the i-th missing subclass tuple and the complete subclass tuple; y denotes the total number of data in the current tuple, Y i Representing the number of i-th type data in the current tuple.
4. The method according to claim 3, wherein the k value in step 6) of the integrity repair in step 3-2 is obtained by:
6-1) dividing all complete tuples into a test tuple set and a training tuple set;
6-2) dividing the training tuple set into n sub-tuple sets with the same size;
6-3) taking each sub-tuple set as a complete tuple set T in turn C Repairing the current missing tuple using 1 to 100 as training k values, respectively;
6-4) obtaining the k value with the highest repairing accuracy in each sub-element set;
6-5) the average of these n k values is used as the repair k value for the set of test tuples.
5. The method according to claim 1, wherein the consistency repair in step 4 mainly comprises determining a rule repair order, locating an abnormal tuple, and selecting a target tuple; the method comprises the following steps:
4-2-1 determining a rule repair order
1) Constructing a rule sequence diagram G (V, E) by taking the conditional function dependence in the abnormal rule set Σ 'as a node V and the dependence relationship between nodes as an edge E, wherein V = Σ'; for any two conditional function dependencies
Figure FDA0003925109720000041
If it is not
Figure FDA0003925109720000042
Then->
Figure FDA0003925109720000043
There is a combination of>
Figure FDA0003925109720000044
Point on>
Figure FDA0003925109720000045
Is on or is greater than>
Figure FDA0003925109720000046
There is a dependency relationship where L and R represent the left and right parts, respectively, on which the conditional function depends;
2) Sequentially selecting nodes with the degree of income of 0 in the rule sequence diagram as priority repair rules, and adding the priority repair rules to a repair rule set sigma rep Then deleting the node and the edges connected with the node until no nodes remain in the rule sequence diagram G (V, E); if the rule sequence diagram is not empty and no node with the in-degree of 0 exists, selecting a combination with the minimum sum of repair costs from all conditional function dependent combinations in the rule sequence diagram as a repair rule set sigma rep
4-2-2 location anomaly tuples
Traversing the repairing rule set sigma in sequence rep All of the rules in (1), adding all tuples violating the current rule to the abnormal tuple set T e
4-2-3 select target tuples
Combining external constraint rules Σ in rule sets Σ fc In the abnormal tuple set T e And selecting the tuple with the minimum repair cost as a target tuple, and repairing other abnormal tuples by using the target tuple.
6. The method for cleaning multi-source heterogeneous data according to claim 1, wherein in step 5, an improved SNM algorithm based on a mixed distance and a dynamic window is used for the data processed in step 4, and whether the tuple distance between the first tuple in the sliding window and other tuples in the window is smaller than a set distance threshold is checked; if yes, the two tuple data are considered to be similar and repeated, the repeated tuple in the window is deleted, and if not, the first tuple and other tuples are considered to meet the uniqueness condition; and moving out the first tuple in the sliding window and moving in the next tuple of the last tuple in the window, and repeating the steps until all tuples finish uniqueness check, so as to realize data uniqueness check and repair.
7. The method for cleaning multi-source heterogeneous data according to claim 6, wherein the improved SNM algorithm based on the hybrid distance and the dynamic window comprises the following specific steps:
5-1) selecting one or more data for all tuples, calculating corresponding key values of the data and using the key values as sorting keywords;
5-2) sorting all tuples according to the sorting keywords;
5-3) setting a sliding window with the initial size of N and the step length of 1 on the sorted tuples, calculating the tuple distance between the first tuple in the sliding window and other tuples in the window, deleting repeated tuples in the sliding window if at least one tuple distance is smaller than a distance threshold value, and otherwise moving the sliding window by one step length;
5-4) calculating the ratio of the tuple distance between the head tuple and the tail tuple in the sliding window to the tuple number in the sliding window, taking the ratio as the window average density, increasing the size of the sliding window if the window average density is smaller than a density threshold, keeping the size of the sliding window unchanged if the window average density is equal to the density threshold, decreasing the size of the sliding window if the window average density is larger than the density threshold, and continuing to slide until all tuple inspection is finished.
8. A multi-source heterogeneous data washing apparatus for implementing the method according to any one of claims 1 to 7, characterized in that the apparatus comprises the following modules:
the data acquisition and preprocessing module is used for acquiring multi-source heterogeneous data and converting the data with the same attribute into data with a uniform format;
the rule set building module is used for building a rule set containing conditional function dependence and external constraint on the data from the data acquisition and preprocessing module;
the integrity checking and repairing module is used for checking whether the data acquired by the data acquisition and preprocessing module is missing or not and then dividing the data into a missing tuple set and a complete tuple set; sequentially checking whether missing items of all missing tuples in the missing tuple set are matched with certain rules in the rule set, if so, filling missing data of the current missing tuple by using the rules, otherwise, filling the missing data of the current missing tuple by using a mixed filling algorithm based on improved KNN;
the consistency checking and repairing module is used for checking whether the data processed by the integrity checking and repairing module violates the rules in the rule set or not, if so, determining the repairing rules, and taking all tuples violating the repairing rules as abnormal tuples; selecting the tuple with the minimum repairing cost from all abnormal tuples as a target tuple, and repairing other abnormal tuples by using the target tuple;
the uniqueness checking and repairing module is used for checking and deleting repeated tuples on the data by using an improved SNM algorithm based on a mixed distance and a dynamic window for the data processed by the consistency checking and repairing module;
and the consistency secondary checking module is used for checking whether the data processed by the uniqueness checking and repairing module meets the consistency condition, finishing the cleaning of the data if the data meets the consistency condition, and returning to the consistency checking and repairing module to execute again if the data does not meet the consistency condition.
CN202111577423.6A 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device Active CN114281809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111577423.6A CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111577423.6A CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Publications (2)

Publication Number Publication Date
CN114281809A CN114281809A (en) 2022-04-05
CN114281809B true CN114281809B (en) 2023-03-28

Family

ID=80873920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111577423.6A Active CN114281809B (en) 2021-12-22 2021-12-22 Multi-source heterogeneous data cleaning method and device

Country Status (1)

Country Link
CN (1) CN114281809B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625525B (en) * 2020-05-26 2023-05-26 哈尔滨工业大学 Environment data repairing/filling method and system
CN115543977B (en) * 2022-09-29 2024-07-19 河北雄安睿天科技有限公司 Water supply industry data cleaning method
CN115713270B (en) 2022-11-28 2023-07-21 之江实验室 Method and device for detecting and correcting peer mutual evaluation abnormal scores
CN116578557B (en) * 2023-03-03 2024-04-02 齐鲁工业大学(山东省科学院) Missing data filling method for data center

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7519952B2 (en) * 2003-07-28 2009-04-14 International Business Machines Corporation Detecting an integrity constraint violation in a database by analyzing database schema, application and mapping and inserting a check into the database and application
US8041668B2 (en) * 2007-06-29 2011-10-18 Alcatel Lucent Methods and apparatus for capturing and detecting inconsistencies in relational data using conditional functional dependencies
CN108446294A (en) * 2018-01-22 2018-08-24 东华大学 A kind of cleaning rule digging system towards dirty data
CN109885561A (en) * 2019-01-03 2019-06-14 中国人民解放军国防科技大学 Inconsistent data cleaning method based on maximum dependency set and attribute correlation

Also Published As

Publication number Publication date
CN114281809A (en) 2022-04-05

Similar Documents

Publication Publication Date Title
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
Zhang et al. Community detection in networks with node features
CN107292330B (en) Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
US7814111B2 (en) Detection of patterns in data records
US7711736B2 (en) Detection of attributes in unstructured data
US11055307B2 (en) Automatic modeling method and classifier for OLAP data model
CN106294762B (en) Entity identification method based on learning
CN111338950A (en) Software defect feature selection method based on spectral clustering
CN111309777A (en) Report data mining method for improving association rule based on mutual exclusion expression
CN110633371A (en) Log classification method and system
CN112364352A (en) Interpretable software vulnerability detection and recommendation method and system
WO2021114483A1 (en) Method for automatically identifying design change in building information model
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
Uludağ et al. On the financial situation analysis with KNN and naive Bayes classification algorithms
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
Wang et al. Approximate truth discovery via problem scale reduction
Wang et al. Sound and complete causal identification with latent variables given local background knowledge
Revindasari et al. Traceability between business process and software component using Probabilistic Latent Semantic Analysis
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
Hadžić et al. Different similarity measures to identify duplicate records in relational databases
CN117435246B (en) Code clone detection method based on Markov chain model
Ali et al. Duplicates detection within incomplete data sets using blocking and dynamic sorting key methods
CN109977269B (en) Data self-adaptive fusion method for XML file
Kokhov Two approaches to determining similarity of two digraphs
CN110704522B (en) Concept data model automatic conversion method based on semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant