CN106599193A - Data cleaning method and system - Google Patents

Data cleaning method and system Download PDF

Info

Publication number
CN106599193A
CN106599193A CN201611152151.4A CN201611152151A CN106599193A CN 106599193 A CN106599193 A CN 106599193A CN 201611152151 A CN201611152151 A CN 201611152151A CN 106599193 A CN106599193 A CN 106599193A
Authority
CN
China
Prior art keywords
data
cleaning
dirty
cleaning object
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611152151.4A
Other languages
Chinese (zh)
Inventor
曹敏
杨政
黄星
赵薇
杨莉
张林山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of Yunnan Power System Ltd
Original Assignee
Electric Power Research Institute of Yunnan Power System Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of Yunnan Power System Ltd filed Critical Electric Power Research Institute of Yunnan Power System Ltd
Priority to CN201611152151.4A priority Critical patent/CN106599193A/en
Publication of CN106599193A publication Critical patent/CN106599193A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data cleaning method and system. The data cleaning method comprises the steps of extracting data to an intermediate database from data source, and determining the data extracted to the intermediate database as data objects; extracting the data objects from the intermediate database, and analyzing to obtain a constrained relationship respectively corresponding to each data object; adapting dirty data types of the data objects and data characteristics corresponding to the dirty data types, and screening the data objects adaptive to the dirty data types and the data characteristics as cleaning objects; setting a cleaning sequence for each cleaning object according to the constrained relationships, and adding each cleaning object to a cleaning queue in sequence according to the cleaning sequence; matching data cleaning methods, corresponding to the dirty data types, of each cleaning object in a data cleaning model, loading the data cleaning model, and cleaning the cleaning objects in the cleaning queue in sequence according to the cleaning sequence; and storing the cleaned cleaning objects to the data source. According to the technical scheme provided by the invention, the time and the error rate for data cleaning can be reduced.

Description

A kind of Data Cleaning Method and system
Technical field
The present invention relates to technical field of power systems, more specifically, is related to a kind of Data Cleaning Method and system.
Background technology
With the extensive application and development of data warehouse technology and data mining technology, in the industry to how to pass through mass data Being analyzed the process of decision-making has higher requirement.At present during analysis decision, enterprise's more attention is how to exist Excavate more useful hiding information and how to use such information for instructing and predicting enterprise in the behind of existing mass data Following development.
When the development on the basis of the data warehouse based on historical data for enterprise is instructed and predicted, the matter of data Amount problem becomes very crucial.According to " rubbish enters, and rubbish goes out " principle, often there are data and lack in the data in data warehouse Mistake, data noise, data are inconsistent and the quality problems such as data redundancy.These stain data that there are quality problems are often led Cause the operating cost of very long response time and costliness, and influence whether the derived rule from data derivation accuracy and from The correctness of the mining mode of hiding information is excavated in data, and then makes DSS mislead decision-making.
Enterprise grows with each passing day to the demand of the stain data processing that there are quality problems, and the requirement to data cleansing is also more next It is higher.In traditional stain data cleansing work, main or dependence is manually carried out manually to the data in disparate databases Process.Such data cleansing mode not only can take a substantial amount of time, and, data cleansing. due to uncontrollable factor too much Error rate also can increase, and the raising degree for causing the quality of data is not high, reliability is not strong.
The content of the invention
It is an object of the invention to provide a kind of technical scheme of data cleansing, existing described in background technology to solve By the problem of consuming time, the error rate increase of data cleansing caused by manual cleaning data in technology.
In order to solve above-mentioned technical problem, the present invention provides following technical scheme:
The invention provides a kind of Data Cleaning Method, the Data Cleaning Method includes:
Storage address of the data included according to data source model in data source, extracts the number from the data source According to intermediate database, it is determined that the data for being drawn into the intermediate database are data object;
The data object is extracted from the intermediate database, the data framework of the data object is analyzed, Obtain and each data object corresponding restriction relation of difference;
The dirty data type of data object and dirty data type correspondence according to dirty data feature adaptation model adaptation Data characteristicses, screen data object that dirty data type and the data characteristicses are adapted as cleaning object, wherein, institute Dirty data feature adaptation model is stated including dirty data type and the corresponding data characteristicses of dirty data type;
According to the restriction relation, the cleaning sequence of each cleaning object is set, it is according to the cleaning sequence that each is clear Wash object to add successively into cleaning queue;
The dirty data type corresponding Data Cleaning Method of difference with each cleaning object is matched, the data cleansing is loaded Model, cleans successively according to the cleaning sequence to the cleaning object in the cleaning queue, wherein, the data cleansing Model includes and the corresponding cleaning method of various dirty data type difference;
The cleaning object after by cleaning is stored in the data source.
Preferably, the Data Cleaning Method also includes:
Judge the data cleansing the standard whether cleaning object after cleaning is met in data quality standard model;
If the cleaning object is unsatisfactory for the data cleansing standard, according to the restriction relation by the cleaning object Again add to the cleaning queue, re-execute the step of cleaning to the cleaning object according to the cleaning sequence;
If the cleaning object meets the data cleansing standard, the cleaning object is stored in into the data source.
Preferably, the data cleansing standard includes data format standard, data feature values range criterion and/or data about Beam affinity criterions;It is described to judge the data cleansing the standard whether cleaning object after cleaning is met in data quality standard model, Including:
Whether the data form for judging the cleaning object after cleaning meets the data format standard;
Whether the data feature values for judging the cleaning object after cleaning meet the data feature values range criterion; And/or
Whether the data constraint relation for judging the cleaning object after cleaning meets the data constraint affinity criterions.
Preferably, the Data Cleaning Method also includes:
According to the data feature values and the corresponding restriction relation of the data object of the data object, data backup is generated Model;
Judge the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
If there is disappearance in the data feature values of the cleaning object after the cleaning, according in the data backup model Data feature values are recovered with the data feature values of cleaning object described in constraint relation pair;
Cleaning object after recovery is stored in into the data source.
Preferably, it is described that the cleaning sequence of each cleaning object is arranged according to restriction relation, will according to the cleaning sequence Each cleaning object adds successively into cleaning queue, including:
According to the restriction relation, the constraint grade of data feature values in each cleaning object is determined;
According to constraint grade order from low to high, each cleaning object is added successively to the cleaning queue In.
A kind of Data clean system is additionally provided according to the second aspect of the invention, and the Data clean system includes:
Data pick-up interface module, the storage address for the data that included according to data source model in data source, from The data are extracted in the data source to intermediate database, it is determined that the data for being drawn into the intermediate database are data pair As;
Data framework analysis module, for extracting the data object from the intermediate database, to the data pair The data framework of elephant is analyzed, and obtains and each data object corresponding restriction relation of difference;
Data type and characteristics analysis module, for the dirty of the data object according to dirty data feature adaptation model adaptation Data type and the corresponding data characteristicses of dirty data type, screen the number that dirty data type and the data characteristicses are adapted According to object as cleaning object, wherein, the dirty data feature adaptation model includes dirty data type and dirty data type pair The data characteristicses answered;
Data cleansing order setup module, for according to the restriction relation, arranging the cleaning sequence of each cleaning object, Each cleaning object is added successively into cleaning queue according to the cleaning sequence;
Data cleansing module, for matching the corresponding data cleansing side of dirty data type difference with each cleaning object Method, loading data washing moulding is cleaned successively according to the cleaning sequence to the cleaning object in the cleaning queue, its In, the Data Cleaning Model includes and the corresponding cleaning method of various dirty data type difference;
Data are stored in module, and for the cleaning object after by cleaning the data source is stored in.
Preferably, the Data clean system also includes:
Cleaning standard judge module, for judging whether the cleaning object after cleaning meets data quality standard model In data cleansing standard;
The data cleansing order setup module, if being additionally operable to the cleaning object is unsatisfactory for the data cleansing standard When, the cleaning object is added to the cleaning queue again according to the restriction relation;
The data are stored in module, if be additionally operable to the cleaning object and meet the data cleansing standard, will be described clear Wash object and be stored in the data source.
Preferably, the data cleansing standard includes data format standard, data feature values range criterion and/or data about Beam affinity criterions;The cleaning standard judge module, including:
First judging submodule, for judging whether the data form of the cleaning object after cleaning meets the data Format standard;
Second judging submodule, for judging whether the data feature values of the cleaning object after cleaning meet the number According to range of characteristic values standard;And/or
3rd judging submodule, for judging it is described whether the data constraint relation of the cleaning object after cleaning meets Data constraint affinity criterions.
Preferably, the Data clean system also includes:
Data backup model generation module, for according to the data feature values of the data object and the data object pair The restriction relation answered, generates data backup model;
Shortage of data judge module, for judging the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
Data recovery module, if the data feature values for the cleaning object after the cleaning have disappearance, according to institute State the data feature values in data backup model and the data feature values of cleaning object are recovered described in constraint relation pair;
The data are stored in module, are additionally operable to for the cleaning object after recovery to be stored in the data source.
Preferably, the data cleansing order setup module, including:
Constraint grade judging submodule, for determining data feature values in each cleaning object according to the restriction relation Constraint grade;
Cleaning object adds submodule, for the order according to the constraint grade from low to high, by each cleaning object Add successively into the cleaning queue.
Data cleansing scheme provided in an embodiment of the present invention, is matched by screening dirty data type and data characteristicses Data object, then according to the corresponding restriction relation of each data object, arranges the clear of each cleaning object as cleaning object Order is washed, with the dirty data type corresponding Data Cleaning Method of difference of each cleaning object in matched data washing moulding, is pressed The cleaning object in cleaning queue is cleaned successively according to above-mentioned cleaning sequence, by data pick-up, feature adaptation and model The technology such as drive cleaning to the data in data source such that it is able to rapidly and accurately complete the cleaning to various data and grasp Make, while the input cost of time, manpower and physics required during reducing data cleansing, improve the cleaning matter of data Amount.
Description of the drawings
Technical scheme in order to be illustrated more clearly that the embodiment of the present invention, below will be to making needed for embodiment description Accompanying drawing is briefly described, it should be apparent that, for those of ordinary skills, do not paying creative work Property on the premise of, can be obtaining other accompanying drawings according to these accompanying drawings.
Fig. 1 is the flow chart of the first Data Cleaning Method shown in the embodiment of the present invention;
Fig. 2 is a kind of flow chart of the method for the setting cleaning sequence shown in embodiment illustrated in fig. 1;
Fig. 3 is the flow chart of second Data Cleaning Method shown in the embodiment of the present invention;
Fig. 4 is the flow chart of the third Data Cleaning Method shown in the embodiment of the present invention;
Fig. 5 is the structure chart of the first Data clean system shown in the embodiment of the present invention;
Fig. 6 is the structure chart of the data cleansing order setup module shown in embodiment illustrated in fig. 5;
Fig. 7 is the structure chart of second Data clean system shown in the embodiment of the present invention;
Fig. 8 is the structure chart of the third Data clean system shown in the embodiment of the present invention;
Fig. 9 is the structure chart of the 4th kind of Data clean system shown in the embodiment of the present invention.
Specific embodiment
A kind of data cleansing scheme provided in an embodiment of the present invention, solves the manual cleaning number described in background technology According to the problem that the error rate of caused consuming time and data cleansing increases.
In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real Apply the above-mentioned purpose of example, feature and advantage can become apparent from it is understandable, below in conjunction with the accompanying drawings to the technology in the embodiment of the present invention Scheme is described in further detail.
Accompanying drawing 1 is refer to, Fig. 1 is a kind of schematic flow sheet of the Data Cleaning Method shown in the embodiment of the present invention.This The bright Data Cleaning Method for implementing to exemplify is comprised the following steps:
S110:Storage address of the data included according to data source model in data source, extracts from the data source The data are to intermediate database, it is determined that the data for being drawn into the intermediate database are data object.
Data source be to be cleaned data set source, the data source include Oracle, sql-Server, Sqlite and The universal relation type data base such as MySql.Data Cleaning Method provided in an embodiment of the present invention, the extracted data from data source it Before, need to be pre-created data source model, to obtain and preserve the information such as storage address of the data in data source, then according to number According to the storage address of source model offer, extracted data is clear as to carry out data using the data to intermediate database from data source The data object washed, the data in data source can be accurately obtained by said method, so as to realize the accurate cleaning of data.Its Middle data object is to suspect to be likely to occur the data of dirty data value, and data object is carried out in units of field or tables of data Extract.
S120:The data object is extracted from the intermediate database, the data framework of the data object is carried out Analysis, obtains and each data object corresponding restriction relation of difference.
The data framework of data object is analyzed, including the unique constraint to data object, Primary key, external key about The data frameworks such as beam, inspection constraint, null value constraint, field composition and field type are analyzed, so as to obtain and each data Object distinguishes corresponding restriction relation.
S130:The dirty data type and dirty data type of data object according to dirty data feature adaptation model adaptation Corresponding data characteristicses, screen data object that dirty data type and the data characteristicses are adapted as cleaning object, its In, the dirty data feature adaptation model includes dirty data type and the corresponding data characteristicses of dirty data type.
Dirty data feature adaptation model includes dirty data type and corresponding data characteristicses, and the dirty data type includes similar The suspicious field types such as data, noise data, out-of-limit data, missing data and redundant data, it is dirty by carrying out to data object The adaptation of data type and the corresponding data characteristicses of dirty data type, can be recognized accurately containing certain types of dirty data Data object as cleaning object, so as to ensure cleaning efficiency and cleaning performance.Wherein, the corresponding data of dirty data type are special The data characteristicses such as interval including out-of-limit data are levied, during being adapted to data object, can be by judging word Section certain string eigenvalue whether in the interval, if the row eigenvalue is not in interval, you can judge the field For cleaning object.For example:If the A1 row in A tables are defined as out-of-limit data, during adaptation, it is only necessary to be input into the row The reasonable value of data is interval, when a certain field is not interval interior in the reasonable value corresponding to the data in A1 row in A tables, then By the field filter out, determine that data are out-of-limit data in the corresponding A1 row of the field.
S140:According to the restriction relation, the cleaning sequence of each cleaning object is set, will be each according to the cleaning sequence Individual cleaning object adds successively into cleaning queue.
As a kind of preferred embodiment, as shown in Fig. 2 step S140:According to restriction relation, each cleaning object is set Cleaning sequence, according to the cleaning sequence by each cleaning object add successively to cleaning queue in, including:
S141:According to the restriction relation, the constraint grade of data feature values in each cleaning object is determined;
S142:According to constraint grade order from low to high, each cleaning object is added successively to the cleaning In queue.
In each tables of data, often the data feature values of a certain field are a certain with another tables of data in a tables of data There is restriction relation in the data feature values of field, for example:The data feature values of the corresponding A2 row of AI fields are BII in table B in Table A The higher level of the data feature values of the corresponding B2 row of field, now, when cleaning each cleaning object, needs first relatively low to constraining grade Data cleaned, then clean to constraining the high data of grade, so as to ensure the reliability of data cleansing.
S150:The dirty data type corresponding Data Cleaning Method of difference with each cleaning object is matched, loading data is clear Mold cleaning type, is carried out successively according to the cleaning sequence using the Data Cleaning Model to the cleaning object in the cleaning queue Cleaning, wherein, the Data Cleaning Model includes and the corresponding cleaning method of various dirty data type difference.
Data Cleaning Model includes that each dirty data type distinguishes corresponding cleaning method, and each cleaning object is being entered During row data cleansing, need to match the corresponding Data Cleaning Method of dirty data type, so as to loading data washing moulding, according to mould Data Cleaning Method corresponding with the dirty data type of cleaning object carries out the cleaning of dirty data to the cleaning object in type.
S160:Cleaning object is stored in into the data source.
After the completion of cleaning to cleaning object, the dirty data in the cleaning object is changed into clean data, now cleans this Object is stored in data source, can reduce the quality problems of data in data source, and then provides to be analyzed decision-making according to data Reliable basis.
Data Cleaning Method provided in an embodiment of the present invention, is matched by screening dirty data type and data characteristicses Data object, then according to the corresponding restriction relation of each data object, arranges the clear of each cleaning object as cleaning object Order is washed, with the dirty data type corresponding Data Cleaning Method of difference of each cleaning object in matched data washing moulding, is pressed The cleaning object in cleaning queue is cleaned successively according to above-mentioned cleaning sequence, by data pick-up, feature adaptation and model The technology such as drive cleaning to the data in data source such that it is able to rapidly and accurately complete the cleaning to various data and grasp Make, while the input cost of time, manpower and physics required during reducing data cleansing, improve the cleaning matter of data Amount.
Cleaning object after cleaning might not meet data cleansing standard, may go back remaining data quality problem, in order to The data quality problem in cleaning object is eliminated, as a kind of preferred embodiment, as shown in figure 3, in step shown in Fig. 1 S160:Cleaning object is stored in after data source, the Data Cleaning Method shown in the present embodiment is further comprising the steps of:
S210:Judge the data cleansing the mark whether cleaning object after cleaning is met in data quality standard model It is accurate.
Wherein, the data cleansing standard includes that data format standard, data feature values range criterion and/or data constraint are closed It is standard, such as data format standard includes character type standard, integer type standard and date format standard, data feature values scope Standard includes data higher limit and data lower limit etc..Data constraint affinity criterions include the data whether can for null value and Whether there is the standards such as restriction relation with specific external key.
Specifically judge whether the cleaning object after cleaning meets the data cleansing standard in data quality standard model Method is comprised the following steps:Whether the data form for judging the cleaning object after cleaning meets the data format standard; Whether the data feature values for judging the cleaning object after cleaning meet the data feature values range criterion;And/or judge Whether the data constraint relation of the cleaning object after cleaning meets the data constraint affinity criterions.
Whether the data form of cleaning object, data feature values and/or data constraint relation after judging to clean be full Whether the respective standard of foot, can accurately determine cleaning object and clear data quality problems.
If cleaning object is unsatisfactory for the data cleansing standard, according to the restriction relation by the cleaning object again Add to the cleaning queue, re-execute what is in step S150 the cleaning object cleaned according to the cleaning sequence Step.
When cleaning object is unsatisfactory for data cleansing standard, cleaning object is added into cleaning queue again, repetition is held The above-mentioned cleaning step of row, until meeting data cleansing standard, can as much as possible eliminate the data quality problem of cleaning object, subtract The presence of few dirty data, so as to provide foundation for follow-up analysis decision.
If the cleaning object meets the data cleansing standard, execution step S160:Cleaning object is stored in into data Source.
When cleaning object meets data cleansing standard, the cleaning object eliminated after data quality problem is stored in into data Source, can reduce the presence of the dirty data in data source, so as to provide foundation for follow-up analysis decision.
After data cleansing is carried out to cleaning object, may be by mistake clear by the part proper characteristics Value Data in cleaning object Remove, there is disappearance so as to cause data feature values, in order to avoid this kind of situation, as a kind of preferred embodiment, such as Fig. 4 institutes Show, the Data Cleaning Method shown in Fig. 4 after data cleansing is carried out to cleaning object, is somebody's turn to do in addition to the method and step shown in Fig. 1 Data Cleaning Method is further comprising the steps of:
S310:According to the data feature values and the corresponding restriction relation of the data object of the data object, number is generated According to backup model;
Because data backup model is generated according to the corresponding restriction relation of data feature values and data object of data object, Therefore the data backup model includes the restriction relation and data feature values of each data object.Therefore cleaning after cleaning is right As existing during shortage of data, data recovery is carried out to cleaning object by the data backup model, data feature values can be reduced The situation of disappearance.
S320:Judge the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
S330:If there is disappearance in the data feature values of the cleaning object after the cleaning, according to the data backup mould Data feature values in type are recovered with the data feature values of cleaning object described in constraint relation pair;
After the completion of the data feature values to cleaning object are recovered, step S160 in Fig. 1 is performed again:By cleaning object It is stored in the data source.
There is disappearance in the data feature values of cleaning object after cleaning, according to the data feature values in data backup model Recovered with the data feature values of constraint relation pair cleaning object, the cleaning object after cleaning can be reduced and there is shortage of data Situation, so as to correct cleaning object is stored in into data source.
Based on same inventive concept, the embodiment of the present application additionally provides Data clean system, due to the system it is corresponding Method is the Data Cleaning Method in the embodiment of the present application, and the principle of the system solve problem is similar to method, therefore should The enforcement of system may refer to the enforcement of method, repeats part and repeats no more.
Fig. 5 is the structural representation of the first Data clean system shown in the embodiment of the present invention, as shown in figure 5, the number Include according to purging system:
Data pick-up interface module 501, the storage address for the data that included according to data source model in data source, The data are extracted from the data source to intermediate database, it is determined that the data for being drawn into the intermediate database are data pair As;
Data framework analysis module 502, for extracting the data object from the intermediate database, to data object Data framework be analyzed, obtain and each data object corresponding restriction relation of difference;
Data type and characteristics analysis module 503, for according to the dirty of dirty data feature adaptation model adaptation data object Data type and the corresponding data characteristicses of dirty data type, screen the data pair that dirty data type and data characteristicses are adapted As cleaning object, wherein, the dirty data feature adaptation model includes that dirty data type and dirty data type are corresponding Data characteristicses;
Data cleansing order setup module 504, for according to the restriction relation, the cleaning for arranging each cleaning object to be suitable Sequence, successively adds each cleaning object into cleaning queue according to the cleaning sequence;
As shown in fig. 6, data cleansing order setup module 504, including:Constraint grade judging submodule 5041, for root Determine the constraint grade of data feature values in each cleaning object according to the restriction relation;Cleaning object adds submodule 5042, For the order according to the constraint grade from low to high, each cleaning object is added successively into the cleaning queue.
Data cleansing module 505, for matching the dirty data type corresponding data cleansing of difference with each cleaning object Method, loading data washing moulding is cleaned successively according to the cleaning sequence to the cleaning object in the cleaning queue, Wherein, the Data Cleaning Model includes and the corresponding cleaning method of various dirty data type difference;
Data are stored in module 506, and for the cleaning object after by cleaning the data source is stored in.
Data clean system provided in an embodiment of the present invention, is matched by screening dirty data type and data characteristicses Then data object arranges the cleaning of each cleaning object as cleaning object according to the corresponding restriction relation of each data object Sequentially, in matched data washing moulding with the dirty data type corresponding Data Cleaning Method of difference of each cleaning object, according to Above-mentioned cleaning sequence is cleaned successively to the cleaning object in cleaning queue, is driven by data pick-up, feature adaptation and model The technology such as dynamic to the data in data source cleaning such that it is able to rapidly and accurately completes the cleaning to various data and grasps Make, while the input cost of time, manpower and physics required during reducing data cleansing, improve the cleaning matter of data Amount.
Fig. 7 is the structural representation of second Data clean system shown in the embodiment of the present invention, as shown in fig. 7, this reality Apply the Data clean system shown in example also includes in addition to the modules shown in Fig. 5:
Cleaning standard judge module 507, for judging whether the cleaning object after cleaning meets data quality standard Data cleansing standard in model;Data cleansing standard includes data format standard, data feature values range criterion and/or data Restriction relation standard.
The cleaning standard judge module 507 includes:First judging submodule, for judging the cleaning object after cleaning Data form whether meet the data format standard;Second judging submodule is right for judging the cleaning after cleaning Whether the data feature values of elephant meet the data feature values range criterion;And/or the 3rd judging submodule, for judging cleaning Whether the data constraint relation of the cleaning object afterwards meets the data constraint affinity criterions.
The data cleansing order setup module 504, if being additionally operable to the cleaning object is unsatisfactory for the data cleansing mark On time, the cleaning object is added to the cleaning queue again according to the restriction relation;
The data are stored in module 506, if be additionally operable to the cleaning object and meet the data cleansing standard, will be described Cleaning object is stored in the data source.
Fig. 8 is the structural representation of the third Data clean system shown in the embodiment of the present invention, as shown in figure 8, the number Also include according to purging system:
Data backup model generation module 508, for according to the data feature values of the data object and the data pair As corresponding restriction relation, data backup model is generated;
Shortage of data judge module 509, for judging the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
Data recovery module 510, if the data feature values for the cleaning object after the cleaning have disappearance, basis Data feature values in the data backup model are recovered with the data feature values of cleaning object described in constraint relation pair;
The data are stored in module 506, are additionally operable to for the cleaning object after recovery to be stored in the data source.
Refer to Fig. 9, Fig. 9 is the structural representation of the 4th kind of Data clean system provided in an embodiment of the present invention, such as Fig. 9 Shown, the Data clean system includes:
Meta data block 91, data cleansing module 92, data statistic analysis module 93 and data backup module 94.Wherein, Meta data block 91 includes data source 911, data pick-up interface 912, intermediate database 913 and data source model 914;Data are clear Mold cleaning block 92 includes data quality standard model 921, Work-flow control unit 922, data cleansing unit 923 and data cleansing mould Type 924;Data statistic analysis module 93 includes data framework analytic unit 931, dirty data characteristic analysis unit 932 and dirty data Feature adaptation unit 933;Data backup module 94 includes data backup unit 941 and data backup model 942.
Data source 911 be to carry out data cleansing data set source, including Oracle, SqlServer, Sqlite and The universal relation type data base such as MySql.Information in data source 911 is carried out extracting and creates data source by data pick-up interface 912 Model 914, and intermediate data set will be formed in data pick-up to intermediate database 913, as the data set of data cleansing process Object.
The intermediate data set that data statistic analysis module 93 is formed with intermediate database 913 as data object, by data Architecture Analysis unit 931 carries out unique constraint, Primary key, foreign key constraint, inspection constraint, null value constraint, acquiescence to data set The data framework analyses such as constraint, field composition and type, form analysis result, and the analysis result includes the data to be cleaned Whether other tables in the restriction relation of collection, the such as tables of data and data base have main external key incidence relation.
Dirty data feature adaptation model 933 can be carried out similar according to data traffic requirement to each literary name section of data set The adaptation of the suspicious fields such as data, noise data, out-of-limit data and missing data, by dirty data characteristic analysis unit 932, from Data set is randomly selected in intermediate database 913 and forms sample data, according to the mould that dirty data feature adaptation model 933 has been configured Type carries out sample data analysis to the data field being adapted to, defined in the dirty data feature adaptation model 933 in tables of data certain The dirty data type of row, such as set of metadata of similar data, noise data, out-of-limit data and missing data.By taking the analysis of out-of-limit data as an example, Only need to be input into the reasonable data interval of data row, not the interval data will it is screened out.By above-mentioned Sample data analysis process can filter out the field that there is dirty data, so as to form cleaning object.
The intermediate data set that data backup unit 941 in data backup module 94 is formed with intermediate database 913 is as number According to object, according to the analysis result of data framework analytic unit 931, data backup script is generated, and create data backup model 942, there is provided time data recovery mechanism.When data recovery is carried out, the data that directly can be stored according to data backup model 942 are standby Part information carries out initial data backflow, completes the recovery of data.
What the Work-flow control unit 922 in data cleansing module 92 can be provided according to data framework analytic unit 931 The data framework such as relation information formulates the sequencing collection of data cleansing between data constraint and table, and then forms data cleansing team Row;Data cleansing field that Work-flow control unit 922 is provided according to dirty data characteristic analysis unit 932 and dirty data (including Set of metadata of similar data, noise data, out-of-limit data and missing data) type determine corresponding data cleansing pretreating scheme, The data cleansing pretreating scheme includes needing to clean the corresponding dirty data type of field, data cleansing in which tables of data, table Sequencing and corresponding method of cleaning etc..Work-flow control unit 922 is by data cleansing queue and data cleansing pretreatment Scheme is committed to one by one data cleansing unit 923, being adapted in the loading data washing moulding 924 of data cleansing unit 923 Dirty data cleaning method, forms corresponding data cleansing workflow.Data cleansing unit 923 according to data cleansing queue one by one After completing the establishment of data cleansing workflow, data cleansing is proceeded by, cleaning process transfers to Work-flow control unit 922 to unify Management.When often having carried out a data cleaning process, can judge whether the data set after cleaning meets data quality standard model Data cleansing result standard in 921, the data cleansing result standard of the definition of data quality standard model 921 includes data lattice Formula (such as character type, integer type and date format), data higher limit, data lower limit and whether can be null value etc..If Data set after cleaning meets data cleansing result standard, then the cleaning object after cleaning is back in data source, forms dry Net data backflow, is such as unsatisfactory for, then wash cycles again.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiments.
Invention described above embodiment, does not constitute limiting the scope of the present invention.It is any in the present invention Spirit and principle within modification, equivalent and the improvement made etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of Data Cleaning Method, it is characterised in that include:
Storage address of the data included according to data source model in data source, extract from the data source data to Intermediate database, it is determined that the data for being drawn into the intermediate database are data object;
The data object is extracted from the intermediate database, the data framework of the data object is analyzed, obtained With each data object corresponding restriction relation of difference;
The dirty data type of data object and the corresponding number of dirty data type according to dirty data feature adaptation model adaptation According to feature, data object that dirty data type and the data characteristicses are adapted is screened as cleaning object, wherein, it is described dirty Data characteristicses adaptation model includes dirty data type and the corresponding data characteristicses of dirty data type;
According to the restriction relation, the cleaning sequence of each cleaning object is set, it is according to the cleaning sequence that each cleaning is right As adding successively into cleaning queue;
The dirty data type corresponding Data Cleaning Method of difference with each cleaning object is matched, loading data washing moulding is pressed The cleaning object in the cleaning queue is cleaned successively according to the cleaning sequence, wherein, the Data Cleaning Model bag Include and the corresponding cleaning method of various dirty data type difference;
The cleaning object after by cleaning is stored in the data source.
2. Data Cleaning Method according to claim 1, it is characterised in that also include:
Judge the data cleansing the standard whether cleaning object after cleaning is met in data quality standard model;
If the cleaning object is unsatisfactory for the data cleansing standard, according to the restriction relation by the cleaning object again Add to the cleaning queue, re-execute the step of cleaning to the cleaning object according to the cleaning sequence;
If the cleaning object meets the data cleansing standard, the cleaning object is stored in into the data source.
3. Data Cleaning Method according to claim 2, it is characterised in that the data cleansing standard includes data form Standard, data feature values range criterion and/or data constraint affinity criterions;It is described to judge whether the cleaning object after cleaning meets Data cleansing standard in data quality standard model, including:
Whether the data form for judging the cleaning object after cleaning meets the data format standard;
Whether the data feature values for judging the cleaning object after cleaning meet the data feature values range criterion;And/or Whether the data constraint relation for judging the cleaning object after cleaning meets the data constraint affinity criterions.
4. Data Cleaning Method according to claim 1, it is characterised in that also include:
According to the data feature values and the corresponding restriction relation of the data object of the data object, data backup mould is generated Type;
Judge the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
If there is disappearance in the data feature values of the cleaning object after the cleaning, according to the data in the data backup model Eigenvalue is recovered with the data feature values of cleaning object described in constraint relation pair;
Cleaning object after recovery is stored in into the data source.
5. Data Cleaning Method according to claim 1, it is characterised in that described according to restriction relation, arranges each clear The cleaning sequence of object is washed, successively adds each cleaning object into cleaning queue according to the cleaning sequence, including:
According to the restriction relation, the constraint grade of data feature values in each cleaning object is determined;
According to constraint grade order from low to high, each cleaning object is added successively into the cleaning queue.
6. a kind of Data clean system, it is characterised in that include:
Data pick-up interface module, the storage address for the data that included according to data source model in data source, from described The data are extracted in data source to intermediate database, it is determined that the data for being drawn into the intermediate database are data object;
Data framework analysis module, for extracting the data object from the intermediate database, to the data object Data framework is analyzed, and obtains and each data object corresponding restriction relation of difference;
Data type and characteristics analysis module, for the dirty data of the data object according to dirty data feature adaptation model adaptation Type and the corresponding data characteristicses of dirty data type, screen the data pair that dirty data type and the data characteristicses are adapted As cleaning object, wherein, the dirty data feature adaptation model includes that dirty data type and dirty data type are corresponding Data characteristicses;
Data cleansing order setup module, for according to the restriction relation, arranging the cleaning sequence of each cleaning object, according to The cleaning sequence successively adds each cleaning object into cleaning queue;
Data cleansing module, for matching the dirty data type corresponding Data Cleaning Method of difference with each cleaning object, plus Data Cleaning Model is carried, the cleaning object in the cleaning queue is cleaned successively according to the cleaning sequence, wherein, institute Stating Data Cleaning Model includes and the corresponding cleaning method of various dirty data type difference;
Data are stored in module, and for the cleaning object after by cleaning the data source is stored in.
7. Data clean system according to claim 6, it is characterised in that also include:
Cleaning standard judge module, for judging whether the cleaning object after cleaning is met in data quality standard model Data cleansing standard;
The data cleansing order setup module, if be additionally operable to the cleaning object and be unsatisfactory for the data cleansing standard, root The cleaning object is added to the cleaning queue again according to the restriction relation;
The data are stored in module, if be additionally operable to the cleaning object and meet the data cleansing standard, the cleaning is right As being stored in the data source.
8. Data clean system according to claim 7, it is characterised in that the data cleansing standard includes data form Standard, data feature values range criterion and/or data constraint affinity criterions;The cleaning standard judge module, including:
First judging submodule, for judging whether the data form of the cleaning object after cleaning meets the data form Standard;
Second judging submodule, for judging it is special whether the data feature values of the cleaning object after cleaning meet the data Value indicative range criterion;And/or
3rd judging submodule, for judging whether the data constraint relation of the cleaning object after cleaning meets the data Restriction relation standard.
9. Data clean system according to claim 6, it is characterised in that also include:
Data backup model generation module, for corresponding according to the data feature values of the data object and the data object Restriction relation, generates data backup model;
Shortage of data judge module, for judging the data feature values of the cleaning object after cleaning with the presence or absence of disappearance;
Data recovery module, if the data feature values for the cleaning object after the cleaning have disappearance, according to the number According to the data feature values in backup model and described in constraint relation pair, the data feature values of cleaning object are recovered;
The data are stored in module, are additionally operable to for the cleaning object after recovery to be stored in the data source.
10. Data clean system according to claim 6, it is characterised in that the data cleansing order setup module, bag Include:
Constraint grade judging submodule, for determining the constraint of data feature values in each cleaning object according to the restriction relation Grade;
Cleaning object adds submodule, for constraining grade order from low to high according to described, by each cleaning object successively Add into the cleaning queue.
CN201611152151.4A 2016-12-14 2016-12-14 Data cleaning method and system Pending CN106599193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611152151.4A CN106599193A (en) 2016-12-14 2016-12-14 Data cleaning method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611152151.4A CN106599193A (en) 2016-12-14 2016-12-14 Data cleaning method and system

Publications (1)

Publication Number Publication Date
CN106599193A true CN106599193A (en) 2017-04-26

Family

ID=58801059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611152151.4A Pending CN106599193A (en) 2016-12-14 2016-12-14 Data cleaning method and system

Country Status (1)

Country Link
CN (1) CN106599193A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679129A (en) * 2017-09-21 2018-02-09 无线生活(杭州)信息科技有限公司 A kind of big data processing method and processing device
CN107908744A (en) * 2017-11-16 2018-04-13 河南中医药大学 A kind of method of abnormality detection and elimination for big data cleaning
CN108228825A (en) * 2018-01-02 2018-06-29 北京市燃气集团有限责任公司 A kind of station address data cleaning method based on participle
CN108984708A (en) * 2018-07-06 2018-12-11 蔚来汽车有限公司 Dirty data recognition methods and device, data cleaning method and device, controller
CN110471978A (en) * 2019-08-23 2019-11-19 国家气象信息中心 A kind of meteorological government data abstracting method based on JBPM scheduling system
CN110555019A (en) * 2019-09-12 2019-12-10 成都中科大旗软件股份有限公司 Data cleaning method based on service end
CN110727668A (en) * 2019-09-30 2020-01-24 北京百度网讯科技有限公司 Data cleaning method and device
CN111641532A (en) * 2020-03-30 2020-09-08 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN112528331A (en) * 2020-12-15 2021-03-19 杭州默安科技有限公司 Privacy disclosure risk detection method, device and system
CN116226098A (en) * 2023-05-09 2023-06-06 北京尽微致广信息技术有限公司 Data processing method, device, electronic equipment and storage medium
CN116303382A (en) * 2023-02-10 2023-06-23 重庆见芒信息技术咨询服务有限公司 Multidimensional big data cleaning method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130086010A1 (en) * 2011-09-30 2013-04-04 Johnson Controls Technology Company Systems and methods for data quality control and cleansing
CN104361064A (en) * 2014-11-04 2015-02-18 中国银行股份有限公司 Data cleaning method for data files and data files processing method
CN104462604A (en) * 2014-12-31 2015-03-25 成都市卓睿科技有限公司 Data processing method and system
CN105426502A (en) * 2015-11-26 2016-03-23 福州大学 Social network based person information search and relational network drawing method
CN106095953A (en) * 2016-06-13 2016-11-09 西安数驰信息科技有限公司 A kind of real estate data integration method based on GIS

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130086010A1 (en) * 2011-09-30 2013-04-04 Johnson Controls Technology Company Systems and methods for data quality control and cleansing
CN104361064A (en) * 2014-11-04 2015-02-18 中国银行股份有限公司 Data cleaning method for data files and data files processing method
CN104462604A (en) * 2014-12-31 2015-03-25 成都市卓睿科技有限公司 Data processing method and system
CN105426502A (en) * 2015-11-26 2016-03-23 福州大学 Social network based person information search and relational network drawing method
CN106095953A (en) * 2016-06-13 2016-11-09 西安数驰信息科技有限公司 A kind of real estate data integration method based on GIS

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679129A (en) * 2017-09-21 2018-02-09 无线生活(杭州)信息科技有限公司 A kind of big data processing method and processing device
CN107908744A (en) * 2017-11-16 2018-04-13 河南中医药大学 A kind of method of abnormality detection and elimination for big data cleaning
CN107908744B (en) * 2017-11-16 2021-05-18 河南中医药大学 Anomaly detection and elimination method for big data cleaning
CN108228825A (en) * 2018-01-02 2018-06-29 北京市燃气集团有限责任公司 A kind of station address data cleaning method based on participle
CN108228825B (en) * 2018-01-02 2019-02-15 北京市燃气集团有限责任公司 A kind of station address data cleaning method based on participle
CN108984708A (en) * 2018-07-06 2018-12-11 蔚来汽车有限公司 Dirty data recognition methods and device, data cleaning method and device, controller
CN108984708B (en) * 2018-07-06 2022-02-01 蔚来(安徽)控股有限公司 Dirty data identification method and device, data cleaning method and device, and controller
CN110471978A (en) * 2019-08-23 2019-11-19 国家气象信息中心 A kind of meteorological government data abstracting method based on JBPM scheduling system
CN110555019A (en) * 2019-09-12 2019-12-10 成都中科大旗软件股份有限公司 Data cleaning method based on service end
CN110555019B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data cleaning method based on service end
CN110727668B (en) * 2019-09-30 2022-03-01 北京百度网讯科技有限公司 Data cleaning method and device
CN110727668A (en) * 2019-09-30 2020-01-24 北京百度网讯科技有限公司 Data cleaning method and device
CN111641532A (en) * 2020-03-30 2020-09-08 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN111641532B (en) * 2020-03-30 2022-02-18 北京红山信息科技研究院有限公司 Communication quality detection method, device, server and storage medium
CN112528331A (en) * 2020-12-15 2021-03-19 杭州默安科技有限公司 Privacy disclosure risk detection method, device and system
CN116303382A (en) * 2023-02-10 2023-06-23 重庆见芒信息技术咨询服务有限公司 Multidimensional big data cleaning method and system
CN116226098A (en) * 2023-05-09 2023-06-06 北京尽微致广信息技术有限公司 Data processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106599193A (en) Data cleaning method and system
CN105160038B (en) Data analysis method and system based on audit database
US8171001B2 (en) Using a data mining algorithm to generate rules used to validate a selected region of a predicted column
US9588871B1 (en) Method and system for dynamic business rule extraction
CN101464797A (en) Method and system for automatically generating test use case based on unified modeling language activity graph
CN112181758B (en) Fault root cause positioning method based on network topology and real-time alarm
CN111459799A (en) Software defect detection model establishing and detecting method and system based on Github
Sahraoui et al. Applying concept formation methods to object identification in procedural code
CN110362562A (en) The method and system of big data sample drawn data
CN103761337A (en) Method and system for processing unstructured data
CN106886417A (en) A kind of universal parallel method for digging of linear temporal specification
CN106503755A (en) A kind of structural similarity matching process towards fault tree
CN110162472A (en) A kind of method for generating test case based on fuzzing test
CN109816038A (en) A kind of Internet of Things firmware program classification method and its device
CN113051161A (en) API misuse detection method based on historical code change information
Murillo-Morera et al. A Software Defect-Proneness Prediction Framework: A new approach using genetic algorithms to generate learning schemes.
CN112416800A (en) Intelligent contract testing method, device, equipment and storage medium
CN101930401A (en) Detection object-based software vulnerability model detection method
CN106970791A (en) A kind of universal parallel digging system of linear temporal specification
CN112667617A (en) Visual data cleaning system and method based on natural language
Quah et al. Prediction of software readiness using neural network
CN108520006A (en) Data mining method based on pipeline screening
CN115221045A (en) Multi-target software defect prediction method based on multi-task and multi-view learning
CN115982655A (en) Missing data flow abnormity prediction method based on decision tree
CN109688009B (en) Network abnormal data mining method based on service flow space diagram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170426

RJ01 Rejection of invention patent application after publication