CA3177209A1 - Data cleaning method - Google Patents

Data cleaning method

Info

Publication number
CA3177209A1
CA3177209A1 CA3177209A CA3177209A CA3177209A1 CA 3177209 A1 CA3177209 A1 CA 3177209A1 CA 3177209 A CA3177209 A CA 3177209A CA 3177209 A CA3177209 A CA 3177209A CA 3177209 A1 CA3177209 A1 CA 3177209A1
Authority
CA
Canada
Prior art keywords
data
cleaned
field
rule
deleting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3177209A
Other languages
French (fr)
Inventor
Licheng Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3177209A1 publication Critical patent/CA3177209A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Preliminary Treatment Of Fibers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data cleansing method. The method comprises: acquiring data from a first data source, and establishing an independent data stream by using the acquired data (101); filtering the data in the data stream to obtain data to be cleansed (102); deleting or filling a field comprising a missing value in the data to be cleansed, to obtain preliminary cleansed data (103); detecting whether the preliminary cleansed data conforms to a preset determination rule, and deleting the data not conforming to the determination rule to obtain final cleansed data (104); and outputting the final cleansed data to a second data source (105). By using the above-mentioned method, data security can be improved.

Description

DATA CLEANING METHOD
BACKGROUND OF THE INVENTION
Technical Field [0001] The present application relates to the field of big data processing technology, and more particularly to a data processing method.
Description of Related Art
[0002] With the advent of the Age of Network, large quantities of information data incessantly rush into the network, and data quantities are increased each year at a speed of 50%.
Under the support of colossal data sources, enterprise decisions are more and more based on data analyses, rather than the mere reliance on experience and intuition as traditional case. Data cleaning is an indispensable link in the data analyzing process as a whole, and its resultant quality directly affects model effect and the final data analyzing conclusion.
Data cleaning means a process to recheck and verify the data, and aims to delete repetitive data, rectify existing errors, and ensure consistency of data. In practical operations, data cleaning usually occupies 50% to 80% of the time of the entire data analyzing process.
[0003] Data cleaning includes two types as offline data cleaning and real-time data cleaning, by which offline data cleaning data can be cleaned with more refined granules by means of complicated processing at the expense of performance, and such cleaning includes missing value processing, abnormal value processing, repetitive value processing, null value filling, unifying units, whether to perform standardized processing, whether to delete unnecessary variables, and whether to sort, etc.; in comparison with offline data cleaning, due to its requirement on real time, real-time cleaning is more adapted to missing value filling, filtering and data legitimacy checking of data, but the currently Date Regue/Date Received 2022-09-27 available data cleaning process is usually integral with the data analyzing process, coupling between the two is large, the data cleaning process is greatly affected by the function of data-analyzing of other codes, data loss easily tends to occur, and data security is rendered inferior.
SUMMARY OF THE INVENTION
[0004] In view of the above technical problems, there is an urgent need to propose a data cleaning method capable of enhancing data security.
[0005] There is provided a data cleaning method that comprises:
[0006] obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
[0007] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
[0008] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
[0009] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and
[0010] outputting the finally cleaned data to a second data source.
[0011] In one of the embodiments, the step of deleting or filling in any field containing missing values in the data to be cleaned includes:
[0012] calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;
[0013] determining an attribute importance degree of the field according to an index required to be analyzed; and
[0014] deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

Date Regue/Date Received 2022-09-27
[0015] In one of the embodiments, the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:
[0016] filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold;
[0017] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and
[0018] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
[0019] In one of the embodiments, the method further comprises:
[0020] probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;
[0021] the step of subjecting data in the data stream to a filtering process, and obtaining data to be cleaned includes: subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.
[0022] In one of the embodiments, the step of subjecting data in the data stream to a filtering process includes:
[0023] row-grade filtering, whereby any row not required in the data is removed; and
[0024] column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.
[0025] In one of the embodiments, the preset judging rule includes a legitimacy rule and a logic Date Regue/Date Received 2022-09-27 rule, and the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:
[0026] setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and
[0027] deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.
[0028] In one of the embodiments, the first data source and the second data source are of different data types of the same and single distributed messaging system, the distributed messaging system is Kafka, the first data source and the second data source are two different Topics of Kafka, and the data stream is embodied as a data stream based on Spark Streaming.
[0029] There is provided a data cleaning device that comprises:
[0030] a data obtaining module, for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
[0031] a data filtering module, for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
[0032] a preliminarily cleaning module, for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
[0033] a finally cleaning module, for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and
[0034] a data outputting module, for outputting the finally cleaned data to a second data source.
[0035] There is provided a computer equipment that comprises a memory, a processor, and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:
[0036] obtaining data from a first data source, and creating an independent data stream by Date Regue/Date Received 2022-09-27 employing the obtained data;
[0037] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
[0038] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
[0039] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and
[0040] outputting the finally cleaned data to a second data source.
[0041] There is provided a computer-readable storage medium that stores thereon a computer program, and the following steps are realized when the computer program is executed by a processor:
[0042] obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
[0043] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
[0044] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
[0045] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and
[0046] outputting the finally cleaned data to a second data source.
[0047] In comparison with prior-art technology, the present invention achieves the following advantageous effects.
[0048] In the data cleaning method, and corresponding device, computer equipment and storage medium, data cleaning is performed by creating an independent data stream, and data obtained from a first data source is cleaned and thereafter placed in another data source for processing by subsequent businesses, so that the data cleaning process is separated Date Regue/Date Received 2022-09-27 from data analyzing codes, coupling among the codes is reduced, and data security is effectively enhanced.
[0049] Moreover, data filtering is placed as the first step of data cleaning in the present invention, whereby reducing quantity of data to be subsequently cleaned, and enhancing the efficiency in cleaning the data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0050] Fig. 1 is a flowchart schematically illustrating the data cleaning method in an embodiment;
and
[0051] Fig. 2 is a block diagram illustrating the structure of the data cleaning device in an embodiment.
DETAILED DESCRIPTION OF THE INVENTION
[0052] In order to make the objectives, technical solutions and advantages of the present application more lucid and clear, the present application is described in greater detail below with reference to accompanying drawings and embodiments. As should be understood, the specific embodiments as described here are merely meant to explain the present application, rather than to restrict the present application.
[0053] In one embodiment, as shown in Fig. 1, the present application provides a data cleaning method that comprises the following steps.
[0054] Step 101 - obtaining data from a first data source, and creating an independent data stream by employing the obtained data.
[0055] The first data source is a source from which data is obtained, and the data stream is a set Date Regue/Date Received 2022-09-27 of orderly data sequence of nodes with starting points and ending points.
[0056] Specifically, by creating an independent data stream for data cleaning, the present invention separates the data cleaning process from data analyzing codes, and reduces coupling among the codes.
[0057] Step 102 - subjecting data in the data stream to a filtering process, and obtaining data to be cleaned.
[0058] Specifically, data filtering is placed as the first step of data cleaning, whereby can effectively reduce quantity of data to be subsequently cleaned, and greatly enhance the efficiency in cleaning the data.
[0059] Step 103 - deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data.
[0060] The missing values mean information deficient in the data, that is to say, one or some attribute(s) of the data is/are incomplete in value(s).
[0061] Step 104 - detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data.
[0062] Step 105 - outputting the finally cleaned data to a second data source.
[0063] The second data source is another data source that is different from the first data source, and it is employed to store data to be used or processed by subsequent businesses.
[0064] Specifically, the data cleaning process of the present invention is independent of other Date Regue/Date Received 2022-09-27 processing processes of data analysis, and is not affected by other codes, so security of data is higher.
[0065] In the data cleaning method, data cleaning is performed by creating an independent data stream, and data obtained from a first data source is cleaned and thereafter placed in another data source for processing by subsequent businesses, so that the data cleaning process is separated from data analyzing codes, coupling among the codes is reduced, and data security is effectively enhanced.
[0066] As one of specific modes of execution, the first data source and the second data source are of different data types of the same and single distributed messaging system, for instance, the distributed messaging system is Kafka, the first data source and the second data source are two different Topics of Kafka, and the data stream is embodied as a data stream based on Spark Streaming.
[0067] In one of the embodiments, the step of deleting or filling in any field containing missing values in the data to be cleaned includes:
[0068] calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;
[0069] determining an attribute importance degree of the field according to an index required to be analyzed; and
[0070] deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.
[0071] The missing rate of the field is the proportion of the number of pieces of the missing values of the field in the total number of pieces.
[0072] For instance, there are altogether 100 pieces of records in a salary field, and 20 pieces of records are missing values, then the missing rate is 20%.

Date Regue/Date Received 2022-09-27
[0073] Judging criterion of the attribute importance degree of the field is decided by the index required to be analyzed, for example, it is required to portray or label users so as to supply data for subsequent precise marketing, it is then required to collect attribute information of the users, for instance, such attribute information as ages and genders of the users are important fields.
[0074] In one of the embodiments, the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:
[0075] filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold.
[0076] Specifically, if the field attribute is numerical-type data, it suffices to fill in the field according to the circumstance of data distribution; further specifically, if data is unifolinly distributed, the field is filled in by means of a mean value; if data is distributed in a skewed manner, the field is filled in by means of a median;
[0077] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and
[0078] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
[0079] Specifically, the step of complementing the missing values of the field includes:
[0080] complementing through other information, such as using an ID card number to reckon gender, native place, date of birth, and age, etc.;
[0081] complementing through foregoing and following data, for instance, when data is deficient in a time sequence, foregoing and following mean values can be used to serve as Date Regue/Date Received 2022-09-27 complementary values, when there are many missing values, numerical values obtained through a smoothening process can serve as the complementary values;
[0082] where it is impossible to complement, removal is necessitated, but deletion should not be made for possible use subsequently.
[0083] As one of specific modes of execution, the missing rate threshold can be any numeral value between 90% and 95%.
[0084] In one of the embodiments, before data in the data stream is subjected to a filtering process, metadata that describes data attribute of the data in the first data source is firstly probed, any quality problem present in the data is then analyzed and obtained according to the metadata, a filtering rule is set according to the quality problem, the data in the data stream is subjected to a filtering process according to the filtering rule, and the data to be cleaned is obtained in Step 102.
[0085] Metadata is also referred to as intermediary data, relay data, and it is data that describes data, mainly describing information of data attributes, and supporting such functions as indicating storage locations, historical data, resource searching, and document recording, etc.
[0086] Specifically, the data attribute required to be processed is packaged into metadata, thus enabling the program to possess better expandability. At the same time, a corresponding filtering rule is stipulated with respect to any quality problem of the data, thus facilitating enhancement of data filtering efficiency.
[0087] In one of the embodiments, the step of subjecting data in the data stream to a filtering process includes:
[0088] row-grade filtering, whereby any row not required in the data is removed; and
[0089] column-grade filtering, whereby, when one row has plural columns, fields to which any Date Regue/Date Received 2022-09-27 required column corresponds are merely selected and retained.
[0090] Specifically, the combination of row-grade filtering with column-grade filtering makes it possible to effectively quicken the data filtering speed.
[0091] For instance, a process to calculate pv/uv by divided channels:
[0092] the log data includes approximately 200 such fields as the IP address, browser information, client terminal equipment information, the specific access time, the specific page accessed, the page previously accessed, and the access time duration, etc., the requirement in this embodiment is to count the clicking amount of each channel and the access amount of the independent IP.
[0093] By row-grade filtering, log data relevant to the channels is selected and retained only, so that log data not containing the channels is filtered away;
[0094] by column-grade filtering, cid (channel name), uid (equipment identification), and ip address are selected from the approximately 200 fields contained in the log data relevant to the channels, unnecessary fields are filtered away, and it is then possible to count and obtain pv/nv of each channel;
[0095] pv is an acronym of Page View, namely page browsing amount, one access by a user to a certain page in a website is recorded once, and the amount of multiple accesses by the user to the same and single page becomes the total number of pv;
[0096] uv is an acronym of unique visitor, and means a natural person that accesses to and browses the page through the internet.
[0097] In this embodiment, in consideration of expandability, for instance, it might be required to count a retention rate of users in subsequent data processing, it is possible to further record such data as the access time of each ip address, and so on.
[0098] The retention rate of users is a ratio of old users to the total users.

Date Regue/Date Received 2022-09-27
[0099] In one of the embodiments, the preset judging rule includes a legitimacy rule and a logic rule, and the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:
[0100] setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and
[0101] deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.
[0102] The legitimacy rule is such format requirement rule as numerical values, dates, and field contents, etc.
[0103] Specifically, field-type legitimacy rule: a date field format is "YYYY-MM-DD".
[0104] Field content legitimacy rule: the gender is male, female, or unknown;
the date of birth is earlier than or equal to "today".
[0105] The logic rule is a rule of common sense used for judging whether the data conforms to logics, for instance, ages of people usually lie between 0 and 120, and any piece of data is judged as abnormal if the age of 200 appears therein.
[0106] After the data has been cleaned by the legitimacy rule and the logic rule, any data not conforming to format requirements and logics is removed, and valid, finally cleaned data is obtained.
[0107] As should be understood, although the various steps in the flowchart of Fig. 1 are sequentially displayed as indicated by arrows, these steps are not necessarily executed in the sequences indicated by arrows. Unless otherwise explicitly noted in this paper, Date Regue/Date Received 2022-09-27 execution of these steps is not restricted by any sequence, as these steps can also be executed in other sequences (than those indicated in the drawings). Moreover, at least partial steps in the flowchart of Fig. 1 may include plural sub-steps or multi-phases, these sub-steps or phases are not necessarily completed at the same timing, but can be executed at different timings, and these sub-steps or phases are also not necessarily sequentially performed, but can be performed in turns or alternately with other steps or with at least some of sub-steps or phases of other steps.
[0108] In one embodiment, as shown in Fig. 2, there is provided a data cleaning device that comprises a data obtaining module, a data filtering module, a preliminarily cleaning module, a finally cleaning module, and a data outputting module, of which:
[0109] the data obtaining module is employed for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
[0110] the data filtering module is employed for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
[0111] the preliminarily cleaning module is employed for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
[0112] the finally cleaning module is employed for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and
[0113] the data outputting module is employed for outputting the finally cleaned data to a second data source.
[0114] During specific implementation, the first data source and the second data source are of different data types of the same and single distributed messaging system.
[0115] In one embodiment, the preliminarily cleaning module includes a missing rate sub-module, an importance degree sub-module, and a missing value processing sub-module, Date Regue/Date Received 2022-09-27 of which:
[0116] the missing rate sub-module is employed for calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;
[0117] the importance degree sub-module is employed for determining an attribute importance degree of the field according to an index required to be analyzed; and
[0118] the missing value processing sub-module is employed for deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.
[0119] Further, the missing value processing sub-module includes a comparing unit and a preliminarily processing unit, of which:
[0120] the comparing unit is employed for comparing the missing rate and the attribute importance degree of the field respectively with a preset missing rate threshold and a preset importance grading threshold, and the preliminarily processing unit is employed for filling in, deleting or complementing the field:
[0121] filling in the field when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold;
[0122] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and
[0123] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
[0124] In one embodiment, the data cleaning device further comprises a data probing module for firstly probing metadata that describes data attribute of the data in the first data source before the data in the data stream is subjected to a filtering process, then analyzing to Date Regue/Date Received 2022-09-27 obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem.
[0125] In one embodiment, the data filtering module includes a row-grade filtering unit and a column-grade filtering unit, of which:
[0126] the row-grade filtering unit is employed for removing any row not required in the data;
and the column-grade filtering unit is employed for, when one row has plural columns, merely selecting and retaining fields to which any required column corresponds.
[0127] In one embodiment, the finally cleaning module includes a legitimacy detecting unit, a logics detecting unit, and a finally processing unit, of which:
[0128] the legitimacy detecting unit is employed for detecting whether the preliminarily cleaned data conforms to a preset legitimacy rule;
[0129] the logics detecting unit is employed for detecting whether the preliminarily cleaned data conforms to a preset logic rule; and
[0130] the finally processing unit is employed for setting the preliminarily cleaned data not conforming to the legitimacy rule as a maximum value that conforms to the legitimacy rule, or deleting the data; and deleting the preliminarily cleaned data not conforming to the logic rule, and generating an alarming instruction.
[0131] Specific definitions relevant to the data cleaning device may be inferred from the aforementioned definitions to the data cleaning method, so no repetition is made in this context. The various modules in the aforementioned data cleaning device can be wholly or partly realized via software, hardware, and a combination of software with hardware.
The various modules can be embedded in the form of hardware in a processor in a computer equipment or independent of any computer equipment, and can also be stored in the form of software in a memory in a computer equipment, so as to facilitate the processor to invoke and perform operations corresponding to the aforementioned various modules.
Date Regue/Date Received 2022-09-27
[0132] In one embodiment, a computer equipment is provided, and the computer equipment can be a terminal. The computer equipment comprises a processor, a memory, a network interface, a display screen and an input means connected to each other via a system bus.
The processor of the computer equipment is employed to provide computing and controlling capabilities. The memory of the computer equipment includes a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores therein an operating system and a computer program. The internal memory provides environment for the running of the operating system and the computer program in the nonvolatile storage medium. The network interface of the computer equipment is employed to connect to an external terminal via network for communication. The computer program realizes a data cleaning method when it is executed by a processor. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input means of the computer equipment can be a touch layer covering on the display screen, can also be a press button, a track ball or a touch control board disposed on the housing of the computer equipment, and can further be an externally connected keyboard, touch control board or mouse, etc.
[0133] In one embodiment, there is provided a computer equipment that comprises a memory, a processor and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:
obtaining data from a first data source, and creating an independent data stream by employing the obtained data; subjecting data in the data stream to a filtering process, and obtaining data to be cleaned; deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data; detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.

Date Regue/Date Received 2022-09-27
[0134] In one embodiment, when the processor executes the computer program, the following steps are further realized: calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces; determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.
[0135] In one embodiment, when the processor executes the computer program, the following steps are further realized: filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold; deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
[0136] In one embodiment, when the processor executes the computer program, the following steps are further realized: probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;
and subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.
[0137] In one embodiment, when the processor executes the computer program, the following steps are further realized: row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.

Date Regue/Date Received 2022-09-27
[0138] The preset judging rule includes a legitimacy rule and a logic rule, in one embodiment, when the processor executes the computer program, the following steps are further realized: setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.
[0139] In one embodiment, there is provided a computer-readable storage medium storing thereon a computer program, and the following steps are realized when the computer program is executed by a processor: obtaining data from a first data source, and creating an independent data stream; subjecting data in the data stream by employing the obtained data to a filtering process, and obtaining data to be cleaned; deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data; detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.
[0140] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces; determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.
[0141] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold; deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance Date Regue/Date Received 2022-09-27 degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
[0142] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;
and subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.
[0143] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.
[0144] The preset judging rule includes a legitimacy rule and a logic rule, in one embodiment, when the computer program is executed by a processor, the following steps are further realized: setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.
[0145] As comprehensible to persons ordinarily skilled in the art, the entire or partial flows in the methods according to the aforementioned embodiments can be completed via a computer program instructing relevant hardware, the computer program can be stored in a nonvolatile computer-readable storage medium, and the computer program can include the flows as embodied in the aforementioned various methods when executed. Any Date Regue/Date Received 2022-09-27 reference to the memory, storage, database or other media used in the various embodiments provided by the present application can all include nonvolatile and/or volatile memory/memories. The nonvolatile memory can include a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable and programmable ROM (EEPROM) or a flash memory. The volatile memory can include a random access memory (RAM) or an external cache memory. To serve as explanation rather than restriction, the RAM is obtainable in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM
(SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct Rambus dynamic RAM (DRDRAM), and Rambus dynamic RAM
(RDRAM), etc.
[0146] Technical features of the aforementioned embodiments are randomly combinable, while all possible combinations of the technical features in the aforementioned embodiments are not exhausted for the sake of brevity, but all these should be considered to fall within the scope recorded in the Description as long as such combinations of the technical features are not mutually contradictory.
[0147] The foregoing embodiments are merely directed to several modes of execution of the present application, and their descriptions are relatively specific and detailed, but they should not be hence misunderstood as restrictions to the inventive patent scope. As should be pointed out, persons with ordinary skill in the art may further make various modifications and improvements without departing from the conception of the present application, and all these should pertain to the protection scope of the present application.
Accordingly, the patent protection scope of the present application shall be based on the attached Claims.
Date Regue/Date Received 2022-09-27

Claims (10)

CA 03177209 2022-09-27What is claimed is:
1. A data cleaning method, characterized in that the method comprises:
obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.
2. The method according to Claim 1, characterized in that the step of deleting or filling in any field containing missing values in the data to be cleaned includes:
calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;
determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.
3. The method according to Claim 2, characterized in that the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:
filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold;
deleting the field when the missing rate of the field is not lower than the preset missing rate Date Regue/Date Received 2022-09-27 threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.
4. The method according to Claim 1, characterized in that the method further comprises:
probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem; and that the step of subjecting data in the data stream to a filtering process, and obtaining data to be cleaned includes: subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.
5. The method according to any one of Claims 1 to 4, characterized in that the step of subjecting data in the data stream to a filtering process includes:
row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.
6. The method according to anyone of Claims 1 to 4, characterized in that the preset judging rule includes a legitimacy rule and a logic rule, and that the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:
setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.
7. The method according to Claim 1, characterized in that the first data source and the second Date Regue/Date Received 2022-09-27 data source are of different data types of the same and single distributed messaging system, that the distributed messaging system is Kafka, that the first data source and the second data source are two different Topics of Kafka, and that the data stream is embodied as a data stream based on Spark Streaming.
8. A data cleaning device, characterized in that the device comprises:
a data obtaining module, for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
a data filtering module, for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
a preliminarily cleaning module, for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
a finally cleaning module, for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and a data outputting module, for outputting the finally cleaned data to a second data source.
9. A computer equipment, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that steps of the method according to any one of Claims 1 to 7 are realized when the processor executes the computer program.
10. A computer-readable storage medium, storing a computer program thereon, characterized in that steps of the method according to any one of Claims 1 to 7 are realized when the computer program is executed by a processor.

Date Regue/Date Received 2022-09-27
CA3177209A 2019-04-17 2019-09-29 Data cleaning method Pending CA3177209A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910308949.0A CN110162519A (en) 2019-04-17 2019-04-17 Data clearing method
CN201910308949.0 2019-04-17
PCT/CN2019/109121 WO2020211299A1 (en) 2019-04-17 2019-09-29 Data cleansing method

Publications (1)

Publication Number Publication Date
CA3177209A1 true CA3177209A1 (en) 2020-10-22

Family

ID=67639550

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3177209A Pending CA3177209A1 (en) 2019-04-17 2019-09-29 Data cleaning method

Country Status (3)

Country Link
CN (1) CN110162519A (en)
CA (1) CA3177209A1 (en)
WO (1) WO2020211299A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356902A (en) * 2021-12-14 2022-04-15 中核武汉核电运行技术股份有限公司 Industrial data quality management method and device
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment
CN115794795A (en) * 2022-12-08 2023-03-14 湖北华中电力科技开发有限责任公司 Power distribution station power consumption data standardized cleaning method, device and system and storage medium

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method
CN110716928A (en) * 2019-09-09 2020-01-21 上海凯京信达科技集团有限公司 Data processing method, device, equipment and storage medium
CN110704410A (en) * 2019-09-27 2020-01-17 中冶赛迪重庆信息技术有限公司 Data cleaning method, system and equipment
CN110781176A (en) * 2019-11-06 2020-02-11 国网山东省电力公司威海供电公司 Power grid data quality improvement method based on data correlation
CN110990447B (en) * 2019-12-19 2023-09-15 北京锐安科技有限公司 Data exploration method, device, equipment and storage medium
CN111563071A (en) * 2020-04-03 2020-08-21 深圳价值在线信息科技股份有限公司 Data cleaning method and device, terminal equipment and computer readable storage medium
CN111966735A (en) * 2020-07-22 2020-11-20 山东高速信息工程有限公司 NIFI-based micro-service data interaction method and system
CN111859814B (en) * 2020-07-30 2023-07-28 中国电建集团昆明勘测设计研究院有限公司 Rock aging deformation prediction method and system based on LSTM deep learning
CN112287562B (en) * 2020-11-18 2023-03-10 国网新疆电力有限公司经济技术研究院 Power equipment retired data completion method and system
CN113268476A (en) * 2021-06-07 2021-08-17 一汽解放汽车有限公司 Data cleaning method and device applied to Internet of vehicles and computer equipment
CN113535697B (en) * 2021-07-07 2024-05-24 广州三叠纪元智能科技有限公司 Climbing frame data cleaning method, climbing frame control device and storage medium
CN113568811A (en) * 2021-07-28 2021-10-29 中国南方电网有限责任公司 Distributed safety monitoring data processing method
CN114549052A (en) * 2022-01-20 2022-05-27 深圳市宝视佳科技有限公司 Data-based accurate marketing method, device, equipment and storage medium
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment
CN115809406B (en) * 2023-02-03 2023-05-12 佰聆数据股份有限公司 Fine granularity classification method, device, equipment and storage medium for electric power users
CN117290315B (en) * 2023-10-11 2024-06-25 河南师范大学 Data classification cleaning method
CN117540151B (en) * 2023-12-08 2024-06-28 深圳市亲邻科技有限公司 Data preprocessing method of data pushing system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179599A1 (en) * 2012-10-11 2016-06-23 University Of Southern California Data processing framework for data cleansing
CN105989163A (en) * 2015-03-04 2016-10-05 ***通信集团福建有限公司 Data real-time processing method and system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN108596386A (en) * 2018-04-20 2018-09-28 上海市司法局 A kind of prediction convict repeats the method and system of crime probability
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system
CN109255523B (en) * 2018-08-16 2021-07-20 北京奥技异科技发展有限公司 Analytical index computing platform based on KKS coding rule and big data architecture
CN109492002B (en) * 2018-10-19 2021-03-23 浙江大学华南工业技术研究院 Smart power grid big data storage and analysis system and processing method
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment
CN114356902A (en) * 2021-12-14 2022-04-15 中核武汉核电运行技术股份有限公司 Industrial data quality management method and device
CN115794795A (en) * 2022-12-08 2023-03-14 湖北华中电力科技开发有限责任公司 Power distribution station power consumption data standardized cleaning method, device and system and storage medium
CN115794795B (en) * 2022-12-08 2023-09-22 湖北华中电力科技开发有限责任公司 Power distribution station electricity consumption data standardization cleaning method, device, system and storage medium

Also Published As

Publication number Publication date
WO2020211299A1 (en) 2020-10-22
CN110162519A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CA3177209A1 (en) Data cleaning method
CN109543925B (en) Risk prediction method and device based on machine learning, computer equipment and storage medium
CN110569214A (en) Index construction method and device for log file and electronic equipment
JP5810719B2 (en) Data arrangement changing program, data arrangement changing method, and data arrangement changing apparatus
CN109656779A (en) Internal memory monitoring method, device, terminal and storage medium
CN111400361A (en) Data real-time storage method and device, computer equipment and storage medium
CN112153375B (en) Front-end performance testing method, device, equipment and medium based on video information
CN106445815A (en) Automated testing method and device
CN111858278A (en) Log analysis method and system based on big data processing and readable storage device
CN112948504B (en) Data acquisition method and device, computer equipment and storage medium
CN113190531A (en) Database migration method, device, equipment and storage medium
CN112527786A (en) Data table partition adding method and device, computer equipment and storage medium
CN113691631B (en) Data cleaning method and device and electronic equipment
CN115827691A (en) Batch processing result verification method and device, computer equipment and storage medium
CN115145674A (en) Page jump method, device, equipment and medium based on dynamic anchor point
CN115098503A (en) Null value data processing method and device, computer equipment and storage medium
CN114661686A (en) Message extraction method, device, equipment, medium and program product of log file
CN114238052A (en) Pressure measurement data filtering method and device, storage medium and computer equipment
CN113778996A (en) Large data stream data processing method and device, electronic equipment and storage medium
CN113761443A (en) Website page data acquisition and statistics method, storage medium and equipment
CN114722261A (en) Resource processing method and device, electronic equipment and storage medium
CN112256685A (en) Spreadsheet-based segmentation de-duplication import method and related product
CN112187564A (en) vSAN performance test method, apparatus, computer device and storage medium
CN112800005B (en) Deep inspection method, system, terminal and storage medium for file system
CN117076292A (en) Webpage testing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927

EEER Examination request

Effective date: 20220927