CA3177209A1

CA3177209A1 - Data cleaning method

Info

Publication number: CA3177209A1
Application number: CA3177209A
Authority: CA
Inventors: Licheng Zhang
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2019-04-17
Filing date: 2019-09-29
Publication date: 2020-10-22
Also published as: WO2020211299A1; CN110162519A

Abstract

A data cleansing method. The method comprises: acquiring data from a first data source, and establishing an independent data stream by using the acquired data (101); filtering the data in the data stream to obtain data to be cleansed (102); deleting or filling a field comprising a missing value in the data to be cleansed, to obtain preliminary cleansed data (103); detecting whether the preliminary cleansed data conforms to a preset determination rule, and deleting the data not conforming to the determination rule to obtain final cleansed data (104); and outputting the final cleansed data to a second data source (105). By using the above-mentioned method, data security can be improved.

Description

DATA CLEANING METHOD
BACKGROUND OF THE INVENTION
Technical Field [0001] The present application relates to the field of big data processing technology, and more particularly to a data processing method.
Description of Related Art

[0002] With the advent of the Age of Network, large quantities of information data incessantly rush into the network, and data quantities are increased each year at a speed of 50%.
Under the support of colossal data sources, enterprise decisions are more and more based on data analyses, rather than the mere reliance on experience and intuition as traditional case. Data cleaning is an indispensable link in the data analyzing process as a whole, and its resultant quality directly affects model effect and the final data analyzing conclusion.
Data cleaning means a process to recheck and verify the data, and aims to delete repetitive data, rectify existing errors, and ensure consistency of data. In practical operations, data cleaning usually occupies 50% to 80% of the time of the entire data analyzing process.

[0003] Data cleaning includes two types as offline data cleaning and real-time data cleaning, by which offline data cleaning data can be cleaned with more refined granules by means of complicated processing at the expense of performance, and such cleaning includes missing value processing, abnormal value processing, repetitive value processing, null value filling, unifying units, whether to perform standardized processing, whether to delete unnecessary variables, and whether to sort, etc.; in comparison with offline data cleaning, due to its requirement on real time, real-time cleaning is more adapted to missing value filling, filtering and data legitimacy checking of data, but the currently Date Regue/Date Received 2022-09-27 available data cleaning process is usually integral with the data analyzing process, coupling between the two is large, the data cleaning process is greatly affected by the function of data-analyzing of other codes, data loss easily tends to occur, and data security is rendered inferior.
SUMMARY OF THE INVENTION

[0004] In view of the above technical problems, there is an urgent need to propose a data cleaning method capable of enhancing data security.

[0005] There is provided a data cleaning method that comprises:

[0006] obtaining data from a first data source, and creating an independent data stream by employing the obtained data;

[0007] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;

[0008] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;

[0009] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and

[0010] outputting the finally cleaned data to a second data source.

[0011] In one of the embodiments, the step of deleting or filling in any field containing missing values in the data to be cleaned includes:

[0012] calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;

[0013] determining an attribute importance degree of the field according to an index required to be analyzed; and

[0014] deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

Date Regue/Date Received 2022-09-27

[0015] In one of the embodiments, the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:

[0016] filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold;

[0017] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and

[0018] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

[0019] In one of the embodiments, the method further comprises:

[0020] probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;

[0021] the step of subjecting data in the data stream to a filtering process, and obtaining data to be cleaned includes: subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.

[0022] In one of the embodiments, the step of subjecting data in the data stream to a filtering process includes:

[0023] row-grade filtering, whereby any row not required in the data is removed; and

[0024] column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.

[0025] In one of the embodiments, the preset judging rule includes a legitimacy rule and a logic Date Regue/Date Received 2022-09-27 rule, and the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:

[0026] setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and

[0027] deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.

[0028] In one of the embodiments, the first data source and the second data source are of different data types of the same and single distributed messaging system, the distributed messaging system is Kafka, the first data source and the second data source are two different Topics of Kafka, and the data stream is embodied as a data stream based on Spark Streaming.

[0029] There is provided a data cleaning device that comprises:

[0030] a data obtaining module, for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;

[0031] a data filtering module, for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;

[0032] a preliminarily cleaning module, for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;

[0033] a finally cleaning module, for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and

[0034] a data outputting module, for outputting the finally cleaned data to a second data source.

[0035] There is provided a computer equipment that comprises a memory, a processor, and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:

[0036] obtaining data from a first data source, and creating an independent data stream by Date Regue/Date Received 2022-09-27 employing the obtained data;

[0037] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;

[0038] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;

[0039] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and

[0040] outputting the finally cleaned data to a second data source.

[0041] There is provided a computer-readable storage medium that stores thereon a computer program, and the following steps are realized when the computer program is executed by a processor:

[0042] obtaining data from a first data source, and creating an independent data stream by employing the obtained data;

[0043] subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;

[0044] deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;

[0045] detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and

[0046] outputting the finally cleaned data to a second data source.

[0047] In comparison with prior-art technology, the present invention achieves the following advantageous effects.

[0048] In the data cleaning method, and corresponding device, computer equipment and storage medium, data cleaning is performed by creating an independent data stream, and data obtained from a first data source is cleaned and thereafter placed in another data source for processing by subsequent businesses, so that the data cleaning process is separated Date Regue/Date Received 2022-09-27 from data analyzing codes, coupling among the codes is reduced, and data security is effectively enhanced.

[0049] Moreover, data filtering is placed as the first step of data cleaning in the present invention, whereby reducing quantity of data to be subsequently cleaned, and enhancing the efficiency in cleaning the data.
BRIEF DESCRIPTION OF THE DRAWINGS

[0050] Fig. 1 is a flowchart schematically illustrating the data cleaning method in an embodiment;
and

[0051] Fig. 2 is a block diagram illustrating the structure of the data cleaning device in an embodiment.
DETAILED DESCRIPTION OF THE INVENTION

[0052] In order to make the objectives, technical solutions and advantages of the present application more lucid and clear, the present application is described in greater detail below with reference to accompanying drawings and embodiments. As should be understood, the specific embodiments as described here are merely meant to explain the present application, rather than to restrict the present application.

[0053] In one embodiment, as shown in Fig. 1, the present application provides a data cleaning method that comprises the following steps.

[0054] Step 101 - obtaining data from a first data source, and creating an independent data stream by employing the obtained data.

[0055] The first data source is a source from which data is obtained, and the data stream is a set Date Regue/Date Received 2022-09-27 of orderly data sequence of nodes with starting points and ending points.

[0056] Specifically, by creating an independent data stream for data cleaning, the present invention separates the data cleaning process from data analyzing codes, and reduces coupling among the codes.

[0057] Step 102 - subjecting data in the data stream to a filtering process, and obtaining data to be cleaned.

[0058] Specifically, data filtering is placed as the first step of data cleaning, whereby can effectively reduce quantity of data to be subsequently cleaned, and greatly enhance the efficiency in cleaning the data.

[0059] Step 103 - deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data.

[0060] The missing values mean information deficient in the data, that is to say, one or some attribute(s) of the data is/are incomplete in value(s).

[0061] Step 104 - detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data.

[0062] Step 105 - outputting the finally cleaned data to a second data source.

[0063] The second data source is another data source that is different from the first data source, and it is employed to store data to be used or processed by subsequent businesses.

[0064] Specifically, the data cleaning process of the present invention is independent of other Date Regue/Date Received 2022-09-27 processing processes of data analysis, and is not affected by other codes, so security of data is higher.

[0065] In the data cleaning method, data cleaning is performed by creating an independent data stream, and data obtained from a first data source is cleaned and thereafter placed in another data source for processing by subsequent businesses, so that the data cleaning process is separated from data analyzing codes, coupling among the codes is reduced, and data security is effectively enhanced.

[0066] As one of specific modes of execution, the first data source and the second data source are of different data types of the same and single distributed messaging system, for instance, the distributed messaging system is Kafka, the first data source and the second data source are two different Topics of Kafka, and the data stream is embodied as a data stream based on Spark Streaming.

[0067] In one of the embodiments, the step of deleting or filling in any field containing missing values in the data to be cleaned includes:

[0068] calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;

[0069] determining an attribute importance degree of the field according to an index required to be analyzed; and

[0070] deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

[0071] The missing rate of the field is the proportion of the number of pieces of the missing values of the field in the total number of pieces.

[0072] For instance, there are altogether 100 pieces of records in a salary field, and 20 pieces of records are missing values, then the missing rate is 20%.

Date Regue/Date Received 2022-09-27

[0073] Judging criterion of the attribute importance degree of the field is decided by the index required to be analyzed, for example, it is required to portray or label users so as to supply data for subsequent precise marketing, it is then required to collect attribute information of the users, for instance, such attribute information as ages and genders of the users are important fields.

[0074] In one of the embodiments, the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:

[0075] filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold.

[0076] Specifically, if the field attribute is numerical-type data, it suffices to fill in the field according to the circumstance of data distribution; further specifically, if data is unifolinly distributed, the field is filled in by means of a mean value; if data is distributed in a skewed manner, the field is filled in by means of a median;

[0077] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and

[0078] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

[0079] Specifically, the step of complementing the missing values of the field includes:

[0080] complementing through other information, such as using an ID card number to reckon gender, native place, date of birth, and age, etc.;

[0081] complementing through foregoing and following data, for instance, when data is deficient in a time sequence, foregoing and following mean values can be used to serve as Date Regue/Date Received 2022-09-27 complementary values, when there are many missing values, numerical values obtained through a smoothening process can serve as the complementary values;

[0082] where it is impossible to complement, removal is necessitated, but deletion should not be made for possible use subsequently.

[0083] As one of specific modes of execution, the missing rate threshold can be any numeral value between 90% and 95%.

[0084] In one of the embodiments, before data in the data stream is subjected to a filtering process, metadata that describes data attribute of the data in the first data source is firstly probed, any quality problem present in the data is then analyzed and obtained according to the metadata, a filtering rule is set according to the quality problem, the data in the data stream is subjected to a filtering process according to the filtering rule, and the data to be cleaned is obtained in Step 102.

[0085] Metadata is also referred to as intermediary data, relay data, and it is data that describes data, mainly describing information of data attributes, and supporting such functions as indicating storage locations, historical data, resource searching, and document recording, etc.

[0086] Specifically, the data attribute required to be processed is packaged into metadata, thus enabling the program to possess better expandability. At the same time, a corresponding filtering rule is stipulated with respect to any quality problem of the data, thus facilitating enhancement of data filtering efficiency.

[0087] In one of the embodiments, the step of subjecting data in the data stream to a filtering process includes:

[0088] row-grade filtering, whereby any row not required in the data is removed; and

[0089] column-grade filtering, whereby, when one row has plural columns, fields to which any Date Regue/Date Received 2022-09-27 required column corresponds are merely selected and retained.

[0090] Specifically, the combination of row-grade filtering with column-grade filtering makes it possible to effectively quicken the data filtering speed.

[0091] For instance, a process to calculate pv/uv by divided channels:

[0092] the log data includes approximately 200 such fields as the IP address, browser information, client terminal equipment information, the specific access time, the specific page accessed, the page previously accessed, and the access time duration, etc., the requirement in this embodiment is to count the clicking amount of each channel and the access amount of the independent IP.

[0093] By row-grade filtering, log data relevant to the channels is selected and retained only, so that log data not containing the channels is filtered away;

[0094] by column-grade filtering, cid (channel name), uid (equipment identification), and ip address are selected from the approximately 200 fields contained in the log data relevant to the channels, unnecessary fields are filtered away, and it is then possible to count and obtain pv/nv of each channel;

[0095] pv is an acronym of Page View, namely page browsing amount, one access by a user to a certain page in a website is recorded once, and the amount of multiple accesses by the user to the same and single page becomes the total number of pv;

[0096] uv is an acronym of unique visitor, and means a natural person that accesses to and browses the page through the internet.

[0097] In this embodiment, in consideration of expandability, for instance, it might be required to count a retention rate of users in subsequent data processing, it is possible to further record such data as the access time of each ip address, and so on.

[0098] The retention rate of users is a ratio of old users to the total users.

Date Regue/Date Received 2022-09-27

[0099] In one of the embodiments, the preset judging rule includes a legitimacy rule and a logic rule, and the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:

[0100] setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and

[0101] deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.

[0102] The legitimacy rule is such format requirement rule as numerical values, dates, and field contents, etc.

[0103] Specifically, field-type legitimacy rule: a date field format is "YYYY-MM-DD".

[0104] Field content legitimacy rule: the gender is male, female, or unknown;
the date of birth is earlier than or equal to "today".

[0105] The logic rule is a rule of common sense used for judging whether the data conforms to logics, for instance, ages of people usually lie between 0 and 120, and any piece of data is judged as abnormal if the age of 200 appears therein.

[0106] After the data has been cleaned by the legitimacy rule and the logic rule, any data not conforming to format requirements and logics is removed, and valid, finally cleaned data is obtained.

[0107] As should be understood, although the various steps in the flowchart of Fig. 1 are sequentially displayed as indicated by arrows, these steps are not necessarily executed in the sequences indicated by arrows. Unless otherwise explicitly noted in this paper, Date Regue/Date Received 2022-09-27 execution of these steps is not restricted by any sequence, as these steps can also be executed in other sequences (than those indicated in the drawings). Moreover, at least partial steps in the flowchart of Fig. 1 may include plural sub-steps or multi-phases, these sub-steps or phases are not necessarily completed at the same timing, but can be executed at different timings, and these sub-steps or phases are also not necessarily sequentially performed, but can be performed in turns or alternately with other steps or with at least some of sub-steps or phases of other steps.

[0108] In one embodiment, as shown in Fig. 2, there is provided a data cleaning device that comprises a data obtaining module, a data filtering module, a preliminarily cleaning module, a finally cleaning module, and a data outputting module, of which:

[0109] the data obtaining module is employed for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;

[0110] the data filtering module is employed for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;

[0111] the preliminarily cleaning module is employed for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;

[0112] the finally cleaning module is employed for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and

[0113] the data outputting module is employed for outputting the finally cleaned data to a second data source.

[0114] During specific implementation, the first data source and the second data source are of different data types of the same and single distributed messaging system.

[0115] In one embodiment, the preliminarily cleaning module includes a missing rate sub-module, an importance degree sub-module, and a missing value processing sub-module, Date Regue/Date Received 2022-09-27 of which:

[0116] the missing rate sub-module is employed for calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;

[0117] the importance degree sub-module is employed for determining an attribute importance degree of the field according to an index required to be analyzed; and

[0118] the missing value processing sub-module is employed for deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

[0119] Further, the missing value processing sub-module includes a comparing unit and a preliminarily processing unit, of which:

[0120] the comparing unit is employed for comparing the missing rate and the attribute importance degree of the field respectively with a preset missing rate threshold and a preset importance grading threshold, and the preliminarily processing unit is employed for filling in, deleting or complementing the field:

[0121] filling in the field when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold;

[0122] deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and

[0123] complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

[0124] In one embodiment, the data cleaning device further comprises a data probing module for firstly probing metadata that describes data attribute of the data in the first data source before the data in the data stream is subjected to a filtering process, then analyzing to Date Regue/Date Received 2022-09-27 obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem.

[0125] In one embodiment, the data filtering module includes a row-grade filtering unit and a column-grade filtering unit, of which:

[0126] the row-grade filtering unit is employed for removing any row not required in the data;
and the column-grade filtering unit is employed for, when one row has plural columns, merely selecting and retaining fields to which any required column corresponds.

[0127] In one embodiment, the finally cleaning module includes a legitimacy detecting unit, a logics detecting unit, and a finally processing unit, of which:

[0128] the legitimacy detecting unit is employed for detecting whether the preliminarily cleaned data conforms to a preset legitimacy rule;

[0129] the logics detecting unit is employed for detecting whether the preliminarily cleaned data conforms to a preset logic rule; and

[0130] the finally processing unit is employed for setting the preliminarily cleaned data not conforming to the legitimacy rule as a maximum value that conforms to the legitimacy rule, or deleting the data; and deleting the preliminarily cleaned data not conforming to the logic rule, and generating an alarming instruction.

[0131] Specific definitions relevant to the data cleaning device may be inferred from the aforementioned definitions to the data cleaning method, so no repetition is made in this context. The various modules in the aforementioned data cleaning device can be wholly or partly realized via software, hardware, and a combination of software with hardware.
The various modules can be embedded in the form of hardware in a processor in a computer equipment or independent of any computer equipment, and can also be stored in the form of software in a memory in a computer equipment, so as to facilitate the processor to invoke and perform operations corresponding to the aforementioned various modules.
Date Regue/Date Received 2022-09-27

[0132] In one embodiment, a computer equipment is provided, and the computer equipment can be a terminal. The computer equipment comprises a processor, a memory, a network interface, a display screen and an input means connected to each other via a system bus.
The processor of the computer equipment is employed to provide computing and controlling capabilities. The memory of the computer equipment includes a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores therein an operating system and a computer program. The internal memory provides environment for the running of the operating system and the computer program in the nonvolatile storage medium. The network interface of the computer equipment is employed to connect to an external terminal via network for communication. The computer program realizes a data cleaning method when it is executed by a processor. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input means of the computer equipment can be a touch layer covering on the display screen, can also be a press button, a track ball or a touch control board disposed on the housing of the computer equipment, and can further be an externally connected keyboard, touch control board or mouse, etc.

[0133] In one embodiment, there is provided a computer equipment that comprises a memory, a processor and a computer program stored on the memory and operable on the processor, and the following steps are realized when the processor executes the computer program:
obtaining data from a first data source, and creating an independent data stream by employing the obtained data; subjecting data in the data stream to a filtering process, and obtaining data to be cleaned; deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data; detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.

Date Regue/Date Received 2022-09-27

[0134] In one embodiment, when the processor executes the computer program, the following steps are further realized: calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces; determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

[0135] In one embodiment, when the processor executes the computer program, the following steps are further realized: filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold; deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

[0136] In one embodiment, when the processor executes the computer program, the following steps are further realized: probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;
and subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.

[0137] In one embodiment, when the processor executes the computer program, the following steps are further realized: row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.

Date Regue/Date Received 2022-09-27

[0138] The preset judging rule includes a legitimacy rule and a logic rule, in one embodiment, when the processor executes the computer program, the following steps are further realized: setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.

[0139] In one embodiment, there is provided a computer-readable storage medium storing thereon a computer program, and the following steps are realized when the computer program is executed by a processor: obtaining data from a first data source, and creating an independent data stream; subjecting data in the data stream by employing the obtained data to a filtering process, and obtaining data to be cleaned; deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data; detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.

[0140] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces; determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

[0141] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold; deleting the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance Date Regue/Date Received 2022-09-27 degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

[0142] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem;
and subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.

[0143] In one embodiment, when the computer program is executed by a processor, the following steps are further realized: row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.

[0144] The preset judging rule includes a legitimacy rule and a logic rule, in one embodiment, when the computer program is executed by a processor, the following steps are further realized: setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.

[0145] As comprehensible to persons ordinarily skilled in the art, the entire or partial flows in the methods according to the aforementioned embodiments can be completed via a computer program instructing relevant hardware, the computer program can be stored in a nonvolatile computer-readable storage medium, and the computer program can include the flows as embodied in the aforementioned various methods when executed. Any Date Regue/Date Received 2022-09-27 reference to the memory, storage, database or other media used in the various embodiments provided by the present application can all include nonvolatile and/or volatile memory/memories. The nonvolatile memory can include a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable and programmable ROM (EEPROM) or a flash memory. The volatile memory can include a random access memory (RAM) or an external cache memory. To serve as explanation rather than restriction, the RAM is obtainable in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM
(SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct Rambus dynamic RAM (DRDRAM), and Rambus dynamic RAM
(RDRAM), etc.

[0146] Technical features of the aforementioned embodiments are randomly combinable, while all possible combinations of the technical features in the aforementioned embodiments are not exhausted for the sake of brevity, but all these should be considered to fall within the scope recorded in the Description as long as such combinations of the technical features are not mutually contradictory.

[0147] The foregoing embodiments are merely directed to several modes of execution of the present application, and their descriptions are relatively specific and detailed, but they should not be hence misunderstood as restrictions to the inventive patent scope. As should be pointed out, persons with ordinary skill in the art may further make various modifications and improvements without departing from the conception of the present application, and all these should pertain to the protection scope of the present application.
Accordingly, the patent protection scope of the present application shall be based on the attached Claims.
Date Regue/Date Received 2022-09-27

Claims

CA 03177209 2022-09-27What is claimed is:

1. A data cleaning method, characterized in that the method comprises:
obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and outputting the finally cleaned data to a second data source.

2. The method according to Claim 1, characterized in that the step of deleting or filling in any field containing missing values in the data to be cleaned includes:
calculating to obtain a missing rate of the field according to a proportion of number of pieces of the missing values of the field in a total number of pieces;
determining an attribute importance degree of the field according to an index required to be analyzed; and deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field.

3. The method according to Claim 2, characterized in that the step of deleting or filling in the field containing missing values according to the missing rate and the attribute importance degree of the field includes:
filling in the field when the missing rate of the field is lower than a preset missing rate threshold and the attribute importance degree thereof is lower than a preset importance grading threshold;
deleting the field when the missing rate of the field is not lower than the preset missing rate Date Regue/Date Received 2022-09-27 threshold and the attribute importance degree thereof is lower than the preset importance grading threshold; and complementing the missing values of the field when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance degree thereof is greater than the preset importance grading threshold.

4. The method according to Claim 1, characterized in that the method further comprises:
probing metadata that describes data attribute of the data in the first data source, analyzing to obtain any quality problem present in the data according to the metadata, and setting a filtering rule according to the quality problem; and that the step of subjecting data in the data stream to a filtering process, and obtaining data to be cleaned includes: subjecting data in the data stream to a filtering process according to the filtering rule, and obtaining data to be cleaned.

5. The method according to any one of Claims 1 to 4, characterized in that the step of subjecting data in the data stream to a filtering process includes:
row-grade filtering, whereby any row not required in the data is removed; and column-grade filtering, whereby, when one row has plural columns, fields to which any required column corresponds are merely selected and retained.

6. The method according to anyone of Claims 1 to 4, characterized in that the preset judging rule includes a legitimacy rule and a logic rule, and that the step of detecting whether the preliminarily cleaned data conforms to a preset judging rule includes:
setting the preliminarily cleaned data as a maximum value that conforms to the legitimacy rule, or deleting the data, if the preliminarily cleaned data does not conform to the legitimacy rule; and deleting the preliminarily cleaned data and generating an alarming instruction, if the preliminarily cleaned data does not conform to the logic rule.

7. The method according to Claim 1, characterized in that the first data source and the second Date Regue/Date Received 2022-09-27 data source are of different data types of the same and single distributed messaging system, that the distributed messaging system is Kafka, that the first data source and the second data source are two different Topics of Kafka, and that the data stream is embodied as a data stream based on Spark Streaming.

8. A data cleaning device, characterized in that the device comprises:
a data obtaining module, for obtaining data from a first data source, and creating an independent data stream by employing the obtained data;
a data filtering module, for subjecting data in the data stream to a filtering process, and obtaining data to be cleaned;
a preliminarily cleaning module, for deleting or filling in any field containing missing values in the data to be cleaned, and obtaining preliminarily cleaned data;
a finally cleaning module, for detecting whether the preliminarily cleaned data conforms to a preset judging rule, deleting any data that does not conform to the judging rule, and obtaining finally cleaned data; and a data outputting module, for outputting the finally cleaned data to a second data source.

9. A computer equipment, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that steps of the method according to any one of Claims 1 to 7 are realized when the processor executes the computer program.

10. A computer-readable storage medium, storing a computer program thereon, characterized in that steps of the method according to any one of Claims 1 to 7 are realized when the computer program is executed by a processor.

Date Regue/Date Received 2022-09-27