CN109947754A - Data cleaning method and device - Google Patents

Data cleaning method and device Download PDF

Info

Publication number
CN109947754A
CN109947754A CN201910080821.3A CN201910080821A CN109947754A CN 109947754 A CN109947754 A CN 109947754A CN 201910080821 A CN201910080821 A CN 201910080821A CN 109947754 A CN109947754 A CN 109947754A
Authority
CN
China
Prior art keywords
data
data cleansing
file
flow file
workflow application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910080821.3A
Other languages
Chinese (zh)
Inventor
田森
杨柳
安平凯
黄小浦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Hengyun Co Ltd
Original Assignee
Zhongke Hengyun Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Hengyun Co Ltd filed Critical Zhongke Hengyun Co Ltd
Priority to CN201910080821.3A priority Critical patent/CN109947754A/en
Publication of CN109947754A publication Critical patent/CN109947754A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention is suitable for technical field of data processing, provides a kind of data cleaning method and device, wherein the above method includes: the data cleansing flow file and initial data for receiving client and sending;Corresponding multiple workflow application models are obtained according to data cleansing flow file;Corresponding data cleansing, which is generated, according to multiple workflow application models executes file;File is executed according to data cleansing to clean initial data.Since the different workflow application model of each function can be freely combined in user in data cleansing flow file, so that data cleaning method provided in an embodiment of the present invention flexibility with higher and scalability;Simultaneously as each workflow application model may be reused, to improve the reusability of data cleansing.

Description

Data cleaning method and device
Technical field
The invention belongs to technical field of data processing more particularly to a kind of data cleaning methods and device.
Background technique
Mainly in data warehouse, data mining and data quality management, these three area researches are more for data cleansing.Currently, The country is still in the primary stage to the research of data cleaning technique, is in data warehouse, decision support, data mining research mostly In, some fairly simple elaborations are done to it.Many data cleansing schemes and algorithm are specially set both for specific application problem Meter, it is only applicable to lesser range.Reusability, scalability and the flexibility of traditional data cleaning method or system compared with Difference can satisfy the use demand of user when data volume is smaller, but in the case where and multi-source huge in data volume, reusable Property, scalability and flexibility it is poor problem it is especially prominent.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of data cleaning method and device, to solve to count in the prior art According to reusability existing for cleaning method or system, scalability and the poor problem of flexibility.
According in a first aspect, the embodiment of the invention provides a kind of data cleaning methods, comprising: receive what client was sent Data cleansing flow file and initial data;Corresponding multiple workflow application moulds are obtained according to the data cleansing flow file Type;Corresponding data cleansing, which is generated, according to the multiple workflow application model executes file;It is executed according to the data cleansing File cleans the initial data.
With reference to first aspect, described according to the multiple workflow application model in first aspect first embodiment It generates corresponding data cleansing and executes file, comprising: obtain the corresponding data cleansing of the multiple workflow application model respectively Code;According to the sequence in the data cleansing flow file to the data cleansing code reordering, forms data cleansing and execute File.
With reference to first aspect or first aspect first embodiment connects in first aspect second embodiment described After the step of receiving data cleansing flow file and initial data that client is sent, and according to the data cleansing process text Before part obtains the step of corresponding multiple workflow application models, the data cleaning method further include: judge the data Whether cleaning process file standardizes;When the data cleansing flow file specification, execute described according to the data cleansing stream The step of journey file acquisition corresponding multiple workflow application models.
With reference to first aspect, described to be obtained according to the data cleansing flow file in first aspect third embodiment Take corresponding multiple workflow application models, comprising: parse to the data cleansing flow file;When clear to the data When the successfully resolved of wash journey file, according to parsing result, the corresponding workflow application of the data cleansing flow file is extracted Model.
According to second aspect, the embodiment of the invention provides another data cleaning methods, comprising: selection enforcement engine; Data cleansing flow file and initial data are sent to the enforcement engine, so that the enforcement engine is according to the data cleansing Flow file simultaneously executes the data cleaning method as described in first aspect or first aspect any embodiment, to the original number According to being cleaned;Obtain the cleaning task running state information of the enforcement engine.
According to the third aspect, the embodiment of the invention provides a kind of data cleansing devices, comprising: input unit, for connecing Data cleansing flow file and initial data that client is sent are received, and for obtaining according to the data cleansing flow file Corresponding multiple workflow application models;File generating unit is corresponded to for being generated according to the multiple workflow application model Data cleansing execute file;Clean execution unit, for according to the data cleansing execute file to the initial data into Row cleaning.
In conjunction with the third aspect, in third aspect first embodiment, the file generating unit includes: Code obtaining list Member, for obtaining the corresponding data cleansing code of the multiple workflow application model respectively;Sequencing unit, for according to Sequence in data cleansing flow file forms data cleansing and executes file to the data cleansing code reordering.
According to fourth aspect, the embodiment of the invention provides another data cleansing devices, comprising: selecting unit is used for Select enforcement engine;Transmission unit, for sending data cleansing flow file and initial data to the enforcement engine, so that institute Enforcement engine is stated according to the data cleansing flow file and is executed as described in first aspect or first aspect any embodiment Data cleaning method, the initial data is cleaned;Monitoring unit, for obtaining the cleaning task of the enforcement engine Running state information.
According to the 5th aspect, the embodiment of the invention provides a kind of terminal devices, including memory, processor and storage In the memory and the computer program that can run on the processor, the processor execute the computer program The step of Shi Shixian such as first aspect or first aspect any embodiment the method, alternatively, described in processor execution It realizes when computer program such as the step of second aspect the method.
According to the 6th aspect, the embodiment of the present invention provides a kind of computer readable storage medium, described computer-readable Storage medium is stored with computer program, and such as first aspect or first aspect are realized when the computer program is executed by processor The step of any embodiment the method, alternatively, realizing when the computer program is executed by processor such as second aspect institute The step of stating method.
Data cleaning method and device provided in an embodiment of the present invention, the data cleansing flow file sent according to user are true Surely carry out executing the workflow application model of specific cleaning task, to keep enforcement engine specific for each according to user The workflow application model of data cleansing task special setting cleans initial data.Since user can be in data cleansing The different workflow application model of each function is freely combined in flow file, so that data cleansing provided in an embodiment of the present invention Method flexibility with higher and scalability;Simultaneously as each workflow application model may be reused, to mention The high reusability of data cleansing.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is the application scenarios schematic diagram of the embodiment of the present invention;
Fig. 2 is a kind of implementation process signal of a specific example of data cleaning method provided in an embodiment of the present invention Figure;
Fig. 3 is a kind of implementation process signal of another specific example of data cleaning method provided in an embodiment of the present invention Figure;
Fig. 4 is a kind of implementation process signal of the third specific example of data cleaning method provided in an embodiment of the present invention Figure;
Fig. 5 is a kind of structural schematic diagram of a specific example of data cleansing device provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of another specific example of data cleansing device provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of a specific example of terminal device provided in an embodiment of the present invention.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity The detailed description of road and method, in case unnecessary details interferes description of the invention.
In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.
Fig. 1 is the application scenarios schematic diagram of the embodiment of the present invention.In Fig. 1, enforcement engine 100 receives client 200 and sends out The data cleansing flow file and initial data sent are generated corresponding by the parsing and processing to data cleaning process file Data cleansing executes file.Enforcement engine 100 executes cleaning of the file realization to initial data by executing the data cleansing.
In some embodiments, as shown in Fig. 2, enforcement engine 100 can realize data cleansing by executing following steps:
Step S101: data cleansing flow file and initial data that client is sent are received.
Step S102: corresponding multiple workflow application models are obtained according to data cleansing flow file.
In a specific embodiment, as shown in figure 3, enforcement engine 100 can realize step by executing following sub-step The process of rapid S102:
Step S1021: data cleaning process file is parsed.
Step S1022: judge whether the parsing to data cleaning process file succeeds.When to data cleaning process file When successfully resolved, step S1023 is executed;When the parsing to data cleaning process file is unsuccessful, return step S1021, or Person is to 200 feedback data cleaning process document analysis failure news of client.
Step S1023: according to parsing result, the corresponding workflow application model of data cleansing flow file is extracted.
Step S103: corresponding data cleansing is generated according to the multiple workflow application models of institute and executes file.
In a specific embodiment, as shown in figure 3, enforcement engine 100 can realize step by executing following sub-step The process of rapid S103:
Step S1031: the corresponding data cleansing code of multiple workflow application models is obtained respectively.Data cleansing code can To be the cleaning function etc. of SQL statement, invocation component.
Step S1032: according to the sequence in data cleansing flow file to data cleansing code reordering, data cleansing is formed Execute file.
Step S104: file is executed according to data cleansing, initial data is cleaned.
Optionally, as shown in figure 3, following steps can also be added between step S101 and step S102:
Step S105: judge whether data cleansing flow file standardizes.It in a specific embodiment, can be according to default Format judges whether data cleansing flow file standardizes.The data cleansing flow file of specification should have identical as preset format File format;And nonstandard data cleansing flow file then has differences with preset format.When data cleansing flow file When specification, step S102 is executed;When data cleansing flow file is lack of standardization, to 200 feedback data cleaning process text of client The nonstandard message of part is reminded the user that and is modified to data cleaning process file, so that data cleansing flow file accords with Close the Standardization Requirement that enforcement engine 100 carries out subsequent parsing to it.
Data cleaning method provided in an embodiment of the present invention is determined according to the data cleansing flow file that user sends and is carried out The workflow application model of specific cleaning task is executed, to keep enforcement engine clear for each specific data according to user The workflow application model for washing task special setting cleans initial data.Since user can be in data cleansing process text The different workflow application model of each function is freely combined in part, so that data cleaning method provided in an embodiment of the present invention has There are higher flexibility and scalability;Simultaneously as each workflow application model may be reused, to improve number According to the reusability of cleaning.
In further embodiments, as shown in figure 4, client 200 can realize data cleansing by executing following steps:
Step S201: selection enforcement engine.
Step S202: data cleansing flow file and initial data are sent to enforcement engine.Client 200 passes through Xiang Qixuan Fixed enforcement engine 100 sends data cleansing flow file and initial data, and enforcement engine 100 can be made according to data cleansing stream Journey file simultaneously executes data cleaning method as shown in Figure 2 or Figure 3, cleans to initial data.
Step S203: the cleaning task running state information of enforcement engine is obtained.In a specific embodiment, execution is drawn 100 are held up during cleaning according to data cleaning method as shown in Figure 2 or Figure 3 to initial data, client 200 can To obtain the cleaning task running state information of enforcement engine 100 in real time, to realize the real time monitoring to data cleaning task.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
The embodiment of the invention also provides a kind of data cleansing device, corresponding diagram 2 or data cleaning method shown in Fig. 3. As shown in figure 5, the data cleansing device may include: input unit 501, file generating unit 502 and cleaning execution unit 503。
Wherein, input unit 501 is used to receive the data cleansing flow file and initial data of client transmission, Yi Jiyong According to the corresponding multiple workflow application models of data cleansing flow file acquisition;Its specific work process can be found in above-mentioned side In method embodiment described in step S101 to step S102.
File generating unit 502, which is used to generate corresponding data cleansing according to multiple workflow application models, executes file; Its specific work process can be found in above method embodiment described in step S103.
In a specific embodiment, it may include Code obtaining unit and sequencing unit that file generating unit, which includes 502,. Wherein, Code obtaining unit for obtaining the corresponding data cleansing code of multiple workflow application models respectively;Sequencing unit is used It, to data cleansing code reordering, forms data cleansing according to the sequence in data cleansing flow file and executes file.Code obtains The specific work process of unit and sequencing unit is taken, reference can be made to step S1031 and step S1032 institute in above method embodiment It states.
Cleaning execution unit 503 is used to execute file according to data cleansing and clean to initial data;Its specific works Process can be found in above method embodiment described in step S104.
The embodiment of the invention also provides another data cleansing device, corresponding data cleaning method shown in Fig. 4.Such as figure Shown in 6, which may include: selecting unit 601, transmission unit 602 and monitoring unit 603.
Wherein, selecting unit 601 is for selecting enforcement engine;Its specific work process can be found in above method embodiment Described in step S201.
Transmission unit 602 is used to send data cleansing flow file and initial data to enforcement engine, so that enforcement engine According to data cleansing flow file and Fig. 2 or data cleaning method shown in Fig. 3 are executed, initial data is cleaned;It has Body running process can be found in above method embodiment described in step S202.
Monitoring unit 302 is used to obtain the cleaning task running state information of enforcement engine;Its specific work process can join As described in step S203 in above method embodiment.
The embodiment of the invention also provides a kind of terminal devices, as shown in fig. 7, the terminal device may include processor 701 and memory 702, wherein processor 701 can be connected with memory 702 by bus or other modes, with logical in Fig. 7 It crosses for bus connection.
Processor 701 can be central processing unit (Central Processing Unit, CPU).Processor 701 may be used also Think other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, The combination of the chips such as discrete hardware components or above-mentioned all kinds of chips.
Memory 702 is used as a kind of non-transient computer readable storage medium, can be used for storing non-transient software program, non- Transient computer executable program and module, such as the corresponding program instruction/mould of the data cleaning method in the embodiment of the present invention Block is (for example, input unit 501, file generating unit 502 described in Fig. 5 and cleaning execution unit 503 and choosing shown in fig. 6 Select unit 601, transmission unit 602 and monitoring unit 603).Processor 701 is stored in non-temporary in memory 702 by operation State software program, instruction and module realize above-mentioned side thereby executing the various function application and data processing of processor Data cleaning method in method embodiment.
Memory 702 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function;It storage data area can the data etc. that are created of storage processor 701.In addition, Memory 702 may include high-speed random access memory, can also include non-transient memory, and a for example, at least disk is deposited Memory device, flush memory device or other non-transient solid-state memories.In some embodiments, it includes opposite that memory 702 is optional In the remotely located memory of processor 701, these remote memories can pass through network connection to processor 701.Above-mentioned net The example of network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
One or more of modules are stored in the memory 702, when being executed by the processor 701, are held Data cleaning method in row embodiment as shown in Figures 2 to 4.
Above-mentioned terminal device detail can correspond to refering to Fig. 2 into embodiment shown in Fig. 4 corresponding associated description Understood with effect, details are not described herein again.
It is that can lead to it will be understood by those skilled in the art that realizing all or part of the process in above-described embodiment method Computer program is crossed to instruct relevant hardware and complete, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk, CD, read-only memory (Read-Only Memory, ROM), random access memory (RandomAccessMemory, RAM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid state hard disk (Solid-State Drive, SSD) etc.;The storage medium can also include the combination of the memory of mentioned kind.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of data cleaning method characterized by comprising
Receive data cleansing flow file and initial data that client is sent;
Corresponding multiple workflow application models are obtained according to the data cleansing flow file;
Corresponding data cleansing, which is generated, according to the multiple workflow application model executes file;
File is executed according to the data cleansing to clean the initial data.
2. data cleaning method as described in claim 1, which is characterized in that described according to the multiple workflow application model It generates corresponding data cleansing and executes file, comprising:
The corresponding data cleansing code of the multiple workflow application model is obtained respectively;
According to the sequence in the data cleansing flow file to the data cleansing code reordering, forms data cleansing and execute text Part.
3. data cleaning method as claimed in claim 1 or 2, which is characterized in that in the data that the reception client is sent After the step of cleaning process file and initial data, and according to the corresponding multiple works of data cleansing flow file acquisition Before the step of making stream application model, the data cleaning method further include:
Judge whether the data cleansing flow file standardizes;
When the data cleansing flow file specification, execute described corresponding more according to data cleansing flow file acquisition The step of a workflow application model.
4. data cleaning method as described in claim 1, which is characterized in that described to be obtained according to the data cleansing flow file Take corresponding multiple workflow application models, comprising:
The data cleansing flow file is parsed;
According to parsing result, the corresponding workflow application model of the data cleansing flow file is extracted.
5. a kind of data cleaning method characterized by comprising
Select enforcement engine;
Data cleansing flow file and initial data are sent to the enforcement engine, so that the enforcement engine is according to the data Cleaning process file simultaneously executes data cleaning method according to any one of claims 1 to 4, carries out to the initial data Cleaning;
Obtain the cleaning task running state information of the enforcement engine.
6. a kind of data cleansing device characterized by comprising
Input unit, the data cleansing flow file and initial data sent for receiving client, and for according to Data cleansing flow file obtains corresponding multiple workflow application models;
File generating unit executes file for generating corresponding data cleansing according to the multiple workflow application model;
Execution unit is cleaned, the initial data is cleaned for executing file according to the data cleansing.
7. data cleansing device as claimed in claim 6, which is characterized in that the file generating unit includes:
Code obtaining unit, for obtaining the corresponding data cleansing code of the multiple workflow application model respectively;
Sequencing unit, for, to the data cleansing code reordering, being formed according to the sequence in the data cleansing flow file Data cleansing executes file.
8. a kind of data cleansing device characterized by comprising
Selecting unit, for selecting enforcement engine;
Transmission unit, for sending data cleansing flow file and initial data to the enforcement engine, so that the execution is drawn It holds up according to the data cleansing flow file and executes data cleaning method according to any one of claims 1 to 4, to institute Initial data is stated to be cleaned;
Monitoring unit, for obtaining the cleaning task running state information of the enforcement engine.
9. a kind of terminal device, including memory, processor and storage are in the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes such as Claims 1-4 when executing the computer program The step of any one the method, alternatively, the processor realizes side as claimed in claim 5 when executing the computer program The step of method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In when the computer program is executed by processor the step of any one of such as Claims 1-4 of realization the method, alternatively, institute State the step of realizing method as claimed in claim 5 when computer program is executed by processor.
CN201910080821.3A 2019-01-28 2019-01-28 Data cleaning method and device Pending CN109947754A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910080821.3A CN109947754A (en) 2019-01-28 2019-01-28 Data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910080821.3A CN109947754A (en) 2019-01-28 2019-01-28 Data cleaning method and device

Publications (1)

Publication Number Publication Date
CN109947754A true CN109947754A (en) 2019-06-28

Family

ID=67006533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910080821.3A Pending CN109947754A (en) 2019-01-28 2019-01-28 Data cleaning method and device

Country Status (1)

Country Link
CN (1) CN109947754A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538720A (en) * 2020-03-12 2020-08-14 嘉陵江亭子口水利水电开发有限公司 Method and system for cleaning basic data in power industry

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538720A (en) * 2020-03-12 2020-08-14 嘉陵江亭子口水利水电开发有限公司 Method and system for cleaning basic data in power industry
CN111538720B (en) * 2020-03-12 2023-07-21 嘉陵江亭子口水利水电开发有限公司 Method and system for cleaning basic data of power industry

Similar Documents

Publication Publication Date Title
CN107210928B (en) Distributed and adaptive computer network analysis
CN101694626B (en) Script execution system and method
CN107463582B (en) Distributed Hadoop cluster deployment method and device
CN105068864B (en) Method and system for processing asynchronous message queue
CN110908788B (en) Spark Streaming based data processing method and device, computer equipment and storage medium
CN110716744A (en) Data stream processing method, system and computer readable storage medium
US20150347305A1 (en) Method and apparatus for outputting log information
US20090070773A1 (en) Method for efficient thread usage for hierarchically structured tasks
US20140310278A1 (en) Creating global aggregated namespaces for storage management
CN106406905B (en) Configuration method and system for SETUP option of BIOS of server
CN110233802B (en) Method for constructing block chain structure with one main chain and multiple side chains
CN111522786A (en) Log processing system and method
CN104765641A (en) Job scheduling method and system
US9794122B2 (en) Method for propagating network management data for energy-efficient IoT network management and energy-efficient IoT node apparatus
CN110177146A (en) A kind of non-obstruction Restful communication means, device and equipment based on asynchronous event driven
CN109388346A (en) A kind of data rule method and relevant apparatus
CN111367591B (en) Spark task processing method and device
CN111782473A (en) Distributed log data processing method, device and system
CN115269193A (en) Method and device for realizing distributed load balance in automatic test
CN109947754A (en) Data cleaning method and device
US9880923B2 (en) Model checking device for distributed environment model, model checking method for distributed environment model, and medium
US10073938B2 (en) Integrated circuit design verification
CN104104701A (en) Online service configuration updating method and system
CN109412970B (en) Data transfer system, data transfer method, electronic device, and storage medium
CN110109986B (en) Task processing method, system, server and task scheduling system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190628

WD01 Invention patent application deemed withdrawn after publication