CN109947754A - Data cleaning method and device - Google Patents
Data cleaning method and device Download PDFInfo
- Publication number
- CN109947754A CN109947754A CN201910080821.3A CN201910080821A CN109947754A CN 109947754 A CN109947754 A CN 109947754A CN 201910080821 A CN201910080821 A CN 201910080821A CN 109947754 A CN109947754 A CN 109947754A
- Authority
- CN
- China
- Prior art keywords
- data
- data cleansing
- file
- flow file
- workflow application
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention is suitable for technical field of data processing, provides a kind of data cleaning method and device, wherein the above method includes: the data cleansing flow file and initial data for receiving client and sending;Corresponding multiple workflow application models are obtained according to data cleansing flow file;Corresponding data cleansing, which is generated, according to multiple workflow application models executes file;File is executed according to data cleansing to clean initial data.Since the different workflow application model of each function can be freely combined in user in data cleansing flow file, so that data cleaning method provided in an embodiment of the present invention flexibility with higher and scalability;Simultaneously as each workflow application model may be reused, to improve the reusability of data cleansing.
Description
Technical field
The invention belongs to technical field of data processing more particularly to a kind of data cleaning methods and device.
Background technique
Mainly in data warehouse, data mining and data quality management, these three area researches are more for data cleansing.Currently,
The country is still in the primary stage to the research of data cleaning technique, is in data warehouse, decision support, data mining research mostly
In, some fairly simple elaborations are done to it.Many data cleansing schemes and algorithm are specially set both for specific application problem
Meter, it is only applicable to lesser range.Reusability, scalability and the flexibility of traditional data cleaning method or system compared with
Difference can satisfy the use demand of user when data volume is smaller, but in the case where and multi-source huge in data volume, reusable
Property, scalability and flexibility it is poor problem it is especially prominent.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of data cleaning method and device, to solve to count in the prior art
According to reusability existing for cleaning method or system, scalability and the poor problem of flexibility.
According in a first aspect, the embodiment of the invention provides a kind of data cleaning methods, comprising: receive what client was sent
Data cleansing flow file and initial data;Corresponding multiple workflow application moulds are obtained according to the data cleansing flow file
Type;Corresponding data cleansing, which is generated, according to the multiple workflow application model executes file;It is executed according to the data cleansing
File cleans the initial data.
With reference to first aspect, described according to the multiple workflow application model in first aspect first embodiment
It generates corresponding data cleansing and executes file, comprising: obtain the corresponding data cleansing of the multiple workflow application model respectively
Code;According to the sequence in the data cleansing flow file to the data cleansing code reordering, forms data cleansing and execute
File.
With reference to first aspect or first aspect first embodiment connects in first aspect second embodiment described
After the step of receiving data cleansing flow file and initial data that client is sent, and according to the data cleansing process text
Before part obtains the step of corresponding multiple workflow application models, the data cleaning method further include: judge the data
Whether cleaning process file standardizes;When the data cleansing flow file specification, execute described according to the data cleansing stream
The step of journey file acquisition corresponding multiple workflow application models.
With reference to first aspect, described to be obtained according to the data cleansing flow file in first aspect third embodiment
Take corresponding multiple workflow application models, comprising: parse to the data cleansing flow file;When clear to the data
When the successfully resolved of wash journey file, according to parsing result, the corresponding workflow application of the data cleansing flow file is extracted
Model.
According to second aspect, the embodiment of the invention provides another data cleaning methods, comprising: selection enforcement engine;
Data cleansing flow file and initial data are sent to the enforcement engine, so that the enforcement engine is according to the data cleansing
Flow file simultaneously executes the data cleaning method as described in first aspect or first aspect any embodiment, to the original number
According to being cleaned;Obtain the cleaning task running state information of the enforcement engine.
According to the third aspect, the embodiment of the invention provides a kind of data cleansing devices, comprising: input unit, for connecing
Data cleansing flow file and initial data that client is sent are received, and for obtaining according to the data cleansing flow file
Corresponding multiple workflow application models;File generating unit is corresponded to for being generated according to the multiple workflow application model
Data cleansing execute file;Clean execution unit, for according to the data cleansing execute file to the initial data into
Row cleaning.
In conjunction with the third aspect, in third aspect first embodiment, the file generating unit includes: Code obtaining list
Member, for obtaining the corresponding data cleansing code of the multiple workflow application model respectively;Sequencing unit, for according to
Sequence in data cleansing flow file forms data cleansing and executes file to the data cleansing code reordering.
According to fourth aspect, the embodiment of the invention provides another data cleansing devices, comprising: selecting unit is used for
Select enforcement engine;Transmission unit, for sending data cleansing flow file and initial data to the enforcement engine, so that institute
Enforcement engine is stated according to the data cleansing flow file and is executed as described in first aspect or first aspect any embodiment
Data cleaning method, the initial data is cleaned;Monitoring unit, for obtaining the cleaning task of the enforcement engine
Running state information.
According to the 5th aspect, the embodiment of the invention provides a kind of terminal devices, including memory, processor and storage
In the memory and the computer program that can run on the processor, the processor execute the computer program
The step of Shi Shixian such as first aspect or first aspect any embodiment the method, alternatively, described in processor execution
It realizes when computer program such as the step of second aspect the method.
According to the 6th aspect, the embodiment of the present invention provides a kind of computer readable storage medium, described computer-readable
Storage medium is stored with computer program, and such as first aspect or first aspect are realized when the computer program is executed by processor
The step of any embodiment the method, alternatively, realizing when the computer program is executed by processor such as second aspect institute
The step of stating method.
Data cleaning method and device provided in an embodiment of the present invention, the data cleansing flow file sent according to user are true
Surely carry out executing the workflow application model of specific cleaning task, to keep enforcement engine specific for each according to user
The workflow application model of data cleansing task special setting cleans initial data.Since user can be in data cleansing
The different workflow application model of each function is freely combined in flow file, so that data cleansing provided in an embodiment of the present invention
Method flexibility with higher and scalability;Simultaneously as each workflow application model may be reused, to mention
The high reusability of data cleansing.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some
Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is the application scenarios schematic diagram of the embodiment of the present invention;
Fig. 2 is a kind of implementation process signal of a specific example of data cleaning method provided in an embodiment of the present invention
Figure;
Fig. 3 is a kind of implementation process signal of another specific example of data cleaning method provided in an embodiment of the present invention
Figure;
Fig. 4 is a kind of implementation process signal of the third specific example of data cleaning method provided in an embodiment of the present invention
Figure;
Fig. 5 is a kind of structural schematic diagram of a specific example of data cleansing device provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of another specific example of data cleansing device provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of a specific example of terminal device provided in an embodiment of the present invention.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed
Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific
The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity
The detailed description of road and method, in case unnecessary details interferes description of the invention.
In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.
Fig. 1 is the application scenarios schematic diagram of the embodiment of the present invention.In Fig. 1, enforcement engine 100 receives client 200 and sends out
The data cleansing flow file and initial data sent are generated corresponding by the parsing and processing to data cleaning process file
Data cleansing executes file.Enforcement engine 100 executes cleaning of the file realization to initial data by executing the data cleansing.
In some embodiments, as shown in Fig. 2, enforcement engine 100 can realize data cleansing by executing following steps:
Step S101: data cleansing flow file and initial data that client is sent are received.
Step S102: corresponding multiple workflow application models are obtained according to data cleansing flow file.
In a specific embodiment, as shown in figure 3, enforcement engine 100 can realize step by executing following sub-step
The process of rapid S102:
Step S1021: data cleaning process file is parsed.
Step S1022: judge whether the parsing to data cleaning process file succeeds.When to data cleaning process file
When successfully resolved, step S1023 is executed;When the parsing to data cleaning process file is unsuccessful, return step S1021, or
Person is to 200 feedback data cleaning process document analysis failure news of client.
Step S1023: according to parsing result, the corresponding workflow application model of data cleansing flow file is extracted.
Step S103: corresponding data cleansing is generated according to the multiple workflow application models of institute and executes file.
In a specific embodiment, as shown in figure 3, enforcement engine 100 can realize step by executing following sub-step
The process of rapid S103:
Step S1031: the corresponding data cleansing code of multiple workflow application models is obtained respectively.Data cleansing code can
To be the cleaning function etc. of SQL statement, invocation component.
Step S1032: according to the sequence in data cleansing flow file to data cleansing code reordering, data cleansing is formed
Execute file.
Step S104: file is executed according to data cleansing, initial data is cleaned.
Optionally, as shown in figure 3, following steps can also be added between step S101 and step S102:
Step S105: judge whether data cleansing flow file standardizes.It in a specific embodiment, can be according to default
Format judges whether data cleansing flow file standardizes.The data cleansing flow file of specification should have identical as preset format
File format;And nonstandard data cleansing flow file then has differences with preset format.When data cleansing flow file
When specification, step S102 is executed;When data cleansing flow file is lack of standardization, to 200 feedback data cleaning process text of client
The nonstandard message of part is reminded the user that and is modified to data cleaning process file, so that data cleansing flow file accords with
Close the Standardization Requirement that enforcement engine 100 carries out subsequent parsing to it.
Data cleaning method provided in an embodiment of the present invention is determined according to the data cleansing flow file that user sends and is carried out
The workflow application model of specific cleaning task is executed, to keep enforcement engine clear for each specific data according to user
The workflow application model for washing task special setting cleans initial data.Since user can be in data cleansing process text
The different workflow application model of each function is freely combined in part, so that data cleaning method provided in an embodiment of the present invention has
There are higher flexibility and scalability;Simultaneously as each workflow application model may be reused, to improve number
According to the reusability of cleaning.
In further embodiments, as shown in figure 4, client 200 can realize data cleansing by executing following steps:
Step S201: selection enforcement engine.
Step S202: data cleansing flow file and initial data are sent to enforcement engine.Client 200 passes through Xiang Qixuan
Fixed enforcement engine 100 sends data cleansing flow file and initial data, and enforcement engine 100 can be made according to data cleansing stream
Journey file simultaneously executes data cleaning method as shown in Figure 2 or Figure 3, cleans to initial data.
Step S203: the cleaning task running state information of enforcement engine is obtained.In a specific embodiment, execution is drawn
100 are held up during cleaning according to data cleaning method as shown in Figure 2 or Figure 3 to initial data, client 200 can
To obtain the cleaning task running state information of enforcement engine 100 in real time, to realize the real time monitoring to data cleaning task.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
The embodiment of the invention also provides a kind of data cleansing device, corresponding diagram 2 or data cleaning method shown in Fig. 3.
As shown in figure 5, the data cleansing device may include: input unit 501, file generating unit 502 and cleaning execution unit
503。
Wherein, input unit 501 is used to receive the data cleansing flow file and initial data of client transmission, Yi Jiyong
According to the corresponding multiple workflow application models of data cleansing flow file acquisition;Its specific work process can be found in above-mentioned side
In method embodiment described in step S101 to step S102.
File generating unit 502, which is used to generate corresponding data cleansing according to multiple workflow application models, executes file;
Its specific work process can be found in above method embodiment described in step S103.
In a specific embodiment, it may include Code obtaining unit and sequencing unit that file generating unit, which includes 502,.
Wherein, Code obtaining unit for obtaining the corresponding data cleansing code of multiple workflow application models respectively;Sequencing unit is used
It, to data cleansing code reordering, forms data cleansing according to the sequence in data cleansing flow file and executes file.Code obtains
The specific work process of unit and sequencing unit is taken, reference can be made to step S1031 and step S1032 institute in above method embodiment
It states.
Cleaning execution unit 503 is used to execute file according to data cleansing and clean to initial data;Its specific works
Process can be found in above method embodiment described in step S104.
The embodiment of the invention also provides another data cleansing device, corresponding data cleaning method shown in Fig. 4.Such as figure
Shown in 6, which may include: selecting unit 601, transmission unit 602 and monitoring unit 603.
Wherein, selecting unit 601 is for selecting enforcement engine;Its specific work process can be found in above method embodiment
Described in step S201.
Transmission unit 602 is used to send data cleansing flow file and initial data to enforcement engine, so that enforcement engine
According to data cleansing flow file and Fig. 2 or data cleaning method shown in Fig. 3 are executed, initial data is cleaned;It has
Body running process can be found in above method embodiment described in step S202.
Monitoring unit 302 is used to obtain the cleaning task running state information of enforcement engine;Its specific work process can join
As described in step S203 in above method embodiment.
The embodiment of the invention also provides a kind of terminal devices, as shown in fig. 7, the terminal device may include processor
701 and memory 702, wherein processor 701 can be connected with memory 702 by bus or other modes, with logical in Fig. 7
It crosses for bus connection.
Processor 701 can be central processing unit (Central Processing Unit, CPU).Processor 701 may be used also
Think other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
The combination of the chips such as discrete hardware components or above-mentioned all kinds of chips.
Memory 702 is used as a kind of non-transient computer readable storage medium, can be used for storing non-transient software program, non-
Transient computer executable program and module, such as the corresponding program instruction/mould of the data cleaning method in the embodiment of the present invention
Block is (for example, input unit 501, file generating unit 502 described in Fig. 5 and cleaning execution unit 503 and choosing shown in fig. 6
Select unit 601, transmission unit 602 and monitoring unit 603).Processor 701 is stored in non-temporary in memory 702 by operation
State software program, instruction and module realize above-mentioned side thereby executing the various function application and data processing of processor
Data cleaning method in method embodiment.
Memory 702 may include storing program area and storage data area, wherein storing program area can store operation system
Application program required for system, at least one function;It storage data area can the data etc. that are created of storage processor 701.In addition,
Memory 702 may include high-speed random access memory, can also include non-transient memory, and a for example, at least disk is deposited
Memory device, flush memory device or other non-transient solid-state memories.In some embodiments, it includes opposite that memory 702 is optional
In the remotely located memory of processor 701, these remote memories can pass through network connection to processor 701.Above-mentioned net
The example of network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
One or more of modules are stored in the memory 702, when being executed by the processor 701, are held
Data cleaning method in row embodiment as shown in Figures 2 to 4.
Above-mentioned terminal device detail can correspond to refering to Fig. 2 into embodiment shown in Fig. 4 corresponding associated description
Understood with effect, details are not described herein again.
It is that can lead to it will be understood by those skilled in the art that realizing all or part of the process in above-described embodiment method
Computer program is crossed to instruct relevant hardware and complete, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can for magnetic disk,
CD, read-only memory (Read-Only Memory, ROM), random access memory (RandomAccessMemory,
RAM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid state hard disk (Solid-State
Drive, SSD) etc.;The storage medium can also include the combination of the memory of mentioned kind.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of data cleaning method characterized by comprising
Receive data cleansing flow file and initial data that client is sent;
Corresponding multiple workflow application models are obtained according to the data cleansing flow file;
Corresponding data cleansing, which is generated, according to the multiple workflow application model executes file;
File is executed according to the data cleansing to clean the initial data.
2. data cleaning method as described in claim 1, which is characterized in that described according to the multiple workflow application model
It generates corresponding data cleansing and executes file, comprising:
The corresponding data cleansing code of the multiple workflow application model is obtained respectively;
According to the sequence in the data cleansing flow file to the data cleansing code reordering, forms data cleansing and execute text
Part.
3. data cleaning method as claimed in claim 1 or 2, which is characterized in that in the data that the reception client is sent
After the step of cleaning process file and initial data, and according to the corresponding multiple works of data cleansing flow file acquisition
Before the step of making stream application model, the data cleaning method further include:
Judge whether the data cleansing flow file standardizes;
When the data cleansing flow file specification, execute described corresponding more according to data cleansing flow file acquisition
The step of a workflow application model.
4. data cleaning method as described in claim 1, which is characterized in that described to be obtained according to the data cleansing flow file
Take corresponding multiple workflow application models, comprising:
The data cleansing flow file is parsed;
According to parsing result, the corresponding workflow application model of the data cleansing flow file is extracted.
5. a kind of data cleaning method characterized by comprising
Select enforcement engine;
Data cleansing flow file and initial data are sent to the enforcement engine, so that the enforcement engine is according to the data
Cleaning process file simultaneously executes data cleaning method according to any one of claims 1 to 4, carries out to the initial data
Cleaning;
Obtain the cleaning task running state information of the enforcement engine.
6. a kind of data cleansing device characterized by comprising
Input unit, the data cleansing flow file and initial data sent for receiving client, and for according to
Data cleansing flow file obtains corresponding multiple workflow application models;
File generating unit executes file for generating corresponding data cleansing according to the multiple workflow application model;
Execution unit is cleaned, the initial data is cleaned for executing file according to the data cleansing.
7. data cleansing device as claimed in claim 6, which is characterized in that the file generating unit includes:
Code obtaining unit, for obtaining the corresponding data cleansing code of the multiple workflow application model respectively;
Sequencing unit, for, to the data cleansing code reordering, being formed according to the sequence in the data cleansing flow file
Data cleansing executes file.
8. a kind of data cleansing device characterized by comprising
Selecting unit, for selecting enforcement engine;
Transmission unit, for sending data cleansing flow file and initial data to the enforcement engine, so that the execution is drawn
It holds up according to the data cleansing flow file and executes data cleaning method according to any one of claims 1 to 4, to institute
Initial data is stated to be cleaned;
Monitoring unit, for obtaining the cleaning task running state information of the enforcement engine.
9. a kind of terminal device, including memory, processor and storage are in the memory and can be on the processor
The computer program of operation, which is characterized in that the processor realizes such as Claims 1-4 when executing the computer program
The step of any one the method, alternatively, the processor realizes side as claimed in claim 5 when executing the computer program
The step of method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In when the computer program is executed by processor the step of any one of such as Claims 1-4 of realization the method, alternatively, institute
State the step of realizing method as claimed in claim 5 when computer program is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910080821.3A CN109947754A (en) | 2019-01-28 | 2019-01-28 | Data cleaning method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910080821.3A CN109947754A (en) | 2019-01-28 | 2019-01-28 | Data cleaning method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109947754A true CN109947754A (en) | 2019-06-28 |
Family
ID=67006533
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910080821.3A Pending CN109947754A (en) | 2019-01-28 | 2019-01-28 | Data cleaning method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109947754A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111538720A (en) * | 2020-03-12 | 2020-08-14 | 嘉陵江亭子口水利水电开发有限公司 | Method and system for cleaning basic data in power industry |
-
2019
- 2019-01-28 CN CN201910080821.3A patent/CN109947754A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111538720A (en) * | 2020-03-12 | 2020-08-14 | 嘉陵江亭子口水利水电开发有限公司 | Method and system for cleaning basic data in power industry |
CN111538720B (en) * | 2020-03-12 | 2023-07-21 | 嘉陵江亭子口水利水电开发有限公司 | Method and system for cleaning basic data of power industry |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107210928B (en) | Distributed and adaptive computer network analysis | |
CN101694626B (en) | Script execution system and method | |
CN107463582B (en) | Distributed Hadoop cluster deployment method and device | |
CN105068864B (en) | Method and system for processing asynchronous message queue | |
CN110908788B (en) | Spark Streaming based data processing method and device, computer equipment and storage medium | |
CN110716744A (en) | Data stream processing method, system and computer readable storage medium | |
US20150347305A1 (en) | Method and apparatus for outputting log information | |
US20090070773A1 (en) | Method for efficient thread usage for hierarchically structured tasks | |
US20140310278A1 (en) | Creating global aggregated namespaces for storage management | |
CN106406905B (en) | Configuration method and system for SETUP option of BIOS of server | |
CN110233802B (en) | Method for constructing block chain structure with one main chain and multiple side chains | |
CN111522786A (en) | Log processing system and method | |
CN104765641A (en) | Job scheduling method and system | |
US9794122B2 (en) | Method for propagating network management data for energy-efficient IoT network management and energy-efficient IoT node apparatus | |
CN110177146A (en) | A kind of non-obstruction Restful communication means, device and equipment based on asynchronous event driven | |
CN109388346A (en) | A kind of data rule method and relevant apparatus | |
CN111367591B (en) | Spark task processing method and device | |
CN111782473A (en) | Distributed log data processing method, device and system | |
CN115269193A (en) | Method and device for realizing distributed load balance in automatic test | |
CN109947754A (en) | Data cleaning method and device | |
US9880923B2 (en) | Model checking device for distributed environment model, model checking method for distributed environment model, and medium | |
US10073938B2 (en) | Integrated circuit design verification | |
CN104104701A (en) | Online service configuration updating method and system | |
CN109412970B (en) | Data transfer system, data transfer method, electronic device, and storage medium | |
CN110109986B (en) | Task processing method, system, server and task scheduling system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190628 |
|
WD01 | Invention patent application deemed withdrawn after publication |