CN115687310A - Data cleaning method and device - Google Patents

Data cleaning method and device Download PDF

Info

Publication number
CN115687310A
CN115687310A CN202110824482.2A CN202110824482A CN115687310A CN 115687310 A CN115687310 A CN 115687310A CN 202110824482 A CN202110824482 A CN 202110824482A CN 115687310 A CN115687310 A CN 115687310A
Authority
CN
China
Prior art keywords
data cleaning
information
data
component
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110824482.2A
Other languages
Chinese (zh)
Inventor
田朋雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Guoshuang Software Co ltd
Original Assignee
Suzhou Guoshuang Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Guoshuang Software Co ltd filed Critical Suzhou Guoshuang Software Co ltd
Priority to CN202110824482.2A priority Critical patent/CN115687310A/en
Publication of CN115687310A publication Critical patent/CN115687310A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention discloses a data cleaning method and a device, wherein a server responds to a data cleaning instruction, determines configuration information, then acquires preset component information of each data cleaning component, determines a target data cleaning component corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information, and finally calls the target data cleaning components in sequence according to the data cleaning flow information to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information. According to the invention, the required data cleaning component can be adjusted in a mode of inputting a data cleaning instruction by a user, compared with a mode of adapting to the changed data cleaning component by adjusting a code when the data cleaning component is changed, the method is simple to operate and low in error rate, and the accuracy of data cleaning is further improved.

Description

Data cleaning method and device
Technical Field
The invention relates to the field of data cleaning, in particular to a data cleaning method and device.
Background
In the data analysis process, data cleaning is an important link, and the quality of data cleaning directly relates to the accuracy of data analysis.
In the data cleaning process, a data cleaning component is required to be used, the data cleaning component comprises an input source, a calculation engine and an output source, the data cleaning program is determined based on the input source, the calculation engine and the output source, and after the data cleaning program is determined, the obtained data cleaning program can be operated to clean data.
In practical application, the input source, the calculation engine and the output source are diversified, if the input source, the calculation engine or the output source is adjusted, the program needs to be modified to adapt to the changed data cleaning component, and the data cleaning accuracy is low.
Disclosure of Invention
In view of the above, the present invention provides a data cleansing method and apparatus that overcomes, or at least partially solves, the above-mentioned problems.
A data cleaning method is applied to a server and comprises the following steps:
under the condition of receiving a data cleaning instruction input by a user, responding to the data cleaning instruction and determining configuration information; the configuration information comprises identification information of the data cleaning components and a logical relationship between the data cleaning components;
acquiring preset component information of each data cleaning component; the component information includes the identification information;
determining target data cleaning components corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information;
and sequentially calling the target data cleaning components according to the data cleaning flow information so as to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information.
Optionally, responding to the data cleansing instruction, and determining configuration information, including:
responding to the data cleaning instruction, and acquiring and displaying a data cleaning flow configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and receiving identification information of the data cleaning assembly input by a user at the target position of the data cleaning flow configuration template to obtain configuration information.
Optionally, in a case that the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, responding to the data cleansing instruction, and determining configuration information includes:
acquiring a data cleaning process configuration template; a preset logic relation and a target position corresponding to the preset logic relation are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning components and the logical connection relation between the target data cleaning components to obtain configuration information.
Optionally, determining, based on the component information of each data cleansing component, a target data cleansing component corresponding to the configuration information and data cleansing logic information between the target data cleansing components, to obtain data cleansing process information, where the step includes:
determining a data cleaning component corresponding to the identification information in the configuration information based on the component information of each data cleaning component, and taking the data cleaning component as a target data cleaning component;
and determining data cleaning logic information among the target data cleaning assemblies based on the assembly information of the target data cleaning assemblies and the logic relation among the data cleaning assemblies to obtain data cleaning flow information.
Optionally, determining data cleansing logic information between the target data cleansing assemblies based on the assembly information of the target data cleansing assemblies and the logic relationship between the data cleansing assemblies to obtain data cleansing flow information, including:
determining data cleaning sub-logic information of the target data cleaning component with the logic relation based on the component type in the component information of the target data cleaning component and the logic relation between the data cleaning components;
and integrating all the data cleaning sub-logic information to obtain data cleaning logic information which is used as data cleaning process information.
A data cleaning device is applied to a server and comprises:
the information determining module is used for responding to the data cleaning instruction and determining configuration information under the condition of receiving the data cleaning instruction input by a user; the configuration information comprises identification information of the data cleaning components and a logical relationship between the data cleaning components;
the information acquisition module is used for acquiring preset component information of each data cleaning component; the component information includes the identification information;
a cleaning determining module, configured to determine, based on the component information of each data cleaning component, a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components, so as to obtain data cleaning flow information;
and the data cleaning module is used for sequentially calling the target data cleaning components according to the data cleaning flow information so as to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information.
Optionally, the information determining module includes:
the first template acquisition submodule is used for responding to the data cleaning instruction and acquiring and displaying a data cleaning flow configuration template; a preset logic relation and a target position corresponding to the preset logic relation are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information receiving submodule is used for receiving the identification information of the data cleaning assembly input by a user at the target position of the data cleaning process configuration template to obtain configuration information.
Optionally, in a case that the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, the information determining module includes:
the second template acquisition submodule is used for acquiring a data cleaning process configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information determining submodule is used for determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning assemblies and the logic connection relation between the target data cleaning assemblies to obtain configuration information.
A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the above-mentioned data cleansing method when the program is run.
An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory so as to execute the data cleaning method.
By means of the technical scheme, the invention provides a data cleaning method and a device, a server responds to a data cleaning instruction and determines configuration information under the condition that the server receives the data cleaning instruction, then preset component information of each data cleaning component is obtained, a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components are determined based on the component information of each data cleaning component, data cleaning process information is obtained, and finally the target data cleaning components are sequentially called according to the data cleaning process information so that the target data cleaning components execute data cleaning operation corresponding to the target data cleaning components in the data cleaning process information. In the invention, the server determines the configuration information under the condition of receiving the data cleaning instruction, wherein the configuration information comprises the identification information of the data cleaning components and the logic relation between the data cleaning components, so that the required data cleaning components can be adjusted in a mode of inputting the data cleaning instruction by a user.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for data cleansing according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of another data cleansing method provided by an embodiment of the present invention;
FIG. 3 is a diagram illustrating a data cleansing process configuration template according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a scenario of component definition provided by an embodiment of the present invention;
FIG. 5 is a diagram illustrating a scenario of data cleansing setting information provided by an embodiment of the present invention;
FIG. 6 is a flow chart of a method of a further data cleansing method according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a data cleansing apparatus according to an embodiment of the present invention;
fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Data cleaning is an indispensable link in the whole data analysis process, and the result quality directly relates to the model effect and the final conclusion. In the data washing process, a data washing component is required to be used, and the data washing component comprises an input source, a calculation engine and an output source. The data cleaning process comprises the following steps: the data is transmitted to another storage component from a certain storage component by writing codes under corresponding scenes, for example, the data of an oracle database is commonly transmitted to the hive inner side by writing data cleaning codes, and then the data of a mysql database is transmitted to the hbase and the like by writing data cleaning codes. Since the input source (oracle, mysql) of the data cleansing is not single, and the final output source (hive, hbase) is not single, the traditional data cleansing service code and the data source are highly coupled, so that each time a new data source is added, a new set of code needs to be written for the corresponding data source again, for example, data needs to be cleansed from oracle to hbase now, or data needs to be re-encoded from mysql to hive, and for more complicated data cleansing, such as data cleansing from oracle to hive and then from hive to hbase, the multi-flow data cleansing needs to be re-encoded.
In addition, the cleaning implementation (writing code) of data cleaning is generally based on one computing engine, which is commonly referred to as flink and spark. For example, oracle data is extracted, data cleaning is realized based on api writing codes provided by the flink, and the data is written to hive, the middle computing engine (flink) is not single, and when the flink needs to be switched to spark, the codes need to be rewritten.
In summary, when data cleaning is performed, the data cleaning code is highly coupled with the data source and the computing engine, when the data source or the computing engine is changed, the data cleaning code needs to be rewritten or the original data cleaning code needs to be changed, and both the code writing and the code changing need to be performed by professional technicians, if the writing or the changing is not accurate, the accuracy of the data cleaning code is low, and further, when the data cleaning code is operated, the final data cleaning accuracy is low.
In order to solve the technical problem, the inventor researches and discovers that when the data cleaning process needs to be updated, a data cleaning process configuration template can be determined, and the change of the data cleaning process is realized by the modification of a user on the template. Because the data cleaning process is modified from the user layer, compared with a code modification mode, the method has the advantages that the specialty is reduced, the modification of the data cleaning process is simpler, the error modification probability is lower, the data cleaning process modification accuracy is improved, and the data cleaning accuracy is improved.
Specifically, the server responds to the data cleaning instruction and determines configuration information under the condition that the server receives the data cleaning instruction, then obtains preset component information of each data cleaning component, determines a target data cleaning component corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information, and finally calls the target data cleaning components in sequence according to the data cleaning flow information to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information. In the invention, the server determines the configuration information under the condition of receiving the data cleaning instruction, wherein the configuration information comprises the identification information of the data cleaning components and the logic relation between the data cleaning components, so that the required data cleaning components can be adjusted in a mode of inputting the data cleaning instruction by a user.
On the basis of the above content, another embodiment of the present invention provides a data cleansing method, which is applied to a server, where the server in this embodiment may be a Web server, and the Web server has a display interface, so that a user can modify a data cleansing flow configuration template.
Referring to fig. 1, a data cleansing method may include:
s11, under the condition that a data cleaning instruction input by a user is received, responding to the data cleaning instruction, and determining configuration information.
In practical application, the whole data cleaning method is triggered by a user, a data cleaning instruction input by the user is used for determining configuration information, and in practical application, the configuration information comprises identification information of data cleaning components and a logical relationship between the data cleaning components.
The data cleaning component is the input source, the calculation engine and the output source, the input source may be mysql, kafka, oracle, postgres, files and the like, the calculation engine may be spark, flink and the like, and the output source may be hbase, hive, postgres and the like.
There are various ways in which the data cleansing instructions entered by the user may be implemented, such that there are also various ways in which configuration information may be determined in response to the instructions, which are now separately described.
1. In practical application, a data cleaning button is arranged on a web server interface, and a user can start a data cleaning process by clicking the data cleaning button.
Referring to fig. 2, responding to the data cleansing instruction and determining configuration information may include:
and S21, responding to the data cleaning instruction, and acquiring and displaying a data cleaning flow configuration template.
After the user clicks the data cleaning button, a data cleaning instruction is generated, and after the server receives the data cleaning instruction, a data cleaning flow configuration template is obtained and displayed.
The data cleaning process configuration template is provided with a preset logic relationship and a target position corresponding to the preset logic relationship. And the target position is used for inputting the identification information of the data cleaning component corresponding to the preset logical relationship.
Specifically, the data cleansing process configuration template may refer to fig. 3, where a dashed box in fig. 3 is a target location, right beside the dashed box represents a previous data cleansing component of the data cleansing process, and Left represents a next data cleansing component of the data cleansing process, that is, the data cleansing process is from Right to Left. From Right to Left, in this embodiment, the relationship is referred to as a preset logical relationship, and as can be seen from fig. 3, there may be a plurality of preset logical relationships in the data cleansing flow configuration template.
Right corresponds to 1002, and 1002 indicates the identification information of the data cleansing component corresponding to the preset logical relationship, and in practical applications, right corresponds to the identification information at the target position, which may be default identification information, for example, 1002, at this time, the user needs to confirm whether 1002 should be used, if yes, no modification is needed, and if not, the data cleansing component required by the user needs to be modified.
In addition, right corresponds to that the identification information at the target location may also be a blank item, and in this case, the user is required to fill and write the required data cleaning component in the blank item.
When determining the identification information to be filled in at the target position, the user may refer to the component definition in fig. 4, for each data cleansing component, there is corresponding component information, and the component information may include identification information, a component name, and a component type, which is described by taking a mysql component as an example, where the identification information is 1001, the component name is mysql, and the component type is an input source. Other data cleansing components are possible with reference to fig. 4.
It should be noted that fig. 4 is only an example illustrating a data cleansing component, and only gives component information of a part of the data cleansing component, and in practical applications, each input source, computing engine and output source has corresponding component information.
And S22, receiving identification information of the data cleaning assembly input by a user at the target position of the data cleaning flow configuration template to obtain configuration information.
If the server displays the data cleaning process configuration template on the Web interface, a user can add or change the content of the target position by clicking the target position in the data cleaning process configuration template, and after the change is completed, a confirmation instruction can be clicked to obtain the configuration information. It should be noted that the data cleansing flow configuration template in fig. 3 only shows a template of one type of data cleansing flow from an input source-a computing engine-an output source. In practical application, there may also be data cleaning flows of input source, computing engine, output source, computing engine and output source, or other data cleaning flows, at this time, for different data cleaning flows, corresponding data cleaning flow configuration templates may be set, and when a user performs configuration, a suitable template may be selected according to needs.
In practical application, the data of a plurality of input sources can be transmitted to one calculation engine, and the output sources can be used as the input sources and written to other output sources again after being cleaned by the calculation engine, so that the method is quite flexible.
2. In practical application, a user can also send required data cleaning setting information to a server in a graphic form, and the server can automatically adjust the data cleaning flow configuration template through the data cleaning setting information.
In this scenario, the data cleansing instruction includes data cleansing setting information, and the data cleansing setting information includes attribute information of the target data cleansing component and a logical connection relationship between the target data cleansing components.
Specifically, referring to fig. 5, the target data cleansing component is the data cleansing component in the circular frame in fig. 5, in this embodiment, the attribute information of the target data cleansing component may be the above-mentioned component information, and in addition, the attribute information may also include only the above-mentioned component name and component type, where the component name is used to describe which data cleansing component the user selects, and the component type is used to describe what the specific function of the data cleansing component the user selects is, that is, as an input source, a calculation engine, or an output source. Taking mysql as an example, in fig. 5, mysql serves as an input source.
Referring to fig. 5, fig. 5 also shows the logical connections between the target data cleansing components, specifically,
mysql, kafka and oracle are used as input sources and input into a computing engine flink, the input is output into an output source postgres after the processing of the computing engine flink is completed, then the input is continuously input into a computing engine spark, and the output is finally output into an output source hbase after the processing of the computing engine spark is completed.
In this case, responding to the data cleansing instruction and determining configuration information may include:
1) Acquiring a data cleaning process configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; and the target position is used for inputting the identification information of the data cleaning component corresponding to the preset logical relationship.
Specifically, please refer to the corresponding descriptions in the above embodiments for a specific explanation of the data cleaning process configuration template, which is not described herein again.
2) And determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning components and the logical connection relation between the target data cleaning components to obtain configuration information.
Specifically, a preset logic relationship, specifically Right to Left, is arranged on the data cleaning flow configuration template, and then the target data cleaning assemblies respectively corresponding to Right and Left can be selected according to the logic connection relationship between the target data cleaning assemblies, and the identification information of the target data cleaning assemblies is added to the target positions corresponding to Right and Left.
Still taking fig. 4 and fig. 5 as an example for explanation, the data cleansing process in fig. 4 configures a template, that is, it is determined according to the data cleansing setting information in fig. 5, specifically, mysql, kafka, oracle are input into the computing engine flink as input sources, and after the processing by the computing engine flink is completed, the template is output into the output source postgres, and then continuously input into the computing engine spark, and after the processing by the computing engine spark is completed, the template is finally output into the output source hbase.
S12, acquiring preset component information of each data cleaning component; the component information includes the identification information.
Specifically, the component information is preset and includes identification information, a component name, and a component type.
S13, determining target data cleaning components corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information.
Specifically, the configuration information includes identification information of the data cleansing components and a logical relationship between the data cleansing components, and specifically, the configuration information includes a logical relationship between any two data cleansing components, specifically, a group of Right and Left. Therefore, in this embodiment, multiple sets of Right and Left need to be combined together to form a complete data cleansing flow information.
Specifically, referring to fig. 6, step S13 may include:
and S31, determining the data cleaning component corresponding to the identification information in the configuration information based on the component information of each data cleaning component, and taking the data cleaning component as a target data cleaning component.
Specifically, as described with reference to fig. 3, both Right and Left correspond to the identification information, it is necessary to determine which data cleansing component the identification information corresponding to Right and Left specifically represents, by using the identification information in the component information of each data cleansing component.
Specifically, identification information corresponding to Right and Left is acquired, then a data cleaning component corresponding to the identification information is queried according to component information of the data cleaning component in fig. 4, and the queried data cleaning component is used as a target data cleaning component.
And S32, determining data cleaning logic information among the target data cleaning assemblies based on the assembly information of the target data cleaning assemblies and the logic relation among the data cleaning assemblies to obtain data cleaning flow information.
Specifically, in this embodiment, the data cleansing logical relationship between all the target data cleansing components is determined according to the logical relationship between two target data cleansing components, and the obtained data cleansing logical relationship is used as the data cleansing flow information.
In practical applications, step S32 may include:
1) And determining data cleansing sub-logic information of the target data cleansing component with the logic relation based on the component type in the component information of the target data cleansing component and the logic relation between the data cleansing components.
Specifically, in the embodiment, in the above content, the data cleansing component corresponding to each group of Right and Left is determined, and then the component type of the data cleansing component is obtained, assuming that Right corresponds to mysql and Left corresponds to flink, since mysql corresponds to the component type as the input source, flink corresponds to the component type as the calculation engine, and the execution sequence is from Right to Left, then the data to be processed obtained from the input source mysql is obtained, and the obtained data is input to the calculation engine flink for data cleansing.
Similarly, the same process is performed for the other groups of Right and Left.
2) And integrating all the data cleaning sub-logic information to obtain data cleaning logic information which is used as data cleaning process information.
After the data cleaning sub-logic information corresponding to each group of Right and Left is obtained, the data cleaning logic information is combined according to the logic among the data cleaning sub-logic information and is used as data cleaning process information.
For example, one piece of data cleansing sub-logic information is to acquire data to be processed from an input source mysql and input the data to the computing engine flink for data cleansing, and the other piece of data cleansing sub-logic information is to output the data to an output source postgres after cleansing is completed by the computing engine flink, and the data cleansing sub-logic information is integrated as follows:
and acquiring data to be processed from an input source mysql, inputting the data to the computing engine flink for data cleaning, and outputting the data to an output source postgres after cleaning of the computing engine flink.
And S14, sequentially calling the target data cleaning components according to the data cleaning process information so that the target data cleaning components execute data cleaning operation corresponding to the target data cleaning components in the data cleaning process information.
Specifically, in the data cleaning process, the server is used as a controller for data cleaning, and it is assumed that the data cleaning flow information is as follows:
and acquiring data to be processed from an input source mysql, inputting the data to the computing engine flink for data cleaning, and outputting the data to an output source postgres after cleaning of the computing engine flink.
The specific control process of the server is as follows:
the server acquires data to be processed from the input source mysql, inputs the data to the computing engine flink for data cleaning, and after the computing engine flink is cleaned, the server acquires the cleaned data from the flink and outputs the cleaned data to the output source postgres.
In this embodiment, when receiving a data cleaning instruction, a server may respond to the data cleaning instruction, determine configuration information, then obtain preset component information of each data cleaning component, determine a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components based on the component information of each data cleaning component, obtain data cleaning flow information, and finally sequentially invoke the target data cleaning components according to the data cleaning flow information, so that the target data cleaning components execute data cleaning operations corresponding to the target data cleaning components in the data cleaning flow information. In the invention, the server determines the configuration information under the condition of receiving the data cleaning instruction, wherein the configuration information comprises the identification information of the data cleaning components and the logic relation between the data cleaning components, so that the required data cleaning components can be adjusted in a mode of inputting the data cleaning instruction by a user. In addition, compared with a code modification mode, the method saves a large amount of human resources and time, and improves the working efficiency.
It should be noted that the data cleansing process configuration template and the component definition can be contents in the yml file. Specifically, the yml file is divided into two parts, the first part is the input source, the calculation engine, and the output source definition, i.e., the component definition, and the second part is the cleaning process definition, i.e., the data cleaning process configuration template.
In practical applications, there are many input sources (mysql, kafka, postgres, files, etc.), and any middleware capable of storing data may be used as an input source. Since the reading modes of the input sources are generally different, for example, kafka needs to be read by a kafka-consumer, mysql needs to be read by jdbc, a java interface is defined first, and then the input sources of different types are used to implement the java interface, so that the data formats transmitted to the computing engine after data reading are kept consistent, and the processing by the computing engine is facilitated. Similarly, the compute engine and output source are implemented in the same manner.
Since the input source, the computing engine and the output source already implement corresponding interfaces, yml only needs to configure corresponding parameters as the url of the database, the user name, the password, etc. when defining. The specific data flow direction is defined by the data cleaning process.
In the embodiment of the invention, an input source, a calculation engine and an output source are defined in advance through a yml file, and then a complete data cleaning process is formed through analyzing a yml file. Any link of the two links needs to be switched, and only the lower yml file needs to be slightly modified, so that the workload is greatly saved.
Optionally, on the basis of the above embodiment of the data cleansing method, another embodiment of the present invention provides a data cleansing apparatus applied to a server, and referring to fig. 7, the data cleansing apparatus includes:
the information determining module 11 is configured to respond to a data cleansing instruction input by a user and determine configuration information when the data cleansing instruction is received; the configuration information comprises identification information of the data cleaning components and a logical relationship between the data cleaning components;
the information acquisition module 12 is used for acquiring preset component information of each data cleaning component; the component information includes the identification information;
a cleaning determining module 13, configured to determine, based on the component information of each data cleaning component, a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components, so as to obtain data cleaning flow information;
and the data cleaning module 14 is configured to sequentially invoke the target data cleaning components according to the data cleaning flow information, so that the target data cleaning components execute data cleaning operations corresponding to the target data cleaning components in the data cleaning flow information.
Further, the information determination module includes:
the first template acquisition submodule is used for responding to the data cleaning instruction and acquiring and displaying a data cleaning flow configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information receiving submodule is used for receiving the identification information of the data cleaning assembly input by a user at the target position of the data cleaning process configuration template to obtain configuration information.
Further, in a case that the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, the information determining module includes:
the second template acquisition submodule is used for acquiring a data cleaning process configuration template; a preset logic relation and a target position corresponding to the preset logic relation are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information determining submodule is used for determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning assemblies and the logic connection relation between the target data cleaning assemblies to obtain configuration information.
Further, the purge determination module 13 includes:
the component determining submodule is used for determining a data cleaning component corresponding to the identification information in the configuration information based on the component information of each data cleaning component and taking the data cleaning component as a target data cleaning component;
and the logic determining submodule is used for determining data cleaning logic information among the target data cleaning assemblies based on the assembly information of the target data cleaning assemblies and the logic relation among the data cleaning assemblies to obtain data cleaning flow information.
Further, the logic determination submodule is specifically configured to:
and determining data cleaning sub-logic information of the target data cleaning assembly with the logic relation based on the assembly type in the assembly information of the target data cleaning assembly and the logic relation between the data cleaning assemblies, and integrating all the data cleaning sub-logic information to obtain data cleaning logic information which is used as data cleaning process information.
In this embodiment, when receiving a data cleansing instruction, a server may respond to the data cleansing instruction, determine configuration information, then obtain preset component information of each data cleansing component, determine, based on the component information of each data cleansing component, a target data cleansing component corresponding to the configuration information, and data cleansing logic information between the target data cleansing components, obtain data cleansing flow information, and finally sequentially invoke the target data cleansing components according to the data cleansing flow information, so that the target data cleansing components perform data cleansing operations corresponding to the target data cleansing components in the data cleansing flow information. In the invention, the server determines the configuration information under the condition of receiving the data cleaning instruction, wherein the configuration information comprises the identification information of the data cleaning components and the logic relation between the data cleaning components, so that the required data cleaning components can be adjusted in a mode of inputting the data cleaning instruction by a user.
It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.
The data cleaning device comprises a processor and a memory, wherein the information determining module, the information acquiring module, the cleaning determining module, the data cleaning module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the accuracy of data cleaning is improved by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the data cleansing method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the data cleaning method is executed when the program runs.
Referring to fig. 8, an embodiment of the present invention provides an electronic device, which includes at least one processor 701, at least one memory 702 connected to the processor, and a bus 703; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to perform the data cleansing method described above. The electronic device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on an electronic device:
a data cleaning method is applied to a server and comprises the following steps:
under the condition of receiving a data cleaning instruction input by a user, responding to the data cleaning instruction and determining configuration information; the configuration information comprises identification information of the data cleansing components and a logical relationship between the data cleansing components;
acquiring preset component information of each data cleaning component; the component information includes the identification information;
determining target data cleaning components corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information;
and sequentially calling the target data cleaning components according to the data cleaning flow information so as to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information.
Further, responding to the data cleansing instruction and determining configuration information, comprising:
responding to the data cleaning instruction, and acquiring and displaying a data cleaning flow configuration template; a preset logic relation and a target position corresponding to the preset logic relation are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and receiving identification information of the data cleaning assembly input by a user at the target position of the data cleaning process configuration template to obtain configuration information.
Further, in a case where the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, responding to the data cleansing instruction, and determining configuration information includes:
acquiring a data cleaning process configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning components and the logical connection relation between the target data cleaning components to obtain configuration information.
Further, based on the component information of each data cleaning component, determining a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components to obtain data cleaning flow information, including:
determining a data cleaning component corresponding to the identification information in the configuration information based on the component information of each data cleaning component, and using the data cleaning component as a target data cleaning component;
and determining data cleaning logic information among the target data cleaning assemblies based on the assembly information of the target data cleaning assemblies and the logic relation among the data cleaning assemblies to obtain data cleaning flow information.
Further, determining data cleaning logic information between the target data cleaning components based on the component information of the target data cleaning components and the logic relationship between the data cleaning components to obtain data cleaning flow information, including:
determining data cleaning sub-logic information of a target data cleaning component with the logic relationship based on the component type in the component information of the target data cleaning component and the logic relationship between the data cleaning components;
and integrating all the data cleaning sub-logic information to obtain data cleaning logic information which is used as data cleaning process information.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, an electronic device includes one or more processors (CPUs), memory, and a bus. The electronic device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A data cleaning method is applied to a server, and comprises the following steps:
under the condition of receiving a data cleaning instruction input by a user, responding to the data cleaning instruction and determining configuration information; the configuration information comprises identification information of the data cleansing components and a logical relationship between the data cleansing components;
acquiring preset component information of each data cleaning component; the component information includes the identification information;
determining target data cleaning components corresponding to the configuration information and data cleaning logic information among the target data cleaning components based on the component information of each data cleaning component to obtain data cleaning flow information;
and sequentially calling the target data cleaning components according to the data cleaning flow information so as to enable the target data cleaning components to execute data cleaning operation corresponding to the target data cleaning components in the data cleaning flow information.
2. The data cleansing method of claim 1, wherein determining configuration information in response to the data cleansing instruction comprises:
responding to the data cleaning instruction, and acquiring and displaying a data cleaning flow configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and receiving identification information of the data cleaning assembly input by a user at the target position of the data cleaning process configuration template to obtain configuration information.
3. The data cleansing method according to claim 1, wherein in a case where the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, responding to the data cleansing instruction, and determining configuration information includes:
acquiring a data cleaning process configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning components and the logical connection relation between the target data cleaning components to obtain configuration information.
4. The data cleansing method according to claim 1, wherein determining a target data cleansing component corresponding to the configuration information and data cleansing logic information between the target data cleansing components based on the component information of each data cleansing component to obtain data cleansing flow information comprises:
determining a data cleaning component corresponding to the identification information in the configuration information based on the component information of each data cleaning component, and using the data cleaning component as a target data cleaning component;
and determining data cleaning logic information among the target data cleaning assemblies based on the assembly information of the target data cleaning assemblies and the logic relation among the data cleaning assemblies to obtain data cleaning flow information.
5. The data cleansing method of claim 4, wherein determining data cleansing logic information between the target data cleansing components based on the component information of the target data cleansing components and the logic relationship between the data cleansing components to obtain data cleansing flow information comprises:
determining data cleaning sub-logic information of a target data cleaning component with the logic relationship based on the component type in the component information of the target data cleaning component and the logic relationship between the data cleaning components;
and integrating all the data cleaning sub-logic information to obtain data cleaning logic information which is used as data cleaning process information.
6. A data cleaning device is applied to a server, and comprises:
the information determining module is used for responding to the data cleaning instruction and determining configuration information under the condition of receiving the data cleaning instruction input by a user; the configuration information comprises identification information of the data cleaning components and a logical relationship between the data cleaning components;
the information acquisition module is used for acquiring preset component information of each data cleaning component; the component information includes the identification information;
a cleaning determining module, configured to determine, based on the component information of each data cleaning component, a target data cleaning component corresponding to the configuration information and data cleaning logic information between the target data cleaning components, so as to obtain data cleaning flow information;
and the data cleaning module is used for sequentially calling the target data cleaning assembly according to the data cleaning flow information so as to enable the target data cleaning assembly to execute the data cleaning operation corresponding to the target data cleaning assembly in the data cleaning flow information.
7. The data cleansing apparatus of claim 6, wherein the information determination module comprises:
the first template acquisition submodule is used for responding to the data cleaning instruction and acquiring and displaying a data cleaning flow configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information receiving submodule is used for receiving the identification information of the data cleaning assembly input by a user at the target position of the data cleaning flow configuration template to obtain configuration information.
8. The data cleansing apparatus according to claim 6, wherein in a case where the data cleansing instruction includes data cleansing setting information, the data cleansing setting information includes attribute information of a target data cleansing component, and a logical connection relationship between the target data cleansing components, the information determining module includes:
the second template acquisition submodule is used for acquiring a data cleaning process configuration template; a preset logic relationship and a target position corresponding to the preset logic relationship are arranged on the data cleaning process configuration template; the target position is used for inputting identification information of the data cleaning component corresponding to the preset logical relationship;
and the information determining submodule is used for determining data at the target position of the data cleaning flow configuration template based on the attribute information of the target data cleaning assemblies and the logic connection relation between the target data cleaning assemblies to obtain configuration information.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the data cleansing method according to any one of claims 1 to 5 when the program is run.
10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory are communicated with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the data cleansing method of any of claims 1-5.
CN202110824482.2A 2021-07-21 2021-07-21 Data cleaning method and device Pending CN115687310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110824482.2A CN115687310A (en) 2021-07-21 2021-07-21 Data cleaning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110824482.2A CN115687310A (en) 2021-07-21 2021-07-21 Data cleaning method and device

Publications (1)

Publication Number Publication Date
CN115687310A true CN115687310A (en) 2023-02-03

Family

ID=85045075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110824482.2A Pending CN115687310A (en) 2021-07-21 2021-07-21 Data cleaning method and device

Country Status (1)

Country Link
CN (1) CN115687310A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171153A (en) * 2023-09-11 2023-12-05 北京三维天地科技股份有限公司 Visual data cleaning method and system supporting custom cleaning flow

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171153A (en) * 2023-09-11 2023-12-05 北京三维天地科技股份有限公司 Visual data cleaning method and system supporting custom cleaning flow

Similar Documents

Publication Publication Date Title
EP3404542A1 (en) Data pipeline architecture for analytics processing stack
CN106874174B (en) Method and device for realizing interface test and function test
CN112162915B (en) Test data generation method, device, equipment and storage medium
CN112364074B (en) Time sequence data visualization method, equipment and medium
CN109814863A (en) A kind of processing method, device, computer equipment and computer storage medium for requesting returned data
CN114168114A (en) Operator registration method, device and equipment
CN112115394A (en) Data display method, server, terminal and medium
CN110941428A (en) Website creation method and device
CN113448678A (en) Application information generation method, deployment method, device, system and storage medium
CN115687310A (en) Data cleaning method and device
CN113900633A (en) Low-code development method and device for scene of Internet of things, storage medium and development platform
CN112579066A (en) Chart display method and device, storage medium and equipment
CN115857929A (en) Resource data processing method and device, computer equipment and storage medium
CN112559576A (en) Data display method, system, device, storage medium and electronic equipment
CN113485746B (en) Method and device for generating application program interface document
CN113055209B (en) Arranging method and device for edge calculation
CN115269050A (en) Multi-map calling method and device, storage medium and computer equipment
CN113704664A (en) Method and device for generating routing address for accessing page
CN112748917B (en) Graph display method and device
CN114047914A (en) Interface configuration method and device, electronic equipment and computer readable storage medium
CN114237871A (en) Arranging method and device of cloud resources, computer equipment and storage medium
CN114281818A (en) Data processing method, device, server and storage medium
CN113326237A (en) Log data processing method and device, terminal device and storage medium
CN111639030A (en) Page testing method, device, equipment and storage medium
CN112749229A (en) Data conversion method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination