CN113515500B

CN113515500B - Visual data processing system and processing method

Info

Publication number: CN113515500B
Application number: CN202110563262.9A
Authority: CN
Inventors: 马学中; 胡德斌
Original assignee: Suzhou Weizhong Data Technology Co ltd
Current assignee: Suzhou Weizhong Data Technology Co ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2023-06-30
Anticipated expiration: 2041-05-24
Also published as: CN113515500A

Abstract

The invention discloses a visualized data processing system and a visualized data processing method, wherein the system consists of a foreground visualized operation part and a background data processing part, and the method comprises the following steps: s1, defining a task execution unit, and defining an execution sequence of the task execution unit according to specific task requirements to form a task execution rule; s2, calling a corresponding task execution unit according to the task execution rule, obtaining a task execution result and storing the task execution result. The invention realizes effective control of the data processing flow in a visual and self-defined mode, has simple and visual whole operation process, greatly shortens development period, saves valuable technician resources in enterprises, and improves production efficiency and actual output of the enterprises.

Description

Visual data processing system and processing method

Technical Field

The invention relates to a data processing system and a processing method, in particular to a comprehensive and visual data processing system and a data processing method applying the system, and belongs to the technical field of big data processing.

Background

Big data is a concept which is widely focused, discussed and researched in recent years and mainly refers to a data set which cannot be used for capturing, managing and processing the content of the big data by using a conventional software tool in a certain time. Whereas big data technology refers to a technology for quickly obtaining valuable information from various types of big data. Technologies suitable for big data include Massively Parallel Processing (MPP) databases, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the internet, scalable storage systems, and the like.

It can be considered that the effective utilization of big data in the industry is still a technical pain point at present, and how to effectively integrate and utilize the data in the database according to the needs of different enterprises or different projects of the same enterprise, so as to obtain the expected effect is a difficult problem for researchers in the industry.

In the actual application process of the present stage, specific project requirements are generally required to be set up by business personnel in an enterprise for big data processing, then evaluation and system development are carried out by developers in the enterprise according to the project requirements, and the whole development period is short, namely a few hours, long, namely a few days or a few weeks. Once the situation of unsmooth communication exists between the business personnel and the developer, and the understanding deviation of the developer is caused, the system needs to be pushed over for redevelopment. Obviously, for enterprises, the operation flow can cause huge waste of resources, and severely restrict the production efficiency and actual output of the enterprises.

In summary, how to provide a comprehensive and visual data processing system and a data processing method using the same based on the prior art to overcome the defects in the prior art is also a problem to be solved by researchers in the industry.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a comprehensive and visual data processing system and a data processing method using the same, which are specifically as follows.

A visualized data processing system for enabling processing of big data, comprising:

the foreground visual operation part is used for defining task execution units, defining the execution sequence of the task execution units according to specific task requirements, forming task execution rules and sending the task execution rules;

the background data processing part is in signal connection with the foreground visual operation part and is used for receiving the task execution rule, calling the task execution unit according to the task execution rule, obtaining a task execution result and storing the task execution result;

the foreground visualization operation portion specifically includes,

the cleaning task execution units are used for defining specific data cleaning task operations and storing the operations in a modularized mode;

the modeling task execution units are used for defining specific data cleaning task operations, storing the operations in a modularized form and mutually independent;

the task input unit is used for defining the execution sequence of the cleaning task execution unit and the modeling task execution unit according to specific task requirements, forming the task execution rule and sending the task execution rule;

the background data processing section specifically includes,

the task receiving unit is in signal connection with the task input unit and is used for receiving the task execution rule;

the task analysis and judgment unit is in signal connection with the task receiving unit and is used for analyzing the task execution rule, judging whether the task execution rule is effective or not and executing subsequent operation according to a judgment result;

the task chain forming and executing unit is in signal connection with the task analyzing and judging unit and is also in signal connection with a plurality of cleaning task executing units and a plurality of modeling task executing units respectively, and when the task analyzing and judging unit judges that the task executing rule is effective, the cleaning task executing units and the modeling task executing units are sequentially called according to the task executing rule to obtain and send the task executing result;

and the task result storage unit is in signal connection with the task chain forming and executing unit and is used for storing and recording the task execution rules and the task execution results.

Preferably, a plurality of the cleaning task execution units are mutually independent;

each of the cleaning task performing units includes,

the cleaning object input module is used for defining a data set object which needs to be subjected to data cleaning;

the cleaning process definition module is in signal connection with the cleaning object input module and is used for defining a specific data cleaning process;

and the cleaning result deriving module is in signal connection with the cleaning process defining module and is used for carrying out data cleaning on the data set object according to the data cleaning process to obtain and output a data cleaning result.

Preferably, the modeling task execution units are independent from each other, and each modeling task execution unit comprises a model training subunit and a model application subunit;

the model training subunit comprises,

the training set selection module is used for selecting a training data set;

the training set preprocessing module is in signal connection with the training set selection module and is used for carrying out data preprocessing operation on the training data set;

the training model construction module is in signal connection with the training set preprocessing module and is used for forming a data processing model according to the preprocessed training data set and combining an algorithm and parameters;

the model application subunit comprises,

the data set selection module is used for selecting a task training set;

the data set preprocessing module is in signal connection with the data set selection module and is used for carrying out data preprocessing operation on the task training set;

the modeling result deriving unit is in signal connection with the data set preprocessing module and is used for combining the data processing model according to the preprocessed task training set to obtain and output a data modeling processing result.

A visualized data processing method, based on a visualized data processing system as described above, comprising the steps of:

s1, defining a task execution unit, and defining an execution sequence of the task execution unit according to specific task requirements to form a task execution rule;

s2, calling the corresponding task execution units according to the task execution rules to obtain task execution results and storing the task execution results;

s1 specifically comprises the following steps,

s11, defining specific data cleaning task operation, storing the operation into a cleaning task execution unit in a modularized mode, and ensuring that a plurality of cleaning task execution units are mutually independent;

s12, defining specific data modeling task operation, storing the operation into a modeling task execution unit in a modularized form, and ensuring that a plurality of modeling task execution units are mutually independent;

s13, defining the execution sequence of the cleaning task execution unit and the modeling task execution unit according to specific task requirements, forming the task execution rule and sending the task execution rule;

s2 specifically comprises the following steps,

s21, receiving the task execution rule;

s22, analyzing the task execution rule, judging whether the task execution rule is effective, executing S23 according to the need if the judging result is that the task execution rule is effective, and reporting the error to finish the subsequent flow if the judging result is that the task execution rule is ineffective;

s23, calling the cleaning task execution unit and the modeling task execution unit in sequence according to the task execution rule, and obtaining and sending the task execution result after the operation flow is executed in sequence;

and S24, storing and recording the task execution rules and the task execution results, and storing the data processing models together if the data processing models are involved in the task execution process.

Preferably, S11 specifically includes the following steps:

s111, defining a data set object needing data cleaning, wherein the source of the data set object can be a file type database or a relational database or a message queue;

s112, defining a specific data cleaning process, wherein the cleaning process comprises weight removal, mean filling, null filling and data deletion;

s113, the data cleaning process is used for cleaning the data of the data set object, and the cleaned result can be selectively aggregated or collided in a space-time manner to obtain a data cleaning result and output the data cleaning result.

Preferably, S12 includes a model training sub-step and a model application sub-step, which are performed sequentially;

the model training substep comprises in particular,

s121, selecting a training data set, wherein the training data set can be a file or a database table, and the training data set must contain a feature column required by training;

s122, performing data preprocessing operation on the training data set;

s123, selecting an algorithm and setting parameters according to the preprocessed training data set to form a data processing model and storing the data processing model, wherein the parameters comprise the proportion of training and testing data sets, the iteration times, the depth of the tree, the classification number and regularization parameters;

the model application substep comprises in particular,

s124, selecting a task training set;

s125, performing data preprocessing operation on the task training set;

and S126, according to the preprocessed task training set, combining the data processing model, selecting a characteristic column (which is consistent with the characteristic column when training the model) or a column to be processed according to the model, obtaining a data modeling processing result, and outputting the data modeling processing result to a file or a relational database.

Compared with the prior art, the invention has the advantages that:

the visualized data processing system provided by the invention realizes effective control of the data processing flow in a visualized and self-defined mode, the whole operation process is simple and visual, business personnel not familiar with the technology can also finish the construction of the system in a targeted manner according to specific project requirements, the development period is greatly shortened, valuable technical personnel resources in enterprises are saved, and the production efficiency and the actual output of the enterprises are improved.

Correspondingly, the visualized data processing method provided by the invention efficiently realizes the cleaning and modeling work of big data, and has high automation degree and integration degree in the whole method flow. The method can fully meet the requirements of different enterprises or different projects of the same enterprise, and is wide in application range and strong in adaptability.

In addition, the invention provides reference for other related problems in the same field, can be used for expanding and extending based on the reference, and has very wide application prospect when applied to other technical schemes related to big data processing.

The following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings, so that the technical scheme of the present invention can be understood and mastered more easily.

Drawings

Fig. 1 is a schematic diagram of a system architecture according to the present invention.

Detailed Description

The invention provides a comprehensive and visual data processing system and a data processing method using the system, and the specific scheme is as follows.

As shown in FIG. 1, the present invention discloses a visualized data processing system for realizing the processing of big data, comprising:

and the background data processing part is in signal connection with the foreground visual operation part and is used for receiving the task execution rule, calling the task execution unit according to the task execution rule, obtaining a task execution result and storing the task execution result.

The foreground visual operation part specifically comprises:

and the task input unit is used for defining the execution sequence of the cleaning task execution unit and the modeling task execution unit according to specific task requirements, forming the task execution rule and sending the task execution rule.

The background data processing part specifically comprises:

It should be emphasized that a plurality of the cleaning task execution units are mutually independent; and each of the cleaning task execution units includes:

Also, the modeling task execution units are independent from each other, and each modeling task execution unit comprises a model training subunit and a model application subunit.

The model training subunit comprises:

the training set selection module is used for selecting a training data set;

the training model construction module is in signal connection with the training set preprocessing module and is used for forming a data processing model according to the preprocessed training data set and combining algorithms and parameters.

The model application subunit includes:

the data set selection module is used for selecting a task training set;

In summary, the visualized data processing system provided by the invention realizes effective control of the data processing flow in a visualized and self-defined mode, the whole operation process is simple and visual, business personnel not familiar with the technology can also finish the construction of the system in a targeted manner according to specific project requirements, the development period is greatly shortened, valuable technician resources in enterprises are saved, and the production efficiency and the actual output of the enterprises are improved.

The invention also discloses a visualized data processing method, which is based on the visualized data processing system, and comprises the following steps:

s1, defining task execution units, and defining the execution sequence of the specific task execution units in a convenient operation mode such as dragging according to specific task requirements to form task execution rules;

s2, calling the corresponding task execution units according to the task execution rule, obtaining a task execution result and storing the task execution result.

S1 specifically comprises the following steps:

s13, defining the execution sequence of the cleaning task execution unit and the modeling task execution unit according to specific task requirements, forming the task execution rule and sending the task execution rule; data cleansing or data modeling tasks may be added here and information specifying task names, classifications, task descriptions, etc. may be specified.

S2 specifically comprises the following steps:

s21, receiving the task execution rule;

Further, S11 specifically includes the following steps:

s111, defining a data set object needing data cleaning, wherein the source of the data set object can be a file type database or a relational database or a message queue; each operator in the dataset object carries type and specific parameter information;

s112, defining a specific data cleaning process, wherein the cleaning process can comprise specific cleaning operations such as removing weight according to a certain column, filling a certain field mean value, filling null values, deleting columns and the like, and the dependency relationship and execution sequence of each cleaning operator;

s113, the data cleaning process is used for cleaning the data of the data set object, the cleaned result can be selectively aggregated or collided in a space-time manner according to actual needs, then a data cleaning result is obtained, and the data cleaning result can be directly obtained and output; the data cleansing results described herein may also be files, relational databases, or message queues.

Further, S12 includes a model training sub-step and a model application sub-step, which are performed sequentially.

The model training substeps specifically include:

s121, selecting a training data set, wherein the training data set can be a file or a database table; it should be emphasized that the training data set must contain feature columns required for training, and optionally labeled tag columns;

s122, performing data preprocessing operation on the training data set, wherein the data preprocessing operation is an optional item, a feature column and a tag column (a part of algorithms do not need the tag column) which need training are selected when the operation is performed, and the modeling algorithm extracts feature values according to the feature column;

s123, selecting an algorithm and setting parameters according to the preprocessed training data set to form a data processing model and storing the data processing model, wherein the parameters comprise the proportion of training and testing data sets, the iteration times, the depth of the tree, the classification number and regularization parameters.

The model application substeps specifically include:

s124, selecting a task training set;

s125, performing data preprocessing operation on the task training set;

and S126, according to the preprocessed task training set, combining the data processing model to obtain a data modeling processing result and outputting the data modeling processing result.

Corresponding to the system proposal, the visualized data processing method provided by the invention efficiently realizes the cleaning and modeling work of big data, and has high automation degree and integration degree in the whole method flow. The method can fully meet the requirements of different enterprises or different projects of the same enterprise, and is wide in application range and strong in adaptability.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Finally, it should be understood that although the present disclosure describes embodiments, not every embodiment is intended to include only a single embodiment, and that this description is for clarity only, and that those skilled in the art will recognize that the embodiments described herein may be suitably combined to form other embodiments as would be understood by those skilled in the art.

Claims

1. A visualized data processing system for enabling processing of big data, comprising:

the foreground visualization operation portion specifically includes,

the system comprises a plurality of cleaning task execution units, a plurality of data processing units and a plurality of data processing units, wherein the cleaning task execution units are used for defining specific data cleaning task operations, storing the operations in a modularized form, and mutually independent, and each cleaning task execution unit comprises:

the cleaning result deriving module is in signal connection with the cleaning process defining module and is used for carrying out data cleaning on the data set object according to the data cleaning process to obtain and output a data cleaning result;

the modeling task execution units are used for defining specific data cleaning task operations, storing the operations in a modularized form, and mutually independent, and each modeling task execution unit comprises a model training subunit and a model application subunit;

the model training subunit comprises:

the training set selection module is used for selecting a training data set;

the model application subunit includes:

the data set selection module is used for selecting a task training set;

the modeling result deriving unit is in signal connection with the data set preprocessing module and is used for obtaining and outputting a data modeling processing result according to the preprocessed task training set and the data processing model;

the background data processing section specifically includes,

2. A visualized data processing method based on a visualized data processing system according to claim 1, comprising the steps of:

s1 specifically comprises the following steps,

s11, defining specific data cleaning task operation, storing the operation in a modularized form into a cleaning task execution unit, and ensuring that a plurality of cleaning task execution units are mutually independent, wherein the method comprises the following steps: s111, defining a data set object needing data cleaning, wherein the source of the data set object can be a file type database or a relational database or a message queue;

s113, the data cleaning process is used for cleaning the data of the data set object, and the cleaned result can be selectively aggregated or collided in a space-time manner to obtain a data cleaning result and output the data cleaning result;

s12, defining specific data modeling task operation, storing the operation in a modeling task execution unit in a modularized form, ensuring that a plurality of modeling task execution units are mutually independent, comprising a model training sub-step and a model application sub-step which are sequentially carried out,

the model training substeps specifically include: s121, selecting a training data set, wherein the training data set can be a file or a database table, and the training data set must contain a feature column required by training;

s122, performing data preprocessing operation on the training data set;

the model application substeps specifically include: s124, selecting a task training set;

s125, performing data preprocessing operation on the task training set;

s126, according to the preprocessed task training set, combining the data processing model to obtain a data modeling processing result and outputting the data modeling processing result;

s2 specifically comprises the following steps,

s21, receiving the task execution rule;