CN115905292A

CN115905292A - Data management method and system based on big data

Info

Publication number: CN115905292A
Application number: CN202211623253.5A
Authority: CN
Inventors: 刘力铭; 兰海峰; 刘汉胤
Original assignee: Guangdong Donglian Xinchuang Information Technology Co ltd
Current assignee: Guangdong Donglian Xinchuang Information Technology Co ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-04

Abstract

The invention discloses a data management method based on big data, which comprises the following steps: receiving a data governance request which is sent by a client and corresponds to a data governance process triggered by a response user; acquiring a preset data management configuration interface and a data management strategy corresponding to a user according to the user identification in the data management request; displaying a data management configuration interface through a client, selecting a corresponding data reading component and a corresponding data processing component on the data management configuration interface according to a data management strategy, and performing connection processing to complete data management process configuration; the client asks the user whether to treat the data to be treated by the current data treatment process configuration; and when the user agrees to treat the data according to the current data treatment flow configuration, which is sent by the client, the data is treated according to the current data treatment flow configuration to complete the data treatment.

Description

Data management method and system based on big data

Technical Field

The invention relates to the technical field of big data processing, in particular to a data management method and system based on big data.

Background

Data is the basis and core of big data engineering, and the integrity, timeliness and quality of the data are guarantee conditions of all targets. The development of economy and technology is oriented to 'intellectualization' under the support of big data, and the running conditions of all social production fields are monitored by integrating various data information, so that the improvement and optimization of the safety production management work are realized.

By real-time acquisition, data storage, data analysis and comprehensive query of data information of big data, all walks of life can capture, find and analyze efficiently, valuable information can be economically mined from data with various types and large quantity, and data support is provided for production and operation comprehensive management, comprehensive scheduling, comprehensive coordination and comprehensive command of all walks of life. However, due to the difference of organization, service system and data platform, many data organizations present their own arrays, data are not shared, data are duplicated, information is not connected, data are not distributed uniformly, and the utilization condition of the data platform is unbalanced; from the viewpoint of data platform hardware, that is, from the viewpoint of data content transparency, part of data is updated too fast due to the problem of device capacity, so that critical data is actively lost in advance, and some data is unconditionally retained.

How to effectively manage and provide the big data is a technical problem to be solved, so that the big data is provided to meet the user experience of the user access request.

Disclosure of Invention

The invention aims to provide a data management method and a data management system based on big data, which can effectively solve the technical problems in the prior art.

In order to achieve the above object, an embodiment of the present invention provides a data governance method based on big data, including the steps of:

s1, receiving a data management request which is sent by a client and corresponds to a data management process triggered by a response user; the data governance request comprises a user identifier and data to be governed; the data governance process comprises data reading and data processing;

s2, acquiring a preset data management configuration interface and a data management strategy corresponding to the user according to the user identification in the data management request;

s3, displaying the data management configuration interface through the client, and selecting a corresponding data reading component and a corresponding data processing component on the data management configuration interface according to the data management strategy and carrying out connection processing to complete data management process configuration; the data reading component and the data processing component are displayed on the configuration interface; the data processing components comprise a plurality of data processing components, when the data reading component or the data processing component is connected in front of the data processing components, the data processing components are used as target nodes, when the data processing components are connected behind the data processing components, the data processing components are used as source nodes, parameters required to be configured by different data processing components are different, and output results of the source nodes can be used as input parameters of the target nodes;

s4, asking a user whether to treat the data to be treated by the current data treatment process configuration through the client;

s5, when the user agrees to treat the data according to the current data treatment process configuration and sent by the client side is received, treating the data to be treated according to the current data treatment process configuration to complete data treatment;

s6, when a user request sent by the client is received to update the current data management process configuration, displaying an update interface through the client to enable a user to perform data management process configuration update operation, and when the data management process configuration update operation sent by the client is received to be completed, managing the data to be managed according to the updated data management process configuration to complete data management; the data management process configuration updating operation comprises the steps of replacing a data reading assembly and a data processing assembly, and changing parameters and connecting lines;

and S7, displaying a result after data management is finished through the client.

As an improvement of the above scheme, the data governance process further includes data publishing, and a data publishing component corresponding to the data publishing is also displayed on the configuration interface; the data governance method based on the big data further comprises the following steps:

and S8, receiving a data governance result issuing request sent by a client and responding to user triggering, and correspondingly issuing the data governance result according to the data issuing component selected by the user in the data governance result issuing request.

As an improvement of the above scheme, in step S2, when a plurality of preset data governance policies corresponding to the user are obtained according to the user identifier in the data governance request, the plurality of data governance policies are displayed by the client for the user to select, and one data governance policy selected by the user is used as the final data governance policy.

As an improvement of the scheme, the data governance request also comprises a label of data to be governed; in step S2, when a plurality of preset data governance policies corresponding to the user are obtained according to the user identifier in the data governance request, the most matched data governance policy is selected from the plurality of data governance policies as the last data governance policy based on the tag of the data to be governed.

As an improvement of the above scheme, the data governance request further includes data cleaning process information; the data processing component comprises a data cleaning component; the data governance method based on the big data further comprises the following steps:

analyzing the data cleaning process file to extract a workflow application model corresponding to the data cleaning process file;

generating corresponding data cleaning execution files according to the plurality of workflow application models;

and in the process of treating the data to be treated, the data cleaning component cleans the data to be treated according to the data cleaning execution file.

As an improvement of the above solution, the generating of the corresponding data cleaning execution file according to the multiple workflow application models specifically includes:

respectively acquiring data cleaning codes corresponding to a plurality of workflow application models; the data cleaning code is a cleaning function of an SQL statement and a calling component;

and sequencing the data cleaning codes according to the sequence in the data cleaning flow file to form a data cleaning execution file.

As an improvement of the above, the data processing component comprises a redundant data processing component; the redundant data processing assembly comprises a redundant data judging unit and a redundant data removing unit, the redundant data judging unit is used for judging redundant information of the data to be treated, the data to be treated is removed by the redundant data removing unit when the redundant information is judged, and the data after the redundant information is removed by the redundant data removing unit is returned to the redundant data judging unit to be continuously judged until the redundant information is judged not to exist.

As an improvement of the above solution, the redundant data judgment unit performs the following steps:

digitizing the data to be treated;

converting the digitized data into a matrix, wherein the matrix is as follows:

wherein A is multi-dimensional data of mxn;

and (3) carrying out inverse matrix solving on the matrix A: placing a unit matrix of the same order at the right side of the characteristic matrix to form an augmentation matrix A _x ：

Judging whether the matrix A can be converted or not through matrix row-column conversion _y ：

If yes, judging that the redundant information does not exist, otherwise, judging that the redundant information exists.

As an improvement of the above scheme, the data governance process further includes data quality verification, and a data quality verification component corresponding to the data release is further displayed on the configuration interface; the data governance method based on big data further comprises the following steps before the step S8:

and receiving a data quality verification request sent by a client and responding to user triggering, and verifying the quality of the data treatment result correspondingly according to the data quality verification component selected by the user in the data quality verification request.

The embodiment of the invention correspondingly provides a data governance system based on big data, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the data governance method based on the big data according to any one of the embodiments.

Compared with the prior art, the data management method and the data management system based on the big data provided by the embodiment of the invention can effectively manage and provide the big data, so that the big data can be provided to meet the user experience of the user access request. The method comprises the steps of acquiring a preset data governance strategy corresponding to a user based on a user identifier in a data governance request, selecting a corresponding data reading assembly and a corresponding data processing assembly on a data governance configuration interface according to the data governance strategy, and performing connection processing to complete data governance process configuration; then, the client asks the user whether to treat the data to be treated by the current data treatment process configuration; when the user sent by the client side is received to agree to treat the data to be treated according to the current data treatment process configuration, treating the data to be treated according to the current data treatment process configuration to complete data treatment; when a user request sent by the client is received to update the current data management process configuration, an update interface is displayed by the client so that the user can perform data management process configuration update operation, and when the data management process configuration update operation sent by the client is received to be completed, the data to be managed is managed according to the updated data management process configuration so as to complete data management. Therefore, the efficiency and the effect of data governance can be effectively improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow diagram of a data governance method based on big data according to an embodiment of the present invention.

Fig. 2 is a network architecture diagram of a data governance method based on big data according to an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of a big data-based data governance system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a data governance method based on big data, including the steps of:

s1, receiving a data management request which is sent by a client and corresponds to a data management process triggered by a response user; the data governance request comprises a user identifier and data to be governed;

the data governance process comprises data reading and data processing;

s3, displaying the data management configuration interface through the client, and selecting a corresponding data reading component and a corresponding data processing component on the data management configuration interface according to the data management strategy and carrying out connection processing to complete data management process configuration;

the data reading component and the data processing component are displayed on the configuration interface; the data processing components comprise a plurality of data processing components, when the data reading components or the data processing components are connected in front of the data processing components, the data processing components are used as target nodes, when the data processing components are connected in back of the data processing components, the data processing components are used as source nodes, parameters required to be configured by different data processing components are different, and output results of the source nodes can be used as input parameters of the target nodes;

With reference to fig. 2, fig. 2 is a network architecture diagram for implementing a big data-based data governance method according to an embodiment of the present invention, where the network architecture includes a server 101 and a client 102, where the server 101 is communicatively connected to the client 102. The server 101 may be a server, and the client 102 may be a computer, an intelligent terminal, or the like. The user triggers a data governance request at client 102 to complete the entire flow of data governance.

And the data to be treated is obtained through data source configuration. The data source, i.e. the source of the data, is a device or original medium that provides some kind of required data, and by providing the correct data source name, the corresponding database connection can be found. Configuring a source database and a target database through a data source, wherein a table of the source database is used for reading data, and a table of the target database is used for writing data.

It can be understood that the data obtained from the source database is the original data, and the field information can be obtained by collecting metadata of the original data, and the field can be used as a parameter of the data processing component and runs through the whole data management process. That is, the data to be remediated may include metadata. In addition, the data reading provides a plurality of reading components for different types of data sources, supports a relational database, hdfs, ES, kafka, hbase, ftp reading, streaming reading and the like, and different data reading components need to be configured with different parameters.

Therefore, the embodiment of the invention obtains the preset data governance strategy corresponding to the user based on the user identifier in the data governance request, and selects the corresponding data reading component and data processing component on the data governance configuration interface according to the data governance strategy and performs the connection processing to complete the data governance process configuration; then, the client asks the user whether to treat the data to be treated by the current data treatment process configuration; when the user sent by the client side is received to agree to treat the data to be treated according to the current data treatment process configuration, treating the data to be treated according to the current data treatment process configuration to complete data treatment; when a user request sent by the client is received to update the current data management process configuration, an update interface is displayed by the client so that the user can perform data management process configuration update operation, and when the data management process configuration update operation sent by the client is received to be completed, the data to be managed is managed according to the updated data management process configuration so as to complete data management. That is to say, the user identifier in the data management request can acquire the data management strategy preset by the user, and the data management flow configuration can be automatically generated according to the data management strategy, so that the data management can be directly performed according to the data management flow configuration under the condition that the user does not need to update/change the existing data management flow configuration, and the efficiency of data management can be effectively improved. And the current data management process configuration can be updated based on the real-time request of the user, and the data to be managed is managed according to the updated data management process configuration, so that the quality and the effect of data management can be effectively improved.

Further, please continue to refer to fig. 1, in this embodiment, the data governance process further includes data publishing, and a data publishing component corresponding to the data publishing is further displayed on the configuration interface; the data governance method based on the big data further comprises the following steps:

It can be understood that the data is distributed and written into the database after the data processing is completed, the data distribution provides a plurality of distribution components for different types of data sources, and supports the relational database, hdfs, ES, kafka, hbase, ftp write data, hive and the like, and different data distribution components need different configuration parameters.

Further, the data governance process also comprises data quality verification, and a data quality verification component corresponding to the data release is also displayed on the configuration interface; the data governance method based on big data further comprises the following steps before the step S8:

It can be understood that before data distribution is completed and written into the database, data management results need to be checked according to quality rules to see whether the data is normalized; the data quality checking component supports types including identification card format checking, telephone number format checking, updating timeliness, record integrity, data uniqueness, attribute integrity, primary key uniqueness, value range validity and the like.

As another improvement of the above scheme, the data governance request further includes a tag of data to be governed; in the step S2, when a plurality of preset data governance policies corresponding to the user are obtained according to the user identifier in the data governance request, the most matched data governance policy is selected from the plurality of data governance policies as the last data governance policy based on the tag of the data to be governed.

Further, in this embodiment, the data governance request further includes data cleaning process information; the data processing component comprises a data cleaning component; the data governance method based on the big data further comprises the following steps:

As an improvement of the above scheme, the generating of the corresponding data cleaning execution file according to the plurality of workflow application models specifically includes:

Therefore, in the embodiment, the workflow application model for executing the specific cleaning task is determined according to the data cleaning flow file sent by the user, so that the execution engine can clean the data to be treated according to the workflow application model specially set for each specific data cleaning task by the user. In addition, a user can freely combine workflow application models with different functions in the data cleaning process file, so that the data cleaning process has higher flexibility and expandability; meanwhile, each workflow application model can be repeatedly used, so that the reusability of data cleaning is improved.

Further, in this embodiment, the data processing component includes a redundant data processing component; the redundant data processing assembly comprises a redundant data judging unit and a redundant data removing unit, the redundant data judging unit is used for judging redundant information of the data to be treated, the data to be treated is removed by the redundant data removing unit when the redundant information is judged, and the data after the redundant information is removed by the redundant data removing unit is returned to the redundant data judging unit to be continuously judged until the redundant information is judged not to exist.

Specifically, the redundant data determining unit performs the following steps:

(1) Digitizing the data to be treated;

(2) Converting the digitized data into a matrix, wherein the matrix is as follows:

wherein A is multi-dimensional data of mxn;

(3) And (3) carrying out inverse matrix solving on the matrix A: placing a feature matrix A of the same order to the rightA unit matrix forming an amplification matrix A _x ：

(4) Judging whether the matrix A can be converted or not through matrix row-column conversion _y ：

(5) If yes, judging that the redundant information does not exist, otherwise, judging that the redundant information exists.

Specifically, assume that redundant data a exists in the x-th row and the y-th row in the feature matrix a _xj And a _yj J =1,2, \ 8230n. Due to the presence of redundant data, i.e. data absence information, a _xj And a _yj In which one data is worthless, i.e. a _xj Or a _yj The element may be 0. Augmentation matrix A of A _x The original feature matrix can not be transformed into identity matrix all the time, so A can not be obtained by transformation _y And thus the presence of redundant data in the characteristic data matrix can be detected. According to A _y And judging whether redundant data exist in A or not: if A is _y If the matrix A exists, redundant data do not exist in the matrix A, namely redundant data processing is not needed to be carried out on the data to be treated; if A is _y If the matrix A does not exist, redundant data exist in the matrix A, namely the redundant data processing needs to be carried out on the data to be treated.

Therefore, in the embodiment, the data to be treated, which needs to be judged whether redundant information exists or not, is converted into the matrix form, the conclusion whether the redundant information exists in the data to be treated can be given only by determining whether the converted matrix exists or not, the redundant information removing operation is executed only when the redundant information exists, the operation is simple and clear, the processing flow can be effectively simplified, and the data treatment efficiency is improved.

It is understood that, in this embodiment, the data processing component may further include a basic processing component, an extraction component, an association component, an alignment component, an identification component, and the like. Wherein, the basic processing component supports selection of fields, data distribution, aggregation operation, union operation and sorting topN. The extraction component comprises Chinese character extraction, mobile phone number extraction, license plate number extraction, identification card number extraction, picture extraction and the like. The association component comprises left association and association; the comparison component comprises an intersection set, a union set and a difference set. The identification component includes preference marking, attribute marking, direct marking, attribute mapping marking, and the like.

As shown in fig. 3, an embodiment of the present invention correspondingly provides a big data-based data governance system, where the big data-based data governance system includes a processor 61, a memory 62, and a computer program stored in the memory 62 and configured to be executed by the processor 61, and when the processor 61 executes the computer program, the big data-based data governance method according to any one of the above embodiments is implemented.

It should be noted that fig. 3 only illustrates an example in which one memory and one processor in the device are connected, and in some specific embodiments, the device may further include a plurality of memories and/or a plurality of processors, and the specific number and the connection mode thereof may be set and adapted according to actual needs.

The invention also provides a computer-readable storage medium, which specifically includes a stored computer program, wherein when the computer program runs, a device where the computer-readable storage medium is located is controlled to execute the data governance method based on big data according to any one of the above embodiments.

It should be noted that, all or part of the flow in the method according to the above embodiments of the present invention may also be implemented by a computer program instructing related hardware, where the computer program may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above embodiments of the method may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be further noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunications signals in accordance with legislation and patent practice.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data governance method based on big data is characterized by comprising the following steps:

s1, receiving a data governance request which is sent by a client and corresponds to a data governance process triggered by a response user; the data governance request comprises a user identifier and data to be governed; the data governance process comprises data reading and data processing;

s3, displaying the data management configuration interface through the client, and selecting a corresponding data reading component and a corresponding data processing component on the data management configuration interface according to the data management strategy and carrying out connection processing to complete data management process configuration; wherein the data reading component and the data processing component are displayed on the configuration interface; the data processing components comprise a plurality of data processing components, when the data reading components or the data processing components are connected in front of the data processing components, the data processing components are used as target nodes, when the data processing components are connected in back of the data processing components, the data processing components are used as source nodes, parameters required to be configured by different data processing components are different, and output results of the source nodes can be used as input parameters of the target nodes;

2. The big data-based data governance method according to claim 1, wherein the data governance process further comprises data publishing, and a data publishing component corresponding to the data publishing is further displayed on the configuration interface; the data governance method based on the big data further comprises the following steps:

3. The big-data-based data governance method according to claim 2, wherein in step S2, when there are a plurality of preset data governance policies corresponding to the user obtained according to the user identifier in the data governance request, the plurality of data governance policies are displayed by the client for the user to select, and one data governance policy selected by the user is used as a final data governance policy.

4. The big data based data governance method according to claim 2, wherein the data governance request further comprises a tag for data to be governed; in step S2, when a plurality of preset data governance policies corresponding to the user are obtained according to the user identifier in the data governance request, the most matched data governance policy is selected from the plurality of data governance policies as the last data governance policy based on the tag of the data to be governed.

5. The big data-based data governance method according to claim 1, wherein the data governance request further comprises data cleansing flow information; the data processing component comprises a data cleaning component; the data governance method based on the big data further comprises the following steps:

6. The big data-based data governance method according to claim 5, wherein the generating of the corresponding data cleansing execution file according to the plurality of workflow application models specifically comprises:

7. The big-data-based data governance method according to claim 1, wherein said data processing components comprise redundant data processing components; the redundant data processing assembly comprises a redundant data judging unit and a redundant data removing unit, the redundant data judging unit is used for judging redundant information of the data to be treated, the data to be treated is removed by the redundant data removing unit when the redundant information is judged, and the data after the redundant information is removed by the redundant data removing unit is returned to the redundant data judging unit to be continuously judged until the redundant information is judged not to exist.

8. The big data-based data governance method according to claim 7, wherein the redundant data determination unit is implemented as follows:

digitizing the data to be treated;

converting the digitized data into a matrix, wherein the matrix is as follows:

wherein, A is multidimensional data of m multiplied by n;

and (3) carrying out inverse matrix solving on the matrix A: placing a identity matrix of the same order at the right side of the feature matrix to form an augmented matrix A _x ：

If so, judging that redundant information does not exist, otherwise, judging that the redundant information exists.

9. The big data-based data governance method according to claim 2, wherein the data governance process further comprises data quality verification, and a data quality verification component corresponding to the data release is further displayed on the configuration interface; the data governance method based on big data further comprises the following steps before the step S8:

10. A big-data-based data governance system, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor, when executing the program, implements the big-data-based data governance method of any one of claims 1 to 9.