CN116450908B

CN116450908B - Self-service data analysis method and device based on data lake and electronic equipment

Info

Publication number: CN116450908B
Application number: CN202310726336.5A
Authority: CN
Inventors: 杨国利; 韩宏伟; 秦伟; 李翔; 刘坤; 王强
Original assignee: Beijing Big Data Advanced Technology Research Institute
Current assignee: Beijing Big Data Advanced Technology Research Institute
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-03
Anticipated expiration: 2043-06-19
Also published as: CN116450908A

Abstract

The invention provides a self-service data analysis method, a self-service data analysis device and electronic equipment based on a data lake, wherein the method is applied to the technical field of data processing, and comprises the following steps: managing metadata information of a data lake, and creating a metadata graph database, wherein the metadata graph database stores the metadata information in a graph structure mode; classifying the data of the data lake in a partitioning way to generate a data asset map, wherein the data asset map graphically displays the distribution condition of the data asset and the relation among the data assets; positioning data to be analyzed according to the metadata map database and the data asset map; ETL operation is carried out on the data to be analyzed, and SQL statement information in the operation process is collected; generating a blood margin map according to the SQL statement information; and generating an analysis result of the data to be analyzed according to the blood-related map.

Description

Self-service data analysis method and device based on data lake and electronic equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a self-service data analysis method, a self-service data analysis device and electronic equipment based on a data lake.

Background

The data lake is a centralized repository for storing structured, semi-structured and unstructured data of multiple sources of arbitrary size, providing data services for various types of digitized applications. However, the metadata model of the existing data lake technology is simple in design, scattered in storage positions and short of related tools, rapid retrieval of data cannot be achieved, related data analysis work is highly dependent on IT participation, a user cannot rapidly know the distribution condition and the data appearance of the lake data, and corresponding data value cannot be obtained through direct analysis.

Therefore, it is necessary to develop a self-service data analysis method, device and electronic equipment based on the data lake to realize rapid and accurate data positioning and analysis of the data in the data lake.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a self-service data analysis method, apparatus and electronic device based on a data lake, so as to overcome or at least partially solve the above problems.

The first aspect of the embodiment of the invention provides a self-service data analysis method based on a data lake, which comprises the following steps:

managing metadata information of a data lake, and creating a metadata graph database, wherein the metadata graph database stores the metadata information in a graph structure mode;

Classifying the data of the data lake in a partitioning way to generate a data asset map, wherein the data asset map graphically displays the distribution condition of the stored data asset and the relation among the data assets;

positioning data to be analyzed according to the metadata map database and the data asset map;

ETL operation is carried out on the data to be analyzed, and SQL statement information in the ETL operation process is collected;

generating a blood margin map according to the SQL statement information;

and generating an analysis result of the data to be analyzed according to the blood-related map.

The second aspect of the present embodiment also proposes a data analysis device, the device comprising:

the metadata map database generation module is used for managing metadata information of the data lake and creating a metadata map database, and the metadata map database stores the metadata information in a map structure mode;

the data asset map generation module is used for carrying out partition classification on the data of the data lake to generate a data asset map, and the data asset map graphically displays the distribution condition of the data asset and the relation among the data assets;

The positioning module is used for positioning data to be analyzed according to the metadata map database and the data asset map;

the operation module is used for carrying out ETL operation on the data to be analyzed to obtain SQL statement information in the ETL operation process;

the blood-margin map generation module is used for generating a blood-margin map according to the SQL statement information;

and the analysis module is used for generating an analysis result of the data to be analyzed according to the blood-margin map.

The third aspect of the present embodiment also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in the self-service data analysis method based on a data lake according to any one of the first aspect of the embodiments of the present application when executed.

The fourth aspect of the embodiment of the present application further provides a computer readable storage medium, on which a computer program/instruction is stored, which when executed by a processor implements the steps of the self-service data analysis method based on a data lake according to any one of the first aspect of the embodiment of the present application.

A fifth aspect of an embodiment of the present application provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps in the data lake based self-service data analysis method of any of the first aspects.

The embodiment of the application provides a self-service data analysis method, a device and electronic equipment based on a data lake, wherein the method comprises the following steps: managing metadata information of a data lake, and creating a metadata graph database, wherein the metadata graph database stores the metadata information in a graph structure mode; classifying the data of the data lake in a partitioning way to generate a data asset map, wherein the data asset map graphically displays the distribution condition of the data asset and the relation among the data assets; positioning data to be analyzed according to the metadata map database and the data asset map; ETL operation is carried out on the data to be analyzed, and SQL statement information in the ETL operation process is collected; generating a blood margin map according to the SQL statement information; and generating an analysis result of the data to be analyzed according to the blood-related map. On one hand, the embodiment of the application carries out unified management on the metadata information of the data lake by creating the metadata map database, and clears the distribution and the relation of the data assets by generating the data asset map, thereby realizing rapid retrieval and positioning of the data based on the metadata map database and the data asset map. On the other hand, by collecting SQL statement information, a data blood-margin map is generated, automatic data analysis is carried out according to the blood-margin map, and rapid and accurate data analysis is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a self-service data analysis method based on a data lake according to an embodiment of the present invention;

FIG. 2 is a flowchart of the steps for creating a metadata map database according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data asset map generation process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a data analysis process based on a blood-related map according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a self-service data analysis device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings in the embodiments of the present invention. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment provides a self-service data analysis method based on a data lake, referring to fig. 1, fig. 1 shows a step flowchart of the self-service data analysis method based on the data lake, and as shown in fig. 1, the method includes:

step S101, metadata information of a data lake is managed, and a metadata graph database is created, wherein the metadata graph database stores the metadata information in a graph structure mode.

Metadata (Metadata), also known as intermediate data, relay data, is information describing data attributes, used to support functions such as indicating storage locations, history data, resource lookups, file records, and the like. Metadata is also an electronic catalog, and in order to achieve the purpose of cataloging, the content or characteristics of the data must be described and collected, so as to achieve the purpose of assisting in data retrieval. The metadata information in the present embodiment represents related information including metadata. It should be noted that the data lakes mentioned in this and all subsequent examples represent data lakes constructed based on Delta Lake technology.

In the related art, management of metadata information in a Delta Lake data Lake is mainly performed by means of a metadata model. However, the existing metadata model is simple in design and less in storage information, metadata information is generally stored in log files, storage based on the files is scattered, and when data retrieval is performed, the files need to be traversed one by one, so that the retrieval efficiency is low, and quick retrieval cannot be realized.

In order to solve the problems, the embodiment of the application manages metadata information of data in a Delta Lake data Lake, extracts scattered metadata information of the Lake-entering service data, encapsulates a metadata model designed according to own service flow, encapsulates the metadata model into a consistent metadata model object, uniformly stores the metadata model object into a metadata graph database, stores the metadata information in a graph structure mode, performs centralized management, and realizes quick positioning of the data by utilizing a quick data retrieval function of the graph database.

In one possible implementation manner, the step S101 manages metadata information of the data lake, and creates a metadata map database, including:

in step S1011, a metadata model conforming to the business process is designed.

The metadata model is a model for defining metadata attributes and relationships. When the metadata model is designed, the current situation of the data of the business is to be met. In particular, the data in a data lake is likely to originate from different data centers, store multiple data sources, and have different data structures. Different services produce data with larger difference, and the data needs to be processed by different metadata models. Therefore, in the present embodiment, when designing the metadata model, a model conforming to the corresponding business flow needs to be designed according to the business related to the data lake.

In specific implementation, the metadata model is a model for describing data, and is used for determining information such as definition, attribute, relationship, constraint and the like of the data, the metadata model is a base class, and the metadata information of the data is an entity object of the metadata model. In particular, metadata can be classified into at least three types of technical metadata, business metadata and operation metadata according to the attribute of the metadata. The technical metadata is related conceptual information in the technical field of description systems, and comprises information such as a data structure, a data processing mode, a feature description, a data processing link and the like, and further comprises information such as a database table name, a column name, a field length, a field type, a data storage position, a data blood relationship and the like. The business metadata is information describing business meaning and business rule of data, and comprises: business definition, terms, business rules, business indexes, and the like. The operation metadata is information describing operation attributes of data, and includes information of a data owner, a data user, a data access mode, access time, access authority, and the like.

Step S1012, traversing file catalogs of the business data of the data lakes, analyzing log information in the file catalogs, and obtaining basic metadata information; the base metadata information includes at least: data modification behavior information, schema information, and data storage location information.

In specific implementation, file directory traversal is performed on the business data which has entered the lake, and delta_log files (namely log information in the file directory) in the business data are analyzed. Specifically, the log information includes four kinds of Metadata information including Protocol information (Protocol), metadata information (Metadata), command information (CommitInfo), and attribute information (setTransmission). The CommitInfo records modification behaviors of the current data, wherein the modification behaviors comprise behavior information such as creation time, modification time, operation mode and the like; the Metadata includes logical structure set information (Schema information) of the current data, and the Schema information is column information where the data is located. In this embodiment, by analyzing log information, information extraction is performed on the content obtained by analysis, and behavior information, schema information, and data storage location information are modified, so as to form basic metadata information of the data.

Step S1013, data fusion is carried out on the basic metadata information by using the designed metadata model, and a metadata model object is obtained.

In this embodiment, the metadata model obtained in step S1011 is used to perform data fusion on the basic metadata information of the data, so as to further perfect the metadata information of the data and generate a corresponding metadata model object. In the data fusion process, corresponding attribute information is added for the data, and relationship information between the data is also added. In this embodiment, a database may be a meta-model object, and a column of data in the database may be a meta-model object, so there may be various relationships between meta-model objects, including inclusion, parallelism, and the like. The relationships between these meta-model objects, i.e. the relationships between data, can also be derived by the meta-data model based on the underlying meta-data information. For example, a metadata model for analyzing relationships, based on schema information in the underlying metadata information, analyzes and generates metadata objects and relationships between metadata objects. Therefore, in the data fusion process, the metadata model adds attribute information, relationship information and the like to the metadata information of the data to obtain a metadata object with more perfect information.

In a possible implementation manner, the step S1013 performs data fusion on the basic metadata information to obtain a meta-model object, including:

in step S1013a, the metadata model is used to map the basic metadata information to the corresponding attribute of the metadata model.

In particular implementations, a metadata model is utilized to add relevant attribute tags to metadata information for the data. The technical metadata, the service metadata and the operation metadata are respectively attribute sets, each attribute in each attribute set is in the form of key-value pairs of keys and values, and the metadata model maps basic metadata information to the corresponding keys, so that the corresponding values are determined according to the values of the keys, and then the attribute corresponding to the basic metadata information is determined.

Step S1013b, adding other metadata information to the basic metadata information according to the information input by the user.

It should be noted that metadata information which cannot be directly extracted from log information exists in the data, such as information of a data attribution department, a data manager, data access rights and the like, and the metadata information is the data which needs to be manually specified or determined. Such information, which cannot be automatically extracted, needs to be manually input by a user, then converted into other metadata information, and added to the basic metadata information. Therefore, the application realizes the further improvement of the basic metadata information by manually adding other metadata information which cannot be directly extracted.

In step S1013c, the basic metadata information is encapsulated into a meta-model object with a consistent format.

In the specific implementation, the metadata model is used for adding corresponding attributes to the extracted basic metadata information, mapping the extracted basic metadata information to the corresponding attributes of the metadata model, adding the metadata information which cannot be automatically extracted in a manual mode, and finally packaging the complete metadata to form a consistent metadata model object. The meta-model object contains all attribute information of the meta-data model, namely, one meta-model object represents a set of all attributes of one data, and for one data, the meta-model object has a plurality of attributes, and can be obtained through meta-data model analysis, each piece of basic meta-data information contains a plurality of key values, so that a plurality of attributes corresponding to the basic meta-data information can be found according to the key values.

Step S1014, converting the meta-model object into a graph object, and storing the graph object in the metadata graph database.

In this embodiment, each meta-model object is converted into a graph object, and the metadata graph database is stored in a graph structure. Specifically, the name of a meta-model object is called a node in the graph, each attribute of the meta-model object is called a node, and an edge in the meta-model object is called a key value corresponding to the attribute. Therefore, metadata information is stored and managed in a centralized manner in a graph structure mode, and the attribute information of metadata and the relation information among the metadata are stored, so that the problem that multi-hop-oriented association relation inquiry is low in efficiency or not supported is effectively solved. In addition, the metadata graph database belongs to a graph database for storing metadata information, so that the metadata graph database has a quick search function which is generally provided for the graph database, can perform data search tasks on the graph database based on the metadata graph database which is uniformly stored, and realizes quick search of data.

Referring to fig. 2, fig. 2 shows a flowchart of a metadata map database creation step, where, as shown in fig. 2, the multi-source heterogeneous data at least includes integrated data, third party data and system data, and the data enters a Delta Lake data Lake through a data access interface or a data access system. Synchronously, the metadata model design is performed according to step S1011 to obtain a business process meta-model, i.e. a metadata model conforming to the corresponding business process. And (3) performing file directory traversal according to the step S1012 on the data which has entered the lake, and analyzing log information in the delta log to obtain basic metadata information. Next, step S1013 is executed to perform data fusion on the basic metadata information by using the designed metadata model, and construct a metadata model object. Finally, step S1014 is executed to store the meta-model object in the graph database to form a graph object, and store the meta-data information in the form of a graph structure to obtain the meta-data graph database. By utilizing the rapid retrieval function of the graph database, rapid positioning of the data based on the graph object can be realized.

And step S102, carrying out partition classification on the data of the data lake to generate a data asset map, wherein the data asset map graphically displays the distribution condition of the data asset and the relation among the data assets.

In the implementation, when the multi-source heterogeneous data is accessed into a data lake through a data access interface or a data access system, data partition recommendation is performed through a carried data type label, and the partition granularity is recommended through the combination of the data type and the storage position. After the partition recommendation is completed, data aggregation is carried out, and the text information in unstructured data is classified through a classification model and then is subjected to data aggregation with the classified text information. And adding corresponding partition classification labels to the metadata information according to the partition and classification conditions of the partitioned and classified data, and generating a data asset map through the partition classification labels of the metadata information. The classification attribute in the metadata information is a precondition for generating a label for the classification of data in the asset map. Further improvement of the metadata information is achieved through the step S101, attribute information is added to the metadata information, and therefore in the step S102, partition classification of the data is achieved according to the completed metadata information, particularly information related to the attribute in the metadata information, and a data asset map is generated according to the partition classification information. The data asset map graphically displays the distribution condition of the data assets and the relation among the data assets, and supports the functions of inquiring keywords and combining conditions so as to realize quick retrieval of the data assets, help a data user to clear the storage distribution and relation of the data assets, and facilitate the data user to do further data analysis work.

In a possible implementation manner, the step S102 performs partition classification on the data of the data lake to generate a data asset map, including:

step S1021, performing type recognition on the data of the data lakes, and determining a data type of the data of each data lake, where the data type includes: structured data types, semi-structured data types, and unstructured data types.

In this embodiment, the data newly entering the lake or the data to be entered into the lake belongs to multi-source heterogeneous data, and when the data lake is accessed through a data access interface or a data access system, the data needs to be first identified by data type, and the data is divided into structured data, semi-structured data and unstructured data. In addition, the data sources can be distinguished by database type. Specifically, the structured data represents corresponding information which can be obtained through the inherent key value, and is generally represented and stored in a relational database, the data is represented as a row unit, one row of data represents information of one entity, the attribute of each row of data is the same, and the storage and arrangement of the data are regular. The semi-structured data represents information which can be obtained through flexible key value adjustment, the data format of the semi-structured data is not fixed, and the information stored under the same key value can be of a numerical type, a text type, a dictionary or a list. Unstructured data has no fixed data format, contains office documents, texts, pictures, reports, images or audios and the like in all formats, and is generally directly and integrally stored.

Step S1022, according to the data type, performing data lake position recommendation on the data of each data lake, and adding partition tags to the metadata information of each data to obtain the data after position.

In a specific implementation, according to the identified data type, data lake position recommendation is performed on each data, that is, the data is partitioned, for example, data with a data type of a structured data type is positioned in a MySQL data area, and data with a data type of an unstructured data type is positioned in a text data area. After the data is located, a corresponding partition label is generated in the metadata information of each data according to the locating area, namely according to the partition result.

Step S1023, classifying and converging the data after the position falling to obtain classified data, and adding a classification label for metadata information of the classified data.

Data aggregation (ETL) refers to loading data of different service systems into a data warehouse, and uniformly storing the data according to categories, for example, two files belonging to the same category have the same meaning or the same content, but different file names and different fields, and through data aggregation, the two files can be classified into the same category for uniform storage. According to the embodiment, structured data, semi-structured data and unstructured data can be respectively classified according to the partition result, and corresponding classification labels are automatically added to metadata information of the data through corresponding algorithms according to the classification result. The class label is used for representing class information to which the data belongs.

In a possible implementation manner, in the case where the data after the position is structured data or semi-structured data, the step S1023 performs classification aggregation on the data after the position to obtain classified data, including:

step S1023a, extracting, converting and loading the structured data or the semi-structured data, and converging the structured data or the semi-structured data into the structured data stored in a classified manner to obtain the classified data.

In practical implementation, data aggregation (ETL) is mainly divided into extraction (extraction), conversion (transformation) and loading (Load) of data, and mainly extracts data from various data sources, performs necessary conversion and arrangement, and stores the data in corresponding data lake positions so as to integrate scattered, scattered and non-uniform data in different services together, thereby providing analysis basis for subsequent decisions.

In the case that the data after the placement is unstructured data, step S1023 classifies and aggregates the data after the placement to obtain classified data, including:

step S1023b, classifying the data subjects of the unstructured data without classification marks in the text information by using a text subject classification model.

Step S1023c, classifying the data subjects of the unstructured data carrying the classification marks in the text information by using the text subject classification rules.

In specific implementation, the unstructured data in place are mainly text data, and are divided into data carrying classification marks and data not carrying classification marks. For unstructured data carrying classification marks in text information, the text topic classification rules can be utilized to classify the data topic, and the category to which the classification mark belongs is determined through preset rules. For unstructured data without classification marks in text information, a text topic classification model can be used for classifying data topics, for example, a Bert model (a pre-trained language characterization model) is used for performing classification operation, and category information corresponding to the text is obtained from the text in a summary mode. The classification mark is information which is carried in a fixed format in the text information and is used for representing the category to which the text belongs, for example, part of the text has a fixed format and contains a field of 'information category', and the field can be extracted from the text information sink and used as the classification mark of the text or the unstructured data.

Step S1023d, the classified unstructured data are converged into the unstructured data stored in a classified mode, and the classified data are obtained.

In one possible implementation, before the classified data is aggregated with the data stored in the original classification, the data to be aggregated itself needs to be subjected to a deduplication operation. Specifically, the repeated data are removed, before the data after the duplication removal is converged, whether the repeated data are overlapped with the classified stored data is checked again, the data which are not overlapped are converged, and the overlapped data are discarded.

And step S1024, generating the data asset map according to the partition label and the classification label.

Referring to fig. 3, fig. 3 shows a schematic diagram of a process for generating a data asset map, and as shown in fig. 3, when multi-source heterogeneous data (integrated data, third party data, system data, etc.) enters a data lake through a data access interface or system, step S1021 is performed to perform data identification partitioning, and the data type of the data of each data lake is determined to be structured data, semi-structured data, or unstructured data. And according to the determined data type, executing step S1022, recommending the data lake position, and adding a partition label to the metadata information of each data. After completing the data partitioning, executing step S1023, respectively carrying out data aggregation according to the data types, and executing ETL to complete classification information aggregation on the structured data and the semi-structured data; and carrying out classification recognition on unstructured data (namely text data), and carrying out unstructured classified text information aggregation according to classification recognition results. And adding a classification label to the metadata information of each data according to the data aggregation result. Finally, step S1024 is performed to generate a data asset map according to the partition tag and the classification tag. According to the embodiment, the data is partitioned and aggregated, further classified storage of the data is achieved, and corresponding partition labels and classification labels are added in metadata information of the data according to classification results, so that a data asset map is generated based on the partition labels and the classification labels, a data user is helped to clear distribution and relation of the data assets, and further data analysis work is facilitated for the data user.

And step S103, positioning data to be analyzed according to the metadata map database and the data asset map.

When a specific data needs to be analyzed, the data needs to be searched and positioned from the data lake. In this embodiment, data query location is performed using a metadata map database and a data asset map that are generated in advance. Specifically, all metadata information related to the keywords can be queried and obtained by utilizing the metadata map database, and more accurate data can be queried and obtained by utilizing the keyword combination query function of the data asset map. In addition, the data asset map provides a visualization function that enables the classified partition of query data to be displayed in the form of a map during a query, for example, by classification in the data asset map: the method comprises the steps that A type areas, B type areas and C type areas, wherein the A type areas, the B type areas and the C type areas represent different data types, according to the inquired keyword 'object a', the distribution condition of data in each area can be directly obtained through a visual interface, 2 pieces of relevant data information exist in the A type areas, relevant information is not inquired in the B type areas, and 1 piece of relevant data information exists in the C type areas. By integrating the positioning results of the metadata map database and the data asset map, data positioning can be quickly and accurately realized.

And step S104, carrying out ETL operation on the data to be analyzed, and collecting structured query statement information (Structured Query Language, SQL) in the ETL operation process.

In this embodiment, the ETL job represents a process of performing the ETL job of the corresponding data by the corresponding server, and in this embodiment, the specific data job type and method thereof are not limited.

And step 105, generating a blood-margin map according to the SQL statement information.

The SQL statement information represents complete information of a process of performing ETL operation on data, and comprises operation information of processing the data in each step, position information of the data and the like. From the SQL statement information, a blood-lineage map can be generated. The blood-related map is used for describing the blood-related relationship of data, representing the coming pulse of the data in the form of a graph, and mainly comprises a source of the data, a processing mode of the data, a mapping relationship and a data outlet, wherein the data is represented by what table of what database the data is stored in, what the corresponding field is, the attribute of the field, a system to which the data belongs and an application program related to the data. The blood-edge relation information between data belongs to a part of metadata information, and the clear data blood-edge is a basis for maintaining stability of a data platform, so that the analysis of influence of data change and the investigation of data problems are facilitated.

Step S105, generating a blood-margin map according to the SQL statement information, including:

step S1051, sending the SQL statement information in the ETL job to a message queue through an application programming interface (Application Programming Interface, API), where the SQL statement information is the SQL statement information in the ETL job executed by one or more data processing engines.

In the implementation, when the various analysis platforms utilize the respective data processing engines to carry out ETL SQL operation, the generated SQL statement information is uniformly sent to a message queue for carrying out through an API interface. For the data to be analyzed obtained in the step S103, a single user may use a single data analysis engine to process the data to obtain the SQL statement information of the data analysis engine, or a plurality of users may use different data analysis engines to process the data at the same time, so as to obtain the SQL statement information of a plurality of different data analysis engines.

Step S1052, monitoring SQL sentence information in the message queue, consuming the SQL sentence information in the message queue, and constructing a unified SQL grammar format.

Because SQL statement information can come from different data analysis engines, the embodiment eliminates the difference by constructing a unified SQL grammar format and performs unified processing on the SQL statement information, thereby shielding the difference of each data analysis platform, avoiding the problem of adapting to different data analysis tools and reducing the adaptation development work of the data blood margin acquisition tools and the data analysis engines.

And step S1053, performing syntax tree analysis on the consumed SQL statement information to obtain a data association relationship, wherein the data association relationship represents pointing information between a source table and a destination table and between fields.

In specific implementation, syntax tree analysis can be performed by using an existing program, and the method for analyzing the acquired data association relationship is not limited in this embodiment. By using a corresponding program, taking a complete flow of data processing as an example, each flow of data processing is marked by a unique identifier, each link in the flow records the front-back dependency relationship, the program can generate a data association relationship of the complete flow according to the dependency relationship and the flow after analyzing the logic of each link, the source table belongs to a data table of a data source, the destination table belongs to a data table of a data flow, and the data association relationship comprises pointing information between the source table and the destination table as well as between the fields.

Step S1054, converting the data association relationship into a data blood relationship according to the process node, the outflow node and the inflow node obtained in the SQL statement information; the data lineage relationship represents a point-to-point relationship between the source table, a process node, and the destination table, and a point-to-point relationship between the field and the process node.

In specific implementation, the ETL SQL job is packaged into a process node object, and the process node, the outflow node and the inflow node obtained according to the SQL statement information are added into the data association relationship to generate and store the data blood relationship. The method comprises the steps of converting the original pointing relation between a source table and a destination table into the pointing relation between a source table and a process node and the destination table, converting the original pointing relation between fields into the pointing relation between the fields and the process node and the field, and clearly displaying the process that data flows from the source table to the destination table through a series of ETL SQL jobs, wherein each process node represents that the data is processed once, for example, the data is extracted once, and correspondingly generating an extracted process node.

Step S1055, generating the blood-edge map according to the data blood-edge relation.

In this embodiment, when data in a lake is cleaned and integrated by using different data analysis tools, ETL operation is performed on data to be analyzed, SQL statement information in the ETL process is sent to a message queue by using a unified API interface, SQL statement information in the ETL operation from different data analysis engines is collected, unified processing is performed on the SQL statement information, data association relations between data tables are obtained through syntax tree analysis, analysis and summarization are performed, data blood-edge relations are generated according to the summarized data association relations, and a data blood-edge map is generated according to the data blood-edge relations, so that next value analysis is performed according to the blood-edge map.

In the related art, when performing data analysis, different data analysis platforms need to rely on corresponding data analysis tools, and for each set of data analysis tools, a set of corresponding adapters needs to be developed to acquire process information of the ETL job. In order to improve the efficiency of data analysis, the data blood edge collection method in the embodiment provides a consistent collection form, sends information to a message queue through a unified API interface, performs data analysis uniformly, shields the difference of each data analysis platform, avoids the problem of adapting to different data analysis tools, and reduces the adaptability development work of the data blood edge collection tools and a data analysis engine.

In one possible implementation, the generating the blood-edge map according to the data blood-edge relationship in step S1055 includes:

and step S1055a, taking the data to be analyzed as a main node of the blood-margin map.

Specifically, the main node is the core of the blood-edge map, generally, only one main node exists in one blood-edge map, and the whole blood-edge map shows the data blood-edge relationship of the main node, so in this embodiment, the data to be analyzed is used as the main node of the blood-edge map.

In step S1055b, the data source of the data to be analyzed is used as the data inflow node of the blood-edge map, and the data inflow node is the father node of the master node.

Specifically, the data source of the data to be analyzed may be a plurality of data sources, so that the master node has a plurality of parent nodes, and in the case that the data sources are in a multi-level structure, the parent nodes of the master node may correspond to the parent nodes of the multi-level.

Step S1055c, the data of the data to be analyzed is sent to a data outflow node serving as the blood margin map, wherein the data outflow node is a child node of the main node; the data flow-out node comprises a terminal node, and after the data reaches the terminal node, the data flow is stopped.

Specifically, the data outflow node marks the data of the main node, when the data to be analyzed flows out in multiple directions, multiple data outflow nodes exist, and when the data flow out needs to pass through multiple nodes, the corresponding data outflow nodes can be child nodes of multiple layers, so that the direction and the path of the data flow out are expressed according to the hierarchical structure. There is a special terminal node in the data outflow node, and after the data arrives at the terminal node, the data will not flow to other places.

In step S1055d, a data circulation path is marked, where the circulation path represents a path where data is converged from the data inflow node to the master node and then diffused from the master node to the data outflow node.

Step S1055e, using the ETL operation step of the data to be analyzed as a process node of the blood-edge map. Specifically, the processing mode and the processing rule (ETL SQL job) of the tag data are used as the process nodes, so that a plurality of process nodes are located in the data flow route according to the actual job sequence.

And step S106, generating an analysis result of the data to be analyzed according to the blood-margin map.

In a possible implementation manner, the generating the analysis result of the data to be analyzed according to the blood-edge map includes:

performing data tracing according to the circulation path in the blood margin map; specifically, according to the circulation direction of the circulation path, the node represented by the data source can be analyzed and determined, so that the position information of the data source is obtained. Judging the data value of the data to be analyzed according to the number of the data outflow nodes in the blood margin map; specifically, the larger the number of data outflow nodes, the larger the data value of the data to be analyzed, and the smaller the number of data outflow nodes, the smaller the data value of the data to be analyzed. Judging the data magnitude of the data to be analyzed according to the line thickness of the circulation path; specifically, the thicker the line of the circulation path, the more data the circulation on the path has, the larger the data magnitude of the data to be analyzed has, the thinner the line of the circulation path has, the less the circulation on the path has, and the smaller the data magnitude of the data to be analyzed has. And judging the data updating frequency of the data to be analyzed according to the line length of the circulation path. Specifically, the longer the line on the circulation path, the more process nodes involved, the higher the data update frequency of the data to be analyzed, the shorter the line on the circulation path, the fewer the process nodes involved, the lower the data update frequency of the data to be analyzed.

Referring to fig. 4, fig. 4 shows a schematic diagram of a data analysis process based on a blood-lineage map, as shown in fig. 4, after locating data to be analyzed, ETL SQL operation is performed on the data, and generated SQL statement information is sent to a message queue through an API, where the SQL statement information is SQL statement information executed by different data processing engines. And then, monitoring SQL statement information in the message queue, consuming the SQL statement information in the message queue, and constructing a unified SQL job format. And carrying out grammar analysis on the consumed SQL statement information, creating process nodes according to the information obtained by the analysis, and establishing data association relations to obtain the data association relations of the data to be analyzed. And then, adding the process nodes into the data association relationship through data summarization and arrangement, storing to obtain a data blood edge relationship, and generating a corresponding blood edge map according to the data blood edge relationship to realize data blood edge relationship visualization. And finally, according to the obtained blood margin map, data tracing and data value analysis can be realized. The data analysis work is accompanied with the generation of data blood edges, and the data to be analyzed can be rapidly positioned based on the metadata graph database and the data asset map, the blood edge map is further generated by collecting the data blood edge relation generated in the data analysis process, and the data tracing and value analysis work is performed by analyzing the blood edge map. Therefore, the problems that the Delta Lake data Lake meta model is simple in design, the scattered storage of data and meta data information cannot be known, and the quick retrieval cannot be realized are solved, and the problem of data asset statistics which cannot be automatically realized because the aggregated data cannot be automatically classified is solved based on the created data asset map; according to the embodiment, SQL statement information is sent to a message queue through a unified API interface, unified processing is carried out on the SQL statement information, and then a blood-margin map is generated according to the SQL statement information, so that the problem that a plurality of data processing engines cannot process data in a unified form and complete and accurate data blood-margin relations cannot be obtained is solved; finally, the data to be analyzed is analyzed based on the generated blood-margin map, so that an analysis result can be quickly and accurately obtained, and the defect that the value of the data cannot be intuitively estimated due to the lack of a visualization tool is overcome.

In the related art, a data lake is a centralized repository that allows storage of multiple sources, all structured, semi-structured, and unstructured data at any scale. The Delta Lake technology can quickly help enterprises to quickly construct data lakes, but because the metadata model is simple in design, the storage positions are scattered and lack of related tools, users cannot quickly know the distribution situation and the data appearance of the data in the lakes, cannot quickly search the data, cannot quickly evaluate the data value, and the data analysis work is highly dependent on IT participation, so that self-service of the data lakes cannot be realized.

In view of the above problems, the embodiments of the present application provide a self-help data analysis method based on Delta Lake data Lake. The method provided by the embodiment of the application enables the data user to quickly know what data is in the data lake, where and by whom the data is in charge, what the data value means, who uses the data, what service is used for, and the like, so that the data user can find and use the data set which is wanted to be used without depending on an IT department, and self-service of the data is realized.

A second aspect of the present application provides a self-service data analysis device, referring to fig. 5, fig. 5 shows a schematic structural diagram of the self-service data analysis device, as shown in fig. 5, where the device includes:

the operation module is used for carrying out SQL operation on the data to be analyzed to obtain SQL statement information;

In one possible implementation manner, the metadata graph database generation module includes:

the metadata model design submodule is used for designing a metadata model conforming to the business flow;

the basic metadata information acquisition sub-module is used for traversing the file catalogue of the business data of the data lake entering the lake, analyzing log information in the file catalogue and obtaining basic metadata information; the base metadata information includes at least: data modification behavior information, schema information and data storage location information;

The data fusion sub-module is used for carrying out data fusion on the basic metadata information by utilizing the designed metadata model to obtain a metadata model object;

and the storage sub-module is used for converting the meta-model object into a graph object and storing the graph object in the metadata graph database.

In one possible implementation, the data fusion sub-module includes:

the attribute mapping unit is used for mapping the basic metadata information to corresponding attributes of the metadata model by utilizing the metadata model;

the adding unit is used for adding other metadata information to the basic metadata information according to the information input by the user;

and the packaging unit is used for packaging the basic metadata information into a meta-model object with the consistent format.

In one possible implementation, the data asset map generation module includes:

the type identification sub-module is not used for carrying out type identification on the data of the data lakes, and determining the data type of the data of each data lake, wherein the data type comprises the following steps: structured data types, semi-structured data types, and unstructured data types;

the locating sub-module is used for recommending the locating of the data of each data lake according to the data type, and adding a partition tag to the metadata information of each data to obtain the data after locating;

The classifying and converging sub-module is used for classifying and converging the data after the position falling to obtain classified data, and adding a classifying label for metadata information of the classified data;

and the generation sub-module is used for generating the data asset map according to the partition label and the classification label.

In one possible implementation, the classification convergence sub-module comprises:

the first data aggregation unit is used for extracting, converting and loading the structured data or the semi-structured data, and aggregating the structured data or the semi-structured data into the structured data which is stored in a classified manner to obtain the classified data;

a second data aggregation unit comprising:

the model classification subunit is used for classifying data topics of the unstructured data without classification marks in the text information by using a text topic classification model;

the rule classification subunit is used for classifying the data subjects of the unstructured data carrying the classification marks in the text information by utilizing the text subject classification rules;

and the aggregation subunit is used for aggregating the classified unstructured data into the classified stored unstructured data to obtain the classified data.

In one possible embodiment, the blood-lineage map generating module includes:

the information sending sub-module is used for sending the SQL statement information in the ETL job to a message queue through an API, wherein the SQL statement information is the SQL statement information in the ETL job executed by one or more data processing engines;

the monitoring submodule is used for monitoring SQL statement information in the message queue, consuming the SQL statement information in the message queue and constructing a unified SQL operation format;

the analysis sub-module is used for carrying out syntax tree analysis on the consumed SQL statement information to obtain a data association relationship, wherein the data association relationship represents pointing information between a source table and a destination table and between fields;

the packaging submodule is used for converting the data association relationship into a data blood relationship according to the process node, the outflow node and the inflow node which are obtained by the SQL statement information; the blood-source relationship comprising a process node, representing a pointing relationship between the source table, the process node, and the destination table, and a pointing relationship between the field and the process node;

and the blood-related map generation submodule is used for generating the blood-related map according to the data blood-related relationship.

In one possible embodiment, the blood-lineage map generating sub-module includes:

the main node determining unit is used for taking the data to be analyzed as a main node of the blood-margin map;

the data inflow node determining unit is used for taking a data source of the data to be analyzed as a data inflow node of the blood-lineage map, wherein the data inflow node is a father node of the main node;

the data outflow node determining unit is used for sending the data of the data to be analyzed to a data outflow node serving as the blood-edge map, wherein the data outflow node is a child node of the main node; the data flow-out node comprises a terminal node, and after the data reaches the terminal node, the data flow is stopped;

a circulation path determining unit, configured to mark a circulation path of data, where the circulation path represents a path in which data is converged from the data inflow node to the master node and then diffused from the master node to the data outflow node;

and the process node determining unit is used for taking the ETL operation step of the data to be analyzed as a process node of the blood-margin map.

In one possible embodiment, the analysis module comprises:

The data tracing sub-module is used for tracing data according to the circulation path in the blood margin map;

the data value judging sub-module is used for judging the data value of the data to be analyzed according to the number of the data outflow nodes in the blood margin map;

the data magnitude judging sub-module is used for judging the data magnitude of the data to be analyzed according to the line thickness of the circulation path;

and the updating frequency judging sub-module is used for judging the data updating frequency of the data to be analyzed according to the line length of the circulation path.

The embodiment of the invention also provides an electronic device, and referring to fig. 6, fig. 6 is a schematic structural diagram of the electronic device according to the embodiment of the invention. As shown in fig. 6, the electronic device 100 includes: the memory 110 and the processor 120 are in communication connection through a bus, and a computer program is stored in the memory 110 and can run on the processor 120, so that the steps in the self-service data analysis method based on the data lake disclosed by the embodiment of the invention are realized.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program/instruction is stored, which when executed by a processor, implements the steps in the self-service data analysis method based on the data lake disclosed in the embodiment of the invention.

The embodiment of the invention also provides a computer program product, which comprises a computer program/instruction, wherein the computer program/instruction realizes the steps in the self-service data analysis method based on the data lake disclosed by the embodiment of the invention when being executed by a processor.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The self-service data analysis method, device and electronic equipment based on the data lake provided by the invention are described in detail, and specific examples are applied to illustrate the principle and the implementation mode of the invention, and the description of the above examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A self-service data analysis method based on a data lake, the method comprising:

classifying the data of the data lake in a partitioning way to generate a data asset map, wherein the data asset map graphically displays the distribution condition of the data asset and the relation among the data assets;

ETL operation is carried out on the data to be analyzed, and SQL statement information in the ETL operation process is collected; the SQL statement information is obtained by processing one or more data analysis engines;

generating a blood margin map according to the SQL statement information;

generating an analysis result of the data to be analyzed according to the blood-related map;

the step of classifying the data of the data lake in a partitioning way to generate a data asset map comprises the following steps:

performing type recognition on the data of the data lakes, and determining the data type of the data of each data lake, wherein the data type comprises the following steps: structured data types, semi-structured data types, and unstructured data types;

according to the data types, recommending data lake positions of the data of each data lake, and adding partition labels to metadata information of each data to obtain data after the data are positioned;

classifying and converging the data after the position falling to obtain classified data, and adding a classification label for metadata information of the classified data;

and generating the data asset map according to the partition label and the classification label.

2. The self-service data analysis method based on the data lake according to claim 1, wherein the managing metadata information of the data lake to create a metadata map database includes:

Designing a metadata model conforming to the business process;

traversing file catalogues of the business data of the data lakes, and analyzing log information in the file catalogues to obtain basic metadata information; the base metadata information includes at least: data modification behavior information, schema information and data storage location information;

performing data fusion on the basic metadata information by using the designed metadata model to obtain a metadata model object;

and converting the meta-model object into a graph object and storing the graph object in the metadata graph database.

3. The self-service data analysis method based on the data lake according to claim 2, wherein the data fusion is performed on the basic metadata information to obtain a meta-model object, which comprises:

mapping the basic metadata information to corresponding attributes of a metadata model by utilizing the metadata model;

according to the information input by the user, adding other metadata information for the basic metadata information;

and packaging the basic metadata information into a meta-model object with a consistent format.

4. The self-service data analysis method based on the data lake of claim 1, wherein, in the case that the data after the placement is structured data or semi-structured data, the classifying and converging the data after the placement to obtain classified data includes:

Extracting, converting and loading the structured data or the semi-structured data, and converging the structured data or the semi-structured data into the structured data which is classified and stored to obtain the classified data;

and under the condition that the data after the falling is unstructured data, classifying and converging the data after the falling to obtain classified data, wherein the classifying and converging comprises the following steps:

classifying data subjects of the unstructured data without classification marks in the text information by using a text subject classification model;

classifying data subjects of the unstructured data carrying classification marks in the text information by using a text subject classification rule;

and converging the classified unstructured data into the classified stored unstructured data to obtain the classified data.

5. The self-service data analysis method based on data lakes of claim 1, wherein the generating a blood-margin map from the SQL statement information comprises:

the SQL statement information in the ETL job is sent to a message queue through an application programming interface; the SQL statement information is SQL statement information in ETL jobs executed by one or more data processing engines;

Monitoring SQL information in the message queue, consuming the SQL information in the message queue, and constructing a unified SQL grammar format;

analyzing the syntax tree of the consumed SQL statement information to obtain a data association relationship, wherein the data association relationship represents pointing information between a source table and a destination table and between fields;

according to the process node, the outflow node and the inflow node obtained from the SQL statement information, converting the data association relationship into a data blood relationship; the data lineage relationship representing a pointing relationship between the source table, the process node, and the destination table, and a pointing relationship between the field and the process node;

and generating the blood-related map according to the data blood-related relationship.

6. The data lake-based self-service data analysis method of claim 5, wherein the generating the blood-lineage map from the data blood-lineage relationship includes:

taking the data to be analyzed as a main node of the blood-related map;

taking a data source of the data to be analyzed as a data inflow node of the blood-lineage map, wherein the data inflow node is a father node of the main node;

The data of the data to be analyzed is sent to a data outflow node serving as the blood-margin map, wherein the data outflow node is a child node of the main node; the data flow-out node comprises a terminal node, and after the data reaches the terminal node, the data flow is stopped;

marking a data circulation path, wherein the circulation path represents a path for converging data from the data inflow node to the main node and then diffusing the data from the main node to the data outflow node;

and using the ETL operation step of the data to be analyzed as a process node of the blood-margin map.

7. The self-service data analysis method based on data lake of claim 6, wherein the generating the analysis result of the data to be analyzed according to the blood-margin map comprises:

performing data tracing according to the circulation path in the blood margin map;

judging the data value of the data to be analyzed according to the number of the data outflow nodes in the blood margin map;

judging the data magnitude of the data to be analyzed according to the line thickness of the circulation path;

and judging the data updating frequency of the data to be analyzed according to the line length of the circulation path.

8. A self-service data analysis device, the device comprising:

the operation module is used for carrying out ETL operation on the data to be analyzed and collecting SQL statement information in the ETL operation process; the SQL statement information is obtained by processing one or more data analysis engines;

the analysis module is used for generating an analysis result of the data to be analyzed according to the blood-margin map;

wherein the data asset map generation module comprises:

The type identification sub-module is used for carrying out type identification on the data of the data lakes and determining the data type of the data of each data lake, wherein the data type comprises the following steps: structured data types, semi-structured data types, and unstructured data types;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, performs the steps of the data lake-based self-service data analysis method of any one of claims 1 to 7.