CN111125229B

CN111125229B - Data blood edge generation method and device and electronic equipment

Info

Publication number: CN111125229B
Application number: CN201911376186.XA
Authority: CN
Inventors: 元庚; 周万; 甘长华
Original assignee: Hangzhou Dt Dream Technology Co Ltd
Current assignee: Hangzhou Dt Dream Technology Co Ltd
Filing date: 2019-12-24
Publication date: 2024-06-28
Anticipated expiration: 2039-12-24

Abstract

A data lineage generation method, apparatus, electronic device, and machine-readable storage medium are disclosed. In the application, source data is acquired from a butted service system and is stored locally; wherein the source data is database-based table data; generating target data corresponding to the source data; the target data at least comprises a first blood edge identifier which uniquely characterizes the data source of each row of data of the target data, so that accurate construction of the data blood edge relationship based on row-level data is realized, and the data blood edge tracing efficiency and accuracy are improved.

Description

Data blood edge generation method and device and electronic equipment

Technical Field

One or more embodiments of the present application relate to the field of computer application technology, and more particularly, to a data blood-lineage generation method, apparatus, electronic device, and machine-readable storage medium.

Background

Data warehouse (Data Warehouse, abbreviated as DW or DWH), a set of data that is subject-oriented, integrated, time-varying, but relatively stable in information itself. For example, in practice, data warehouses are commonly used in support of enterprise management decisions, providing a data set of all types of data related to the management decisions.

The data warehouse mainly comprises four characteristics: "subject-oriented", "integration", "time-varying", "data of the data warehouse is not updatable"; wherein, the "theme-oriented" refers to that the data warehouse is based on a certain specific theme, only the data related to the theme is needed, and other irrelevant detail data is eliminated; "integration" refers to the process from collecting data from different source data to generating target data, and requires data processing based on ETL (Extract-Transform-Load) technology; "time-varying" refers to implicit or explicit time-based changes in data; "data in data warehouse can not be updated", namely, after the loading (Load) of the ETL is executed on the data, the data query operation can only be generally carried out, and the adding, deleting and modifying operation of the traditional database is not carried out.

Data in a data warehouse is OLAP (Online Analytical Processing ) based data, which reflects the content of historical data over a considerable period of time, is a collection of database snapshots at different points in time, and is derived based On statistics, synthesis, and reorganization of these snapshots, whereas conventional database data is OLTP (On-Line Transaction Processing, online transaction) based data.

Disclosure of Invention

The application provides a data blood-edge generating method, which is applied to a data warehouse management system and comprises the following steps:

acquiring source data from a butted service system and storing the source data in a local place; wherein the source data is database-based table data;

Generating target data corresponding to the source data; wherein the target data includes at least a first blood-edge identification uniquely characterizing a data source of each row of data of the target data.

Optionally, the target data further includes an index identifier that uniquely characterizes each row of data of the target data;

the generating the target data corresponding to the source data includes:

Generating processing data of the source data; wherein the process data is process data between the source data and the target data, the process data including at least a second blood edge identification uniquely characterizing a data source of each line of data of the process data;

And generating target data corresponding to the source data based on the index identification and the processing data.

Optionally, the generating the processing data of the source data includes:

Generating first machining data of the source data; wherein the first processing data comprises at least the source data, each data uniquely characterizing the processing data corresponding to a second blood-source identification from a data source of the source data;

Generating second machining data of the first machining data; wherein the second processing data comprises at least the first processing data, each line of data uniquely characterizing the second processing data corresponding to a second blood-margin identification from a data source of the first processing data;

And iteratively generating machining data of the second machining data until final third machining data is obtained.

Optionally, the generating, based on the index identifier and the processing data, target data corresponding to the source data includes:

And taking the index identifier and the processing data as table data of the target data, and generating target data corresponding to the source data.

Optionally, when data blood source tracing is required for the target data, the method further includes:

and constructing a data blood edge query instruction for the target data, and querying the target data based on the data blood edge query instruction to obtain the data blood edge of the target data traced back to the source data.

Optionally, the third processing data further includes an index identifier uniquely characterizing each row of data of the third processing data, where the index identifier of the third processing data is obtained by combining a unique identifier generated based on the table identifier of the third processing data and a unique identifier algorithm; the first blood-edge identification points to an index identification of the third processed data.

Optionally, the source data further includes an index identifier that uniquely characterizes each line of data of the source data, the first processed data further includes an index identifier that uniquely characterizes each line of data of the first processed data, and the second processed data further includes an index identifier that uniquely characterizes each line of data of the second processed data;

The second blood-edge identification points to an index identification of the source data, or the second blood-edge identification points to an index identification of the first processing data; or the second blood-vessel identifier points to an index identifier of the second processing data.

Optionally, the unique identification algorithm is a UUID algorithm or a hash algorithm.

The application also provides a data blood-edge generating device, which is applied to a data warehouse management system and comprises:

The acquisition module acquires source data from the butted service system and stores the source data in a local place; wherein the source data is database-based table data;

The generation module generates target data corresponding to the source data; wherein the target data includes at least a first blood-edge identification uniquely characterizing a data source of each row of data of the target data.

The generation module further:

Optionally, the generating module further:

Generating first machining data of the source data; wherein the first processing data comprises at least the source data, each data uniquely characterizing the processing data corresponding to a second blood-source identification of a data source from the source data;

Optionally, the generating module further:

And the tracing module is used for constructing a data blood-edge query instruction for the target data, and querying the target data based on the data blood-edge query instruction to obtain the data blood-edge traced from the target data to the source data.

The application also provides electronic equipment, which comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are mutually connected through the bus;

the memory stores machine readable instructions and the processor performs the method described above by invoking the machine readable instructions.

Through the above embodiment, source data is obtained from the docked service system and stored locally; and generating target data containing the blood edge identification corresponding to the source data, thereby realizing accurate construction of the data blood edge relationship based on line-level data and improving the data blood edge tracing efficiency and accuracy.

Drawings

FIG. 1 is a schematic diagram of ETL data processing performed by a data warehouse management system according to an exemplary embodiment;

FIG. 2 is a flow chart of a method of data lineage generation according to an exemplary embodiment;

FIG. 3 is a hardware configuration diagram of an electronic device according to an exemplary embodiment;

fig. 4 is a block diagram of a data blood-lineage generation apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present specification, the following briefly describes the related art of data blood edge generation related to the embodiments of the present specification.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating ETL data processing performed by a data warehouse management system according to an embodiment of the present disclosure.

As shown in fig. 1, the data warehouse management system is deployed based on a cluster mode, and comprises a control node and one or more working nodes managed by the control node, task scheduling is performed on the managed working nodes, and the working nodes perform ETL data processing on source data to obtain target data after ETL data processing.

On the basis of the networking architecture shown above, the application aims to provide a technical scheme for generating the data source of the source data based on the data source identification of each data of the target data, thereby realizing data blood source generation.

When in implementation, the data warehouse management system acquires source data from the butted service system and stores the source data in a local place; wherein the source data is database-based table data; further, the data warehouse management system generates target data corresponding to the source data; wherein the target data includes at least a first blood-source identification uniquely characterizing a data source of each line of data of the target data.

In the scheme, source data are acquired from a butted service system and are stored locally; and generating target data containing the blood edge identification corresponding to the source data, thereby realizing accurate construction of the data blood edge relationship based on line-level data and improving the data blood edge tracing efficiency and accuracy.

The present application is described below by way of specific embodiments and in connection with specific application scenarios.

Referring to fig. 2, fig. 2 is a flowchart of a data blood-edge generating method according to an embodiment of the present application, the method is applied to a data warehouse management system, and the method performs the following steps:

Step 202, acquiring source data from a butted service system and storing the source data in a local area; wherein the source data is database-based table data.

Step 204, generating target data corresponding to the source data; wherein the target data includes at least a first blood-edge identification uniquely characterizing a data source of each row of data of the target data.

In the present specification, the data warehouse management system refers to a machine or a cluster of machines that perform ETL data processing on data.

For example, the data warehouse management system described above may be a machine or cluster of machines that may perform ETL data processing, including a control node and several work nodes, as shown in fig. 1.

For ease of understanding, ETL data processing is briefly described herein. ETL, an abbreviation for Extract-Transform-Load, is used to describe the process of extracting (Extract), transpose (Transform), and loading (Load) data from a source to a destination. For example, in practical application, based on ETL data processing, the data warehouse management system may extract data of source databases (such as various data tables in the source databases) in distributed and heterogeneous data sources, and the like, to the temporary intermediate layer, then perform data cleaning, conversion and integration, and finally load the data into a target database of the data warehouse, which becomes a basis for online analysis processing and further data mining of the data warehouse.

In this specification, the service system refers to any service type of service system that interfaces with the data warehouse management system.

For example, in practical applications, the service system may include a cloud computing service system that interfaces with the data warehouse management system; the business system can also be a big data business system which is in butt joint with the data warehouse management system; the business system may also be a business system of a security business interfacing with the data warehouse management system.

In the present specification, the source data refers to database-based table data in the service system.

For example, in practice, the source data may include table data based on one or more data tables of a relational database (e.g., mysql, pgsql).

In the present specification, the data warehouse management system acquires the source data from the service system and stores the source data locally.

Taking the service system as a big data service system for example, the data warehouse management system acquires the source data from the big data service system and stores the source data in a local source database so as to facilitate ETL data processing based on the follow-up of the source database; the source data stored locally in the data warehouse management system, for example, includes two data tables: a source table ta, a source table tb; the source table ta is shown in the following table 1, and the source table tb is shown in the following table 2.

id_card	name	age
			ID00001	Nail armor	20
ID00002	Second step	21
			ID00003	Polypropylene (C)	22
ID00004	Butyl	23
			ID00001	Nail armor	20

TABLE 1

id_card	degree	graduation
			ID00001	Gramineae (Gramineae)	2018-06-06
ID00003	Doctor	2018-06-07

TABLE 2

Top individual field (id_ card, name, age) as shown in table 1, top individual field (id_ card, degree, graduation) as shown in table 2, a table field for each table; each line data except for the top individual fields as shown in tables 1 and 2 is a line data of each table.

In the present specification, the target data refers to data having a data blood relationship with the source data, which is obtained by ETL data processing of the source data by the data warehouse management system.

For ease of understanding, the following data blood-lineage relationships are briefly described herein. The data blood relationship refers to an inheritance relationship similar to the human society blood relationship formed between the finally obtained data and the source data from which the data come in the process of generating, processing and transferring the data to death.

In this specification, the first blood-source identifier is an identifier of a data source included in the target data and uniquely characterizing each line of data of the target data.

In implementation, the first blood-source identifier may be a table field of the data source of each row of data in the data table that includes, and uniquely characterizes, the target data.

Continuing the above example, as shown in fig. 1 and fig. 2, the target data obtained by performing ETL data processing by the data warehouse management system may be the target table ta and the target table tb; the relationship between the target table ta and the source table ta, and between the target table tb and the source table tb are shown in the following table 3:

Source list	Target table
		Source table ta	Target table ta=etl processed source table ta+ srckey
Source table tb	Target table tb=etl processed source table tb+ srckey

TABLE 3 Table 3

As shown in table 3, the target table ta may include ETL processed source table ta and srckey table fields; wherein srckey is a first blood-source identification uniquely characterizing the data source of each row of data of the target table ta. Similarly, the target table tb may include ETL processed source table tb and srckey table fields; wherein srckey is a first blood-source identification uniquely characterizing the data source of each row of data of the target table tb.

In one illustrated embodiment, the target data includes, in addition to the first blood-source identification, an index identification that uniquely characterizes each row of data of the target data.

Continuing with the above example, the contents of target table ta and target table tb are included, as shown in Table 4 below:

target table	Content of target table
		Target table ta	Rowkey +ETL processed Source Table ta+ srckey
Target table tb	Rowkey +ETL processed Source Table tb+ srckey

TABLE 4 Table 4

As shown in table 4, the target table ta may include, in addition to the srckey table field (first blood-edge identification), an index identification rowkey that uniquely characterizes each row of data of the target table ta. Similarly, the target table tb may include, in addition to the srckey table field (first blood-edge identification), an index identification rowkey that uniquely characterizes each row of data of the target table tb.

In the present specification, the process data is process data in the process of performing ETL data processing on the source data by the data warehouse management system to obtain the target data.

For example, in practical application, the data warehouse management system may perform ETL data processing on the source data to obtain corresponding processing data; further, the processing data may be subjected to ETL data processing to obtain processing data of multiple ETL data processing.

In this specification, the processing data includes at least a second blood-source identifier that uniquely characterizes a data source of each line of the processing data; wherein the second blood-source identifier is an identifier of a data source included in the processing data and uniquely characterizing each line of data of the processing data.

In implementation, the second blood-source identifier may be a table field of the data source of each row of the data table that includes, and uniquely characterizes, the process data.

For example, the processing data corresponding to the source table ta is a processing table ta, and the processing data corresponding to the source table tb is a processing table tb; the relationship between the processing table ta and the source table ta, and the relationship between the processing table tb and the source table tb are shown in the following table 5:

Source list	Processing meter
		Source table ta	Processing table ta=etl processed source table ta+ srckey1
Source table tb	Processing table tb=etl processed source table tb+ srckey1

TABLE 5

As shown in table 5, the process table ta may include ETL processed source table ta and srckey table fields; wherein srckey is a second blood-vessel identification uniquely characterizing the data source of each row of data of the process table ta. Similarly, the process table tb may include ETL processed source table tb and srckey table fields; wherein srckey is a second blood-vessel identification uniquely characterizing the data source of each row of data of the process table tb.

In this specification, the data warehouse management system generates target data including a first blood-vessel identifier corresponding to the source data.

In one embodiment, the data warehouse management system generates the process data of the source data in generating target data including a first blood-vessel identifier corresponding to the source data.

For easy understanding, the data warehouse management system performs ETL data processing on the source data to obtain corresponding processing data; and performing iterative ETL data processing on the processing data to obtain corresponding target data, wherein the process is described in detail below through a specific embodiment.

In one embodiment, the data warehouse management system generates first processed data of the source data during generation of the processed data of the source data; the first processing data at least comprises the source data and a second blood edge identifier which uniquely characterizes each data of the processing data and corresponds to a data source from the source data.

Continuing with the above example, the first processing data corresponding to the source table ta is the first processing table ta, and the first processing data corresponding to the source table tb is the first processing table tb; the relationship between the first processing table ta and the source table ta, and between the first processing table tb and the source table tb is shown in the following table 6:

Source list	First machining table
		Source table ta	First processing table ta=source table ta+ srckey1-a
Source table tb	First working table tb=source table tb+ srckey1-a

TABLE 6

As shown in table 6, the first process table ta may include ETL processed source table ta and srckey-a table fields; wherein srckey-A is a second blood-vessel identification uniquely characterizing that each line of data of the first process table ta corresponds to a source of data from the source table ta. Similarly, the first process table tb may include ETL processed source table tb and srckey-a table fields; wherein srckey-A is a second blood-vessel identification uniquely characterizing each line of data of the first process table tb corresponding to a source of data from the source table tb.

In the present specification, further, the data warehouse management system generates second machining data of the first machining data; the second processing data at least comprises the first processing data and a second blood edge identifier which uniquely characterizes each data of the second processing data and corresponds to a data source from the first processing data.

Continuing with the above example, the second processing data corresponding to the first processing table ta (first processing data) is the second processing table ta, and the second processing data corresponding to the first processing table tb (first processing data) is the second processing table tb; the relationship between the second processing table ta and the first processing table ta, and between the second processing table tb and the first processing table tb are shown in the following table 7:

TABLE 7

As shown in table 7, the second tooling table ta may include first tooling table ta and srckey table fields 1-B; wherein srckey-B is a second blood-vessel identifier uniquely characterizing each data of the second processing table ta corresponding to a data source from the first processing table ta. Similarly, the second tooling table tb may include second tooling table tb and srckey-B table fields; wherein srckey-B is a second blood-vessel identification uniquely characterizing each line of data of the second process table tb corresponding to a source of data from the first process table tb.

In this specification, further, the data warehouse management system iteratively generates the machining data of the second machining data until final third machining data is obtained; the third processing data is corresponding processing data before ETL data processing is performed on the target data.

Continuing with the example above, the data warehouse management system may perform 1 or more iterations of ETL data processing on the second process data until final third process data is obtained. For convenience of description and understanding, the data warehouse management system performs ETL data processing on the second processing data for 1 time until processing data of the second processing data is obtained: the third processing data; wherein the third processing data is corresponding processing data before ETL data processing is performed on the target data;

Wherein the third processing data corresponding to the second processing table ta (second processing data) is the third processing table ta, and the third processing data corresponding to the second processing table tb (second processing data) is the third processing table tb; the relationship between the third processing table ta and the second processing table ta, and between the third processing table tb and the second processing table tb are shown in the following table 8:

TABLE 8

As shown in table 8, the third tooling table ta may include second tooling table ta and srckey table fields 1-C; wherein srckey-C is a second blood-vessel identifier uniquely characterizing each data of the third processing table ta corresponding to a data source from the second processing table ta. Similarly, the third tooling table tb may include second tooling table tb and srckey1-C table fields; wherein srckey-C is a second blood-vessel identifier uniquely characterizing each line of data of the third process table tb corresponding to a source of data from the second process table tb.

In one embodiment, the third processing data further includes an index identifier uniquely characterizing each row of the third processing data, where the index identifier of the third processing data is obtained based on a unique identifier combination generated by a table identifier and a unique identifier algorithm of the third processing data; the table mark is obtained by combining a table name of the third processing data and a unique mark generated by a unique mark algorithm; the unique identification algorithm is a UUID algorithm or a hash algorithm; the first blood-edge identifier is an index identifier indicating the third processing data.

Continuing with the above example, the third process data (e.g., third process table ta, third process table tb, as shown in Table 8) further includes an index identifier (e.g., primary key of third process table ta, primary key of third process table tb) that uniquely characterizes each row of data of the third process data; the index identifier of the third processing data (the primary key rowkey of the third processing table ta and the primary key rowkey of the third processing table tb) is a unique identifier (UIDn, UIDm; where n and m are natural numbers) generated based on a table identifier of the third processing data (for example, a table name of the third processing table ta and a table name of the third processing table tb) and a unique identifier algorithm, for example: the main key corresponding to each row of data of the third processing table ta is "the table name # unique identifier UIDn of the third processing table ta"; the primary key corresponding to each row of data of the third processing table tb is "table name # unique identifier UIDm of the third processing table tb".

The first blood-edge identifications corresponding to the target data (target table ta, target table tb) shown in table 4: srckey, index identifiers (primary key rowkey of the third processing table ta, primary key rowkey3 of the third processing table tb) respectively pointing to the third processing data (third processing table ta, third processing table tb as shown in table 8), that is, first blood-edge identifiers respectively corresponding to the target data (target table ta, target table tb): srckey, an index identification copy having the same value as the index identification (the primary key rowkey of the third machining table ta, the primary key rowkey3 of the third machining table tb) of the third machining data (the third machining table ta, the third machining table tb shown in table 8) is stored.

In practical applications, the data warehouse management system may further perform other combinations of the table identifier based on the third processing data and the unique identifier generated by the unique identifier algorithm to obtain the index identifier of the third processing data, and the combination of the index identifiers to obtain the third processing data is not specifically limited in this specification.

In one embodiment, the source data further includes an index identifier that uniquely identifies each row of data of the source data, the first process data further includes an index identifier that uniquely identifies each row of data of the first process data, and the second process data further includes an index identifier that uniquely identifies each row of data of the second process data.

In implementation, the index identifier of each of the source data, the first processing data, and the second processing data may be a primary key of each of the source data, the first processing data, and the second processing data.

Continuing with the above example, the index of the source data (source table ta as shown in table 1, source table tb as shown in table 2) is identified as "id_card" shown in table 1, and "id_card" shown in table 2. Similarly, the index identifier of the first processing data (the first processing table ta, the first processing table tb shown in table 6) is simply referred to as rowkey1, and the index identifier of the second processing data (the second processing table ta, the second processing table tb shown in table 7) is simply referred to as rowkey. Note that, the generation manner of rowkey and rowkey2 is similar to the generation manner of rowkey3, that is, may be based on: table name of the current table #the unique identifier generated by the unique identifier algorithm (e.g., UUID algorithm or hash algorithm) generates rowkey and rowkey, and the specific process is not repeated here.

In one embodiment shown, the second blood-source identifier is an index identifier pointing to the source data.

Continuing with the above example, the second blood-vessel identifiers corresponding to the first processing data (first processing table ta, first processing table tb) shown in table 6: srckey1-a, index identifications (id_card shown in table 1, "id_card" shown in table 2) respectively pointing to source data (source table ta, source table tb shown in table 6), that is, first blood-edge identifications respectively corresponding to first processed data (first processed table ta, first processed table tb): srckey1-A, which stores a copy of the index identification identical to the value of the index identification (id_card shown in Table 1, id_card shown in Table 2) of the source data (source table ta, source table tb).

In another embodiment shown, the second blood-vessel identifier is an index identifier pointing to the first processing data.

Continuing with the above example, the second processing data (second processing table ta, second processing table tb) as shown in table 7 each correspond to a second blood-edge identifier: srckey1-B, respectively, refer to index identifications (primary key rowkey of the first processing table ta, primary key rowkey1 of the second processing table ta) of the first processing data (first processing table ta, first processing table tb shown in table 7), that is, respectively corresponding second blood-edge identifications in the second processing data (second processing table ta, second processing table tb): srckey1-B, which stores copies of index identifiers having the same values as those of the index identifiers (the primary key rowkey1 of the first processing table ta ) of the first processing data (the first processing table ta, the first processing table tb).

In yet another embodiment, the second blood-margin identification refers to an index identification of the second processing data.

Continuing with the above example, the second processing data (third processing table ta, third processing table tb) as shown in table 8 each correspond to a second blood-edge identifier: srckey1-C, respectively, refer to index identifications (main key rowkey of the second processing table ta, main key rowkey2 of the second processing table ta) of the second processing data (the second processing table ta, the second processing table tb shown in table 8), that is, respectively corresponding second blood-edge identifications in the third processing data (the third processing table ta, the third processing table tb): srckey1-C, which store copies of index identifications identical to values of index identifications (the primary key rowkey2 of the second process table ta ) of the second process data (the second process table ta, the second process table tb).

In the example process of describing the process data of the source data generated by the data warehouse management system, which corresponds to tables 6 to 8, the data warehouse management system performs ETL data processing on the source data 3 times: source data- > first process data- > second process data- > third process data; the third processing data is final processing data for obtaining the target data. In practical applications, the number of times of processing the ETL data by the data warehouse management system to obtain the processing data is not specifically limited in this specification, for example: the data warehouse management system may perform ETL data processing for 1 time only on the source data to obtain first processing data, where the first processing data is the target data; alternatively, the data warehouse management system may perform ETL data processing on the source data 2 times or more than 3 times.

In the present specification, after generating the processed data of the source data, the data warehouse management system generates target data corresponding to the source data based on the index identifier and the processed data.

Continuing with the above example, the index identifies rowkey, e.g., as shown in table 4, the process data, e.g., third process data, including: a third processing table ta, a third processing table tb as shown in table 8; the data warehouse management system generates target data (target table ta and target table tb) corresponding to the source data (source table ta, source table tb) using rowkey (index mark) and third machining data (third machining table ta, third machining table tb) as table data of the target data;

The contents included in the target table ta and the target table tb are changed based on table 4, please refer to the following table 9:

target table	Content of target table
		Target table ta	Rowkey +third working table ta+ srckey
Target table tb	Rowkey +third working table tb+ srckey

TABLE 9

As shown in table 9, the target table ta includes: rowkey, third processing table ta, srckey; wherein rowkey is an index identifier uniquely indicating each row of data of the target table ta, srckey is a first blood-edge identifier uniquely characterizing each row of data of the target table ta from the data source of the third working table ta.

Similarly, the target table tb includes: rowkey, third processing table tb, srckey; wherein rowkey is an index identifier uniquely indicating each row of data of the target table tb, srckey is a first blood-edge identifier uniquely characterizing each row of data of the target table tb from the data source of the third working table tb.

In the example process of generating the target data corresponding to the source data by the data warehouse management system described in tables 6 to 9, the following data processing is performed: source data- > first processing data- > second processing data- > third processing data- > target data, to obtain target data; substituting the source data, the first processing data, the second processing data and the third processing data into the target data, respectively, to obtain target data, please refer to the following table 10:

Table 10

As shown in table 10, srckey (first blood-edge identifier) points to rowkey3, srckey1-C (second blood-edge identifier) points to rowkey2, srckey1-B (second blood-edge identifier) points to rowkey1, srckey-a (second blood-edge identifier) points to the index identifier (primary key of source table) of the source table (source table ta, source table tb).

It should be noted that, through the above-described technical solution, the data warehouse management system generates multi-stage blood edge identifiers through iteration, so that quick tracing and accurate tracing (line-level data capable of tracing the source data) of the data blood edges can be realized.

In one embodiment, when the data edge tracing is required for the target data after the target data is generated, the data warehouse management system constructs a data edge query instruction for the target data, and queries the target data based on the data edge query instruction, so as to obtain the data edge traced from the target data to the source data.

Continuing with the above example, when data lineage tracing is desired for target data (target table ta, target table tb as shown in table 10), such as: tracing the data source of a certain line of data in the target table ta or the target table tb (for example, after several times of ETL data processing, which processing tables are available for processing data in the processing process, and the line of data corresponds to the target table and the target line in the source data), the data warehouse management system can construct a data blood-edge query instruction for the target data based on SQL (Structured Query Language );

The data blood-edge query instruction may include a blood-edge identifier corresponding to data to be subjected to data blood-edge tracing in the target data, for example:

the blood-margin identification is a multi-level blood-margin identification of srckey, srckey-C, srckey1-B, srckey-1-A as shown in Table 10.

Further, the data warehouse management system can trace back data blood edges from target data to processing data to source data through the multi-stage blood edge identification layer by layer until tracing back to target line data of the source data; the sequence of data blood edge tracing is as follows:

srckey- > rowkey- > srckey-1-C- > rowkey- > srckey-1-B- > srckey-A- > index identity of the source table (primary key of the source table).

The type of database in which the data warehouse management system executes the data blood-vessel query instruction based on SQL is not particularly limited in this specification.

In the technical scheme, source data are acquired from a butted service system and are stored locally; and generating target data containing the blood edge identification corresponding to the source data, thereby realizing accurate construction of the data blood edge relationship based on line-level data and improving the data blood edge tracing efficiency and accuracy.

Corresponding to the method embodiment, the application also provides an embodiment of the data blood-lineage generation device.

Corresponding to the above method embodiments, the present specification also provides an embodiment of a data blood-lineage generation apparatus. Embodiments of the data lineage generation apparatus of the present specification can be applied to an electronic device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an electronic device where the data blood-address generating device in the present specification is located is shown in fig. 3, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, the electronic device where the device is located in the embodiment generally may further include other hardware according to the actual function of the electronic device, which is not described herein again.

Fig. 4 is a block diagram of a data blood-lineage generation apparatus according to an exemplary embodiment of the present disclosure.

Referring to fig. 4, the data blood-lineage generating apparatus 40 can be applied to the electronic device shown in fig. 3, and the apparatus is applied to a data warehouse management system, and includes:

the acquisition module 401 acquires source data from the butted service system and stores the source data in a local place; wherein the source data is database-based table data;

A generation module 402, configured to generate target data corresponding to the source data; wherein the target data includes at least a first blood-edge identification uniquely characterizing a data source of each row of data of the target data.

In this embodiment, the target data further includes an index identifier that uniquely characterizes each row of data of the target data;

The generating module 402 further:

In this embodiment, the generating module 402 further:

In this embodiment, when data blood edge tracing needs to be performed on the target data, the method further includes:

And the tracing module 403 constructs a data blood edge query instruction for the target data, queries the target data based on the data blood edge query instruction, and obtains the data blood edge of the target data traced back to the source data.

In this embodiment, the third processing data further includes an index identifier that uniquely characterizes each row of data of the third processing data, where the index identifier of the third processing data is obtained by combining unique identifiers generated based on the table identifier and the unique identifier algorithm of the third processing data; the first blood-edge identification points to an index identification of the third processed data.

In this embodiment, the source data further includes an index identifier that uniquely characterizes each line of data of the source data, the first processed data further includes an index identifier that uniquely characterizes each line of data of the first processed data, and the second processed data further includes an index identifier that uniquely characterizes each line of data of the second processed data;

In this embodiment, the unique identification algorithm is a UUID algorithm or a hash algorithm.

The apparatus, device, module or module set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

Corresponding to the method embodiment described above, the present specification also provides an embodiment of an electronic device. The electronic device may be applied to a data warehouse management system; the electronic device includes: a processor and a memory for storing machine executable instructions; wherein the processor and the memory are typically interconnected by an internal bus. In other possible implementations, the device may also include an external interface to enable communication with other devices or components.

In this embodiment, the processor is caused to, by reading and executing machine-executable instructions stored by the memory corresponding to control logic for data lineage generation:

In this embodiment, the target data further includes an index identifier that uniquely characterizes each row of data of the target data; by reading and executing machine-executable instructions stored by the memory corresponding to control logic for data lineage generation, the processor is caused to:

In this embodiment, when data lineage tracing is required for the target data, the processor is caused to:

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method of data lineage generation, the method being applied to a data warehouse management system, the method comprising:

Generating first machining data of the source data; wherein the first processing data comprises at least the source data, each data uniquely characterizing the processing data corresponding to a first blood-source identification from a data source of the source data;

iteratively generating machining data of the second machining data until final third machining data are obtained;

and generating target data corresponding to the source data by taking the third processing data as table data of the target data.

2. The method of claim 1, the target data further comprising an index identification of each row of data that uniquely characterizes the target data;

Generating target data corresponding to the source data by using the third processing data as table data of the target data, including:

And taking the index identifier and the third processing data as table data of the target data, and generating target data corresponding to the source data.

3. The method of claim 1, when data lineage tracing is required for the target data, further comprising:

4. The method of claim 2, the third process data further comprising an index identification uniquely characterizing each row of data of the third process data, the index identification of the third process data being derived from a unique identification combination generated based on a table identification and a unique identification algorithm of the third process data; the first blood-edge identification points to an index identification of the third processed data.

5. The method of claim 2, the source data further comprising an index identification uniquely characterizing each row of data of the source data, the first process data further comprising an index identification uniquely characterizing each row of data of the first process data, the second process data further comprising an index identification uniquely characterizing each row of data of the second process data;

6. The method of claim 4 or 5, wherein the unique identification algorithm is a UUID algorithm or a hash algorithm.

7. A data lineage generation apparatus, the apparatus being applied to a data warehouse management system, the apparatus comprising:

the generation module generates first processing data of the source data; wherein the first processing data comprises at least the source data, each data uniquely characterizing the processing data corresponding to a first blood-source identification from a data source of the source data;

8. An electronic device comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory stores machine readable instructions, the processor executing the method of any of claims 1 to 6 by invoking the machine readable instructions.

9. A machine-readable storage medium storing machine-readable instructions which, when invoked and executed by a processor, implement the method of any one of claims 1 to 6.