CN109614432B

CN109614432B - System and method for acquiring data blood relationship based on syntactic analysis

Info

Publication number: CN109614432B
Application number: CN201811483550.8A
Authority: CN
Inventors: 苏萌; 刘钰; 张凯; 姜楠; 赵群; 赵丹
Original assignee: Beijing Baifendian Information Science & Technology Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2018-12-05
Filing date: 2018-12-05
Publication date: 2021-01-05
Anticipated expiration: 2038-12-05
Also published as: CN109614432A

Abstract

The invention discloses a system and a method for acquiring data blood relationship based on syntactic analysis, comprising a data blood relationship analysis server; the data blood relationship analysis server mainly comprises an original operation information input module, a frame analysis module, a lexical analysis module, a syntax analysis module, an intermediate result information generation module, a data blood relationship logic analysis module and a query interface; also includes a blood relationship agent plug-in. The system and the method have strong expandability and are more efficient.

Description

System and method for acquiring data blood relationship based on syntactic analysis

Technical Field

The invention relates to the technical field of big data oriented data management, in particular to a system and a method for acquiring data blood relationship based on syntactic analysis.

Background

Data governance refers to a process from using scattered data to using uniform master data, from having little or no organization and flow governance to enterprise-wide comprehensive data governance, from attempting to handle master data upsets to have a good deal with master data wells. The key to successful data governance lies in metadata management, and then, the blood relationship of data as a part of metadata management is a very important link.

In the field of data governance, the analysis of blood-related relationships of data can be explored from two levels. In the field of relational data, traditional industry enterprises exist, and in the data management stage, when the blood relationship of data is analyzed, the blood relationship content of the data is manually analyzed, and a mode of manually recording an EXCEL table is used as an auxiliary mode. This approach is not only inefficient, but also lacks scalability. Some IT enterprises also develop related software, for example, IBM develops Infosphere Datastage software, which contains data relationship function; oracle developed a BI software suite that also contained the consanguineous relationship functions of the data. However, these software are client software, which needs to be installed and deployed on the PC side and is heavy in weight.

In the field of relational data, the existing blood relationship analysis software is based on a software client side, so that a PC (personal computer) end needs to be installed and deployed, and blood relationship analysis data are stored locally at the PC end, are not easy to expand and cannot be traced. For the field of big data, the data blood relationship analysis technology is expanded at a cluster end in a plug-in mode; however, the implantation of the blood relationship analysis logic into the cluster may affect the normal performance of the cluster, and meanwhile, the data blood relationship analysis logic and the cluster module may generate service coupling, which is inconvenient for subsequent expansion.

The biggest problem of blood relationship analysis is how to realize accurate data blood relationship analysis and deep refinement of blood relationship granularity, and the blood relationship analysis is accurate to a table and field level. The existing industry solution is realized by controlling the granularity of the blood relationship analysis at a table level, or the accuracy of the blood relationship analysis of data is not high, and the blood relationship analysis of the data with a complex SQL relationship cannot be realized. How to realize the blood relationship analysis of field granularity level and how to grasp the complete coverage of the data blood relationship analysis on the data blood relationship is the biggest technical problem.

In the current industry field, such as Atlas software of Apache, granularity grasp of data blood relationship analysis is realized, but the granularity grasp is highly dependent on module functions of a cluster, so that the blood relationship analysis capability coverage of the data is not high, and the blood relationship analysis capability of an autonomous node is limited.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a system and a method for acquiring data blood relationship based on syntactic analysis, which have strong expandability and are more efficient.

In order to achieve the purpose, the invention adopts the following technical scheme:

a system for acquiring data blood relationship based on grammar analysis comprises a data blood relationship analysis server;

the data blood relationship analysis server mainly comprises an original operation information input module, a frame analysis module, a lexical analysis module, a syntax analysis module, an intermediate result information generation module, a data blood relationship logic analysis module and a query interface;

the original operation information input module is used for acquiring original operation information when the big data cluster carries out ETL processing on data, and the original operation information is in a data interaction language form;

the frame analysis module is used for carrying out frame analysis on the original operation information, and comprises the steps of carrying out unit segmentation, classification and grouping on the original operation information to generate frame analysis information;

the lexical analysis module is used for carrying out lexical analysis on the frame analysis information to generate information of each lexical unit;

the syntax analysis module is used for carrying out syntax analysis on the lexical unit information and the frame analysis information to generate abstract syntax tree information;

the intermediate result information generating module is used for traversing each information node of the abstract syntax tree, identifying and analyzing the type of each information node, and analyzing key information comprising a database, a data table and data fields so as to generate intermediate result information;

the data blood relationship logic analysis module is used for carrying out respective data blood relationship logic analysis on different types of information nodes by combining lexical unit information, abstract syntax tree information and intermediate result information to obtain data blood relationship information, wherein the data blood relationship information comprises data blood relationship information between a data table and the data table and data blood relationship information between data fields and data fields;

the query interface is used for an external terminal to access and query the data consanguinity relationship information.

Further, the data interaction language comprises a standard SQL language, an Oracle Sql dialect, a SparkSql dialect and a Phoenix Sql dialect.

Further, the lexical unit information is stored in a lexical dictionary structure maintaining internal data inside the data genetic relationship analysis server.

Further, the intermediate result information is stored inside the data genetic relationship analysis server in a dictionary form.

Further, the obtained data blood relationship information is stored in the data blood relationship analysis server or stored in an external storage system.

The system further comprises a blood relationship agent plug-in, wherein the blood relationship agent plug-in is deployed at the big data cluster end and used for dynamically acquiring original operation information of the big data cluster in an ETL processing mode on data and sending the original operation information to an original operation information input module of the data blood relationship analysis server in a data interaction language mode.

Further, the kindred agent plugin sends the original operation information to the original operation information input module in a real-time and asynchronous mode.

The method for acquiring the data blood relationship system based on the syntactic analysis comprises the following steps:

s1, acquiring original operation information when ETL processing is carried out on data by a big data cluster by an original operation information input module, wherein the original operation information is in a data interaction language form;

s2, the frame analysis module performs frame analysis on the original operation information, including the unit segmentation, classification and grouping of the original operation information, to generate frame analysis information;

s3, the lexical analysis module performs lexical analysis on the frame analysis information to generate information of each lexical unit;

s4, the syntax analysis module carries out syntax analysis on the lexical unit information and the frame analysis information to generate abstract syntax tree information;

s5, traversing each information node of the abstract syntax tree by the intermediate result information generating module, identifying and analyzing the type of each information node, and analyzing key information comprising a database, a data table and a data field so as to generate intermediate result information;

and S6, the data blood relationship logic analysis module performs respective data blood relationship logic analysis on different types of information nodes by combining lexical unit information, abstract syntax tree information and intermediate result information to obtain data blood relationship information, wherein the data blood relationship information comprises data blood relationship information between a data table and the data table and data blood relationship information between data fields and data fields.

Further, in step S1, the original operation information of the large data cluster in the ETL processing formula for the data is dynamically obtained by a kindred agent plugin deployed at the large data cluster end, and the kindred agent plugin sends the original operation information to the original operation information input module of the data kindred analysis server in the form of data interaction language.

The invention has the beneficial effects that:

1. because the system and the method of the invention generate the data blood relationship through the frame analysis, the lexical processing, the grammar processing and the abstract grammar tree processing, and are based on the data interaction language, the system and the method have high expandability, not only support the standard SQL language, but also support various dialects (such as Oracle Sql dialects, Spark Sql dialects and Phoenix Sql dialects);

2. the system and method of the present invention is completely decoupled from the cluster environment, has no impact on the cluster performance, and the analysis is real-time, asynchronous and more efficient relative to the cluster service.

Drawings

FIG. 1 is a schematic structural diagram of a system according to embodiment 1 of the present invention;

fig. 2 is a schematic flow chart of a method in embodiment 2 of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical solution, and the detailed implementation and the specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The terms of art referred to in this embodiment will be briefly explained below.

Relationship between blood sources: the data relationship is particularly referred to as the blood relationship of the data, which indicates the data flow direction and data relationship of the data in the data treatment process. The blood relationship includes the blood relationship of table-level granularity and the blood relationship of field-level granularity.

Example 1

The present embodiment provides a system for obtaining data relationship based on syntactic analysis, as shown in fig. 1, including a data relationship analysis server;

the original operation information input module is used for acquiring original operation information when a big data cluster (such as HIVE) performs ETL (extract transform and load) processing on data, and the original operation information is in a data interaction language form;

specifically, the data interaction language may be various dialects such as Oracle SQL dialects, Spark SQL dialects, and Phoenix SQL dialects, in addition to the standard SQL language.

specifically, the lexical unit information may be stored in a lexical dictionary structure maintaining internal data inside the data genetic relationship analysis server, so as to form a relationship correspondence between the original operation information and the lexical unit information.

the abstract syntax tree information is a tree structure description of the original operation information, after syntax analysis, the data blood relationship analysis server has the basic content of the flow direction information of the data, and although the data blood relationship result cannot be directly generated, the abstract syntax tree information has already preliminarily analyzable basic data.

in particular, the intermediate result information may be stored inside the data consanguinity analysis server in the form of a dictionary.

The data blood relationship logical analysis module is used for carrying out respective data blood relationship logical analysis aiming at different types of information nodes by combining lexical unit information, abstract syntax tree information and intermediate result information to obtain data blood relationship information, wherein the data blood relationship information comprises data blood relationship information between a data table and the data table and data blood relationship information between data fields.

Specifically, the obtained data blood relationship information can be stored in the data blood relationship analysis server, and can also be stored in an external storage system, so that the compatibility of a high output mode is achieved.

Furthermore, the system also comprises a blood relationship agent plug-in, wherein the blood relationship agent plug-in is deployed at the end of the big data cluster and used for dynamically acquiring original operation information (generally, operation statements) of an ETL processing formula of the big data cluster on the data and sending the original operation information to an original operation information input module of the data blood relationship analysis server in a data interaction language form.

Specifically, a bloody cut relationship agent plug-in the standard SQL language or other dialects can be developed as needed to implement the extension of the language.

It should be noted that the blood relationship agent plug-in is a pluggable service program working at the big data cluster end, and can perform corresponding expansion and adaptation for different types of big data clusters, and the flexibility is high. After the adaptation work with the big data cluster is completed, the blood relationship agent plug-in can be installed and deployed at the big data cluster end. The big data cluster can normally perform ETL work of data, and the blood relation agent plug-in does not have any influence on the original work of the cluster. When the big data cluster carries out ETL processing tasks or works, the data processing mode reflects the blood relationship flow direction of the data, and the blood relationship agent plug-in deployed at the big data cluster end can dynamically acquire the original operation information of the processed data when the big data cluster processes the ETL data.

Example 2

The present embodiment provides a method for obtaining data relationship based on syntactic analysis according to embodiment 1, as shown in fig. 2, including the following steps:

Further, the specific process of step S6 is:

s6.1, taking abstract syntax tree information as input, traversing step by step from a root node of a syntax tree until a leaf node of the lowest layer is found, and then jumping to the step S6.2;

the whole analysis process of the abstract syntax tree is a backtracking analysis process of finding leaf nodes from top to bottom and then from bottom to top.

S6.2, combining lexical unit information, classifying information types of the leaf nodes, wherein different types of classification correspond to different algorithm processing logics; the types comprise a table building type, a condition type, an input table type, an output table type and an input/output field type. For example, a leaf node of a build type may correspond to its algorithm processing logic and a leaf node of a conditional type may correspond to its algorithm processing logic.

S6.3, according to the information type classification result of the leaf node in the step S6.2, the leaf node of the table building type or the table output type is combined with the intermediate result information to analyze to obtain the downstream information content of the data blood relationship;

s6.4, according to the information type classification result of the leaf node in the step S6.2, the leaf node of the input table type is combined with the intermediate result information to analyze to obtain the upstream information content of the data blood relationship;

s6.5, according to the information type classification result of the leaf node in the step S6.2, analyzing the leaf node of the input and output field type by combining intermediate result information to obtain field information content of the data blood relationship;

and S6.6, integrating the downstream information content, the upstream information content and the field information content of the data relationship generated in the steps S6.3, S6.4 and S6.5 into final data relationship information by logic processing.

Further, the data relationship information obtained in step S6 is stored in the internal or external storage system of the data relationship analysis server, and is accessed and queried through the query interface.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A system for acquiring data blood relationship based on grammar analysis is characterized by comprising a data blood relationship analysis server;

the query interface is used for an external terminal to access and query the data consanguinity relationship information;

wherein the content of the first and second substances,

the system also includes, a kindred agent plugin,

the blood relationship agent plug-in is deployed at a big data cluster end and used for dynamically acquiring original operation information of an ETL (extract transform and load) processing formula of the big data cluster on data and sending the original operation information to an original operation information input module of a data blood relationship analysis server in a data interaction language form;

the blood relationship agent plug-in sends the original operation information to an original operation information input module in a real-time and asynchronous mode;

wherein the content of the first and second substances,

the lexical unit information is stored in a lexical dictionary structure for maintaining internal data in the data blood relationship analysis server;

the intermediate result information is stored in a data blood relationship analysis server in a dictionary form;

the obtained data blood relationship information is stored in the data blood relationship analysis server or an external storage system.

2. The system for obtaining data consanguinity relationships based on syntactic analysis according to claim 1, wherein said data interaction languages include standard SQL language, Oracle SQL dialect, Spark SQL dialect, Phoenix SQL dialect.

3. A method for obtaining data consanguinity systems using syntactic analysis according to any one of claims 1-2, comprising the steps of: