CN112579625A - Multi-source heterogeneous data treatment method and device - Google Patents

Multi-source heterogeneous data treatment method and device Download PDF

Info

Publication number
CN112579625A
CN112579625A CN202011044826.XA CN202011044826A CN112579625A CN 112579625 A CN112579625 A CN 112579625A CN 202011044826 A CN202011044826 A CN 202011044826A CN 112579625 A CN112579625 A CN 112579625A
Authority
CN
China
Prior art keywords
data
class
query
source
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011044826.XA
Other languages
Chinese (zh)
Inventor
王济平
黎刚
汤克云
周健雄
杨劲业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingxin Data Technology Co ltd
Original Assignee
Jingxin Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingxin Data Technology Co ltd filed Critical Jingxin Data Technology Co ltd
Priority to CN202011044826.XA priority Critical patent/CN112579625A/en
Publication of CN112579625A publication Critical patent/CN112579625A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a multi-source heterogeneous data treatment method and device. The multi-source heterogeneous data treatment method comprises the following steps: step S1, constructing a query engine for uniformly connecting different types of databases and carrying out association query by using a standardized SQL grammar; step S2, receiving a query instruction, connecting to the service of a query engine according to the received connection information of the data source, performing data query according to the received SQL statement and returning a query result; and step S3, receiving the processing instruction, correspondingly processing the query result according to the received processing instruction and outputting the processing result. The invention meets the requirements of multi-source heterogeneous data treatment, greatly simplifies the flow steps of data treatment, and reduces the operation and maintenance cost and time of treatment tasks.

Description

Multi-source heterogeneous data treatment method and device
Technical Field
The invention relates to the field of multi-source heterogeneous data processing, in particular to a multi-source heterogeneous data treatment method and device.
Background
At present, in a towable component type treatment platform provided by big data, functions such as input, output, conversion, field splitting, encryption, standard conversion and the like required in the treatment process are all carried out in a visual towable mode, and treatment task configuration is realized in a 'what you see is what you get' mode. At present, main big data management product varieties only provide a single data source mode to be connected with a database, and data management work such as data query, conversion, processing and the like is carried out on a database table in a configured single database. If the related data in a plurality of database tables need to be subjected to correlation query analysis and data management at the same time, a plurality of data management tasks need to be configured, transition is carried out by means of an intermediate database table mode, and finally data management work is finished.
In the process of configuring multi-source data and simultaneously managing through the common single-source database connecting assembly, when a multi-step management task mode is configured according to a data logical relationship, data management engineering personnel need to sequentially configure various data management operation tasks according to the data logical relationship, and extra workload is brought to data management implementation personnel. Meanwhile, because parallel execution of a plurality of tasks is involved, there may be some abatement task running, and other configured data abatement tasks may not run normally. The quality problem of data treatment is difficult to guarantee. Still another problem is because of a plurality of administration tasks are moved simultaneously, because relate to in the data relation logic administers the demand, divide into the multistep and operate and be, the fortune dimension personnel often can't judge the task execution rationality because of not understanding the data logic relation, greatly increased fortune dimension work load and time cost.
In the data componentization dragable governance process, when cross-table association query result data among a plurality of heterogeneous databases is involved and extracted to the same table of another database, a data source connection component needs to be separately created for different data sources. Firstly, a plurality of data migration tasks are created, a separate data extraction task is configured for each type of database related to a database table, database table data of the type to be queried are all extracted to a temporary intermediate database in a unified mode, then a connection data source component is created according to the intermediate database, relevant query analysis is carried out in the same database, and finally sub-query analysis results are extracted to a target database table.
The whole process needs to configure the data source connecting assemblies respectively for different databases, configure a plurality of extraction tasks according to the types of the related databases, simultaneously needs to temporarily store data by means of the intermediate database table, and finally writes the data of the intermediate database table into the target table, so that the process can be completed only by a plurality of steps and execution tasks, and the whole data management work efficiency is low. It is necessary to develop a management mode capable of connecting with a plurality of heterogeneous databases, providing a unified standard SQL syntax for cross-table association query and uniformly managing data.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a multi-source heterogeneous data management method and device, which can connect and configure a plurality of heterogeneous databases together, provide a unified standard SQL grammar to perform cross-table correlation query and perform data management in a unified manner.
In order to achieve the purpose, the invention adopts the following technical scheme: a multi-source heterogeneous data governance method, comprising:
step S1, constructing a query engine for uniformly connecting different types of databases and carrying out association query by using a standardized SQL grammar;
step S2, receiving a query instruction, connecting to the service of a query engine according to the received connection information of the data source, performing data query according to the received SQL statement and returning a query result;
and step S3, receiving the processing instruction, correspondingly processing the query result according to the received processing instruction and outputting the processing result.
Step S1 includes:
step S11, analyzing the metadata of the data source and loading the metadata into the memory;
step S12, converting and packaging the SQL grammar corresponding to the data source to obtain a directed acyclic graph SqlNode, and checking the SqlNode according to the metadata;
step S13, cutting the SqlNode to obtain abstract program Extract, and taking the Sql Node root node to form the Sql grammar to be executed, and then selecting the corresponding calculation engine;
step S14, loading java codes required by the calculation engine rule generated based on the abstract program Extract into a memory container;
and step S15, traversing and optimizing the java code generated in the previous step, converting the java code into a class code, executing the class code, packaging data, and loading the obtained data result into a memory.
Step S11 includes:
appointing a corresponding database and a database table factory class for each data source, and identifying the information of the database and the database table;
generating a Json file by using metadata of a data source;
analyzing metadata in the Json file through a dynamic data source management framework;
and creating dynamic factory type information according to the metadata obtained by analysis, and carrying out memory loading on the database type, the connection information, the connection parameters and the database table structure information of the data source.
Step S12 includes:
configuring and generating a plan executor planer by using a dynamic data source management framework based on the factory type information;
converting SQL grammars of various database types by using an abstract syntax tree algorithm through a plan executor planer, and packaging a java class to represent a conversion result to obtain an unverified directed acyclic graph class SqlNode;
carrying out syntax verification on the SqlNode according to the metadata of the data source through a plan executor Planner;
and obtaining a primary execution plan according to the SqlNode semantic analysis, performing syntactic analysis by combining metadata, reading out related information, pushing down through a filter, and moving all filtering conditions behind the where to a join on statement for designation.
Step S13 includes:
creating a directed acyclic graph cutting class QProcedure based on the SqlNode obtained in the step S12;
constructing an SQL line for traversing the SqlNode tree class, cutting the SqlNode tree class, returning to a final cut sub-tree and an alias thereof, obtaining a sub-tree synchronizer subtreeSynlinker after cutting, and decomposing the cut SqlNode into a plurality of processing programs;
returning fields, corresponding data source connection parameters and the like in the process of decomposing into a plurality of processing programs are converted into abstract program extracts;
the cutting QProcedure takes the SqlNode root node to convert into a transmission program, and the SQL grammar to be specifically executed is obtained;
constructing an engine-based syntactic analysis algorithm based on the obtained SQL grammar, and automatically selecting a bottom layer computing engine according to a minimum time principle; the calculation engines are Spark calculation engines and Flink calculation engines.
Step S14 includes:
creating a dedicated Pipeline for connecting all steps in execution based on the selected computing engine;
assembling java class codes required by a computing engine by using Pipeline and returning to api call in an initialization stage;
and generating java code required by the calculation engine rule through a digest program Extract: and putting the obtained data source information, the connection information and the table field information into a temporary table required by a computing engine, generating java codes for the cut QProcedure according to a packaging sequence, splicing the generated java codes by the api, and loading the spliced java codes into a memory container.
Step S15 includes:
creating a theme class, reading the java code generated in the last step from the memory, performing traversal optimization, generating a final class, and loading the final class into a cache;
and executing the class codes in the cache, executing the multi-source heterogeneous logic by a method in an actuator, packaging data, and finally loading the obtained data result into the memory.
Step S1 further includes:
and step S16, the query engine provides TCP access service by adopting a Netty architecture, the TCP service of a Netty client-side butt joint query engine is built at the bottom layer of the driver, so that a unified connection is established with the query engine by using a unified database link url, a unified database driver class and a method for creating multi-source metadata, and finally the driver is packaged into a jar driver package.
Step S2 includes:
step S21, setting a metadata class, and configuring the connection information and SQL statements of the data source set into the metadata class;
step S22, assembling and configuring corresponding connection information, user name and password according to different database types, and packaging the connection information into metadata json;
step S23, configuring address information and port of the query engine for Connection based on the metadata json, and acquiring Connection class of the query engine;
and step S24, based on the Connection class of the query engine, executing the SQL statement to obtain a result set, acquiring metadata information of the result set, and packaging into a returned query result.
Step S2 is packaged into jar packages to form callable components.
The processing instruction of step S3 includes at least commands of character replacement, data verification, sorting, field selection, conversion, encryption.
The abatement method further includes step S4 of receiving a timing scheduling command for timing execution of steps S2 and S3.
The invention also discloses an electronic device, comprising:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement the above method.
The invention also discloses a computer-readable storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the above method.
Compared with the prior art, the invention has the beneficial effects that: the query engine can be used for simultaneously connecting and configuring various databases and realizing cross-database and cross-table query, unified instruction query can be carried out through standard SQL statements, the query result can be used for realizing cross-database and cross-table query and management through a single data management task chain, the multi-source heterogeneous data management requirements are met, the flow steps of data management are greatly simplified, and the operation and maintenance cost and time of management tasks are reduced.
Drawings
FIG. 1 is a flow chart of a multi-source heterogeneous data governance method of the invention.
FIG. 2 is a flowchart of step S1 of the multi-source heterogeneous data governance method of the present invention.
FIG. 3 is a flowchart of step S2 of the multi-source heterogeneous data governance method according to the present invention.
FIGS. 4-8 are screenshots of the multi-source heterogeneous data governance interface of the present invention.
It should be noted that, the products shown in the above views are all appropriately reduced/enlarged according to the size of the drawing and the clear view, and the size of the products shown in the views is not limited.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
The embodiment is a multi-source heterogeneous data management method, and the multi-source heterogeneous data to be managed is subjected to unified connection configuration and unified query, and finally is processed through the same management task, so that the requirements of connection of a plurality of data tables of a plurality of types of databases, data association query and management are met.
As shown in fig. 1, the multi-source heterogeneous data governance method includes: step S1, constructing a query engine for uniformly connecting different types of databases and carrying out association query by using a standardized SQL grammar; step S2, receiving a query instruction, connecting to the service of a query engine according to the received connection information of the data source, performing data query according to the received SQL statement and returning a query result; and step S3, receiving the processing instruction, correspondingly processing the query result according to the received processing instruction and outputting the processing result.
According to the multi-source heterogeneous data management method, the query engine can be used for simultaneously connecting and configuring various databases and realizing cross-database and cross-table query, unified instruction query can be carried out through standard SQL statements, the query result can be used for realizing cross-database and cross-table query and management through a single data management task chain, the multi-source heterogeneous data management requirements are met, the data management process steps are greatly simplified, and the operation and maintenance cost and time of management tasks are reduced.
The following describes each step of the multi-source heterogeneous data governance method in detail.
Step S1 is used to construct a multi-source heterogeneous SQL query engine to implement single-source single-table or multi-source multi-table cross-database association query analysis on different types of databases including relational and non-relational databases such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongodDB, Hbase, ES, Impala, HAWQ, etc. through standard SQL based on the query engine. The terminal user can flexibly inquire the multi-source data in a uniform connection mode through the uniformly packaged drive packages of the multi-source heterogeneous SQL inquiry engines.
As shown in fig. 2, step S1 includes the following steps: step S11, analyzing the metadata of the data source and loading the metadata into the memory; step S12, converting and packaging the SQL grammar corresponding to the data source to obtain a directed acyclic graph SqlNode, and checking the SqlNode according to the metadata; step S13, cutting the SqlNode to obtain abstract program Extract, and taking the Sql Node root node to form the Sql grammar to be executed, and then selecting the corresponding calculation engine; step S14, loading java codes required by the calculation engine rule generated based on the abstract program Extract into a memory container; step S15, traversing and optimizing the java code generated in the previous step, converting the java code into a class code, executing the class code, packaging data, and loading a data result into a memory; and step S16, uniformly driving packaging and packaging.
Step S1 is to perform parsing, checking, converting, and packaging on different types of data source SQL grammars to form a unified calculation program task, so that the multi-source heterogeneous databases to be connected can be quickly configured, and a user can quickly connect and query various data sources through a unified connection manner without performing task data source configuration operation from the user to the bottom layer. The user can carry out query analysis at will according to the business requirement, and the method is suitable for flexible configuration and connection establishment of various databases. The user does not need any operation on bottom layer software, the correlation query analysis of the multi-source data is easily realized, and the requirements of quick access query analysis of various data sources in a big data scene can be met.
Before step S11, different types of databases to be connected are preset for implementing uniform SQL syntax query analysis on the different types of databases. In particular, the database may be a relational or non-relational database, including but not limited to databases such as MS-SQL, Oracle, MySQL, Sybase, DB2, Redis, MongadDB, Hbase, ES, Impala, HAWQ, and the like. Parsing adaptation of different types of database SQL syntax is required.
Step S11 is an initialization work for the parsing task according to the standard SQL syntax. Specifically, in step S11, it is necessary to first specify a corresponding database and database table factory class for each data source, which is used to identify the information of the database and the database table. The Json file is then generated with metadata (metadata) of the data source. Json (javascript Object notification) is a lightweight data exchange format that is easy for human reading and writing, and easy for machine parsing and generation. The metadata of the data source comprises database table structure information, a database account number, a database password, a type and an address. The metadata in the Json file is then parsed by the dynamic data source management framework. A common dynamic data source management framework is Apache call. And creating dynamic factory type information by using the metadata obtained after analysis, and finally carrying out memory loading on the database type, the connection information, the connection parameters and the database table structure information of the data source to wait for reading and calling in the subsequent steps.
Step S12 is to perform standard SQL parsing, verification, and optimization. Specifically, in step S12, a plan executor planer is configured by using a dynamic data source management framework (such as Apache call) first based on the dynamic plant class information created in the previous step. Plan executor planer is a Java class. And then, converting SQL (structured query language) grammars of various database types by using a general Abstract Syntax Tree (AST) algorithm through a plan executor planer, and packaging and representing a conversion result by using a java class to obtain an unverified directed acyclic graph class SqlNode (abstract syntax class, directed acyclic graph DAG object information). And then carrying out grammar verification on the SqlNode according to the metadata of the data source through a plan executor Planner, wherein the grammar verification comprises the following steps: checking whether the information in the SqlNode class object has corresponding information such as a database table, a field, a function and the like; and after the SqlNode is successfully checked, returning by using the SqlNode object information again, and marking that the grammar checking operation is finished. And then, carrying out semantic analysis on the SqlNode to obtain a primary execution plan, and then carrying out syntactic analysis by combining the metadata obtained in the step S11 to read information such as a data source connection mode, data to be returned by select, a where filtering condition, a table to be scanned and the like. The filter conditions behind where are then all rolled over to be specified in the join on statement by the filter push down. Therefore, the data volume of the join can be greatly reduced through the optimization step, and the SQL grammar execution performance is improved.
Step S13 is to perform SQL clipping, translation, and select a compute engine. Specifically, in step S13, a directed acyclic graph clipping class QProcedure is created based on the optimized SqlNode obtained in step S12. And then constructing an SQL line traversing the SqlNode tree class for cutting the SqlNode tree class, returning to a final cut sub-tree and an alias thereof, obtaining a sub-tree synchronizer subtreeSynechotor after cutting, and decomposing the SqlNode after cutting into a plurality of processing programs. The fields returned during the process of decomposing into a plurality of handlers, the corresponding data source connection parameters, and the like are converted into a summary program Extract for temporarily storing information. In step S13, if there are a plurality of data sources, a plurality of digest program extracts are generated. If the sub-query is included, even if the sub-query has the same data source, a summary program Extract is separated for the sub-query. The cutting QProcedure takes the SqlNode root node to convert into a transmission program, and the SQL grammar to be specifically executed is obtained. And finally, creating an adapter channel, constructing a syntax analysis algorithm based on the calculation engine based on the obtained SQL syntax, running the syntax analysis algorithm for analyzing time consumption, and automatically adapting to the optimal underlying calculation engine according to the minimum time principle. The bottom computing frame framework provides two computing engines, namely Spark and Flink for selection.
The role of step S14 is to compute task wrapping, task submission and execution. Specifically, in step S14, a dedicated Pipeline is first created based on the spare or Flink computing engine adapted in step S13 for connecting all steps in execution. And then assembling java class codes required by the computing engine by using the Pipeline, and returning the java class codes to various api calls in the initialization stage. And encapsulating an import package class for importing the java package which needs to be called when the java code is executed. And generating java code required by the calculation engine rule through the abstract program Extract generated in the previous step. Putting the data source information, the connection information, the table field information and the like obtained in the previous step into a temporary table required by a calculation engine, wherein the temporary table comprises an element required by a normal java code: import, class, inner _ class, method, sensor. These elements are packaged in the following order: import, class, inner _ class, method, sensor. And according to the cut QProcedure obtained in the previous step, generating java codes for the QProcedure according to the packaging sequence, splicing the generated java codes by one api, and loading the spliced java codes into a memory container.
Step S15 is for generating buffer class, encapsulating data, and writing data memory. Specifically, in step S15, a topic class is created first to read the java code generated in step S14 from the memory, and then traversal optimization is performed to generate a final class, which is loaded into the cache. And then executing class codes in the cache, executing the multi-source heterogeneous logic by a method in an actuator, packaging data, submitting the executed tasks to a computing component cluster for operation, and finally loading the obtained data results to a memory. When the query is carried out, the query result data can be finally obtained by reading the memory data. The multi-source heterogeneous logic is executed by the following steps: executing a core execute method of a class code in a cache; accessing data source information, if the data sources are a plurality of data sources, sequentially accessing the data sources according to a sequence (the sequence is not fixed), performing initialization data of the data sources by the computing engine spark/flash each time the data sources are accessed according to the previously formed abstract program Extract, and creating a spark/flash temporary table; the computing engine spark/flink carries out sql query according to the temporary table formed by the obtained multiple data sources and the sql statement finally optimized in the step S13; and (5) packaging the data set by the query result and allowing the data set to fall into the memory, and returning the data set.
Step S16 is for driving the encapsulation packaging. The constructed query engine adopts a Netty architecture to externally provide TCP access service, and comprises the following steps: establishing connection, executing query, returning data, returning source data and the like. Specifically, in step S16, a Netty TCP service of a Netty client docking query engine needs to be built at the driver bottom layer, so as to establish a unified connection with the query engine using a unified database link url (jdbc: jxbd:// query engine ip: port), a unified database driver class (com. jxbd. sql. driver) and a method for providing creation of multi-source metadata, and finally package the driver into a jxjad-sql-connector. jar). When the method is used, an end user loads a jar driver package through a local end or is connected with a query engine through a remote end and then is connected with the query engine through a uniform service address url (jdbc: jxbd:// query engine ip: port), and the user can finally realize query analysis of the multi-source heterogeneous database by using standard SQL grammar.
Step S2 is used to receive the query instruction, and connect, query and return the query result according to the instruction. Specifically, as shown in fig. 3, step S2 includes steps S21 to S24. In step S21, a metadata class (jxbdlnputmeta) is set, and connection information of the data source set and the SQL statement are initially configured in the metadata class. The connection information of the data source includes an IP, a database type, a port, a database name, a database alias, a user name, and a password. Step S22 is to encapsulate the input data source information based on the MakeMeta class of the query engine driver package. In step S22, the corresponding connection information, user name and password are assembled and configured according to different database types, and the connection information is encapsulated into metadata json (string format) by using the filejdbc method of MakeMeta. Step S23 is to connect the query engine through the query engine driver package. In step S23, the driver class com.jxbd.sql.jxdriver of the driver package is added, and then based on the metadata json, the address information and the port of the query engine are configured to be connected, and the java.sql.connection class of the query engine is obtained. In step S24, in step S24, based on the java.sql.connection class of the query engine, execute query method of the java.sql.connection class is adopted to execute SQL statement to obtain a result set, then metadata information of the result set is obtained by the getMetaData method, and finally the result set and the metadata information are encapsulated into a returned query result.
When applied, step S2 (steps S21 to S24 and all output classes) may be packaged into a jar package to be added to the system, which automatically recognizes the jar package and treats it as a callable component. The system finally adds such components to the function module as shown in the left column of fig. 4, and automatically calls the method of step S2 when the function module is selected and run.
And the step S3 is used for receiving the processing instruction, correspondingly processing the query result returned in the step S2 according to the received processing instruction and outputting the processing result. The processing instruction at least comprises commands of character replacement, data verification, sorting, field selection, conversion and encryption.
The abatement method further includes step S4. Step S4 is for receiving a timing scheduling command, configuring a scheduling task, and for performing step S2 and step S3 regularly. In selecting the scheduling of the tasks, the step S2 and the step S3 are scheduled to be executed every several months/weeks/days/hours/minutes by the cron expression definition so as to realize the timed data query and treatment process.
FIG. 4 illustrates an exemplary interface for configuration of a newly created data governance task. FIG. 4 can be used normally provided that the query engine has been successfully built and is running normally. The component menu functionality components of the left bar of fig. 4 may each be dragged and dropped into the canvas on the right, which enables free drag and drop combination of components through an mxgraph (web-based rendering flow diagram javascript library tool). In other embodiments, the assembly menu may be suspended and may be fixed at the right, upper, and lower positions. Dragging a multi-source heterogeneous input component into the canvas in FIG. 4 results in the diagram shown in FIG. 5. The multi-source heterogeneous input component comprises the jar package packaged in the step S2, and can be directly called in the running process. Double clicking on the multi-source heterogeneous data governance component in the canvas of FIG. 5 pops up the edit property box as shown in FIG. 6. The edit property box is configured by the ExtJS tool. The edit property box is provided with a drop-down list for selecting data sources (including MSSQL, Oracle, MySQL, Sybase, DB2, Redis, MongadDB, Hbase, ES, Impala, HAWQ, etc.) associated with the query, an input box for a heterogeneous SQL statement, a variable in a replacement SQL statement, and a preview button. Clicking the preview button invokes the business process of the component to access the query engine to obtain the query results and pops up a box to display the returned query results. Then, according to the requirement of the data processing task, respectively, the required processing components are selected from the left component menu and dragged into the canvas on the right and respectively the connection is established between the components, as shown in fig. 7. The canvas of fig. 7 shows a complete data processing task chain, and after the complete data processing task chain is executed, a final processing result is obtained, so that joint processing between multi-source heterogeneous data is completed. The "ethnic group conversion", "gender conversion", "field selection" and "standard verification" components in fig. 7 are all used to receive the processing instruction and perform corresponding processing on the data according to the instruction (i.e. execute step S3 of the abatement method of the present invention). Double-clicking the "table entry" component in FIG. 7 pops up the edit property box as shown in FIG. 8. The edit property box is configured by the ExtJS tool. After the input of each item of content of the edit attribute box of 'table input' is finished, the right pop-up box can be obtained by clicking preview to display a processing result.
In addition, in the embodiment of the invention, the electronic equipment capable of realizing the method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
The electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: the system comprises at least one processing unit, at least one storage unit, a bus for connecting different system components (comprising the storage unit and the processing unit), and a display unit.
Wherein the storage unit stores program code which is executable by the processing unit to cause the processing unit to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present description. For example, the processing unit may perform steps S1 through S4 of the abatement method of the present invention.
The memory unit may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) and/or a cache memory unit, and may further include a read only memory unit (ROM).
The storage unit may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The bus may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. As shown, the network adapter communicates with other modules of the electronic device over a bus. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described data governance method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the description, when the program product is run on the terminal device.
According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (14)

1. The multi-source heterogeneous data treatment method is characterized by comprising the following steps:
step S1, constructing a query engine for uniformly connecting different types of databases and carrying out association query by using a standardized SQL grammar;
step S2, receiving a query instruction, connecting to the service of the query engine according to the received connection information of the data source, performing data query according to the received SQL statement and returning a query result;
and step S3, receiving a processing instruction, correspondingly processing the query result according to the received processing instruction, and outputting a processing result.
2. The multi-source heterogeneous data governance method according to claim 1, wherein said step S1 comprises:
step S11, analyzing the metadata of the data source and loading the metadata into the memory;
step S12, converting and packaging the SQL grammar corresponding to the data source to obtain a directed acyclic graph SqlNode, and checking the SqlNode according to the metadata;
step S13, cutting the SqlNode to obtain abstract program Extract, and taking the Sql Node root node to form the Sql grammar to be executed, and then selecting the corresponding calculation engine;
step S14, loading java codes required by the calculation engine rule generated based on the abstract program Extract into a memory container;
and step S15, traversing and optimizing the java code generated in the previous step, converting the java code into a class code, executing the class code, packaging data, and loading the obtained data result into a memory.
3. The multi-source heterogeneous data governance method according to claim 2, wherein said step S11 comprises:
appointing a corresponding database and a database table factory class for each data source, and identifying the information of the database and the database table;
generating a Json file by using metadata of a data source;
analyzing metadata in the Json file through a dynamic data source management framework;
and creating dynamic factory type information according to the metadata obtained by analysis, and carrying out memory loading on the database type, the connection information, the connection parameters and the database table structure information of the data source.
4. The multi-source heterogeneous data governance method according to claim 2, wherein said step S12 comprises:
configuring and generating a plan executor planer by utilizing a dynamic data source management framework based on the plant type information;
converting SQL grammars of various database types by using an abstract syntax tree algorithm through a plan executor planer, and packaging a java class to represent a conversion result to obtain an unverified directed acyclic graph class SqlNode;
carrying out syntax verification on the SqlNode according to the metadata of the data source through a plan executor Planner;
and obtaining a primary execution plan according to the SqlNode semantic analysis, performing syntactic analysis by combining metadata, reading out related information, pushing down through a filter, and moving all filtering conditions behind the where to a join on statement for designation.
5. The multi-source heterogeneous data governance method according to claim 2, wherein said step S13 comprises:
creating a directed acyclic graph cutting class QProcedure based on the SqlNode obtained in the step S12;
constructing an SQL line for traversing the SqlNode tree class, cutting the SqlNode tree class, returning to a final cut sub-tree and an alias thereof, obtaining a sub-tree synchronizer subtreeSynlinker after cutting, and decomposing the cut SqlNode into a plurality of processing programs;
returning fields, corresponding data source connection parameters and the like in the process of decomposing into a plurality of processing programs are converted into abstract program extracts;
the cutting QProcedure takes the SqlNode root node to convert into a transmission program, and the SQL grammar to be specifically executed is obtained;
constructing an engine-based syntactic analysis algorithm based on the obtained SQL grammar, and automatically selecting a bottom layer computing engine according to a minimum time principle; the computing engines are Spark computing engines and Flink computing engines.
6. The multi-source heterogeneous data governance method according to claim 5, wherein said step S14 comprises:
creating a dedicated Pipeline for connecting all steps in execution based on the selected computing engine;
assembling java class codes required by a computing engine by using Pipeline and returning to api call in an initialization stage;
and generating java code required by the calculation engine rule through a digest program Extract: and putting the obtained data source information, the connection information and the table field information into a temporary table required by a computing engine, generating java codes for the cut QProcedure according to a packaging sequence, splicing the generated java codes by the api, and loading the spliced java codes into a memory container.
7. The multi-source heterogeneous data governance method according to claim 2, wherein said step S15 comprises:
creating a theme class, reading the java code generated in the last step from the memory, performing traversal optimization, generating a final class, and loading the final class into a cache;
and executing the class codes in the cache, executing the multi-source heterogeneous logic by a method in an actuator, packaging data, and finally loading the obtained data result into the memory.
8. The multi-source heterogeneous data governance method according to claim 2, wherein said step S1 further comprises:
and step S16, the query engine provides TCP access service by adopting a Netty architecture, the TCP service of a Netty client-side butt joint query engine is built at the bottom layer of the driver, so that a unified connection is established with the query engine by using a unified database link url, a unified database driver class and a method for creating multi-source metadata, and finally the driver is packaged into a jar driver package.
9. The multi-source heterogeneous data governance method according to claim 1, wherein said step S2 comprises:
step S21, setting a metadata class, and configuring the connection information and SQL statements of the data source set into the metadata class;
step S22, assembling and configuring corresponding connection information, user name and password according to different database types, and packaging the connection information into metadata json;
step S23, configuring address information and port of the query engine for Connection based on the metadata json, and acquiring Connection class of the query engine;
and step S24, based on the Connection class of the query engine, executing the SQL statement to obtain a result set, acquiring metadata information of the result set, and packaging into a returned query result.
10. The multi-source heterogeneous data governance method according to claim 9, wherein step S2 is packaged into a jar package to form a callable component.
11. The multi-source heterogeneous data governance method according to claim 1, wherein the processing instructions of step S3 include at least commands for character replacement, data checking, sorting, field selection, conversion, encryption.
12. The multi-source heterogeneous data governance method according to claim 1, further comprising a step S4 of receiving a timed scheduling command for timed execution of steps S2 and S3.
13. An electronic device, comprising:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of claims 1 to 12.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 12.
CN202011044826.XA 2020-09-28 2020-09-28 Multi-source heterogeneous data treatment method and device Withdrawn CN112579625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011044826.XA CN112579625A (en) 2020-09-28 2020-09-28 Multi-source heterogeneous data treatment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011044826.XA CN112579625A (en) 2020-09-28 2020-09-28 Multi-source heterogeneous data treatment method and device

Publications (1)

Publication Number Publication Date
CN112579625A true CN112579625A (en) 2021-03-30

Family

ID=75119720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011044826.XA Withdrawn CN112579625A (en) 2020-09-28 2020-09-28 Multi-source heterogeneous data treatment method and device

Country Status (1)

Country Link
CN (1) CN112579625A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032423A (en) * 2021-05-31 2021-06-25 北京谷数科技股份有限公司 Query method and system based on dynamic loading of multiple data engines
CN113094387A (en) * 2021-04-08 2021-07-09 杭州数梦工场科技有限公司 Data query method and device, electronic equipment and machine-readable storage medium
CN113722324A (en) * 2021-08-30 2021-11-30 平安国际智慧城市科技股份有限公司 Report generation method and device based on artificial intelligence, electronic equipment and medium
CN113792067A (en) * 2021-11-16 2021-12-14 全景智联(武汉)科技有限公司 System and method for automatically generating SQL (structured query language) based on recursive algorithm
CN113901141A (en) * 2021-10-11 2022-01-07 京信数据科技有限公司 Distributed data synchronization method and system
CN114490842A (en) * 2021-12-28 2022-05-13 航天科工智慧产业发展有限公司 Interface data query method and data query engine for multi-source data
CN114756629A (en) * 2022-06-16 2022-07-15 之江实验室 Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN115729935A (en) * 2022-11-23 2023-03-03 北京水脉科技有限公司 Data interaction processing method and system based on ORM framework
CN117390495A (en) * 2023-12-04 2024-01-12 江苏睿希信息科技有限公司 Multi-source data risk management system and method based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572896A (en) * 2014-12-25 2015-04-29 福建亿榕信息技术有限公司 Method and system for automatically governing data of relational database
CN108154456A (en) * 2016-12-06 2018-06-12 星际空间(天津)科技发展有限公司 Strengthened research system is moved in a kind of urban and rural planning
CN108279879A (en) * 2018-01-25 2018-07-13 北京卓越智软科技有限公司 Applied software development method towards engine
CN110096514A (en) * 2019-04-01 2019-08-06 跬云(上海)信息科技有限公司 Data query method and apparatus
CN110781213A (en) * 2019-09-25 2020-02-11 中国电子进出口有限公司 Multi-source mass data correlation searching method and system with personnel as center
CN111708774A (en) * 2020-04-16 2020-09-25 上海华东电信研究院 Industry analytic system based on big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572896A (en) * 2014-12-25 2015-04-29 福建亿榕信息技术有限公司 Method and system for automatically governing data of relational database
CN108154456A (en) * 2016-12-06 2018-06-12 星际空间(天津)科技发展有限公司 Strengthened research system is moved in a kind of urban and rural planning
CN108279879A (en) * 2018-01-25 2018-07-13 北京卓越智软科技有限公司 Applied software development method towards engine
CN110096514A (en) * 2019-04-01 2019-08-06 跬云(上海)信息科技有限公司 Data query method and apparatus
CN110781213A (en) * 2019-09-25 2020-02-11 中国电子进出口有限公司 Multi-source mass data correlation searching method and system with personnel as center
CN111708774A (en) * 2020-04-16 2020-09-25 上海华东电信研究院 Industry analytic system based on big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIHOO360: "Quicksql", 《HTTPS://GITEE.COM/MIRRORS_QIHOO360/QUICKSQL/TREE/V0.7.0》, 6 January 2020 (2020-01-06) *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094387A (en) * 2021-04-08 2021-07-09 杭州数梦工场科技有限公司 Data query method and device, electronic equipment and machine-readable storage medium
CN113032423B (en) * 2021-05-31 2021-08-17 北京谷数科技股份有限公司 Query method and system based on dynamic loading of multiple data engines
CN113032423A (en) * 2021-05-31 2021-06-25 北京谷数科技股份有限公司 Query method and system based on dynamic loading of multiple data engines
CN113722324B (en) * 2021-08-30 2023-08-18 深圳平安智慧医健科技有限公司 Report generation method and device based on artificial intelligence, electronic equipment and medium
CN113722324A (en) * 2021-08-30 2021-11-30 平安国际智慧城市科技股份有限公司 Report generation method and device based on artificial intelligence, electronic equipment and medium
CN113901141A (en) * 2021-10-11 2022-01-07 京信数据科技有限公司 Distributed data synchronization method and system
CN113792067A (en) * 2021-11-16 2021-12-14 全景智联(武汉)科技有限公司 System and method for automatically generating SQL (structured query language) based on recursive algorithm
CN114490842A (en) * 2021-12-28 2022-05-13 航天科工智慧产业发展有限公司 Interface data query method and data query engine for multi-source data
CN114756629B (en) * 2022-06-16 2022-10-21 之江实验室 Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN114756629A (en) * 2022-06-16 2022-07-15 之江实验室 Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN115729935A (en) * 2022-11-23 2023-03-03 北京水脉科技有限公司 Data interaction processing method and system based on ORM framework
CN117390495A (en) * 2023-12-04 2024-01-12 江苏睿希信息科技有限公司 Multi-source data risk management system and method based on big data
CN117390495B (en) * 2023-12-04 2024-02-20 江苏睿希信息科技有限公司 Multi-source data risk management system and method based on big data

Similar Documents

Publication Publication Date Title
CN112579625A (en) Multi-source heterogeneous data treatment method and device
CN112579626A (en) Construction method and device of multi-source heterogeneous SQL query engine
US9317554B2 (en) SQL generation for assert, update and delete relational trees
US20100175049A1 (en) Scope: a structured computations optimized for parallel execution script language
CN112199086B (en) Automatic programming control system, method, device, electronic equipment and storage medium
CN115617327A (en) Low code page building system, method and computer readable storage medium
US9928288B2 (en) Automatic modeling of column and pivot table layout tabular data
US9563650B2 (en) Migrating federated data to multi-source universe database environment
US8881127B2 (en) Systems and methods to automatically generate classes from API source code
US20080140694A1 (en) Data transformation between databases with dissimilar schemes
JP2015072688A (en) Background format optimization for enhanced sql-like queries in hadoop
US10452628B2 (en) Data analysis schema and method of use in parallel processing of check methods
CN108037919A (en) A kind of visualization big data workflow configuration method and system based on WEB
JP5791149B2 (en) Computer-implemented method, computer program, and data processing system for database query optimization
CN112163017B (en) Knowledge mining system and method
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
CN111008020A (en) Method for analyzing logic expression into general query statement
US11354313B2 (en) Transforming a user-defined table function to a derived table in a database management system
CN113806429A (en) Canvas type log analysis method based on large data stream processing framework
US10691434B2 (en) System and method for converting a first programming language application to a second programming language application
CN116795859A (en) Data analysis method, device, computer equipment and storage medium
CN114356964A (en) Data blood margin construction method and device, storage medium and electronic equipment
CN108874395B (en) Hard compiling method and device in modular stream processing process
CN114048188A (en) Cross-database data migration system and method
CN113204593A (en) ETL job development system and computer equipment based on big data calculation engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210330

WW01 Invention patent application withdrawn after publication