CN111367954A - Data query processing method, device and system and computer readable storage medium - Google Patents

Data query processing method, device and system and computer readable storage medium Download PDF

Info

Publication number
CN111367954A
CN111367954A CN201811599635.2A CN201811599635A CN111367954A CN 111367954 A CN111367954 A CN 111367954A CN 201811599635 A CN201811599635 A CN 201811599635A CN 111367954 A CN111367954 A CN 111367954A
Authority
CN
China
Prior art keywords
data
global metadata
information
query
query processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811599635.2A
Other languages
Chinese (zh)
Inventor
鲍焱
丁岩
黄峰
叶松青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201811599635.2A priority Critical patent/CN111367954A/en
Priority to PCT/CN2019/128782 priority patent/WO2020135613A1/en
Publication of CN111367954A publication Critical patent/CN111367954A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a data query processing method, a device and a system, and a computer readable storage medium. The data query processing method comprises the following steps: receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from a global metadata database, and generating a query execution plan according to the global metadata information; wherein the global metadata information of the global metadata base comprises metadata information of data components; and based on the query execution plan, performing data operation with the data component related to the query processing request in an interactive mode. According to the scheme provided by the embodiment, the related data components are determined according to the global metadata information, and the storage details of the underlying data can be shielded for the business application layer.

Description

Data query processing method, device and system and computer readable storage medium
Technical Field
The embodiment of the invention relates to a data query processing method, a data query processing device, a data query processing system and a computer readable storage medium.
Background
For the storage of mass data, the data record quantity is extremely large, and the data record quantity exceeds the design range of a relational database; in such a case, the solutions commonly used in the industry are HIVE, HBASE, etc.
Hive is a data warehouse infrastructure established on Hadoop, and data is stored in a file block mode. Hive provides a simple SQL-like (Structured Query Language) that allows users to Query data in a SQL-like manner. HIVE converts SQL requests into MapReduce for processing, and needs a large amount of overhead when submitting and scheduling jobs. Therefore, it is not possible to achieve fast queries with low latency on large-scale datasets; the best use case for Hive is batch processing of large data sets, such as analysis processing of behavior logs.
HBase is a high-reliability, high-performance, nematic and scalable distributed storage system, and a large-scale structured storage cluster can be built on the cheap PCServer. HBase is different from a general relational database in that it adopts a column storage mode and stores data in a key-value format. Therefore, after the main key (rowkey) is obtained, the value of the corresponding field can be quickly obtained; therefore, the design of the main key is strongly coupled with the service scene, and the expansibility is poor. HBASE does not provide a SQL-like interface and requires the developer to call through an Application Program Interface (API).
Other components in the field of mass data application are difficult to consider various application scenarios. Therefore, in practice, different components need to be adopted according to specific application scenarios, and then customized development is performed according to the query interface provided by the components. Under the mechanism, in order to deal with multiple application scenarios, a plurality of data processing components need to be deployed to support the complete business requirement. This brings about the following problems: development is required according to the access interface provided by each data component, and engineering complexity is increased. The data components are physically isolated, and association (join, union, etc.) analysis cannot be performed on the data.
There are two types of solutions that are currently more common:
the first kind of scheme firstly analyzes and processes the data in each data component, forms the result into a middle table and stores the middle table in a relational database or a distributed relational database; then, performing association operation on the intermediate table; and finally, respectively fishing out detailed information from each data component according to the formed association result information. This approach has the following significant limitations: 1. the data components need to be directly butted, and the development workload and the maintenance workload of the service codes are increased; 2. the data storage mode has high coupling degree with the application scene, the data processing logic processing is complex, and the rapid deployment capability of the service application is influenced. 3. The data is read for many times, and the cluster resources are greatly consumed.
The second type scheme deploys components supporting correlation analysis of data of multiple data sources; among them, Presto is widely used. As shown in fig. 1, Presto is a Java-developed big data distributed SQL query engine introduced by Facebook, and can query data stored in data products such as Hive, Redis, Cassandra, Mysql, Oracle, and the like, and perform association analysis on data from multiple data sources. However, Presto has significant limitations: the data source used needs to be specified through the schema; for the service layer, the data storage still needs to be logically partitioned, and the engineering implementation complexity is increased.
Disclosure of Invention
At least one embodiment of the invention provides a data query processing method, a data query processing device, a data query processing system and a computer-readable storage medium, which can reduce query complexity.
At least one embodiment of the present invention provides a data query processing method, including:
receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from a global metadata database, and generating a query execution plan according to the global metadata information; wherein the global metadata information of the global metadata base comprises metadata information of data components;
and based on the query execution plan, performing data operation with the data component related to the query processing request in an interactive mode.
At least one embodiment of the present invention provides a data query processing system, including:
the global metadata base is used for storing global metadata information, and the global metadata information comprises metadata information of data components;
the component management node is used for receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from the global metadata database, and generating a query execution plan according to the global metadata information; generating an execution task based on the query execution plan, and issuing the execution task to a task execution node;
and the task execution node is used for finishing data operation with the data assembly related to the query processing request after receiving the execution task.
At least one embodiment of the present invention provides a data query processing apparatus, including a memory and a processor, where the memory stores a program, and the program, when read and executed by the processor, implements the data query processing method according to any one of the embodiments.
At least one embodiment of the present invention provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the data query processing method according to any one of the embodiments.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a diagram of Presto architecture in the related art;
FIG. 2 is a block diagram of a data query processing system according to an embodiment of the present invention;
FIG. 3 is a flowchart of a data query processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a service information group table creation according to an exemplary application of the present invention;
FIG. 5 is a timing diagram illustrating an exemplary operation of the present invention;
FIG. 6 is a schematic diagram of an exemplary two-dimensional analysis application of the present invention;
FIG. 7 is a timing diagram illustrating operation of a second DML according to an exemplary embodiment of the present invention;
FIG. 8a is a diagram illustrating an exemplary third implementation of the present invention based on behavior log queries;
FIG. 8b is a diagram of an exemplary three syntax tree for use in the present invention;
FIG. 9 is a schematic diagram of an exemplary application of the present invention illustrating a three-conditional expression hold-down;
FIG. 10 is a diagram of an exemplary four data export implementation of the present invention;
FIG. 11 is a diagram illustrating an implementation of a five-heterogeneous data source query in accordance with an exemplary embodiment of the present invention;
FIG. 12 is a diagram illustrating an exemplary six-index and detail association implementation of the present invention;
FIG. 13 is a block diagram of a data query processing apparatus according to an embodiment of the present invention;
fig. 14 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
As shown in fig. 2, an embodiment of the present invention provides a data query processing system, which includes a component management node (Coordinator)201, a global metadata base 202, and at least one task execution node (Worker)203, where:
the component management node 201 is configured to receive a query processing request, determine a data component related to the query processing request according to global metadata information acquired from the global metadata repository 202, and generate a query execution plan; generating an execution task (task) based on the query execution plan, and issuing the execution task to a task execution node;
the global metadata base 202 is used for storing global metadata information; the global metadata information packet includes metadata information of each data component, and the component management node does not need to access each data component to obtain the global metadata information. The global metadata information may be all metadata information of the data component or may be partial metadata information.
The task execution node 203 is configured to, after receiving the execution task, complete data operation by interacting with the data component related to the query processing request. The task execution node 203 is responsible for actually executing the task.
The system may also include RDBMS, Redis, HIVE, HBASE, Solr, etc. databases.
According to the scheme provided by the embodiment, the related data components are determined according to the global metadata information, a data source or a data alias (Schema) does not need to be specified, namely, the data storage details do not need to be known, and the bottom data storage details can be shielded for the service application layer. In addition, the global metadata information is stored in the global metadata database, and the global metadata information is directly acquired from the global metadata database. The scheme provided by the embodiment realizes metadata lifting operation, isolates the service application from the bottom data storage, and provides powerful support for rapid design and deployment of the application service.
The global metadata base may include all metadata in the data components, or may include only partial metadata of the data components. When the global metadata information of part of tables in the data component is not contained in the global metadata database, a user cannot access the data in the tables, so that the global metadata information of the data needing to be protected can not be put into the global metadata database, so that the user cannot access the data, and the purpose of protecting the data is achieved. The metadata in the global metadata base can be imported from the data component, and can also be updated by the task execution node.
The component management node 201 is a management node, analyzes the query processing request to obtain a query execution plan, divides the query execution plan to obtain query execution sections (stages), further divides the stages to obtain tasks, issues the tasks to the task execution node 203, and the task execution node 203 is a work node and is called by the component management node 201 to actually complete each task and interact with the data component.
Wherein the query processing request may be an SQL statement. The SQL statement includes a Data Manipulation Language (DML) and a Data Definition Language (DDL). DML includes SELECT (getting data from database TABLEs), UPDATE (updating data in database TABLEs), DELETE (deleting data from database TABLEs), INSERT INTO (inserting data INTO database TABLEs), etc., DDL includes CREATE TABLE (creating new TABLEs), ALTER TABLE (changing database TABLEs), DROP TABLE (deleting TABLEs), etc. When the query processing request is an SQL statement, the component management node analyzes the SQL statement to obtain the query processing request.
The Java DataBase Connectivity (JDBC) client Java package and the command line window are provided, and business application developers can conveniently and quickly submit SQL and acquire an execution query result.
Of course, the query processing request may be other types of statements.
In an embodiment, the global metadata information comprises at least one of: table definition information, table extension information, table field definition information, metadata rule information, table statistics information. The above information may be stored in the form of a table, but may also be stored in other forms. Wherein,
the table definition information includes: data component information, table name information, etc. of the data table;
the table extension information defines the self characteristics of each data table for the data assembly, such as the Region division mode of HBASE, the partition field of HIVE and other information;
the table field definition information comprises field information included in the data table, including field name, data type, whether to index the field, whether to use the primary key and other information;
the metadata rule information includes: when the preset condition is met, executing preset operation on a preset data component; the preset data assembly is one or more data assemblies;
the statistical information of the table is: to support cost-based optimization (CBO), statistics on the data table are needed, such as: the number of records in the table, the maximum and minimum values of the fields, the histogram distribution, etc.
It should be noted that the above-mentioned global metadata information is only an example, and other information may be included as needed.
In one embodiment, the component management node 201 includes: a query execution plan module 2011, cluster management nodes 2012, a global metadata management module 2013, and a data component interface layer 2014, where:
the Query execution plan module (Query)2011 is configured to receive a Query processing request, determine a data component related to the Query processing request according to the global metadata information, and generate a Query execution plan; generating an execution task based on the query execution plan, and issuing the execution task to a task execution node accessed to a cluster management node;
the cluster management node (Discovery Server)2012 is configured to manage access and exit of a task execution node; after the task execution node is started, the task execution node registers to the cluster management node 2012 to realize access. And the execution task is issued to the accessed task execution node.
The Global MetaData management module (Global MetaData, abbreviated as GMD)2013 is configured to obtain the Global MetaData information from the Global MetaData repository; the global metadata information may be retrieved from the global metadata repository upon initialization after startup. The module is mainly responsible for managing the global metadata information, such as interaction with a global metadata base, acquisition and synchronization of the global metadata information, and the like.
The data component interface layer (Connector)2014 is used for interfacing with a data component to enable a task execution node to access the data component, for example, data collection from the data component, data writing in the data component, and the like, and when a tablescan node in an execution plan is queried, a function of collecting data needs to be implemented through the data component interface layer 2014.
It should be noted that the cluster management node 2012, the global metadata management module 2013, and the data component interface layer 2014 may be disposed in the component management node 201, or may be independent modules.
It should be noted that fig. 2 only generally illustrates the data component interface layer 2014, and in an actual application, the data components according to the actual access may include a high Connector, an hbsase Connector, a sol Connector, and the like.
The global metadata management module 2013 is independent of data components and is closer to a business application layer.
In one embodiment, the query execution plan module generating the query execution plan includes: and matching with a predefined metadata rule in the global metadata information, and generating the query execution plan according to the matched metadata rule. Metadata rules may be defined as desired. For example, a group Table creation rule is predefined, and when a Create Table statement is submitted and the Table name starts with a preset name, a data Table is created in preset data components (the preset data components may be one or more). For example, a rule is defined that when the Create Table statement is submitted and the Table name starts with xytj, a data Table is created in Mysql and hbsase. For another example, when the Create Table statement is submitted, the Table name starts with MT _ then the data Table is created in HIVE, and so on.
By defining metadata rules, the implementation complexity of the associated requirements can be simplified. A series of rules can be defined according to common requirements, and the implementation complexity is reduced. Common requirements are, for example, data association among different data components, export of data of one data component to another data component, export of data from a plurality of data components and combination, and the association of data of different data components may predefine rules corresponding thereto to simplify implementation. Taking the same set of tables created in two data components as an example, in the related art, two different SQL statements need to be submitted respectively, each SQL statement is specific to one data component, thereby achieving the creation of the same set of tables in two data components. In the application, a rule may be defined, and when the SQL statement includes information for creating a group table with a preset name, the table is created in the two data components at the same time, so that one SQL statement creates the same group table in the two data components.
In an embodiment, the global metadata management module 2013 is further configured to: and pushing the global metadata information to the task execution node. Since the task execution node may need the global metadata information when executing the task, the global metadata information may be pushed to the task execution node for its use when executing the task.
In an embodiment, the task execution node 203 is further configured to, in the process of finishing data operation by interacting with the data component related to the query processing request, locally obtain the global metadata information from the task execution node when the global metadata information needs to be used. Compared with the prior art that the metadata information is requested from the data component in the task execution process, the execution efficiency is improved by the scheme provided by the embodiment.
In an embodiment, the global metadata management module 2013 is further configured to update the global metadata database when the task execution node updates the data in the data component, and synchronize local global metadata information and global metadata information of the task execution node according to the updated global metadata database.
Each query execution plan generates an Abstract Syntax Tree (AST), and each node in the syntax tree is an execution node (e.g., a JoinNode, a FilterNode, a tablescan node, etc.). In one embodiment, the query execution plan module generating the query execution plan includes: a syntax tree is generated that includes multiple levels of execution nodes, and at least some of the filter conditions are assigned to table scan nodes (tableScanNodes) in the syntax tree. Note that all filter conditions may be assigned to tablescan node. In the solution provided in this embodiment, the filtering condition is pressed to tablescan node, and compared with a manner in which the filtering condition is assigned to FilterNode, the result set is taken out from the data component, and then the filtering is performed in FilterNode in the related art, the filtering is performed before the result set is taken out from the data component, so that the information amount of the result set obtained from the data component can be reduced, and the execution efficiency can be improved. The filter conditions that need to be pressed down to tablescan node may be defined in the metadata rules. The filter condition may be fully or partially pressed to tablescan node. It should be noted that the table scan node represents an operation step directly interacting with the data component in the syntax tree, and is not limited to the name of tablescan node. The filter conditions include, for example: OR statements, and conditional expressions including format conversion, arithmetic, trigonometric functions, and the like.
In an embodiment, in the syntax tree, a filtering node (FilterNode) whose assignment is empty is at a previous stage of the table scan node, that is, the filtering condition assigned to the FilterNode is assigned to tablescan node, and the FilterNode node is still maintained in order to maintain consistency of the established query execution plan. It should be noted that, in other embodiments, the FilterNode may be removed from the syntax tree.
In one embodiment, the query execution plan module generating the query execution plan includes: and generating a query execution plan which is optimized based on cost according to the statistical information in the global metadata information. The global metadata information may include statistical information, such as discrete distribution information of each field in the table, and the cost (time and resources consumed) of different query execution plans is determined according to the statistical information, so as to optimize the query execution plan and select an optimal execution plan.
As shown in fig. 3, an embodiment of the present invention provides a data query processing method, including:
step 301, receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from a global metadata database, and generating a query execution plan according to the global metadata information; wherein the global metadata information of the global metadata base comprises metadata information of data components;
after the component management node is started, the global metadata information is loaded from the global metadata base.
Step 302, based on the query execution plan, data operation is completed by interacting with the data component related to the query processing request.
According to the scheme provided by the embodiment, the related data components are determined according to the global metadata information, a data source or a data alias (Schema) does not need to be specified, that is, the data storage details do not need to be known, or the data storage details of the bottom layer can be shielded for the service application layer. In addition, the global metadata information is stored in the global metadata database, and the global metadata information is directly obtained from the global metadata database. The scheme provided by the embodiment realizes metadata lifting operation, isolates the service application from the bottom data storage, and provides powerful support for rapid design and deployment of the application service. In addition, the format of the query processing request is irrelevant to the data components, and service application is not needed to develop a docking program of each data component, namely, the interfaces are consistent, so that the project complexity is obviously reduced.
The global metadata base may include all metadata information of the data component, or may include partial metadata information, and when only including partial metadata information, data in the data component may be protected. That is, only the data tables having metadata in the global metadata base can be accessed at this time, and the data tables having no metadata in the global metadata base cannot be accessed.
In an embodiment, the global metadata information comprises at least one of: table definition information, table extension information, table field definition information, metadata rule information, table statistics information. The meaning of each information is described in the foregoing system embodiments, and is not described herein again. It should be noted that the above-mentioned global metadata information is only an example, and other information may be added as needed.
In an embodiment, the determining, in step 301, the data component involved in the query processing request according to the global metadata information obtained from the global metadata base includes at least one of:
when the global metadata information comprises table definition information which comprises component information, determining the related data components according to the component information in the table definition information in the global metadata information;
and when the global metadata information comprises the metadata rule, determining the related data component according to the matched metadata rule in the global metadata information.
In one embodiment, the step 301 of generating the query execution plan includes: and matching the query processing request with a predefined metadata rule in the global metadata information, and generating the query execution plan according to the matched metadata rule. Metadata rules are for example: when the preset condition is met, a preset operation is executed in the preset data component, and the preset operation can be specifically defined according to needs, for example, according to access requirements. For example, the access to multiple data sources can be performed by defining the access rules of the multiple data sources and matching the access rules with one SQL statement, so as to directly access the multiple data sources.
In an embodiment, in the step 302, the performing data operation with the data component involved in the query processing request includes:
and dividing the query execution plan into execution tasks, issuing the execution tasks to task execution nodes, and finishing data operation by the task execution nodes through interaction between the data component butt-joint layer of the data component related to the query processing request and the data component. Dividing the query execution plan into execution tasks includes: dividing the query execution plan into query execution paragraphs, and splitting the query execution paragraphs into execution tasks, where dividing the query execution plan into query execution paragraphs, and a specific way of splitting the query execution paragraphs into execution tasks refers to related technologies, which are not described herein again. The query execution paragraphs contain a hierarchical relationship with each other, and one query execution paragraph represents a portion of the query execution plan. Each query execution paragraph has its input (upstream) and output (downstream).
Performing lexical grammar analysis according to the global metadata information to generate a query execution plan; before dividing into query execution paragraphs, rule-based optimization (RBO) can also be performed on the query execution plan.
In an embodiment, before receiving the query processing request, the step 301 further includes: and the component management node acquires the global metadata information from the global metadata database and pushes the global metadata information to the task execution node.
In an embodiment, in step 302, in the process of completing data operation by interacting with the data component related to the query processing request, when the global metadata information needs to be used, the global metadata information is locally obtained from the task execution node.
In an embodiment, the method further comprises: when the data in the data component is updated, the global metadata base is updated, and the global metadata information of the component management node and the global metadata information in the task execution node are updated. For example, a new table is created in the data component, the global metadata base is updated at this time, and the global metadata information at the management node of the component and the global metadata information at the task execution node are updated.
In one embodiment, the query execution plan is a syntax tree comprising multiple levels of execution nodes, and at least some of the filter conditions are assigned to tablescan nodes in the syntax tree. Of course, all the filter conditions may be assigned to tablescan node, or a filter condition push rule may be defined in the global metadata information, and a filter condition that satisfies the filter condition push rule may be assigned to tablescan node. Wherein, the tablescan node completes the operation of collecting data from the data component. In the implementation of Presto scheme in the related art, the OR statement and the conditional expression including the function are converted into FilterNode and are not transmitted to the data component, so that the collection result set of the data component is huge, and the execution efficiency is influenced. In the scheme, the query execution plan is optimized, the filtering condition is pressed down to the data assembly for execution, the size of a result set returned from the data assembly is greatly converged, and the execution efficiency is improved.
In one embodiment, the syntax tree includes a filtering node whose value is assigned to be empty at a level above the table scan node. I.e., the filter node is retained in the syntax tree even if its assignment is null.
In an embodiment, the query execution plan is a query execution plan optimized based on cost according to statistical information in the global metadata information. The global metadata information may include statistical information, such as discrete distribution information of each field in the table, and the cost (time and resources consumed) of different query execution plans is determined according to the statistical information, so as to optimize the query execution plan and select an optimal execution plan. By optimizing the execution plan, the query efficiency is greatly improved.
In an embodiment, step 302 further includes returning the execution result or result set to the business application that initiated the query processing request. Specifically, the task execution node returns an execution result or result set to the component management node, and the component management node returns the execution result or result set to the service application initiating the query processing request. For example, the information is returned to the service application in a ResultSet or command line playback mode in JDBC (Java database Connectivity) specification. JDBC is a Java API for executing SQL statements that provides uniform access to multiple relational databases.
In one embodiment, the method further comprises the steps of expanding the supported data components, and supporting components such as HBase, Solr, elastic search and the like; and a secondary index mechanism is introduced for improving the query efficiency of HBase and the expandability of an application level. The scheme provided by the embodiment realizes the butt joint of common data sources, including a traditional relational database, a memory database, a full-text retrieval component, and HIVE, HBase and the like commonly used in the field of mass data, and remarkably reduces the workload of business application development.
In the embodiment, the automatic selection of the data components according to the rules is supported, and the modification of the metadata definition through the data definition statements is supported. The global metadata information is managed by the component management node and is issued/synchronized to the task execution node through the cluster management node, so that all operations related to the global metadata information can be directly acquired from a local memory, and the execution efficiency is improved. Compared with the Presto scheme in the related art, Presto directly requests each data component to acquire global metadata information. Therefore, when the business application is docked, the required data must be known by which data component is stored; in the analysis and execution process of the query execution plan, global metadata information needs to be requested from the corresponding data component in real time, and the execution efficiency is low.
Compared with the related technology, the scheme provided by at least one embodiment of the invention can improve the development and deployment efficiency of the service application aiming at mass data, reduce the implementation complexity and reduce the workload of design and development.
The application is further illustrated by the following specific application examples. In the example figures, solid lines represent message flows and dashed lines represent data flows.
Application example 1
The present embodiment relates to the creation of a service information group table. In the field of mass data applications, a scenario that a correlation table needs to be created in different data components at the same time to support business applications often occurs. In this embodiment, when the service application executes CREATE TABLE xytj _ tableA (assuming that a rule is established, the TABLE name starts with xytj _ and is a service information group TABLE), xytj _ tableA _ Mysql needs to be created in Mysql, and xytj _ tableA _ HBase needs to be created in HBase.
The process of creating the service information group table is shown in fig. 4. The components described in this scheme are denoted by GSQL in the flow chart herein, the same applies hereinafter.
The precondition of this embodiment is that global metadata information needs to be imported from the data component according to the business application. For this embodiment, the relevant rule for creating the group Table needs to be defined in advance as follows (the rule represents the submitted Create Table statement, and if the Table name starts with xytj, the data Table needs to be created in the hbsase and Mysql simultaneously):
<rule name="MultipleCreateTable">
<operator>CreateTable</operator>
<regex tablename="xytj_*"/>
<functions>
<function>createMysqlTable</function>
<function>createHBaseTable</function>
</functions>
</rule>
step 401, the business application submits a query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
Create Table xytj_tableA(teamcode varchar2(20),contenttype number,contentcode varchar2(64))
the SQL represents a service-side initiated request to create an xytj tableA group table. In other implementation schemes in the industry, the requirements for implementing the embodiment need to be submitted separately for different data sources; the scheme of use only needs to submit once.
Step 402, after receiving the SQL statement, the component management node matches a create service information group table rule (multiple createtable) in combination with the global metadata information, and constructs a query execution plan according to the rule.
Therein, step 402.1 is the interaction of the component management node and the GMD.
In step 403, after the query execution plan is optimized (i.e. Rule-Based Optimization, RBO) according to the rules, a query execution paragraph is formed.
For specific implementation of RBO, reference may be made to related technologies, which are not described herein in detail.
In this embodiment, two query execution paragraphs are generated: stage-1 comprises a table building operation on HBase; stage-2 includes a table build operation on Mysql. Stage-1 was performed first.
Step 404, creating xytj _ tableA _ HBase in the HBase, where the HBase statement actually executed in this embodiment is as follows:
create'jz:xytj_tableA_hbase',{NAME=>'cf',DATA_BLOCK_ENCODING=>'DIFF',COMPRESSION=>'SNAPPY'},{NAME=>'idx',DATA_BLOCK_ENCODING=>'DIFF',COMPRESSION=>'SNAPPY'}
the above command line is a command to create a table in HBase. In other implementation schemes in the industry, the requirement of the embodiment is realized by logging in HBASE by service personnel, and compiling and submitting the command; by using the scheme, the command is automatically generated and submitted, and the manual execution of business personnel is not needed.
Step 405, adding the xytj _ tableA _ hbase related information in the global metadata database.
Step 406, triggering the execution of Stage-2.
Step 407, create xytj _ tableA _ Mysql in Mysql, and the Mysql SQL statements actually executed in this embodiment are as follows:
Create Table xytj_tableA_mysql(teamcode varchar2(20),contenttypenumeric,contentcode varchar2(64))
the above command line is the command to create a table in Mysql. In other implementation schemes in the industry, the requirement of the embodiment is realized by logging in Mysql by service personnel, and compiling and submitting the command; by using the scheme, the command is automatically generated and submitted, and the manual execution of business personnel is not needed.
In step 408, the relevant information of xytj _ tableA _ mysql is added to the global metadata database.
In step 409, the execution result (here, the indication of successful execution of the data definition statement) is output to the component management node.
In step 410, the component management node returns the execution result to the service application.
In the present embodiment, the timing of the related global metadata operation is as shown in fig. 5.
When the component service is started, the implementation flow of the related operation is as follows:
step 501, a component management node process (a process running on a component management node) triggers a metadata information initialization operation, obtains metadata information from a global metadata base, and stores the metadata information in a local memory.
Step 502, the component management node process pushes the global metadata information to the cluster management node.
Step 503, the task execution node service is started and completed, and registers to the cluster management node.
In step 504, after receiving the registration message of the task execution node, the cluster management node pushes the global metadata information to the task execution node. And the task execution node receives the information and stores the information in a local memory.
When executing a data definition statement (DDL), comprising:
and 505, the component management node process receives the SQL statement and triggers the query execution plan construction operation.
Step 506, according to the transmitted SQL statement, rule matching is carried out, the related data component is determined, whether business application customization operation exists or not is determined, and according to the matched rule, the sorted SQL statement is generated.
And 507, performing semantic analysis according to the metadata information to generate a query execution plan. In the syntax tree of the query execution plan, MetaDataUpdateNode is added. The node is used for updating the metadata information.
And step 508, generating an execution task according to the query execution plan, and completing the DDL operation on the data assembly through a corresponding data assembly docking layer.
In step 509, a global metadata update operation is performed in the MetaDataUpdateNode, and the operation updates the metadata information (located in the memory) of the global metadata base and the component management node. And then informing the component management node process to perform global metadata synchronization operation.
Step 510, the component management node process pushes the metadata information to the cluster management node.
Step 511, the cluster management node pushes the metadata information to each task execution node in a normal state. And after receiving the metadata information, the task execution node stores the metadata information in a local memory.
Application example two
The embodiment relates to multidimensional analysis application, and in the multidimensional analysis application of mass data, the data stored in different data sources are often subjected to associated query. In the embodiment, a table kkjkxx stores collected track information, and the table is stored in an HIVE; the table kkppp stores the pictures and the radio frequency image information related to the track, and the table is stored in HBase; the business application needs to perform association (Join) operation on the data of the two tables according to the track number (gjxxbh) field.
The current common solution is to count according to track number from a table kkjkxx and a table kkppp respectively and then merge the statistical information; the implementation is relatively complex. By using the scheme, the business requirement can be realized by submitting one SQL.
The precondition of this embodiment is to import global metadata information according to the service application. For this embodiment, metadata information of the table kkjkxx and the table kkppp and used data component information need to be defined in advance.
The multidimensional analysis (Join) flow is shown in FIG. 6 and includes:
step 601, the business application submits the query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
Select I01.clxxbh,I01.hphm,I01.dwbh,I02.SPSPLSH,I02.GCBQLSH Fromkkjkxx I01join kksppp I02on I01.gjxxbh=I02.gjxxbh
the SQL statement represents the records of two data tables of kkjkxx and kksppp, which are queried and the gjxxbh fields are equal.
Step 602, after receiving the SQL statement, the component management node constructs a query execution plan by combining the global metadata information.
Step 603, after optimizing the query execution plan according to the rules, dividing the query execution plan into a plurality of query execution paragraphs. In this embodiment, two query execution paragraphs are generated: stage-1 comprises data acquisition operation and Join operation of HBase; stage-2 includes data acquisition operations on Hive. Both query execution paragraphs start running simultaneously. In Stage-1, HBaseConnector will be executed first; and after waiting for the completion of the execution of the Stage-2 and transmitting the completed data, executing the JoinNode operation.
In step 604, the HBase Connector acquires data from the HBase.
Step 605, constructing a Hive Connector; and when the global metadata information is needed, local reading operation is performed as much as possible, namely the global metadata information is obtained from the local.
In step 606, the Hive Connector acquires data from HIVE.
And step 607, transmitting the data acquired from HIVE to Stage-1 to perform Join operation.
Step 608, output Join operation result set to component management node.
In step 609, the component management node returns the query result to the service application.
Steps 602.1, 603.1, and 605.1 in fig. 6 represent interactions with the global metadata management module, and interact with the global metadata management module when metadata needs to be acquired or updated during processing.
As shown in fig. 7, when executing a data manipulation statement (DML), the following operations are included:
step 701, triggering and constructing query execution plan operation by the component management node process.
Step 702, according to the transmitted SQL statement, matching with the global metadata information, determining the related data component, and determining whether business application customization operation exists. And generating the sorted SQL statement according to the relevant rule.
And 703, performing semantic analysis according to the global metadata information to generate a query execution plan.
Step 704, generating an execution task according to the query execution plan, and completing data operation on the data component through the corresponding data component docking layer. In the step of performing data operation by the data component docking layer docking the data component, if the global metadata information is needed, the global metadata information is directly acquired from the local memory of the task execution node.
Application example three
The present embodiments relate to behavior log based queries. The query based on the behavior log is an application scene which is frequently appeared in mass data application, the data is stored by HBase frequently, and unnecessary data is filtered as much as possible in the modes of partitioning, predicate push-down and the like for the query. In this embodiment, it is assumed that the behavior log table useropra is stored in the HBase and partitioned by day; the business application needs to inquire the related behavior log information of a certain type of agent within a specified date range.
The precondition of this embodiment is to import global metadata information according to the service application. For this embodiment, metadata information and component information used in the table useropra need to be predefined, and the table useropra is partitioned by day according to the service data scene.
The HBase query flow is shown in fig. 8a, and includes:
step 801, the business application submits a query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
select userid,opraid,capture_time,from useropra where userid like'a1743%'and(substr(opraid,1,3)='101'or substr(opraid,1,3)='102')andcapture_time>=1526400000and capture_time<=1529078400
the SQL represents the records that query the useropra table for, and meet the following conditions:
a: userid begins with "a 1743";
b: the first three characters of oproid are "101" or "102";
c: capture _ time is between 1526400000and 1529078400 (inclusive);
step 802, after receiving the SQL statement, the component management node constructs a query execution plan by combining the global metadata information.
In step 803, a value is assigned to FilterNode and tablescan node in the query execution plan. Traversing the query execution plan, when traversing to the FilterNode, pressing the Filter condition (such as OR statement, conditional expression containing format conversion, four arithmetic operations and trigonometric functions) meeting the rules into the subnode tablescan node, and leaving the FilterNode which is pressed conditionally empty. In this embodiment, the conditional expressions "userid like ' a 1743%" (substr (oprad, 1, 3): 101' or substr (oprad, 1, 3): 102') "" both push down tablescan node, resulting in a very large convergence.
Syntax tree as shown in fig. 8b, the syntax tree includes nodes OutputNode (output node), ExchangeNode (exchange node), FilterNode and tablescan node, in the related art, the filtering condition is executed in FilterNode, in the embodiment, the filtering condition is executed in tablescan node.
And after the query execution plan is optimized according to the rules, dividing the query execution plan into a plurality of query execution paragraphs. In combination with the definition of the partition type of the data table in the global metadata information, an HBase Connector is constructed (according to the preset condition of the embodiment, multiple HBASESCAN is formed according to a specified time range (capture _ time is 1526400000and capture _ time is 1529078400), and concurrent query is performed).
And step 804, the HBase receives the Scan request, and acquires the related behavior log information of the designated agent through the coprocessor and the secondary index.
To better support the traffic demand, we need to index the row key in the HBASE: simultaneously storing service data and secondary index information in HBASE; wherein, the rowkey of the service data is the unique mark of the record; and generating the rowkey of the secondary index according to the actual value of each service field, wherein the corresponding value is the rowkey of the service data. When the query service is processed, the rowkey range of the secondary index can be spelled according to the query condition by realizing a coprocessor interface of HBASE, and then the rowkey of the service data is quickly positioned, so that the query efficiency is improved.
Step 805, the HBase returns the query information to the HBase Connector.
In step 806, the HBase Connector processes the returned result and outputs it to the downstream query execution section (in this embodiment, the component management node).
In step 807, the component management node returns the query result to the business application.
Steps 802.1 and 803.1 in fig. 8 represent interactions with the global metadata management module, and interact with the global metadata management module when metadata needs to be acquired or updated in the processing process.
The Filter condition depression timing diagram is shown in fig. 9 and includes:
in step 901, the component management node process receives the SQL statement and triggers the construction of the query execution plan.
Step 902, generating a query execution plan, and assigning values to filternodes in the query execution plan.
In step 903, a value is assigned to tablescan node in the execution plan.
Step 904, optimize for tablescan node, trace back its own upper level FilterNode, and push down the Filter condition (e.g. OR statement, conditional expression including format conversion, four arithmetic operations, trigonometric functions) that meets the rules to tablescan node.
Step 905, the FilterNode that has been conditionally pushed is nulled, and is not removed from the query execution plan in order to maintain the consistency of the query execution plan.
Step 906, data is collected from the data component according to the tablescan node designation information through the data component connection layer.
Step 907, trigger the FilterNode operation, and filter the result set returned by the tablescan node. The nulled FilterNode will not do substantive processing and directly return to the result set.
Step 908, the result set is returned to the component management node process, which returns the result set to the business application.
Application example four
The embodiment relates to data export operation, and in a massive data application scene, a query result set is often too large and is not suitable for reading one by one; generally, an SQL statement like CREATE table. And then directly acquires the file. In this embodiment, it is assumed that a query is made to a data table tamee in the HBase, and the result is written into the HIVE.
The precondition of this embodiment is to import global metadata information according to the service application. For this embodiment, it is necessary to define the metadata information and the used component information of the table tamee in advance, and define the relevant rules as follows (the rule indicates that the table is a data table in HIVE if the table name starts with MT >):
<rule name="HiveTablePre">
<operator>*</operator>
<regex tablename="MT_*"/>
<functions>
<function>hiveTable</function>
</functions>
</rule>
the data export process is shown in fig. 10 and includes:
step 1001, the service application submits a query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
CREATE TABLE MT_161220102036554_070AS Select msisdn_1,msimsi_1Fromtamee Where dt=20161219
the SQL represents querying the data table tamee for the record set with dt of 20161219, and saving the record set to the data table MT _161220102036554_ 070.
In step 1002, after receiving the SQL statement, the component management node constructs a query execution plan in combination with the global metadata information.
Step 1003, after optimizing the query execution plan according to the rules, forming query execution paragraph division. In this embodiment, two query execution paragraphs are generated: stage-1 comprises data acquisition operation of HBase; stage-2 includes a table build operation on Hive. Two query execution paragraphs are started and run simultaneously; however, since there is no data, Stage-2 is in a suspended state.
At step 1004, the HBase Connector acquires data from HBase.
And step 1005, transmitting the data acquired from the HBase to Stage-2.
In step 1006, a table MT _161220102036554_070 is created in HIVE. Based on submitted SQL, the table MT _161220102036554_070 only comprises two fields of msisdn _1 and msimsi _1, and the field information of the table MT _161220102036554_070 refers to tamee related information defined by global metadata; the generated HIVE tabulation statement is as follows:
CREATE TABLE IF NOT EXISTS MT_161220102036554_070(msisdn_1STRING,msimsi_1STRING)。
after the table is created, the data is saved on the ground.
Step 1007, updating the global metadata, specifically updating the global metadata information of the global metadata base and the component management node in the MetaDataUpdateNode. The detailed process may take part in the description of application example one. And adding table _ Hive related information in the global metadata base.
Step 1008 outputs the result of the execution (here, the total number of records of the export data) to the component management node.
In step 1009, the component management node returns the execution result to the service application.
Steps 1002.1 and 1003.1 in fig. 10 represent interaction with the global metadata management module, and interact with the global metadata management module when metadata is needed or updated in the processing process.
Application example five
The present embodiments relate to heterogeneous data source queries. In a massive data application scene, in order to quickly query hot data and store and use historical full data, the hot data and the historical data are frequently stored separately. In this embodiment, the gprs data is divided into hot spot data (data of the last 3 months), and stored in the Elasticsearch; and the historical full data is deposited in HBASE. When the data is queried, the selection and subsequent processing of the heterogeneous data source can be automatically completed according to the submitted SQL, and the actual storage mode of the corresponding data does not need to be concerned when the business application layer is designed and developed.
The precondition of this embodiment is to import global metadata information from the data component to the global metadata repository according to the service application. For this embodiment, metadata information and used component information of the table gprs need to be predefined, and data table related rules are predefined as follows (the rules indicate query operations on the table gprs, and will be queried from the Elasticsearch and HBase according to the date field (dt):
<rule name="MultipleSelectTable">
<operator>Select</operator>
<regex tablename="gprs"/>
<MultipleSelect>
<function>dateMultipleSelect</function>
<parameter>dt,m3</function>
</MultipleSelect>
<functions>
<function>selectEsTable</function>
<function>selectHBaseTable</function>
</functions>
</rule>
in the present example, a specific time judgment rule is described by "< parameter > dt, m3</function >": dt is the date field of the data and m3 represents the last 3 months. Other rule description modes may be used.
Heterogeneous data source access as shown in fig. 11, includes:
step 1101, the business application submits a query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
Select msisdn_1,msimsi_1,sdnnum From gprs Where dt>'20171219'and dt<'20180531'
the SQL statement represents a query for data between '20171219' and '20180531' in the gprs table, fields msisdn _1, msimsi _1, sdnnum.
Step 1102, after receiving the SQL statement, the component management node constructs a query execution plan in combination with the global metadata information.
Wherein, according to the predefined rule, the query to the gprs table needs to acquire data from HBase and Elasticsearch, respectively.
Step 1103, the query execution plan is optimized according to rules and divided into a plurality of query execution paragraphs. In this embodiment, two query execution paragraphs are generated: stage-1 comprises data acquisition operation and Union operation on historical data (a gprs table in HBase); stage-2 includes the collection operation for hot spot data (gprs table in Elasticsearch). Both query execution paragraphs start running simultaneously. In Stage-1, HBase Connector will be executed first; and after waiting for the completion of the execution of Stage-2 and transmitting the completed data, executing the UnionNode operation.
In step 1104, the HBase Connector acquires historical data (excluding hotspot data) from the HBase.
In step 1105, an elastic search Connector is constructed.
In step 1106, the Elasticsearch Connector obtains hotspot data from the Elasticsearch.
Step 1107, the hotspot data obtained from the Elasticsearch is transmitted to Stage-1, and Union operation is performed on the hotspot data and the history data obtained from the HBase.
Step 1108, the result set of the Union operation is output to the component management node.
Step 1109, the component management node returns the result set to the service application.
Wherein, steps 11602.1, 1103.1, 1105.1 in fig. 11 represent interactions with the global metadata management module, and when metadata is needed or updated during processing, the interactions with the global metadata management module are performed.
Application example six
The present embodiments relate to index and detail association queries. In a massive data application scenario, the index (or information generated by performing statistical operations) is often stored separately from the details. In this embodiment, it is assumed that a certain track application service stores track information in the Solr (table gclli), and stores track details in the HBase (table gcliddesc); the complete trajectory information that meets the conditions may be obtained by associating sql (join) statements. And the present scheme can optimize based on Cost (CBO) when processing the SQL.
The precondition of this embodiment is to import global metadata information from the data component to the global metadata repository according to the service application. For this embodiment, it is necessary to define the table gclli, the table gclldesc metadata information, and the used component information in advance, and the automatic statistics function for the table gclli is already started. The index query flow chart is shown in fig. 12, and includes:
step 1201, the business application submits a query SQL statement to the component management node through the JDBC interface. The SQL statements submitted in this embodiment are as follows:
Select T1.gcbh,T1.gcwz,T2.sxrs,T2.sxfl From gclli T1join gclliDescT2on T1.gcbh=T2.gcbh Where T1.gcwz='N073E011'
the SQL statement represents that gcbh fields are equal in two data tables of gclli and gcliddesc, and gclli table gcwz field is a record of 'N073E 011'.
Step 1202, after receiving the SQL statement, the component management node constructs a query execution plan by combining the global metadata information. In this embodiment, optimization is performed based on cost, and judgment is performed according to statistical information in the global metadata information, and there is less record information that conforms to the conditional expression ("t 1.gcwz in 'N073E 011'"), so that an execution plan for sequential execution is formed instead of the original Join execution plan. That is, information conforming to "T1. gcwz in' N073E 011" is taken out first, and then Join (association) is performed. By using the method, namely the filtered information is taken out firstly and then correlated, compared with the method of correlating firstly and then filtering, the processing speed is higher.
Step 1203, after optimizing the query execution plan according to the rules, forming query execution paragraph partitions. In this embodiment, two query execution paragraphs are generated: stage-1 includes data acquisition operations on the table gclli (deposited in Solr); stage-2 includes data acquisition operations on tables gclliDes (deposited in Hive). Stage-1 is executed first.
In step 1204, the Solr Connector acquires qualified gclli data from the Solr.
And step 1205, transmitting the acquired gcli data to Stage-2.
And step 1206, generating a query statement for the table gcliDesc according to the gcli data, and collecting data from the HIVE.
Step 1207, output the collection data set to the component management node.
In step 1208, the component management node returns the execution result to the service application.
Where step 1202.1 in FIG. 12 represents interaction with the global metadata management module when metadata is needed or updated during processing.
As shown in fig. 13, an embodiment of the present invention provides a data query processing apparatus 130, which includes a memory 1310 and a processor 1320, where the memory 1310 stores a program, and when the program is read and executed by the processor 1320, the program implements the data query processing method according to any embodiment.
As shown in fig. 14, an embodiment of the present invention provides a computer-readable storage medium 140, which stores one or more programs 141, and the one or more programs 141 are executable by one or more processors to implement the data query processing method according to any embodiment.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (22)

1. A data query processing method, comprising:
receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from a global metadata database, and generating a query execution plan according to the global metadata information; wherein the global metadata information of the global metadata base comprises metadata information of data components;
and based on the query execution plan, performing data operation with the data component related to the query processing request in an interactive mode.
2. The data query processing method of claim 1, wherein the global metadata information comprises at least one of: table definition information, table extension information, table field definition information, metadata rule information, table statistics information.
3. The data query processing method according to claim 1, wherein the determining, based on the global metadata information obtained from the global metadata repository, the data component involved in the query processing request includes at least one of:
when the global metadata information comprises table definition information which comprises component information, determining the related data components according to the component information in the table definition information in the global metadata information;
and when the global metadata information comprises the metadata rule, determining the related data component according to the matched metadata rule in the global metadata information.
4. The data query processing method of claim 1, wherein the generating a query execution plan according to the global metadata information comprises: and matching the query processing request with a predefined metadata rule in the global metadata information, and generating the query execution plan according to the matched metadata rule when the metadata rule is matched.
5. The data query processing method of claim 1, wherein the performing data operations with the data component involved in the query processing request comprises:
and dividing the query execution plan into execution tasks, issuing the execution tasks to task execution nodes, and finishing data operation by the task execution nodes through interaction between the data component butt-joint layer of the data component related to the query processing request and the data component.
6. The data query processing method of claim 5, before receiving the query processing request, further comprising: and the component management node acquires the global metadata information from the global metadata database and pushes the global metadata information to the task execution node.
7. The data query processing method of claim 6, wherein in the process of finishing data operations by interacting with the data components involved in the query processing request, when the global metadata information needs to be used, the global metadata information is locally obtained from the task execution node.
8. The data query processing method of claim 6, wherein the method further comprises: when the data in the data component is updated, the global metadata base is updated, and the global metadata information of the component management node and the global metadata information in the task execution node are updated.
9. The data query processing method of any one of claims 1 to 8, wherein the query execution plan is a syntax tree comprising multiple levels of execution nodes, and at least some of the filter conditions are assigned to table scan nodes in the syntax tree.
10. The data query processing method of claim 9, wherein a filtering node whose value is null is located at a level above the table scan node in the syntax tree.
11. The data query processing method according to any one of claims 1 to 8, wherein the query execution plan is a query execution plan optimized based on a cost according to statistical information in the global metadata information.
12. A data query processing system, comprising:
the global metadata base is used for storing global metadata information, and the global metadata information comprises metadata information of data components;
the component management node is used for receiving a query processing request, determining a data component related to the query processing request according to global metadata information acquired from the global metadata database, and generating a query execution plan according to the global metadata information; generating an execution task based on the query execution plan, and issuing the execution task to a task execution node;
and the task execution node is used for finishing data operation with the data assembly related to the query processing request after receiving the execution task.
13. The data query processing system of claim 12, wherein the global metadata information includes at least one of: table definition information, table extension information, table field definition information, metadata rule information, table statistics information.
14. The data query processing system of claim 12, wherein the component management node comprises: the system comprises a query execution plan module, a cluster management node, a global metadata management module and a data component docking layer, wherein:
the global metadata management module is used for acquiring the global metadata information from the global metadata base;
the query execution plan module is used for receiving a query processing request, determining a data component related to the query processing request according to the global metadata information, and generating a query execution plan according to the global metadata information; generating an execution task based on the query execution plan, and issuing the execution task to a task execution node accessed to a cluster management node;
the cluster management node is used for managing the access and exit of the task execution node;
the data component docking layer is used for docking with the data component to realize the access of the task execution node to the data component.
15. The data query processing system of claim 14, wherein the query execution plan module generating a query execution plan based on the global metadata information comprises: and matching the query processing request with a predefined metadata rule in the global metadata information, and generating the query execution plan according to the matched metadata rule.
16. The data query processing system of claim 14, wherein the global metadata management module is further configured to: and pushing the global metadata information to the task execution node.
17. The data query processing system of claim 16, wherein the task execution node is further configured to, during a data operation performed by interacting with the data component involved in the query processing request, locally obtain the global metadata information from the task execution node when the global metadata information is needed.
18. The data query processing system of claim 16, wherein the global metadata management module is further configured to update the global metadata repository when the task execution node updates the data in the data component, and synchronize the local global metadata information with the global metadata information of the task execution node according to the updated global metadata repository.
19. The data query processing system of any of claims 12 to 18, wherein the query execution plan module generating the query execution plan based on the global metadata information comprises: a syntax tree is generated that includes multiple levels of execution nodes, and at least some of the filter conditions are assigned to table scan nodes in the syntax tree.
20. The data query processing system of any of claims 12 to 18, wherein the query execution plan module generating the query execution plan based on the global metadata information comprises: and generating a query execution plan which is optimized based on cost according to the statistical information in the global metadata information.
21. A data query processing apparatus comprising a memory and a processor, the memory storing a program which, when read and executed by the processor, implements a data query processing method as claimed in any one of claims 1 to 11.
22. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the data query processing method of any one of claims 1 to 11.
CN201811599635.2A 2018-12-26 2018-12-26 Data query processing method, device and system and computer readable storage medium Pending CN111367954A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811599635.2A CN111367954A (en) 2018-12-26 2018-12-26 Data query processing method, device and system and computer readable storage medium
PCT/CN2019/128782 WO2020135613A1 (en) 2018-12-26 2019-12-26 Data query processing method, device and system, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811599635.2A CN111367954A (en) 2018-12-26 2018-12-26 Data query processing method, device and system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111367954A true CN111367954A (en) 2020-07-03

Family

ID=71126821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811599635.2A Pending CN111367954A (en) 2018-12-26 2018-12-26 Data query processing method, device and system and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111367954A (en)
WO (1) WO2020135613A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328624A (en) * 2020-11-06 2021-02-05 瀚高基础软件股份有限公司 SQL forwarding method and device and readable storage medium
CN112434056A (en) * 2020-10-12 2021-03-02 南京江北新区生物医药公共服务平台有限公司 Method and device for inquiring detailed data
CN112463837A (en) * 2020-12-17 2021-03-09 四川长虹电器股份有限公司 Relational database data storage query method
CN113076306A (en) * 2021-06-07 2021-07-06 航天神舟智慧***技术有限公司 Data resource automatic collection method and system based on cataloguing rule
CN113094340A (en) * 2021-04-28 2021-07-09 杭州海康威视数字技术股份有限公司 Data query method, device and equipment based on Hudi and storage medium
CN113271307A (en) * 2021-05-18 2021-08-17 中国工商银行股份有限公司 Data assembling method, device, computer system and storage medium
CN113590217A (en) * 2021-07-26 2021-11-02 北京百度网讯科技有限公司 Function management method and device based on engine, electronic equipment and storage medium
CN115563183A (en) * 2022-09-22 2023-01-03 北京百度网讯科技有限公司 Query method, device and program product
CN116595232A (en) * 2023-05-24 2023-08-15 杭州金智塔科技有限公司 Cross-data-source data processing system, method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982075A (en) * 2012-10-30 2013-03-20 北京京东世纪贸易有限公司 Heterogeneous data source access supporting system and method thereof
CN103605698A (en) * 2013-11-06 2014-02-26 广东电子工业研究院有限公司 Cloud database system used for distributed heterogeneous data resource integration
CN104484472A (en) * 2014-12-31 2015-04-01 天津南大通用数据技术股份有限公司 Database cluster for mixing various heterogeneous data sources and implementation method
CN104636329A (en) * 2013-11-06 2015-05-20 北京航天长峰科技工业集团有限公司 Unified management method for large-scale cross-platform heterogeneous data
US20150143503A1 (en) * 2012-08-31 2015-05-21 Salesforce.Com, Inc. Systems and methods for content management in an on-demand environment
CN107491510A (en) * 2017-08-03 2017-12-19 国网江苏省电力公司信息通信分公司 One kind mixing heterogeneous data source unified query system and distributed enquiring method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7974967B2 (en) * 2008-04-15 2011-07-05 Sap Ag Hybrid database system using runtime reconfigurable hardware
US9081826B2 (en) * 2013-01-07 2015-07-14 Facebook, Inc. System and method for distributed database query engines
US9805092B1 (en) * 2013-02-25 2017-10-31 EMC IP Holding Company LLC Parallel processing database system
CN104008135A (en) * 2014-05-07 2014-08-27 南京邮电大学 Multi-source heterogeneous database fusion system and data query method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150143503A1 (en) * 2012-08-31 2015-05-21 Salesforce.Com, Inc. Systems and methods for content management in an on-demand environment
CN102982075A (en) * 2012-10-30 2013-03-20 北京京东世纪贸易有限公司 Heterogeneous data source access supporting system and method thereof
CN103605698A (en) * 2013-11-06 2014-02-26 广东电子工业研究院有限公司 Cloud database system used for distributed heterogeneous data resource integration
CN104636329A (en) * 2013-11-06 2015-05-20 北京航天长峰科技工业集团有限公司 Unified management method for large-scale cross-platform heterogeneous data
CN104484472A (en) * 2014-12-31 2015-04-01 天津南大通用数据技术股份有限公司 Database cluster for mixing various heterogeneous data sources and implementation method
CN107491510A (en) * 2017-08-03 2017-12-19 国网江苏省电力公司信息通信分公司 One kind mixing heterogeneous data source unified query system and distributed enquiring method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆保国 等: "面向数据中心的多源异构数据统一访问框架", 《指挥信息***与技术》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434056A (en) * 2020-10-12 2021-03-02 南京江北新区生物医药公共服务平台有限公司 Method and device for inquiring detailed data
CN112328624A (en) * 2020-11-06 2021-02-05 瀚高基础软件股份有限公司 SQL forwarding method and device and readable storage medium
CN112463837A (en) * 2020-12-17 2021-03-09 四川长虹电器股份有限公司 Relational database data storage query method
CN112463837B (en) * 2020-12-17 2022-08-16 四川长虹电器股份有限公司 Relational database data storage query method
CN113094340A (en) * 2021-04-28 2021-07-09 杭州海康威视数字技术股份有限公司 Data query method, device and equipment based on Hudi and storage medium
CN113271307A (en) * 2021-05-18 2021-08-17 中国工商银行股份有限公司 Data assembling method, device, computer system and storage medium
CN113076306A (en) * 2021-06-07 2021-07-06 航天神舟智慧***技术有限公司 Data resource automatic collection method and system based on cataloguing rule
CN113590217A (en) * 2021-07-26 2021-11-02 北京百度网讯科技有限公司 Function management method and device based on engine, electronic equipment and storage medium
CN113590217B (en) * 2021-07-26 2022-12-02 北京百度网讯科技有限公司 Function management method and device based on engine, electronic equipment and storage medium
CN115563183A (en) * 2022-09-22 2023-01-03 北京百度网讯科技有限公司 Query method, device and program product
CN115563183B (en) * 2022-09-22 2024-04-09 北京百度网讯科技有限公司 Query method, query device and program product
CN116595232A (en) * 2023-05-24 2023-08-15 杭州金智塔科技有限公司 Cross-data-source data processing system, method and device

Also Published As

Publication number Publication date
WO2020135613A1 (en) 2020-07-02

Similar Documents

Publication Publication Date Title
CN111367954A (en) Data query processing method, device and system and computer readable storage medium
US10963428B2 (en) Multi-range and runtime pruning
JP6617117B2 (en) Scalable analysis platform for semi-structured data
US7577637B2 (en) Communication optimization for parallel execution of user-defined table functions
US4769772A (en) Automated query optimization method using both global and parallel local optimizations for materialization access planning for distributed databases
US10007698B2 (en) Table parameterized functions in database
US9298829B2 (en) Performing a function on rows of data determined from transitive relationships between columns
US10534770B2 (en) Parallelizing SQL on distributed file systems
US11494337B2 (en) Data pruning based on metadata
JP4747094B2 (en) DML statement for densifying data in a relational database system
CN109885585B (en) Distributed database system and method supporting stored procedures, triggers and views
CN110399368B (en) Method for customizing data table, data operation method and device
CN110019314B (en) Dynamic data packaging method based on data item analysis, client and server
US20180150544A1 (en) Synchronized updates across multiple database partitions
US11893026B2 (en) Advanced multiprovider optimization
US7958160B2 (en) Executing filter subqueries using a parallel single cursor model
Ding et al. Sagedb: An instance-optimized data analytics system
CN106339432A (en) System and method for balancing load according to content to be inquired
El-Helw et al. Optimization of common table expressions in mpp database systems
CN114416884A (en) Method and device for connecting partition table
Zhu et al. Hydb: Access optimization for data-intensive service
Barhate Shuffle Overhead Analysis for the Layered Data Abstractions
CN116975098A (en) Query plan construction method, device, electronic equipment and storage medium
Pipita Dynamic query optimization in spark
Vissapragada Optimizing SQL Query Execution over Map-Reduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200703