CN111444220A

CN111444220A - Cross-platform SQ L query optimization method combining rule driving and data driving

Info

Publication number: CN111444220A
Application number: CN202010387095.2A
Authority: CN
Inventors: 顾荣; 张仪; 袁春风; 黄宜华
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-07-24
Anticipated expiration: 2040-05-09
Also published as: CN111444220B

Abstract

The invention discloses a cross-platform SQ L query optimization method combining rule driving and data driving, which comprises the following steps of firstly, resolving cross-platform SQ L statements into a logic query plan in a system, secondly, scheduling a most suitable optimizer by an optimizer scheduling module according to the characteristics of the logic query plan to perform query optimization, thirdly, performing plan search by the rule-driven optimizer according to rules, selecting an execution plan according to a cost model and radix estimation to obtain an optimal physical execution plan, and importing an optimization result into a sample acquisition module, and fourthly, converting a sample imported by the sample acquisition module into a training sample by a data adaptation module, performing reinforcement learning model training by the data-driven optimizer by using the training sample, and inputting the query into the trained model to obtain the optimal physical execution plan.

Description

Cross-platform SQ L query optimization method combining rule driving and data driving

Technical Field

The invention relates to the technical field of cross-platform SQ L query optimization, in particular to a cross-platform SQ L query optimization method combining rule driving and data driving.

Background

In recent years, with the increasing demand of various industries for big data analysis and processing applications, big data query systems are gradually developing towards diversification. These data query systems have various characteristics in terms of query language, computational model, system architecture, underlying storage technology, and the like, and are suitable for different vertical application scenarios, so that modern enterprises or organizations often construct a variety of different data query systems in order to handle diversified services. However, many comprehensive services need to be able to perform convenient and efficient cross-platform data query, for example, there is a need for a unified data analysis service between different departments of the same organization. Therefore, how to fully utilize the characteristics of different computing platforms to complete efficient and convenient cross-platform data query has become a research hotspot in the academic and industrial fields at present.

Because the query scenes oriented by different data engines are different, cross-platform query is optimized, and a proper execution platform is selected for query, so that the query efficiency can be greatly improved. However, query optimization is a combined complex problem, and by adding a cross-platform factor, the types of connection operators are expanded, so that the search space is larger. Therefore, it is common to employ heuristic methods for query optimization, for example, classical System R systems typically limit the search space to certain shapes (e.g., "left deep Tree" plans). Query optimizers sometimes also apply to large join queries using more heuristic methods such as genetic or stochastic algorithms. In the boundary case, these heuristics may crash, resulting in query inefficiency.

The existing cross-platform query optimization related work has certain defects in the aspects of flexibility, expansibility and high efficiency. Firstly, the existing work relies heavily on a fixed heuristic strategy when performing query optimization, so that the optimization strategy cannot be flexibly changed along with the change of a data set and a query load, resulting in poor boundary conditions. Secondly, a part of cross-platform query optimization methods search execution plans in an exhaustive mode, so that the optimizable query connection scale is limited. In addition, the physical execution plan found by the existing cross-platform query optimization method is not efficient, the sub-query division and the sub-query scheduling cannot be realized in an optimal mode, and the performance of a plurality of execution platforms cannot be fully utilized.

Disclosure of Invention

The invention aims to solve the problems and the defects of the prior art, and the invention aims to provide a cross-platform SQ L query optimization method combining rule driving and data driving, which solves the problems of poor expansibility, low flexibility, poor optimization effect and the like of the conventional cross-platform query optimization method.

In order to achieve the purpose, the technical scheme adopted by the invention is a cross-platform SQ L query optimization method combining rule driving and data driving, and the method comprises the following steps:

(1) resolving the cross-platform SQ L statement into a logic query plan inside the system;

(2) the optimizer scheduling module schedules the most suitable optimizer according to the characteristics of the logic query plan and the state of the optimizer to perform query optimization, if the first optimizer is selected, the step (3) is carried out, and if the second optimizer is selected, the step (4) is carried out;

(3) the first optimizer carries out plan search according to an optimization rule and selects an execution plan according to a cost model and cardinal number estimation to obtain a first physical execution plan;

(4) the second optimizer establishes an optimization model by using a reinforcement learning technology, performs model training, performs query optimization by using the trained model to obtain an optimal connection sequence, and converts the optimal connection sequence into a second physical execution plan by using a data adaptation module;

(5) and collecting the first physical execution plan to a sample collection module, and providing samples for training the second optimizer model.

Further, in the step (2), the optimizer scheduling module selects the query optimizer by using the connection size and the optimizer status as the basis for scheduling: when the training of the optimization model of the second optimizer is finished, the first optimizer is selected if the connection size is smaller than or equal to the set threshold, and the second optimizer is selected if the connection size is larger than the set threshold.

Further, in the step (2), when the second optimizer optimization model is not trained yet, the first optimizer is adopted and the search space of the physical execution plan is appropriately widened to generate a more effective physical execution plan.

Further, in the step (3), the first optimizer is a rule-driven optimizer, the physical execution plan binds to a specific execution platform, and query data can be migrated between different execution platforms; the first optimizer searches a top-down execution plan according to an optimization rule, and the execution sequence of the optimization rule is guided by an importance index, so that the convergence of the algorithm is promoted.

Further, in the step (3), a cost model is established, the cost of data migration and the execution cost of the query on the platform are comprehensively considered, and the physical execution plan with the minimum overall cost is selected as the first physical execution plan according to the cost model.

Further, in the step (4), the second optimizer is a data-driven optimizer, abstracts the cross-platform query optimization problem into a markov decision process, establishes a model for the markov decision process by using a reinforcement learning technique, trains the model, and finally performs cross-platform query optimization according to the trained model to obtain an optimal connection sequence.

Further, in the step (4), the data adaptation module divides the physical execution plan into sub-queries through a join operator, so as to adapt the input of the second optimizer; a second physical execution plan is generated by depth-first traversing the optimal join sequence, thereby adapting the output of the second optimizer.

The cross-platform query optimizer system has the advantages that cross-platform query optimization can be efficiently carried out in a cross-platform SQ L query scene, and the problems of flexibility, expansibility and high efficiency of the cross-platform SQ L query optimization are effectively solved.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention;

FIG. 2 is a schematic diagram of a root tree structure inside a rule-driven optimizer in the present invention;

FIG. 3 is a schematic diagram of the optimization process of the data-driven optimizer of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention provides a cross-platform SQ L query optimization method combining rule driving and data driving, which solves the problems of poor expansibility, low flexibility, poor optimization effect and the like of the conventional cross-platform query optimization method, as shown in FIG. 1, the complete process of the invention comprises 5 parts, namely an SQ L analysis stage, an optimizer scheduling stage, a rule driving query optimization stage, a data driving query optimization stage and a training sample acquisition stage, and the specific implementation mode is described as follows:

the SQ L parsing stage corresponds to the technical scheme step (1). The specific implementation mode is that a cross-platform query statement is parsed into an abstract syntax tree, then the validity of tables, columns and data types appearing in the query statement is verified, whether ambiguity caused by a namespaces (Namespace) or watchbands with the same name exists is judged, after the verification is completed, whether semantic errors exist in SQ L is checked, and the abstract syntax tree is converted into a logic query plan.

The optimizer scheduling phase corresponds to the technical solution step (2). The specific implementation mode is as follows: and the connection size is used as the basis for scheduling. When the number of the connection tables is less than or equal to 4, because the scale is limited (the number of the query trees is less than or equal to 120), the cost of enumerating the execution plan is not high, so that the execution plan is enumerated by adopting a rule-driven optimizer and an optimal execution plan is found. When the connection scale is larger than 4, the enumeration overhead of the rule-driven optimizer is too large (the number of the query trees is larger than or equal to 1680), and the optimization can only be performed by adopting a fixed heuristic method based on coding, so that the defect of the fixed heuristic method based on coding can be effectively avoided by adopting the data-driven optimizer. When the system is in cold start and the data set adaptive optimizer is not trained, the rule-driven optimizer is adopted and the connection size threshold value of the enumeration strategy is properly relaxed to 6 (the number of the query trees is 30240), so that the search space of the physical execution plan is relaxed and a more effective execution plan is collected for training.

The query optimization stage driven by the rule corresponds to the technical scheme step (3). The specific implementation mode is as follows: the rule-driven optimizer is improved based on a Cascades optimizer generator, so that the characteristics of cross-platform query are met, the advantages of the Cascades optimizer are maintained, and cross-platform function expansion is performed on the Cascades optimizer. After the logic query plan enters the optimizer, the logic query plan is converted into a root tree of an internal data structure of the optimizer, as shown in fig. 2, Set represents an equivalent sub-query Set, one Set includes a plurality of subsets, the subsets represent equivalent sub-query sets on the same platform (Convention), each element in the Set is represented as a RelNode, and nodes in the root tree can be multiplexed and converted with each other. The inter-platform conversion rule is generated by recording all convections to a set when the convections are generated, and judging whether the conversion can be performed or not by pairwise combination of the convections. The registration of rules traverses all nodes in the logical plan through the visitor pattern. Since the leaf node is a node of TableScan, there exists a Convention in the logical query plan, and therefore the leaf node is searched for platform-related rule registration in the traversal process of the logical query plan. The optimizer performs a top-down search of the physical execution plan according to an optimization rule, wherein the execution order of the optimization rule is guided by an Importance index (import). The rule importance may be set to a very high priority so that it can be specified which optimization rules apply, or to a value of 0, so that the branch of this rule will hardly continue to be explored in the future. The method is a method for realizing top-down search pruning and promotes convergence of the algorithm. And finally, establishing a cross-platform query cost model, calculating the arithmetic sum of the data migration cost and the query execution cost as an overall cost, and selecting the physical execution plan with the minimum overall cost as the optimal physical execution plan (namely the first physical execution plan in the claim part).

The data-driven query optimization stage corresponds to the technical scheme step (4). The specific implementation mode is as follows: the cross-platform query is characterized, and the selection of the characteristics can express the whole query request, including the connected left-side table, the connected right-side table and the connected execution platform. The characterized information is designed here in three categories:

1) and (4) participating in the relationship. The basic idea behind relationship design is to take each column as a feature to identify which columns are used in the query. Specifically, by selecting the characteristics of the columns, it is indicated which attributes are participating in the query and specific join operations. Assume A is the set of all attributes under all platforms, with a set of visible attributes for each relationship rel (including the concatenated intermediate results)

To represent the output of the relationship rel. Similarly, each query graph G may be attributed its visibility

Each join operation c can be represented by a tuple of two relations (L, R) to get its visible attribute A_LAnd A_R. Each attribute set A_G，A_L，A_RCan be represented by binary one-hot coding: 1 indicates the presence of a particular attribute and 0 indicates the absence of that attribute. By using

Splicing symbolic representation vectors to obtain query graph characteristics f_G＝A_GAnd connection decision feature

Finally, the feature vector of the input tuple of the model is

2) And (4) predicate influence. The predicate will influence the cardinality of the relationship on both sides of the join operator. To process a single table predicate in a query, the representation of the feature must be adjusted. For each selection operation σ in the query, the selection scale is obtained using table statistics present in most RDBMSs_σThereby estimating the number of data lines in the relationship after applying the selection. To characterize the impact of predicates, the relationship correspondence f specified by the selection operation σ_GThe feature of (1) is scaled. Id if the predicate Emp>200 is selected to be 0.2, then f_GThe eigenvalue corresponding to emp.id in (1) will become 0.2;

3) and (4) selecting a platform. The platform selection of the connection is represented by expanding the implementation type of the connection operator, and only one-hot vector of another representation execution platform needs to be spliced to f_CIn the vector, 1 indicates that the corresponding execution platform is selected, and 0 indicates that the corresponding platform is not selected.

After the feature abstraction is completed, a Q-L earning model corresponding to query optimization is established, as shown in FIG. 3, a multi-layer perceptron (M L P) neural network is used for representing that the input of the Q-function model is (G, c), and the splicing of the characterization vectors is

Under limited training time constraints (less than 10 minutes), two layers of M L P may provide the best performanceThe falling (SGD) algorithm is trained. After the training is finished, a fitting model Q of Q-function is obtained_θ(f_G,f_c). The cross-platform query optimization is carried out by utilizing the trained model, and the method comprises the following four steps: 1) the data adapter module divides the physical execution plan into sub-queries through a connection operator so as to adapt to the input of the data driving optimizer; 2) characterizing each sub-query to be connected; 3) finding the connection with the lowest estimated Q value (output from the neural network); 4) and updating the sub-query set to be connected and repeating the steps until only one element remains in the sub-query set to be connected. Finally, the optimal connection sequence generated in the third step is converted into an optimal physical execution plan (i.e. a second physical execution plan as described in the claim part) through depth-first traversal;

the training sample collection stage corresponds to the technical scheme step (5). The specific implementation mode is as follows: the first physical execution plan is divided into a join sequence of different sub-queries according to a join operator, and a cost per join is extracted. The join sequence and the cost of the join are persisted to local disk in the format of CSV for the second optimizer to read when training the model.

The invention realizes a prototype system coral based on the existing open source software, and three mainstream database systems of Clickhouse, MemSQ L and PostgreSQL L are integrated on the bottom layer of the prototype system coral.

The 12 cross-platform queries Q1 to Q12 are sequentially executed on a TPC-H100 GB data set to test a prototype system realized by the invention, and Table 1 shows the comparison of query time of the prototype system Coral realized by the invention, an existing cross-platform query system Sloth and MuSQ L E which execute the 12 cross-platform queries under the same software and hardware conditions.

Table 1: performance testing for performing cross-platform queries on TPC-H100 GB dataset

Claims

1. A cross-platform SQ L query optimization method combining rule driving and data driving comprises the following steps:

2. The combined rule-driven and data-driven cross-platform SQ L query optimization method of claim 1, wherein in the step (2), the optimizer scheduling module selects the query optimizer according to the connection size and the optimizer status as the scheduling basis, when the training of the second optimizer optimization model is completed, the first optimizer is selected if the connection size is smaller than or equal to a set threshold, and the second optimizer is selected if the connection size is larger than the set threshold.

3. The combined rule-driven and data-driven cross-platform SQ L query optimization method of claim 2, wherein in step (2), when the second optimizer optimization model has not been trained yet, the first optimizer is used and the search space of the physical execution plan is appropriately widened to generate a more efficient physical execution plan.

4. The cross-platform SQ L query optimization method based on combination of rule-driven and data-driven according to claim 1, wherein in the step (3), the first optimizer is a rule-driven optimizer, the physical execution plan is bound with a specific execution platform, query data can migrate between different execution platforms, the first optimizer searches the execution plan from top to bottom according to the optimization rule, and the execution sequence of the optimization rule is guided by an importance index, so as to promote convergence of the algorithm.

5. The combined rule-driven and data-driven cross-platform SQ L query optimization method according to claim 1, wherein in the step (3), a cost model is established, the cost of data migration and the execution cost of the query on the platform are comprehensively considered, and a physical execution plan with the minimum overall cost is selected as the first physical execution plan according to the cost model.

6. The rule-driven and data-driven combined cross-platform SQ L query optimization method according to claim 1, wherein in the step (4), the second optimizer is a data-driven optimizer, the cross-platform query optimization problem is abstracted into a Markov decision process, a model is established and trained in the Markov decision process by adopting a reinforcement learning technology, and finally, cross-platform query optimization is performed according to the trained model to obtain an optimal connection sequence.

7. The cross-platform SQ L query optimization method combining rule driving and data driving according to claim 1, wherein in the step (4), the data adaptation module divides the physical execution plan into sub-queries by a join operator so as to adapt to the input of the second optimizer, and generates a second physical execution plan by depth-first traversal of the optimal join sequence so as to adapt to the output of the second optimizer.