CN105824957A

CN105824957A - Query engine system and query method of distributive memory column-oriented database

Info

Publication number: CN105824957A
Application number: CN201610193220.XA
Authority: CN
Inventors: 段翰聪; 王瑾; 闵革勇; 聂晓文; 郑松; 张博
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-03-30
Filing date: 2016-03-30
Publication date: 2016-08-03
Anticipated expiration: 2036-03-30
Also published as: CN105824957B

Abstract

The invention discloses a query engine system and a query method of a distributive memory column-oriented database. The query method comprises the following steps that a resource management module determines a conversation with a user in charge by a main query engine; the main query engine converts SQL (structured query language) sent by the user into a query plan; the resource management module allocates a sub query engine for the main query engine; the main query engine divides the query plan into at least two sub tasks and allocates sub query engine for each sub task; after the execution of the precursor sub tasks of the current sub task is completed, the current sub task is executed; middle data generated after the execution of the current sub task is completed is transmitted to the sub query engine in which the subsequent sub tasks are located; the current sub task completion state is sent to the main query engine; the main query engine notifies a customer to obtain final result data from the sub query engine. The query engine system and query method of the distributive memory column-oriented database provided by the invention have the advantage that good query efficiency can be obtained.

Description

The query engine system of distributed memory columnar database and querying method

Technical field

The present invention relates to database technical field, be specifically related to query engine system and the querying method of a kind of distributed memory columnar database.

Background technology

NewSQL is that this kind of data base not only has the NoSQL storage operating capability to mass data to various new expansible, the abbreviations in high-performance data storehouse, also maintains traditional database and supports the characteristics such as ACID and SQL.In general, NewSQL is roughly divided into three classes: new architecture, uses brand-new database platform, takes different methods for designing, such as GoogleSpanner, Clustrix, VoltDB and MemSQL；SQL query engine, the SQL storage engines of height optimization, it is provided that the DLL that MySQL is identical, but autgmentability is more preferable than built-in engine InnoDB；Transparent burst, it is provided that the middleware layer of burst, data base is automatically segmented in multiple node and runs.As time goes on, the NewSQL data base of these three type is the most gradually merged, and has been born towards the large-scale distributed internal memory columnar database of on-line analytical processing (OLAP, OnlineAnalyticalProcessing).

Query engine is the core of Database Systems, and that is responsible for whole Database Systems inquiry calculating task performs scheduling.Article one, the SQL statement of user's input, first can carry out SQL statement morphology syntax parsing generative grammar tree in Database Systems, then deform syntax tree through database query optimizer, finally change into the inquiry plan that database query engine can identify.Inquiry plan tells how query engine performs, and how to extract data from data base's bottom storage engines, deforms data and be finally converted into the result that user wants.

HIVE is a Tool for Data Warehouse based on Hadoop, and provides simple SQL query function, SQL statement can be converted into MapReduce task and run.For SQL statement SELECTc_custkeyFROMcustomerJOINnationONcustomer.C_NATION KEY=nation.N_NATIONKEYJOINlineitemONlineitem.L_PARTKEY=c ustomer.C_CUSTKEY, HIVE to a SQL query plan and tasks carrying flow process as shown in Figure 1.What HIVE really performed is MapReduce task, so inquiry plan can be converted into MapReduce set of tasks, former inquiry plan is converted to two MapReduce tasks.Wherein, JOB1 is responsible for calculating Join1, namely the Join computing of lineitem table and customer table；JOB2 is responsible for calculating Join2, namely calculates Join1 result and the Join computing of nation table, finally exports result.After JOB1 has performed, intermediate result data can be write external storage system, JOB2 just can start to perform, and then JOB2 can carry out evaluation work from the intermediate object program that external storage system reading JOB1 produces.The shortcoming of HIVE is apparent, its bottom uses MapReduce computation module, for the data sharing between each two MapReduce calculating task, one of them result calculating task can only be exported external storage system (distributed file system or local file system), later calculates task and calculates from external storage system reading data, cause substantial amounts of magnetic disc i/o, to such an extent as to whole query script postpones higher.

Spark-SQL is another Tool for Data Warehouse, similar with HIVE function, but Spark-SQL bottom uses Spark computation model rather than MapReduce computation module.For SQL statement SELECTc_custkeyFROMcustomerJOINnationONcustomer.C_NATION KEY=nation.N_NATIONKEYJOINlineitemONlineitem.L_PARTKEY=c ustomer.C_CUSTKEY, Spark-SQL to a SQL query plan and tasks carrying flow process as shown in Figure 2.Stage1 is mainly used to process the ScanTable(lineitem in inquiry plan) and ScanTable(customer), the most corresponding RDD1 and RDD2.Owing to RDD is distributed elastic data set, corresponding multiple physical nodes, each physical node can perform the task of correspondence, so a RDD is by the Task(task of multiple executed in parallel) obtain, such as RDD1 is just calculated by Task1-1, Task1-2.After having read lineitem table and customer table content, Stage2 is mainly used to process Join1 operation and ScanTable(nation) operation, generate RDD3 and RDD4 respectively.Finally, Stage3 has been used for Join2 operation.Spark-SQL is a lot of soon relative to HIVE on computing relay, but still there are disadvantages that.

One is that Spark-SQL bottom uses scala language to realize, and on a java virtual machine, its memory management mechanism depends on Java Virtual Machine to overall operation.And Java Virtual Machine memory management mechanism is a kind of general memory management mechanism, in database query engine, do not do the internal memory optimization customized for database query engine, cause Spark-SQL to consume substantial amounts of memory headroom during calculating.

It two is during Spark-SQL tasks carrying to perform according to phase sequence, and the precondition starting to perform such as Stage2 is that Stage1 has performed, and the precondition of Stage3 execution is that Stage2 has performed.Each Stage comprises several can the Task(task of executed in parallel), the Task postponed by the time that performs in this Stage is the longest that performs of each Stage determines.Thus producing a problem, perform to wait other Task being not carried out in same Stage after fast Task completes, after treating that in same Stage, all tasks carryings complete, the Task in next Stage just can start to perform.Such as, Task1-1, Task1-2, Task2-1 and Task2-2 are in Stage1, Task3-1 and Task3-2 is in Stage2, and Task3-2 depends on the result of calculation of Task1-1, Task1-2 and Task2-1.If Task1-1, Task1-2 and Task2-1 tasks carrying completes and Task2-2 has been not carried out, even when Task3-1 meets execution condition, under the constraints of Spark Computational frame, Task3-1 still can not start to perform, and needs just to start to perform after Task2-2 has performed by the time.If it is oversize that Task2-2 performs the time, then can affect the computing relay of whole calculating process.

Summary of the invention

To be solved by this invention is the problem that existing database query engine computational efficiency is low.

The present invention is achieved through the following technical solutions:

The query engine system of a kind of distributed memory columnar database, including resource management module, at least one main query engine and at least one is from query engine；Inquiry plan, for sql like language is converted to inquiry plan, is divided at least two subtask, and is responsible for monitoring and the execution process of scheduling inquiry plan by described main query engine；Described from query engine for perform described main query engine distribution subtask；Described resource management module is for being responsible for management and the distribution of system resource.

Optionally, described system resource includes that CPU calculates resource and memory source.

Query engine system based on above-mentioned distributed memory columnar database, the present invention also provides for the querying method of a kind of distributed memory columnar database, including: resource management module determines the session that a main query engine is responsible between user；The sql like language that user sends is converted to inquiry plan by main query engine；Resource management module is that main query engine distributes from query engine, and sets up from the communication between query engine and main query engine；Inquiry plan is divided at least two subtask by main query engine, and is that each subtask is distributed from query engine；Subtask is added to task queue from query engine, current subtask is performed after the forerunner subtask of current subtask has all performed, current subtask has been performed the intermediate data transmission that produces to place, follow-up subtask from query engine, and current subtask completion status is sent to main query engine；After whole inquiry plan completes, main query engine notifies that client is obtaining final result data from query engine.

Inquiry plan is divided into some subtasks having dependence by the present invention, and by subtask distribution to accordingly from the task queue of query engine, by the subtask performed successively from query engine in task queue, without occurring in Spark-SQL, although in the latter half, certain task is satisfied can perform condition, but perform the restriction of framework due to Spark-SQL, and the shortcoming performing calculating task can not be started.Therefore, the querying method of the distributed memory columnar database of present invention offer is provided, good search efficiency can be obtained.

Optionally, subtask uses physics operator representation, and described physics operator includes at least one in the operation of extraction column data operation, attended operation, condition filter operation, division operation, aggregate function operation, sorting operation and table of being embarked on journey by final result data convert.

Optionally, main query engine is that each subtask is distributed from query engine according to Cost Model.Use Cost Model be the distribution of each subtask from query engine, can be each subtask distribution Executing Cost minimum from query engine, thus improve search efficiency further.

Optionally, main query engine is that the distribution of each subtask includes from query engine according to Cost Model: according to obtaining the IP from query engine place node and the database table information of this node storage and column information from the metadata information of query engine；According in data localization principle distribution inquiry plan, each extracts the execution node IP that column data operates；Greedy algorithm is used to choose the execution node of non-extraction column data operation.

Optionally, the state of each subtask the pending state such as includes, calculates state, distribute data mode, the state that is finished and perform status of fail.

Optionally, the original state of current subtask such as is at the pending state, receives after all forerunners subtask, current subtask performed the intermediate data produced at place, current subtask from query engine, changes calculating the state of current subtask into state；After current subtask has calculated, the state of current subtask changes distributing data mode into, and by calculate the intermediate data produced send extremely place, follow-up subtask from query engine；If intermediate data sends successfully, change the state of current subtask into be finished state；If etc. between pending state and calculating state, calculate between state and distribution data mode or distribute data mode and being finished between state and break down, changing into the state of current subtask performing status of fail；When the state of current subtask changes, the main query engine of asynchronous notifications.

Optionally, between query engine, the intermediate data of transmission is the column data processed through overcompression.In traditional database enforcement engine, intermediate data is pressed the form of table and is occurred, data storage stores according to row, but under major part analytical type business scenario, some attributes in user's only one relation table of relation, the mode using row storage can additionally load the unconcerned attribute data of user during calculating, thus causes the waste of internal memory, uses the mode of row storage to solve this problem well.

Optionally, described compression processes and includes that position compression process and dictionary compression process.Use the mode that dictionary compression processes and position compression processes can reduce memory cost further, improve the service efficiency of internal memory.

The present invention compared with prior art, has such advantages as and beneficial effect:

The query engine system of the distributed memory columnar database that the present invention provides and querying method, integral operation efficiency is improved by the execution of each subtask of asynchronous schedule, some subtasks having dependence will be divided into by inquiry plan, and by subtask distribution to accordingly from the task queue of query engine, by the subtask performed successively from query engine in task queue.Further, between query engine, the data of transmission are the column data processed through overcompression, solve the mode using row storage during calculating extra load user unconcerned attribute data and cause the waste problem of internal memory.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing being further appreciated by the embodiment of the present invention, constitutes the part of the application, is not intended that the restriction to the embodiment of the present invention.In the accompanying drawings:

Fig. 1 is a SQL query plan and the tasks carrying schematic flow sheet of HIVE；

Fig. 2 is a SQL query plan and the tasks carrying schematic flow sheet of Spark-SQL；

Fig. 3 is the part-structure schematic diagram of the query engine system of the distributed memory columnar database of the embodiment of the present invention；

Fig. 4 is a SQL query plan schematic diagram of the embodiment of the present invention；

Fig. 5 is the tasks carrying schematic flow sheet of the embodiment of the present invention；

Fig. 6 is the execution state transition diagram of the subtask of the embodiment of the present invention；

Fig. 7 is the schematic diagram transmitting data between query engine of the embodiment of the present invention.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, and the exemplary embodiment of the present invention and explanation thereof are only used for explaining the present invention, not as a limitation of the invention.

Embodiment

The present embodiment provides the query engine system of a kind of distributed memory columnar database, and the query engine system of described distributed memory columnar database includes resource management module, at least one main query engine and at least one is from query engine.

Specifically, sql like language is converted to inquiry plan by resolving sql like language by described main query engine, it is distributed to described from query engine execution after inquiry plan is divided at least two subtask, and is responsible for monitoring and the execution process of scheduling inquiry plan and fault-tolerant processing.Similar with prior art, inquiry plan tree represents.Described from query engine for performing the subtask of described main query engine distribution, described resource management module is for being responsible for management and the distribution of system resource.Further, described system resource includes that CPU calculates resource and memory source.Fig. 3 is the part-structure schematic diagram of the query engine system of the distributed memory columnar database of the present embodiment, and main query engine 31 correspondence three is from query engine: from query engine 32, from query engine 33 and from query engine 34.

The present embodiment also provides for the querying method of the distributed memory columnar database of query engine system based on above-mentioned distributed memory columnar database, including:

Step S1, resource management module determines the session that a main query engine is responsible between user.Specifically, when user has query demand, resource management module creates the session that a main query engine is responsible between user in resource pool.

Step S2, the sql like language that user sends is converted to inquiry plan by main query engine.Main query engine is resolved by morphology and syntax parsing, and rule-based query optimization, and sql like language is converted into inquiry plan.Similar with prior art, inquiry plan tree represents.

Step S3, resource management module is that main query engine distributes from query engine, and sets up from the communication between query engine and main query engine.After sql like language is converted into inquiry plan, main query engine calculates resource to resource management module application, and resource management module distribution gives main query engine from query engine, and sets up from the network connection between query engine and main query engine.

Step S4, inquiry plan is divided at least two subtask, and is that each subtask is distributed from query engine by main query engine.Due in the present embodiment query engine towards be distributed memory columnar database, tables of data is by row storage in distributed column data base, and every string is cut into some bursts according to value scope.For this characteristic, the present embodiment has taken out some physics operators, is used for representing the subtask that in inquiry plan, some is concrete.Described physics operator includes at least one in the operation of extraction column data operation, attended operation, condition filter operation, division operation, aggregate function operation, sorting operation and table of being embarked on journey by final result data convert.

Extraction column data operates: i.e. GetColumn operator, being responsible for extracting the data of certain string in column database, GetColumn operator itself can be with additional restrictions, such as GetColumn(Teacher.ageTeacher.age > 1), represent the age row extracting Teacher table, and age value is more than 1.

Attended operation: i.e. Join operator, is responsible for performing Join computing, including LeftJoin, RightJoin, FullJoin etc..

Condition filter operates: i.e. Filter operator, is responsible for performing condition filter operation, mainly includes the logical operationss such as AND and OR.

Division operation: i.e. GroupBy operator, is responsible for performing GroupBy division operation, for meeting the function of GroupBy keyword in SQL statement.

Aggregate function operates: i.e. AGG operator, including Max(maximizing), Avg(averages) etc. the conventional operation of data base.

Sorting operation: i.e. Order operator, for being ranked up operation to the row needing sequence.

Final result data convert is embarked on journey the operation of table: i.e. BuildRow operator, for becoming user to may be appreciated row table column database final result data convert, with the form of relation table, final result is presented to user.

nullIllustrate，Article one, concrete SQL statement SELECTc_custkeyFROMcustomerJOINnationONcustomer.C_NATION KEY=nation.N_NATIONKEYJOINlineitemONlineitem.L_PARTKEY=c ustomer.C_CUSTKEY，The inquiry plan generated is resolved as shown in Figure 4 through main query engine，The subtask being divided into is as shown in Figure 5，Including six from query engine: from query engine Slave-QE1、From query engine Slave-QE2、From query engine Slave-QE3、From query engine Slave-QE4、From query engine Slave-QE5 and from query engine Slave-QE6.

Assume that each row all have two bursts, then for there being a GetColumn operator on each burst arranged, owing to the burst of each row has codomain scope, then also can produce Join operator based on this burst scope for each burst.With reference to Fig. 5, Join1 node represents the equivalent attended operation of row L_PARTKEY Yu C_CUSTKEY, in actual subtask, Join1 is split into two concrete physics operators, Join1-1 and Join1-2, is each responsible for codomain scope and operates at the equivalent Join of 101-150 at 1-100 and codomain scope.The like, in inquiry plan, Join2 is also split as two concrete Join operators.

Further, in the present embodiment main query engine be according to Cost Model be each subtask distribute from query engine.Specifically, main query engine is according to obtaining the IP from query engine place node and the database table information of this node storage and column information from the metadata information of query engine.According in data localization principle distribution inquiry plan, each extracts the execution node IP that column data operates.The most in Figure 5, from the fragment data of query engine Slave-QE1 place physical node storage L_PARTKEY row, then the GetColumn operator for this fragment data is just assigned to from the physical node of query engine Slave-QE1 place perform.The like, the node node all at corresponding data place that performs of the GetColumn operator of each burst performs.Node is performed for non-GetColumn operator and chooses employing greedy algorithm, non-GetColumn operator performs node and chooses in the execution node of its son's operator node, calculating the Executing Cost performed on every son operator node physical node respectively, the physical node selecting Executing Cost minimum performs.Principle basis cost computing formula: between Executing Cost=network cost+calculation cost=node, offered load × transmitted data amount+node tasks loads × calculate data volume.In Figure 5, Join1-1 operator performs node or from query engine Slave-QE1, from query engine Slave-QE3, here select to be through calculating Join1-1 operator respectively at the Executing Cost from query engine Slave-QE1 node with at the Executing Cost on query engine SlaveQE-3 node as the foundation performing node from query engine Slave-QE1, calculating determines at Executing Cost on query engine Join1-1 less, so final execution physical node is chosen as from query engine Slave-QE1.

Step S5, subtask is added to task queue from query engine, current subtask is performed after the forerunner subtask of current subtask has all performed, current subtask has been performed the intermediate data transmission that produces to place, follow-up subtask from query engine, and current subtask completion status is sent to main query engine.Specifically, the pending states such as each subtask includes, calculate state, distribute data mode, the state that is finished and perform these five kinds of states of status of fail, and the list of forerunner subtask and the list of follow-up subtask of this subtask can be safeguarded in each subtask, and the execution state transition graph of each subtask is as shown in Figure 6.

As a example by the Join1-1 operator shown in Fig. 4, its forerunner's operator list is GetColumn(L_PARTKEYSlice1 [1-100]), GetColumn (C_CUSTKEYSlice1 [1-150]), its Consequence operator list is Join2-1 operator.Join1-1 operator original state is etc. pending, after Join1-1 operator place physical node receives the data that the transmission of its all forerunner's operators comes, Join1-1 operator state changes into calculating, after Join1-1 operator has calculated, to work as pre-operator change into distribute data, and by calculation result data by network be sent to Consequence operator place from query engine.Data send successfully, when pre-operator tasks carrying completes.If the most a certain step breaks down, i.e. etc. between pending state and calculating state, calculating between state and distribution data mode or distribute data mode and being finished between state and break down, operator state can be set to perform failure.Certainly, often there is one-shot change in Join1-1 operator state, and pre-operator state is worked as in all can be real-time report to main query engine.The execution of each operator is separate, and during each operator performs, state once changes, will the main query engine of asynchronous notifications, and result data is pushed to the execution physical node at Consequence operator place.In this way, whether the execution of the subtask forerunner subtask that places one's entire reliance upon completes, and without as Spark or MapReduce, goes execution task stage by stage.

In step s 5, it is the column data processed through overcompression from query engine and the intermediate data transmitted between query engine, position compression process processes with dictionary compression to use compression processing method to include, as a example by the data structure shown in Fig. 7, intermediate data comprises three vectors, i.e. dictionary vector, side-play amount vector sum position vector.Initial data is ranked up by dictionary vector, and then duplicate removal processes, and the data of redundancy is abandoned, and saves memory storage space.As for side-play amount vector sum position vector, it is integer due to store inside the two vector, uses position Compression Strategies here.In a computer, an INT type accounts for four bytes, i.e. 32bit, and denotable scope of data is-2147483648～2147483647, and for the side-play amount vector sum position vector shown in Fig. 7, in vector, the maximum of integer may determine that.So in most of the cases, a have more than is needed 32bit of numeral is stored.Assume that in side-play amount vector or position vector, the maximum of integer is A, then storing the bit number used by a numeral is that log2A rounds up, contrast conventionally employed INT type or LONG type variable to store integer, adopt and the most more save internal memory.

Step S6, after whole inquiry plan completes, main query engine notifies that client is obtaining final result data from query engine.So far, whole inquiry work is completed.

Above-described detailed description of the invention; the purpose of the present invention, technical scheme and beneficial effect are further described; it is it should be understood that; the foregoing is only the detailed description of the invention of the present invention; the protection domain being not intended to limit the present invention; all within the spirit and principles in the present invention, any modification, equivalent substitution and improvement etc. done, should be included within the scope of the present invention.

Claims

1. the query engine system of a distributed memory columnar database, it is characterised in that include resource management module, at least one main query engine and at least one is from query engine；

Inquiry plan, for sql like language is converted to inquiry plan, is divided at least two subtask, and is responsible for monitoring and the execution process of scheduling inquiry plan by described main query engine；

Described from query engine for perform described main query engine distribution subtask；

Described resource management module is for being responsible for management and the distribution of system resource.

The query engine system of distributed memory columnar database the most according to claim 1, it is characterised in that described system resource includes that CPU calculates resource and memory source.

3. the querying method of the distributed memory columnar database of a query engine system based on the distributed memory columnar database described in claim 1 or 2, it is characterised in that including:

Resource management module determines the session that a main query engine is responsible between user；

The sql like language that user sends is converted to inquiry plan by main query engine；

Resource management module is that main query engine distributes from query engine, and sets up from the communication between query engine and main query engine；

Inquiry plan is divided at least two subtask by main query engine, and is that each subtask is distributed from query engine；

Subtask is added to task queue from query engine, current subtask is performed after the forerunner subtask of current subtask has all performed, current subtask has been performed the intermediate data transmission that produces to place, follow-up subtask from query engine, and current subtask completion status is sent to main query engine；

After whole inquiry plan completes, main query engine notifies that client is obtaining final result data from query engine.

The querying method of distributed memory columnar database the most according to claim 3, it is characterized in that, subtask uses physics operator representation, and described physics operator includes at least one in the operation of extraction column data operation, attended operation, condition filter operation, division operation, aggregate function operation, sorting operation and table of being embarked on journey by final result data convert.

The querying method of distributed memory columnar database the most according to claim 4, it is characterised in that main query engine is that each subtask is distributed from query engine according to Cost Model.

The querying method of distributed memory columnar database the most according to claim 5, it is characterised in that main query engine is that the distribution of each subtask includes from query engine according to Cost Model:

According to obtaining the IP from query engine place node and the database table information of this node storage and column information from the metadata information of query engine；

According in data localization principle distribution inquiry plan, each extracts the execution node IP that column data operates；

Greedy algorithm is used to choose the execution node of non-extraction column data operation.

The querying method of distributed memory columnar database the most according to claim 3, it is characterised in that the state of each subtask the pending state such as includes, calculates state, distribute data mode, the state that is finished and perform status of fail.

The querying method of distributed memory columnar database the most according to claim 7, it is characterized in that, the pending states such as the original state of current subtask is, at place, current subtask after query engine receives the intermediate data that all forerunners subtask, current subtask has performed generation, change calculating the state of current subtask into state；After current subtask has calculated, the state of current subtask changes distributing data mode into, and by calculate the intermediate data produced send extremely place, follow-up subtask from query engine；If intermediate data sends successfully, change the state of current subtask into be finished state；If etc. between pending state and calculating state, calculate between state and distribution data mode or distribute data mode and being finished between state and break down, changing into the state of current subtask performing status of fail；When the state of current subtask changes, the main query engine of asynchronous notifications.

The querying method of distributed memory columnar database the most according to claim 3, it is characterised in that the intermediate data of transmission is the column data processed through overcompression between query engine.

The querying method of distributed memory columnar database the most according to claim 9, it is characterised in that described compression processes and includes that position compression process and dictionary compression process.