WO2018100419A1

WO2018100419A1 - System and method for reducing data distribution with aggregation pushdown

Info

Publication number: WO2018100419A1
Application number: PCT/IB2016/057307
Authority: WO
Inventors: Kalyan Sivakumar; Chunfeng Pei; Liangchun XIONG
Original assignee: Huawei Technologies India Pvt. Ltd.; Huawei Technologies Co., Ltd.
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2018-06-07

Abstract

The present disclosure discloses system and method for aggregation pushdown with double hashing. In contrast to the prior-art techniques, the present disclosure by systems, and methods, for queries involving group by, push down the grouping clause to the local nodes for execution and avoid redistribution. The present disclosure involves. It involves rewriting the query received from the user to use the distribution key as a grouping by clause column.

Description

SYSTEM AND METHOD FOR REDUCING DATA DISTRIBUTION WITH AGGREGATION PUSHDOWN

TECHNICAL FIELD

[001] The present subject matter described herein, in general, relates to database technologies, and more particularly, to systems and methods for optimization of running group by queries with aggregation on distinct column values where the group by is not done on distribution key.

BACKGROUND

[002] As conventionally known, a database system provides a high-level view of data, but ultimately the data have to be stored as bits on one or more storage nodes. A vast majority of databases today store data on magnetic disk (and, increasingly, on flash storage) and fetch data into main memory for processing, or copy data onto tapes and other backup nodes for archival storage. The physical characteristics of storage nodes play a major role in the way data are stored, in particular because access to a random piece of data on disk is much slower than memory access: Disk access takes tens of milliseconds, whereas memory access takes a tenth of a microsecond. A database management system (DBMS) is generally system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data. The DBMS is a collection of programs that enables you to store, modify, and extract information from a database.

[003] The database system can be a distributed database system, wherein the database is distributed over multiple disparate computers or nodes. Distributed databases are formed by placing and distributing the rows of the database in multiple nodes. The rules for distribution of the rows may be broadly classified into a fixed distribution strategy and random distribution strategy. In the fixed distribution strategy the mapping of the node to the row is governed by a mapping function that would uniquely identify the node to fetch the data, and the distribution function is dependent on one of the column values of the row. Example of the fixed distribution strategy is Hash ranged distribution. In random distribution strategy the mapping of a node to a row is not fixed and can vary based on several factors. Example of the random distribution strategy is Round robin ranged distribution.

[004] Further, during the execution of the query in distributed databases, the query plans may be adjusted based on the distribution strategy. The adjustments may be to favor local query processing to distribution of rows and to favor low data redistribution and maximize processing power of the data nodes.

[005] Most of the conventionally known telecom databases are designed as shared-nothing, distributed with a fixed distribution strategy wherein the analytical queries involve grouping and aggregations mechanism/functionalities. However, the group by clause might involve more than one column for processing and might not involve any distribution column at all. The use of distribution columns in the group by queries greatly affect the performance of the query execution in a distributed set up. As the data in distributed database is distributed across various nodes, the query or part of the query has to be executed at the local nodes. Hence, the decision of the query to execute completely by local nodes or the query needs to be redistributed for execution is of critical importance in the distributed databases. If the queries are executed in local, the performance is much better than those queries that involve redistribution. Hence, getting the query executed on local as much as possible or avoiding redistribution as much as possible is the goal of any optimized query. [006] However, in spite of growth and development of the technology, most of the known database systems fail to achieve the goals of the optimization as stated above. This results in poor optimization of the query execution and takes long query time to generate response to the query. The primary cause for slow down in processing of the query is observed due to the heavy redistribution of data across the data nodes.

[007] The above-described deficiencies of today's query execution techniques are merely intended to provide an overview of some of the problems of conventional systems, and are not intended to be exhaustive. Other problems with conventional systems and corresponding benefits of the various non-limiting embodiments described herein may become further apparent upon review of the following description. SUMMARY

[008] This summary is provided to introduce concepts related to system and method for aggregation pushdown with double hashing, and the same are further described below in the detailed description. This summary is not intended to identify essential features of the claimed subject matter nor is it intended for use in determining or limiting the scope of the claimed subject matter.

[009] A main objective of the present disclosure is to solve the technical problem as recited above by providing system and method for aggregation pushdown with double hashing thereby avoiding heavy data redistribution either in distributed databases query execution models or also in a single node instance that uses parallel execution of a single query.

[0010] Accordingly, in one implementation, the present disclosure provides a system for aggregation pushdown. The system comprises a processor, and a memory coupled to the processor for executing a plurality of modules stored in the memory. The plurality of modules comprises a receiving module, a re-ordering module, and an execution module. The receiving module configured to receive at least request related to a data set, the request comprises at least a group by clause. The re-ordering module configured to rearrange the request for receiving from a plurality of data sources a respective plurality of responses related to a request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause. The execution module configured to execute the request rearranged to generate at least a response to the request received. The request is rearranged in the case(s) where the request involves grouping operation with distribution key is used within a DISTINCT condition in an aggregate method and distribution key is not part of group by clause. [0011] In one implementation, a method for aggregation pushdown is disclosed.

The method comprises receiving at least request related to a data set, the request comprises at least a group by clause; rearranging the request for receiving from a plurality of data sources a respective plurality of responses related to a request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause; and executing the request rearranged to generate at least a response to the request received. The request is rearranged in the case(s) where the request involves grouping operation with distribution key is used within a DISTINCT condition in an aggregate method and distribution key is not part of group by clause.

[0012] In one implementation, a system for aggregation pushdown is disclosed.

The system comprises a processor, and a memory coupled to the processor for executing a plurality of modules stored in the memory. The plurality of modules comprises a receiving module, a re-ordering module, and an execution module. The receiving module configured to receive at least request related to a data set, the request comprises at least a group by clause on a non distribution column and/or an aggregate function with DISTINCT over the distribution key. The re-ordering module is configured to rearrange the request to multi-staged group by plan for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause. The execution module is configured to execute the request rearranged, to multi-staged group by plan, to generate at least a response to the request received. [0013] In one implementation, a method for aggregation pushdown is disclosed.

The method comprises receiving at least request related to a data set, the request comprises at least a group by clause on a non distribution column and/or an aggregate function with DISTINCT over the distribution key; rearranging the request to multi- staged group by plan for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause; executing the request rearranged, to multi-staged group by plan, to generate at least a response to the request received.

[0014] In contrast to the prior-art techniques, the present disclosure by systems, and methods, for queries involving group by, push down the grouping clause to the local nodes for execution and avoid redistribution. The present disclosure involves rewriting the query received from the user to use the distribution key as a grouping by clause column. [0015] Further, the present disclosure deals with a usage of "DISTINCT" clause on a distribution key in aggregate methods with grouping clause.

[0016] When a given query contains a) group by clause on a non distribution column, b) contains an aggregate function with DISTINCT over the distribution key, the present disclosure splits the query to multi staged group by plan. In a first stage, the grouping is done based on the grouping column and the distribution key. In a next step, grouping is done on the result from previous grouping by the distribution column alone. Hence, according to the present disclosure, the aggregate with DISTINCT is converted to a simpler form of aggregate, because the DISTINCT entries of the aggregated column are achieved by the previous grouping phase. The two groupings done along with the aggregation is done locally. Further, as one part of the aggregation being done locally itself, the amount of data to be sent over the network is very less. The next group by is done by multiple data nodes with redistribution or by a single data node, as the number of records expected to flow out of each local node is very less.

[0017] The various options and preferred embodiments referred to above in relation to the first implementation are also applicable in relation to the other implementations . BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS

[0018] The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to refer like features and components.

[0019] Figure 1 illustrates an overall processing of the query for aggregation pushdown according to the available prior-art techniques and in accordance with an embodiment of the present subject matter. Figure 1(a) illustrates a naive implementation of the query. Figure 1(b) illustrates an execution of the query in accordance with the prior arts/ exiting databases. Figure 1(c) illustrates processing of the query for aggregation pushdown in accordance with an embodiment of the present subject matter. [0020] Figure 2 illustrates a system for aggregation pushdown, in accordance with an embodiment of the present subject matter.

[0021] Figure 3 illustrates a method for aggregation pushdown, in accordance with an embodiment of the present subject matter.

[0022] It is to be understood that the attached drawings are for purposes of illustrating the concepts of the disclosure and may not be to scale.

DETAILED DESCRIPTION OF THE PRESENT DISCLOSURE

[0023] The following clearly describes the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are merely a part rather than all of the embodiments of the present disclosure. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the protection scope of the present disclosure. [0024] The disclosure can be implemented in numerous ways, as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the disclosure may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the disclosure. [0025] A detailed description of one or more embodiments of the disclosure is provided below along with accompanying figures that illustrate the principles of the disclosure. The disclosure is described in connection with such embodiments, but the disclosure is not limited to any embodiment. The scope of the disclosure is limited only by the claims and the disclosure encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the disclosure. These details are provided for the purpose of example and the disclosure may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the disclosure has not been described in detail so that the disclosure is not unnecessarily obscured.

[0026] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. However, it will be understood by those skilled in the art that the present disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, and components, modules, units and/or circuits have not been described in detail so as not to obscure the disclosure.

[0027] Although embodiments of the disclosure are not limited in this regard, discussions utilizing terms such as, for example, "processing," "computing," "calculating," "determining," "establishing", "analyzing", "checking", or the like, may refer to operation(s) and/or process(es) of a computer, a computing platform, a computing system, or other electronic computing device, that manipulates and/or transforms data represented as physical (e.g., electronic) quantities within the computer's registers and/or memories into other data similarly represented as physical quantities within the computer's registers and/or memories or other information non-transitory storage medium that may store instructions to perform operations and/or processes.

[0028] Although embodiments of the disclosure are not limited in this regard, the terms "plurality" and "a plurality" as used herein may include, for example, "multiple" or "two or more". The terms "plurality" or "a plurality" may be used throughout the specification to describe two or more components, devices, elements, units, parameters, or the like. Unless explicitly stated, the method embodiments described herein are not constrained to a particular order or sequence. Additionally, some of the described method embodiments or elements thereof can occur or be performed simultaneously, at the same point in time, or concurrently.

[0029] It may be noted by the person skilled in the art that, the primary cause for the slowdown in query processing in distributed database systems is the heavy redistribution of data across the data nodes. This problem is solved by the present disclosure by avoiding heavy data redistribution for grouping queries.

[0030] Systems and methods for aggregation pushdown with double hashing are disclosed.

[0031] While aspects are described for systems and methods for aggregation pushdown with double hashing, the present disclosure may be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary systems, apparatus, and methods. [0032] Referring now to figure 1, an overall processing of the query in distributed database system as compared to the prior-art techniques is disclosed. In one implementation, as shown in the figure 1, the present disclosure for queries involving group by clause, a method is disclosed through which the grouping clause is pushed down to a local nodes for execution and thereby avoiding redistribution. The present disclosure involves rewriting/re-arranging/reordering the query to use the distribution key as a grouping by clause column. It is to be noted that the horizontal line as shown in the figure 1 indicates the boundary of local processing and processing before redistribution. The items marked below the horizontal line is localized process, and the ones above are post redistribution processing.

[0033] As shown in the figure 1, if the distribution key is not part of the group by clause, but is involved in a DISTINCT/specific function, the query may be rewritten so that a group by clause is introduced based on the distribution key.

[0034] In one implementation, as shown in figure 1, the key in this rewrite of the query are- i) the query is split into multiple staged grouping query, and ii) The DISTINCT is removed from the query.

[0035] Figure 1(a) illustrates a naive implementation of the query. As shown in the figure 1(a), the scan is performed on all the local nodes and the complete data set is redistributed. A person skilled in that art may understand that this may be a worst performing query plan though very simple to implement. The complete data scanned from each data node has to be redistributed to all the data nodes to perform aggregation.

[0036] Figure 1(b) illustrates an execution of the query in accordance with the prior arts/ exiting databases. Figure 1(b) shows a bit intelligent query plan compared to the first one as shown in figure 1(a). Here the group by is done in two stages. In the first stage, the grouping is done based on the actual grouping column and the distribution key. However, then redistribution is done. The number of records distributed here is also high as, there might be multiple entries many groups that fall under the combination. In the second stage of grouping will happen only based on the actual grouping column will find that the values from the distribution column does not overlap from one other, however, the grouping will done by on many rows. [0037] Figure 1(c) illustrates processing of the query for aggregation pushdown in accordance with an embodiment of the present subject matter. As shown in the figure 1(c), grouping is done in three stages. According to the present disclosure, the first grouping, involves the queries' actual grouping column and the distribution key as used by many solutions (described in the figure 1(b)). In the second grouping, which achieves the technical advancement over the prior-art techniques, the grouping is again done based on the actual group by column.

Example 1:

Original Query: select count(distinct(dist_f3)) from tl;

If dist_f3 is a distribution column, then the below query produce the same result

Rewritten Query as per present disclosure: select count(dist_f3) from tl group by dist_f3

Example 2:

Original Query:

select max(fl),f2, count(distinct(dist_f3)) from tl group by f2;

Rewritten Query as per present disclosure:

select max(X.fl),X.f2, count(X.dist_f3)

from

(select max(fl) as fl,f2, dist_f3 from tl

group by f2,dist_f3) X

group by X.f2.

This count(DISTINCT(dist_f3)) can now be safely converted to simple count(dist_f3), because the distinct handling is done already by the first group by.

[0038] Referring now to figure 2, a system for aggregation pushdown is illustrated, in accordance with an embodiment of the present subject matter. In one implementation, a system 200 for aggregation pushdown is disclosed. Although the present subject matter is explained considering that the present disclosure is implemented in the system 200, it may be understood that the present disclosure may also be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. It will be understood that the system 200 may be accessed by multiple users, or applications residing on the database system. Examples of the system 200may include, but are not limited to, a portable computer, a personal digital assistant, a handheld node, sensors, routers, gateways and a workstation. The system 200 is communicatively coupled to each other and/or other nodes or a nodes or apparatuses to form a network (not shown). Examples of the database system may include, but are not limited to, a portable computer, a personal digital assistant, a handheld node, sensors, routers, gateways and a workstation.

[0039] The system 200 is communicatively coupled to each other and/or other nodes or a nodes or apparatuses to form a network (not shown). In one implementation, the network (not shown) may be a wireless network, a wired network or a combination thereof. The network can be implemented as one of the different types of networks, such as GSM, CDMA, LTE, UMTS, intranet, local area network (LAN), wide area network (WAN), the internet, and the like. The network may either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network may include a variety of network nodes, including routers, bridges, servers, computing nodes, storage nodes, and the like. [0040] The system 200 may include a processor 202, an interface 204, and a memory 206. The processor 202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any nodes that manipulate signals based on operational instructions. Among other capabilities, the at least one processor is configured to fetch and execute computer-readable instructions or modules stored in the memory 206. [0041] The interface (I/O interface) 204, may include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like. The I/O interface may allow the database system, the first node, the second node, and the third node to interact with a user directly. Further, the I/O interface may enable the system 200 to communicate with other nodes or nodes, computing nodes, such as web servers and external data servers (not shown). The I/O interface can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, GSM, CDMA, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. The I/O interface may include one or more ports for connecting a number of nodes to one another or to another server. The I/O interface may provide interaction between the user and the system 200 via, a screen provided for the interface.

[0042] The memory 206 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 206 may include plurality of instructions or modules or applications to perform various functionalities. The memory includes routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types.

[0043] In one implementation, a system 200 for aggregation pushdown. The system 200 comprises a processor 202, and a memory 206 coupled to the processor 202 for executing a plurality of modules present in the memory 206. The plurality of modules comprises a receiving module 208, a re-ordering module 210, a parser 212, an execution module 214, a designation module 216, a determination module 218, and a generation module 220. [0044] The receiving module 208 may be configured to receive at least request related to a data set, the request comprises at least a group by clause. The re-ordering module 210 may be configured to rearrange the request for receiving from a plurality of data sources a respective plurality of responses related to a request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause. The execution module 214 may be configured to execute the request rearranged to generate at least a response to the request received.

[0045] In one implementation, the generation module 220 may be further configured to generate a response to the client by grouping information in the plurality of responses based on a grouping clause included in the request. [0046] In one implementation, the parser 212 may be configured to parse the request to determine the data set to receive responses based on request. The parsing preferably determines a table name. The designating module 216 may be configured to designate at least a parameter included in the request as a distribution key based on the dataset determined. The parameter is based on an association of the table name with a field included in the request. The determination module 218 may be configured to determine a plurality of data sources to provide the responses based on a value of the distribution key. The value is an index identifying the data source. The generation module 220 may be configured to generate the response to the request based on the plurality of responses received from the plurality of data source.

[0047] Referring now to figure 3, a method for aggregation pushdown is illustrated, in accordance with an embodiment of the present subject matter. The method may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, computer executable instructions may be located in both local and remote computer storage media, including memory storage devices. [0048] The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method or alternate methods. Additionally, individual blocks may be deleted from the method without departing from the protection scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof. However, for ease of explanation, in the embodiments described below, the method may be considered to be implemented in the above described system 200. [0049] In one implementation, a method for aggregation pushdown is is disclosed.

[0050] At block 302, at least request related to a data set is received. The request may include at least a group by clause. [0051] At block 304, the request received is rearranged. Based on the rearranged request, the present disclosure receives a respective plurality of responses related to the request received related to the data set from a plurality of data sources. The request is rearranged such that at least a distribution key is used as the group by clause. [0052] At block 306, the request received is parsed to determine the data set to receive responses based on request. The parsing is preferably performed to determine a table name from the request received.

[0053] At block 308, based on the table name determined the system may start executing the query.

[0054] At block 310, during execution, the based on the dataset determined, at least a parameter included in the request is designated as a distribution key. The parameter is based on an association of the table name with a field included in the request. [0055] At block 312, a plurality of data sources to provide the responses is determined based on a value of the distribution key. The value is an index identifying the data source. [0056] At block 314, the response to the request is generated based on the plurality of responses received from the plurality of data source

[0057] In one implementation, the request rearranged is executed to generate at least a response to the request received.

[0058] In one implementation, a response to the client is generated by grouping information in the plurality of responses based on a grouping clause included in the request. [0059] In one implementation, a system for aggregation pushdown is disclosed.

The system comprises a processor, and a memory coupled to the processor for executing a plurality of modules present in the memory. The plurality of modules comprises a receiving module, a re-ordering module, and an execution module. The receiving module configured to receive at least request related to a data set, the request comprises at least a group by clause on a non distribution column and/or an aggregate function with DISTINCT over the distribution key. The re-ordering module is configured to rearrange the request to multi-staged group by plan for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause. The execution module is configured to execute the request rearranged, to multi-staged group by plan, to generate at least a response to the request received.

[0060] In one implementation, a method for aggregation pushdown is disclosed.

[0061] The present disclosure deals with a usage of "DISTINCT" clause on a distribution key in aggregate methods with grouping clause.

[0062] When a given query contains a) group by clause on a non distribution column, b) contains an aggregate function with DISTINCT over the distribution key, the present disclosure splits the query to multi staged group by plan. In a first stage, the grouping is done based on the grouping column and the distribution key. In a next step, grouping is done on the result from previous grouping by the distribution column alone. Hence, according to the present disclosure, the aggregate with DISTINCT is converted to a simpler form of aggregate, because the DISTINCT entries of the aggregated column are achieved by the previous grouping phase. The two groupings done along with the aggregation is done locally. Further, as one part of the aggregation being done locally itself, the amount of data to be sent over the network is very less. The next group by is done by multiple data nodes with redistribution or by a single data node, as the number of records expected to flow out of each local node is very less.

[0063] Apart from what is discussed above, the present disclosure has some additional advantages as provided below:

• The present disclosure improves performance of the group by queries in distributed databases.

• As a result of the optimization by improved performance of the group by queries, the query execution is more localized.

• The present disclosure reduces the number of data redistributed over the network drastically.

· The present disclosure enables to improve the query's latency as more nodes are added to the cluster. • The present disclosure provides a mechanism to push down the grouping clause to the local nodes and avoids redistribution.

• The present disclosure provides a mechanism to rewrite/rearrange/reorder the query to use the distribution key as a grouping by clause column.

· The present disclosure enables to avoid heavy data redistribution for grouping queries.

[0064] A person skilled in the art may understand that any known or new algorithms by be used for the implementation of the present disclosure. However, it is to be noted that, the present disclosure provides a method to be used during data redistribution to achieve the above mentioned benefits and technical advancement irrespective of using any known or new algorithms.

[0065] A person of ordinary skill in the art may be aware that in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on the particular applications and design constraint conditions of the technical solution. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of the present disclosure.

[0066] It may be clearly understood by a person skilled in the art that for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, reference may be made to a corresponding process in the foregoing method embodiments, and details are not described herein again.

[0067] In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

[0068] When the functions are implemented in a form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable storage medium. Based on such an understanding, the technical solutions of the present disclosure essentially, or the part contributing to the prior art, or a part of the technical solutions may be implemented in a form of a software product. The computer software product is stored in a storage medium, and includes several instructions for instructing a computer node (which may be a personal computer, a server, or a network node) to perform all or a part of the steps of the methods described in the embodiment of the present disclosure. The foregoing storage medium includes: any medium that can store program code, such as a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disc.

[0069] Although implementations for system and method for aggregation pushdown with double hashing have been described in language specific to structural features and/or methods, it is to be understood that the appended claims are not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as examples of implementations of the system and method for aggregation pushdown with double hashing.

Claims

1. A system for aggregation pushdown, the system comprising:

a processor;

a memory coupled to the processor for executing a plurality of modules stored in the memory, the plurality of modules comprising:

a receiving module configured to receive at least request related to a data set, the request comprises at least a group by clause;

a re-ordering module configured to rearrange the request for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause; and

an execution module configured to execute the request rearranged to generate at least a response to the request received.

2. The system as claimed in claim 1 further comprises a generation module configured to generate a response to the client by grouping information in the plurality of responses based on a grouping clause included in the request.

3. The system as claimed in claim 1 further comprises:

a parser configured to parse the request to determine the data set to receive responses based on request, the parsing preferably determines a table name;

a designating module configured to designate, based on the dataset determined, at least a parameter included in the request as a distribution key, the parameter is based on an association of the table name with a field included in the request;

a determination module configured to determine, based on a value of the distribution key, a plurality of data sources to provide the responses, the value is an index identifying the data source; and

a generation module configured to generate the response to the request based on the plurality of responses received from the plurality of data source.

4. A method for aggregation pushdown, the method comprising: receiving at least request related to a data set, the request comprises at least a group by clause;

rearranging the request for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause; and

executing the request rearranged to generate at least a response to the request received.

5. The method as claimed in claim 4 further comprises generating a response to the client by grouping information in the plurality of responses based on a grouping clause included in the request.

6. The method as claimed in claim 4, further comprises:

parsing the request to determine the data set to receive responses based on request, the parsing preferably determines a table name;

designating, based on the dataset determined, at least a parameter included in the request as a distribution key, the parameter is based on an association of the table name with a field included in the request;

determining, based on a value of the distribution key, a plurality of data sources to provide the responses, the value is an index identifying the data source; and

generating the response to the request based on the plurality of responses received from the plurality of data source.

7. A system for aggregation pushdown, the system comprising:

a processor;

a receiving module configured to receive at least request related to a data set, the request comprises at least a group by clause on a non distribution column and/or an aggregate function with DISTINCT over the distribution key; a re-ordering module configured to rearrange the request to multi-staged group by plan for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause; and

an execution module configured to execute the request rearranged, to multi-staged group by plan, to generate at least a response to the request received.

8. The system as claimed in claim 7, wherein the request rearranged to multi-staged group by plan comprise:

at least a first stage wherein a grouping is done based on the grouping column and the distribution key; and

at least a second stage wherein a grouping is done on the result received from the grouping by the distribution column of the first stage.

9. The system as claimed in claim 8, wherein the grouping in the first stage is executed locally in the system.

10. The system as claimed in claim 8, wherein the grouping in the second stage is executed by multiple systems with redistribution or by at least another system.

11. A method for aggregation pushdown, the method comprising:

receiving at least request related to a data set, the request comprises at least a group by clause on a non distribution column and/or an aggregate function with DISTINCT over the distribution key;

rearranging the request to multi-staged group by plan for receiving, from a plurality of data sources, a respective plurality of responses related to the request received related to the data set, wherein the request is rearranged such that at least a distribution key is used as the group by clause;

executing the request rearranged, to multi-staged group by plan, to generate at least a response to the request received.

12. The method as claimed in claim 11, further comprises:

grouping, in at least a first stage, based on the grouping column and the distribution key; and

grouping, in at least a second stage, on the result received from the grouping by the distribution column of the first stage.

13. The method as claimed in claim 12 further comprises: executing, using at least one system, the grouping in the first stage.

14. The method as claimed in claim 12 further comprises: executing, using multiple systems with redistribution or by at least another system, the grouping in the second stage.