WO2022041143A1

WO2022041143A1 - Smart procedure routing in partitioned database management systems

Info

Publication number: WO2022041143A1
Application number: PCT/CN2020/112265
Authority: WO
Inventors: Chen Fu; Le CAI; Feifei Li
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2022-03-03
Also published as: CN115803715A

Abstract

A database system may include at least one load balancing node and a plurality of computing nodes. The load balancing node may receive a distributed transaction that includes a stored procedure, and select a predictor function for the stored procedure among from a plurality of predictor functions. The load balancing node may extract one or more input parameters from the stored procedure, and determine a computing node from among the plurality of computing nodes based at least in part on the one or more input parameters using the predictor function. The load balancing node may then forward the stored procedure to the computing node for processing the stored procedure.

Description

SMART PROCEDURE ROUTING IN PARTITIONED DATABASE MANAGEMENT SYSTEMS

BACKGROUND

With the continuously increasing volume of data that is stored in every walk of life, such as company data, shopping data, personal data, etc., partitioned and distributed database management systems (PDBMS) running on the cloud or a network of servers have become increasingly popular. In a PDBM, data tables are partitioned horizontally into multiple parts, commonly known as data shards. Each data shard is a collection of rows, and is independently hosted and replicated. Shards can be moved, split, or merged to improve performance and elasticity.

Transactions that involve data hosted on a single node (i.e. a database server) are called local transactions. These transactions are essentially no different from transactions in traditional monolithic database management systems. On the other hand, transactions involving data on multiple nodes, i.e., global transactions, need to go through a complex process called 2 phase commit (2PC) when a commit operation is performed. As a result, global transactions are significantly slower than local transactions.

To achieve better performance, database administrators (DBAs) often specify data partition and placement policies, in such a way so that most transactions can be performed locally. For instance, a database of an e‐commerce application may include warehouse and customer tables, etc. If most orders submitted to the e‐commerce application can be fulfilled by local warehouses, a database administrator may choose to partition the tables based on geographic locations, so that rows representing warehouses and customers in the same location are in the same shard of their respective tables. The database administrator may also specify data shards of the customer table and data shards of the warehouse table of the same location to be hosted on the same node. In this way, most orders can be served by executing local transactions to achieve better performance.

This effort of database partitioning can achieve the best performance if a PDBMS can bring computations to data, i.e., executing a transaction directly on a node that hosts or stores the data. However, it turns out to be difficult to do so for the following reasons: 1) a transaction is best executed by a single node, and aborting and restarting a transaction on another node is feasible, but often costly; 2) it is difficult to predict a node where relevant data is located before the transaction is started, especially when most transactions involve multiple queries. As a result, a node executing the transaction often does not host the data relevant to the transaction, and has to forward query requests to actual hosting nodes, thus incurring high communication costs. Worst still, since the transaction has been assigned to the node, the node would need to act as a control node to coordinate execution or processing of the multiple queries included in the transaction by other nodes, send a commit instruction to these other nodes to complete a 2 phase commit, and collect query results associated with execution or processing of the multiple queries from the other nodes, and send the query results to a client device of a database administrator for presentation or review, thus further incurring communication costs and time due to transmission of query results and instructions between the control node and the other nodes.

SUMMARY

This summary introduces simplified concepts of smart procedure routing in partitioned database management systems, which will be further described below in the Detailed Description. This summary is not intended to identify essential features of the claimed subject matter, nor is it intended for use in limiting the scope of the claimed subject matter.

This disclosure describes example implementations of smart procedure routing in partitioned database management systems. In implementations, one or more computing devices may perform a static program analysis on a stored procedure to determine one or more input parameters. The one or more computing devices may further obtain stored data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards. In implementations, the one or more computing devices may then train a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs, and set the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is set forth with reference to the accompanying figures. In the figures, the left‐most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.

FIG. 1 illustrates an example environment in which a database system may be used.

FIG. 2 illustrates an example computing node in more detail.

FIG. 3 illustrates an example load balancing node in more detail.

FIG. 4 illustrates an example monitoring node in more detail.

FIG. 5 illustrates an example method of determining a predictor function.

FIG. 6 illustrates an example method of processing a distributed database transaction.

DETAILED DESCRIPTION

Overview

As noted above, existing partitioned and distributed database systems adopt loading balancing strategies that fail to deterministically or strategically assign a distributed database transaction to a particular node that includes one or more data tables involved in the database transaction, and a node that is assigned with the database transaction may need to act as a control or coordinate node to coordinate other nodes that include data tables involved in the distributed database transaction to process the distributed database transaction and to commit synchronously the distributed database transaction. This not only increases communication costs and time due to data and instructions transmitted among the nodes, but also wastes processing resources of the control or coordinate node that could be needless for the distributed database transaction.

This disclosure describes an example database system. In implementations, the database system may include a load balancing node, a plurality of computing nodes, and a monitoring node. In implementations, the database system may be associated with a plurality of procedures that are registered and stored in a distributed database associated with the database system in advance. Each stored procedure may include one or more database queries accessing data shards of one or more database tables that are located or distributed in one or more computing nodes of the database system. In implementations, a stored procedure may include one or more input parameters that may act as placeholders to receive input values from a user for performing a database transaction. In implementations, the database system may assign a unique identifier to each stored procedure of the plurality of stored procedures.

In implementations, the database system may include or store one or more predictor functions. In implementations, a predictor function of the one or more predictor functions may be used for one or more stored procedures of the plurality of the stored procedures. In implementations, each predictor function may be labeled with a respective identifier, and the database system or the load balancing node may include a first data structure (such as a table or a mapping) that associates or stores each predictor function of the one or more predictor functions with a respective identifier. Additionally, the database system may further include a second data structure (such as a table or a mapping) that associates or stores identifiers of one or more stored procedures with an identifier of a corresponding predictor function.

In implementations, a predictor function may be configured to predict or attempt to map a stored procedure having specific values for input parameters of the stored procedure to an identifier of a data shard that may include data corresponding to at least one query of the stored procedure. Alternatively, the predictor function may be configured to predict or attempt to map a stored procedure having specific values for input parameters of the stored procedure to an identifier of a computing node including a data shard that may include data corresponding to at least one query of the stored procedure.

In implementations, the database system may receive a stored procedure with input values for input parameters of the stored procedure from a client device. The database system may select or determine a predictor function for the stored procedure from among the one or more predictor functions based on a second data structure relating the plurality of stored procedure to the one or more predictor functions. In implementations, the database system may determine or extract input values for one or more input parameters of the stored procedure through a static program analysis, and employ the selected or determined predictor function (and also a mapping function if the predictor function outputs an identifier of a data shard, rather than an identifier of a computing node) to determine a computing node to which the stored procedure is forwarded or assigned based on the extracted input values.

In implementations, the one or more predictor functions may include one or more classification models that are obtained by the database system through training based on historical data that is collected by the plurality of computing nodes. In implementations, historical data for a predictor function may include, but is not limited to, a plurality of value sets for respective one or more input parameters of one or more stored procedures associated with the predictor function, identifiers of data shards or computing nodes, etc. In implementations, the one or more classification models may include, but are not limited to, a neural network model, a deep learning model, a decision tree model, etc. In implementations, the database system may update or retrain the one or more predictor functions on a periodic basis or when a prediction error reaches a predetermined error threshold.

As described above, the example database system may receive a stored procedure with input values for input parameters, and predict or determine a computing node that may likely store a data shard of a data table involved in at least one query of the stored procedure based on a predictor function, thereby potentially avoiding a situation of sending the stored procedure to a computing node that does not store any data shard queried or accessed by the procedure, and thereby reducing resource wastes, communication costs, and processing delays due to the additional need of interactions between the computing node and other computing nodes that store data shards queried or accessed by the stored procedure.

In implementations, functions described herein to be performed by the database system may be performed by multiple separate units or services. Moreover, although in the examples described herein, the database system may be implemented as a combination of software and hardware implemented and distributed in multiple devices, in other examples, the database system may be implemented and distributed as services provided in one or more computing devices over a network and/or in a cloud computing architecture.

The application describes multiple and varied embodiments and implementations. The following section describes an example framework that is suitable for practicing various implementations. Next, the application describes example systems, devices, and processes for implementing a database system.

Example Environment

FIG. 1 illustrates an example environment 100 usable to implement a database system. The environment 100 may include a database system 102. In implementations, the database system 102 may include a plurality of servers or computing nodes 104‐1, 104‐2, …, 104‐N (which are collectively called as computing nodes 104) . The computing nodes 104 may communicate data with one another via a network 106. In implementations, the database system 102 may further include at least one load balancing node 108 for allocating workload to the server or computing nodes 104. In implementations, one or more servers or computing nodes 104 may be designated or used as the at least one load balancing node 108. In implementations, the database system 102 may further include a monitoring node 110.

In implementations, each of the servers or computing nodes 104 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.

The network 106 may be a wireless or a wired network, or a combination thereof. The network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) . Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g.,

Zigbee, etc. ) , etc.

In implementations, the environment 100 may further include a client device 112. The client device 112 may be implemented as any of a variety of computing devices, but not limited to, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet or slate computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , a server computer, etc., or a combination thereof.

In implementations, the database system 102 may receive a request for processing a distributed database transaction from the client device 112. For example, a user 114 (such as a database administrator, etc. ) of the client device 112 may submit a plurality of queries involving one or more data tables as a distributed database transaction to the database system 102. In response to receiving the request, the database system 102 may determine at least one computing node including a data section of a data table involved in the distributed database transaction based on a routing directive, send the distributed database transaction to the at least one computing node for processing, and return a result of processing the distributed database transaction to the client device 112.

Example Partitioned Database

In implementations, the database system 102 may further include one or more databases that are partitioned or distributed among the plurality of computing nodes 104. By way of example and not limitation, a partitioned or distributed database may include one or more data tables that are divided or partitioned, with each data table being divided horizontally into multiple parts called data shards or simply shards. In implementations, each shard may include a collection of rows, and may be independently hosted and replicated in one or more computing nodes 104. Furthermore, shards can be moved, split, or merged to improve performance and elasticity of the database system 102.

In implementations, transactions that involve data hosted on a single computing node 104 or server (such as a database server) may be called local transactions. They are essentially no different from the transactions in the traditional monolithic DBMSs. On the other hand, transactions involving data on multiple computing nodes, i.e. global transactions, need to go through a complex process called 2 phase commit (2PC) when committing. As a result, processing of global transactions is significantly slower than processing of local transactions.

In implementations, a data table may be partitioned horizontally based on a value of a column, or a combination of multiple columns. In the latter case, values from each of the columns may form a tuple, and these columns may be treated as a single logical column for the purpose of partitioning. Without the loss og generality and for the sake of simplicity, tables are assumed to be partitioned based on a partition column.

In implementations, rows having a same value in a partition column may be placed in a same data section or shard. For example, an e‐commerce application may include a database having a customer table, which includes columns such as street, city, state, zip code. These columns may form an address of a customer. A user 114 (such as a database administrator) may partition the customer table using a column associated with the zip code alone, so that all rows having a same zip code in the customer table are located in a same data section or shard.

In implementations, the database system 102 may further include or provide a partition function for each data table, part _table (k) →shard_id, where k is called a partition key, which is a value of a partition column. In implementations, an output of this function may an identifier (i.e., ID) of a shard including all the rows having a value of a partition column to be equal to the partition key (i.e., k) .

In implementations, the database system 102 may allow the user 114 to specify one or more partition and placement policies for data stored in the partitioned databases. Continuing the above example of the e‐commerce application, a warehouse table may further be included in addition to the customer table. As most orders for this e‐commerce application are probably fulfilled from local warehouses, the user 114 may decide to partition a database based on geographic locations. In this case, zip codes of locations of warehouses and zip codes of customer addresses may be used as corresponding partition columns for the warehouse table and the customer table respectively. Furthermore, the user 114 may instruct the database system 102 to allow rows that represent warehouses and customers in the same zip code to be hosted or stored in the same node, so that most order processing may incur local transactions. In other words, both functions part _warehouse (k) and part _customer (k) takes a zip code as a partition key, and

wherein place (. ) is a placement function.

In implementations, the partition function and the placement function may be implemented by querying metadata that includes a mapping from partition keys to shard IDs, and a mapping from the shard IDs to computing nodes when corresponding shards are hosted or stored. This metadata may be replicated to some or all of the computing nodes 104 and/or the load balancing node 108 in the database system 102.

In some implementations, the partition function may not exist or may be difficult to be constructed due to a potentially enormous amount of data being involved and needed to be stored beforehand. In implementations, the database system 102 may include or provide a predictor function for a stored procedure, predict _proc (parameters) →shard_id, where parameters represent values that are inputted by a user for input parameters of the stored procedure. In implementations, an output of this function may be shard_id, i.e., an identifier (i.e., ID) of a shard that likely includes data rows to be queried or accessed by at least one query of the stored procedure. After obtaining the identifier of the shard that likely includes data rows to be queried or accessed by at least one query of the stored procedure, the database system 102 may employ the placement function as described above to obtain an identifier of a computing node to which the stored procedure can be assigned or sent.

Example Computing Node

FIG. 2 illustrates the computing node 104 in more detail. In implementations, the computing node 104 may include, but is not limited to, one or more processors 202, an input/output (I/O) interface 204, and/or a network interface 206, and memory 208. In implementations, some of the functions of the computing node 104 may be implemented using hardware, for example, an ASIC (i.e., Application‐Specific Integrated Circuit) , a FPGA (i.e., Field‐Programmable Gate Array) , and/or other hardware.

In implementations, the processors 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the I/O interface 204, and/or the network interface 206. In implementations, the processors 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field‐programmable gate arrays (FPGAs) , application‐specific integrated circuits (ASICs) , application‐specific standard products (ASSPs) , system‐on‐a‐chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.

The memory 208 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non‐volatile memory, such as read only memory (ROM) or flash RAM. The memory 208 is an example of computer readable media.

The computer readable media may include a volatile or non‐volatile type, a removable or non‐removable media, which may achieve storage of information using any method or technology. The information may include a computer readable instruction, a data structure, a program module or other data. Examples of computer readable media include, but not limited to, phase‐change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random‐access memory (RAM) , read‐only memory (ROM) , electronically erasable programmable read‐only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read‐only memory (CD‐ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non‐transmission media, which may be used to store information that may be accessed by a computing device. As defined herein, the computer readable media does not include any transitory media, such as modulated data signals and carrier waves.

Although in this example, only hardware components are described in the computing node 104, in other instances, the computing node 104 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 208 for performing various operations. For example, the computing node 104 may further include a local or partitioned database 210 for storing data tables and other program data 212. By way of example and not limitation, the computing node 104 may store data sections or shards of one or more data tables in the local or partitioned database 210. In implementations, the one or more data tables may be divided and distributed according to respective partition keys among different computing nodes 104.

Example Load Balancing Node

FIG. 3 illustrates the load balancing node 108 in more detail. In implementations, the load balancing node 108 may include, but is not limited to, one or more processors 302, an input/output (I/O) interface 304, and/or a network interface 306, and memory 308. In implementations, some of the functions of the load balancing node 108 may be implemented using hardware, for example, an ASIC (i.e., Application‐Specific Integrated Circuit) , a FPGA (i.e., Field‐Programmable Gate Array) , and/or other hardware.

In implementations, the processors 302 may be configured to execute instructions that are stored in the memory 308, and/or received from the I/O interface 304, and/or the network interface 306. In implementations, the processors 302 may be implemented as one or more hardware processors including, for example, a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field‐programmable gate arrays (FPGAs) , application‐specific integrated circuits (ASICs) , application‐specific standard products (ASSPs) , system‐on‐a‐chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.

The memory 308 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non‐volatile memory, such as read only memory (ROM) or flash RAM. The memory 308 is an example of computer readable media as described in the foregoing description.

Although in this example, only hardware components are described in the load balancing node 108, in other instances, the load balancing node 108 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 308 for performing various operations. For example, the load balancing node 108 may further include a mapping table 310 and other program data 312. In implementations, the mapping table 310 may include mapping from a combination of information of a data table and information of a partition key to information of a computing node that includes a data section or shard of the data table corresponding to the partition key. By way of example and not limitation, given a name of a data table and a value of a partition key, the load balancing node 108 may determine an identifier or address of a computing node that includes a data section or shard of the data table corresponding to the value of the partition key.

In implementations, the load balancing node 108 may obtain this mapping table 310 in advance, for example, by receiving broadcasting information from the plurality of computing nodes 104. In implementations, broadcasting information of each computing node 104 may include, but is not limited to, information about data section (s) or shard (s) of data table (s) that correspond to certain values of partition key (s) is/are included or stored in the respective computing node 104.

Additionally or alternatively, in some implementations, the load balancing node 108 may be associated with a mapping device (which may be a server or computing device provided in the database system 102) . The mapping device may collect information about data section (s) or shard (s) of data table (s) that correspond to certain values of partition key (s) is/are included or stored in each computing node 104 from the plurality of computing nodes 104, for example, by broadcasting information from the plurality of computing nodes 104 as described above. In this case, the load balancing node 108 may send information (such as a name) of a data table and information (such as a value) of a partition key obtained from a distributed database transaction to the mapping device, which maps the information of the data table and the information of the partition key to information (e.g., an identifier or address) of a computing node that includes a data section or shard of the data table corresponding to the partition key. The load balancing node 108 may then receive the information of the computing node from the mapping device, thus reducing the workload and complexity of the load balancing node while enabling the load balancing node to continuously process load balancing of incoming requests or data packets.

In implementations, the load balancing node 108 may include one or more predetermined load balancing strategies. By way of example and not limitation, the one or more predetermined load balancing strategies may include assigning a distributed database transaction received (e.g., from the client device 112) to a computing node in a random manner, assigning the distributed database transaction to a computing node in a round‐robin manner, assigning the distributed database transaction to a computing node that currently has the least workload, assigning to the distributed database transaction to a computing node based on a mapping relationship between an IP address of the client device and the computing node, etc.

In implementations, the load balancing node 108 may further include an identifier database 314. The identifier database 314 may include or store, for example, unique identifiers (such as unique names, etc. ) of stored procedures that are registered and stored in the database system 102, unique identifiers of predictor functions, etc. In implementations, the load balancing node 108 may further include a relationship database 316. The relationship database 316 may include or store, for example, relationships between unique identifiers of stored procedures and unique identifiers of predictor functions, etc. In implementations, the load balancing node 108 may further include a predictor database 318. The predictor database 318 may include or store the plurality of predictor functions, which can be queried or invoked through the identifiers of the predictor functions, for example.

Example Monitoring Node

FIG. 4 illustrates the monitoring node 110 in more detail. In implementations, the monitoring node 110 may include, but is not limited to, one or more processors 402, an input/output (I/O) interface 404, and/or a network interface 406, and memory 408. In implementations, some of the functions of the monitoring node 110 may be implemented using hardware, for example, an ASIC (i.e., Application‐Specific Integrated Circuit) , a FPGA (i.e., Field‐Programmable Gate Array) , and/or other hardware.

In implementations, the processors 402 may be configured to execute instructions that are stored in the memory 408, and/or received from the I/O interface 404, and/or the network interface 306. In implementations, the processors 402 may be implemented as one or more hardware processors including, for example, a microprocessor, an application‐specific instruction‐set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field‐programmable gate arrays (FPGAs) , application‐specific integrated circuits (ASICs) , application‐specific standard products (ASSPs) , system‐on‐a‐chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.

The memory 408 may include computer readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non‐volatile memory, such as read only memory (ROM) or flash RAM. The memory 408 is an example of computer readable media as described in the foregoing description.

Although in this example, only hardware components are described in the monitoring node 110, in other instances, the monitoring node 110 may further include other hardware components and/or other software components such as program units to execute instructions stored in the memory 408 for performing various operations. For example, the monitoring node 110 may further include an information database 410 and other program data 412. In implementations, the information database 410 may store data received from the plurality of computing nodes 104 and the load balancing node 108, which may include, but is not limited to, information about respective numbers of times that stored procedures are separately invoked or used by users from the load balancing node 108, identifier (s) of missed procedure (s) and respective number (s) of misses from each computing node 104, respective input and output data sets that include a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards for each predictor function of a plurality of predictor functions, etc.

Example Methods

FIG. 5 shows a schematic diagram depicting an example method of determining a predictor function. FIG. 6 shows a schematic diagram depicting an example method of processing a distributed database transaction. The methods of FIG. 5 and FIG. 6 may, but need not, be implemented in the environment of FIG. 1 and using the computing node, the load balancing node, and the monitoring node of FIGS. 2‐4. For ease of explanation,

methods

500 and 600 are described with reference to FIGS. 1‐3. However, the

methods

500 and 600 may alternatively be implemented in other environments and/or using other systems.

The

methods

500 and 600 are described in the general context of computer‐executable instructions. Generally, computer‐executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.

Referring back to FIG. 5, at block 502, the monitoring node 110 may select a stored procedure.

In implementations, the database system 102 may include a plurality of procedures that are stored in the database system 102. These procedures are written in a query language (such as SQL language) , and are registered in the database system 102 for invocation by users. In implementations, a stored procedure may include one or more queries that access or update data of one or more data shards of one or more database tables located in one or more computing nodes of the database system 102. In implementations, a query of a stored procedure may include a partition key and an input parameter that receives an input value from a user using the stored procedure. In implementations, a stored procedure may include an argument list of input parameters that may serve as placeholders to receive input values for the input parameters of the stored procedure from a user. In other implementations, a stored procedure may not include an argument list of input parameters, and the input parameters are distributed in the stored procedure that need to be found.

In implementations, the monitoring node 110 may obtain one or more stored procedures that are newly added or registered in the database system 102, and no predictor function (s) has/have existed or been trained for these stored procedures. The monitoring node 110 may select a stored procedure from the one or more stored procedures that are newly added or registered in the database system 102. In implementations, the monitoring node 110 may select a stored procedure that is newly added or registered after the stored procedure has been used or invoked by users of the database system 102 for a predetermined number of times (e.g., 1,000 times, 10,000 times, etc. ) and/or after a predetermined period of time (e.g., after one week, after two weeks, etc. ) . This allows a sufficient amount of training data to be available for training a predictor function for the stored procedure.

Additionally or alternatively, the monitoring node 110 may select a stored procedure from among the plurality of stored procedures that are associated with predictor functions for retraining on a periodic basis (such as every one week, every two weeks, every one month, etc. ) . In implementations, a frequency of selecting a certain stored procedure may depend on a frequency of the stored procedure being invoked or used by users in the database system 102. By way of example and not limitation, the higher the frequency of a stored procedure being invoked or used is, the higher frequency of selecting that stored procedure is.

Additionally or alternatively, the monitoring node 110 may monitor the performance of the plurality of stored procedures in the database system 102, and determine whether to update or retrain a stored procedure based on the performance of the stored procedure. In implementations, the performance of a stored procedure may include, but is not limited to, an error rate of a predictor function used for mapping the stored procedure, etc. For example, a load balancing node 108 may count a number of times that each stored procedure is used or invoked by users within a preset time interval (e.g., every one week, every two weeks, etc. ) , and store identifiers of each stored procedure and respective numbers of times of uses or invocations in a data structure (e.g., a table, etc. ) . Furthermore, each computing node 104 may keep track of one or more stored procedures that are assigned thereto encountering a miss in the respective computing node 104 (i.e., data requested to be accessed or updated by the one or more stored procedures does not exist in the respective computing node 104) . Each computing node 104 may store identifier (s) of such stored procedure (s) and respective numbers of misses of such stored procedure (s) (or called missed procedure (s) ) for simplicity) in a certain data structure (such as a list, etc. ) within the preset time interval.

In implementations, the monitoring node 110 may receive information about respective numbers of times that stored procedures are separately invoked or used by users from the load balancing node 108, and further collect identifier (s) of missed procedure (s) and respective number (s) of misses from each computing node 104. The monitoring node 110 may then determine a respective error rate for each stored procedure (such as determining a percentage of uses or invocations of a stored procedure that result in a miss, for example) . The monitoring node 110 may compare the respective error rate of each stored procedure with a predetermined error threshold, and select a stored procedure having the highest error rate and the error rate being higher than the predetermined error threshold.

At block 504, the monitoring node 110 may determine one or more input parameters used for training a predictor function associated with the stored procedure.

In implementations, upon selecting the stored procedure, the monitoring node 110 may determine one or more input parameters that may be used for training a predictor function for the stored procedure. In implementations, the stored procedure may include an argument list including at least one input parameter, and the monitoring node 110 may use the at least one parameter included in the argument list as an input parameter candidate for training the predictor function of the stored procedure.

Additionally or alternatively, the monitoring node 110 may perform a static program analysis to extract one or more input parameter candidates from the stored procedure. The static program analysis, which may also be called a compile time analysis, is a kind of algorithms that analyze a programming code before the programming code is run, in order to predict a runtime behavior of the programming code. The monitoring node 110 may employ the static program analysis to determine which input parameters decide or affect data shards on which one or more queries in the stored procedure may operate (e.g., access or update) . In implementations, the monitoring node 110 may identify one or more variables that are used as one or more partition keys in the stored procedure. For example, for a SELECT query in a stored procedure, the monitoring node 110 may find distributed tables in FROM clauses, and identify variables that may be used to compare against respective partition columns of these tables. Similar approach may also be used for UPDATE, INSERT and DELETE queries.

For example, a stored procedure, namely, payment_proc, may be expressed as follows. In this example, the first UPDATE is used for updating a target table, namely, warehouse, and a variable, in_w_zip, is compared against a corresponding partition column, warehouse_zip.

In implementations, the monitoring node 110 may compute a transitive closure of all variables that may affect value (s) of partition key (s) in one or more queries of the stored procedure. As mentioned above, the monitoring node 110 may employ a data flow analysis to estimate possible value flows between program points that are reachable according to a control flow graph. To locate all variables whose values may affect a partition key of a query in the stored procedure, the monitoring node 110 may start treat the partition key of the query as a starting point, and trace backwards against value flow directions, until all links that are transitively reachable from the starting point are exhausted. In implementations, the monitoring node 110 may perform such operations using a graph traversal algorithm. In implementations, after obtaining the variables whose values may affect the partition key of the query, the monitoring node 110 may set these variables as input parameter candidates. The monitoring node 110 may repeat the above operations for all partition keys of all queries in the stored procedure, and thus obtain an entire set of input parameter candidates that may affect the partition keys of all the queries of the stored procedure. In implementations, the monitoring node 110 may set these input parameter candidates that may affect the partition keys of the queries of the stored procedure as one or more input parameters used for training a predictor function associated with the stored procedure.

At block 506, the monitoring node 110 may obtain historical data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards.

In implementations, the monitoring node 110 may obtain historical data from the plurality of computing nodes 104 and the load balancing node 108. In implementations, the historical data may include, but is not limited to, a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards. In implementations, each data shard of the plurality of data shards may include a corresponding piece of data that is assessed by the stored procedure when the one or more input parameters take a respective value set of the plurality of value sets.

At block 508, the monitoring node 110 may train a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs.

In implementations, the monitoring node 110 may train a classification model to learn or model the behavior of a predictor function for the stored procedure, based on the historical data. In implementations, the monitoring node 110 may employ supervised training to train and obtain the classification model. In implementations, the classification model may include, but is not limited to, a neural network model, a deep learning model, a decision tree model, etc.

By way of example and not limitation, a neural network model is used as an example of the classification model to learn or model the behavior of the predictor function for the stored procedure. In this example, the one or more input parameters of the stored procedure that are obtained or determined at block 504 may be used as an input (or input features) to the neural network model, and historical values (or value sets) of the one or more parameters are training values for the input features. In implementations, each stored procedure may include one or more queries, and may perform operations (e.g., accessing, updating, etc. ) on one or more data shards of one or more database tables. In this case, the monitoring node 110 may select a most heavily operated data shard from among the one or more data shards of one or more database tables. In implementations, a most heavily operated data shard for a stored procedure may include, but is not limited to, a data shard with the most number of rows that are accessed or affected by queries in the stored procedure, etc. In implementations, the monitoring node 110 may treat the most heavily operated data shard as a corresponding output (or label) of the neural network model. In implementations, such feature‐label pairs can be obtained by monitoring executions of stored procedures, logging respective input parameters, and corresponding data shards that are accessed and corresponding numbers of rows that are accessed or affected, etc. The monitoring node 110 may then train the neural network model using a conventional training or learning algorithm.

At block 510, the monitoring node 110 may set the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.

In implementations, after training of the classification model is completed, the monitoring node 110 may use the trained classification model as a predictor function for the stored procedure. In implementations, the monitoring node 110 may send the predictor function to the load balancing node 108, so that the load balancing node 108 may employ the predictor function for mapping the stored procedure with any new input value set for the one or more input parameters to a corresponding data shard (e.g., an identifier of the data shard) that may likely store data to be accessed or updated by at least one query of the stored procedure.

In implementations, the monitoring node 110 may continue to receive new data associated with new value sets for the one or more input parameters of each stored procedure, and identifiers of data shards that separately include corresponding pieces of data assessed by each stored procedure when the one or more input parameters take the new value sets from the plurality of computing nodes 104 and the load balancing node 108 on a periodic basis. The monitoring node 110 may select a stored procedure and retrain a predictor function for the stored procedure based on the new data and previously stored data according to the above operations as described in blocks 502 –510.

Although in the above operations, a predictor function is described to be trained for and associated with each stored procedure, in some instances, a predictor function may be trained for and associated with a set of stored procedures. In this case, an input (i.e., input parameters) to the predictor function may include a combination of input parameter candidates that are obtained from each stored procedure of the set of stored procedures as described in block 504 above, and an output of the predictor function may include an identifier of a data shard that may likely store or include data to be queried or updated by a respective stored procedure. In implementations, the monitoring node 110 may automatically group stored procedures having similar input parameter candidates as a set, or group stored procedures as a set according to a request of a user.

Referring back to FIG. 6, at block 602, the load balancing node 108 may receive a distributed database transaction including at least one query procedure.

In implementations, the database system 102 or the load balancing node 108 of the database system 102 may receive a distributed database transaction that includes at least one query procedure from the client device 112. In implementations, the at least one query procedure may include one or more queries that access or manipulate data of one or more data shards of one or more database tables located in one or more computing nodes (such as computing nodes 104) of the database system 102.

At block 604, the load balancing node 108 may determine whether the at least one query procedure is a stored procedure.

In implementations, a query procedure may be a procedure that is registered and stored in the database system 102 in advance (i.e., a stored procedure) , or a procedure that is not registered in the database system 102 and is written in a certain query language (such as a SQL language) by the user 114 according to his/her need. If the query procedure is a stored procedure, values of one or more input parameters of the stored procedure may be further provided by the user 114 when the stored procedure is sent to the database system 102. In implementations, the one or more input parameters may include one or more potential partition key values associated with one or more queries included in the stored procedures.

In implementations, a plurality of stored procedures may be registered and stored in the database system 102, and may be invoked by the user 114. In order to distinguish different stored procedures, each stored procedure may be assigned with a unique identifier that can uniquely represent the respective stored procedure. A unique identifier of a stored procedure may include but is not limited to, a unique name of the stored procedure, a unique index of the stored procedure, etc. In implementations, a stored procedure may include one or more input parameters or arguments, through which the user 114 may provide input values to the stored procedure. For example, in the above example of stored procedure (payment_proc) , the stored procedure may be associated with a unique name, i.e., payment_proc, that can uniquely identify the stored procedure, and distinguish this stored procedure from other stored procedures in the database system 102. The example stored procedure may further include an argument list that includes a number of input parameters (i.e., in_w_id integer, in_w_zip integer, in_c_id integer, in_c_zip integer, in_payment_amount decimal) that are provided or inputted by the user 114.

In implementations, the load balancing node 108 may determine whether the query procedure is a stored procedure by determining whether the query procedure received by the load balancing node 108 is associated with an identifier (e.g., a name, etc. ) and the identifier is registered or found in a data structure (e.g., a table or a list, etc. ) which stores identifiers of stored procedures registered in the database system 102. Continuing the above example, the load balancing node 108 may determine whether the received query procedure is associated with a name (e.g., payment_proc) , and if so, further determine whether the name is registered or found in the data structure storing the identifiers of the stored procedures that are registered in the database system 102. If affirmative, the load balancing node 108 may determine that the received query procedure is a stored procedure. Alternatively, if the received query procedure is not associated with any name or a name associated with the received query procedure is not registered or found in the data structure storing the identifiers of the stored procedures that are registered in the database system 102, the load balancing node 108 may determine that the received query procedure is not a stored procedure.

At block 606, the load balancing node 108 may employ a predetermined load balancing strategy to assign the distributed database transaction to a computing node in response to determining that the received query procedure is not a stored procedure.

In implementations, if determining that the received query procedure is not a stored procedure, the load balancing node 108 may employ a predetermined load balancing strategy to assign the distributed database transaction to a computing node of the plurality of computing nodes 104 included in the database system 102. By way of example and not limitation, the predetermined load balancing strategy may include randomly assigning the distributed database transaction to a computing node, assigning the distributed database transaction to a computing node in a round‐robin manner, assigning the distributed database transaction to a computing node that currently has the least workload, assigning to the distributed database transaction to a computing node based on an IP address of the client device from which the distributed database transaction, etc.

At block 608, in response to determining that the received query procedure is a stored procedure, the load balancing node 108 may determine or select a predictor function for the stored procedure among from among a plurality of predictor functions.

In implementations, after determining that the received query procedure is a stored procedure, the load balancing node 108 may determine or select a predictor function for the stored procedure. In implementations, the database system 102 may include or store a plurality of predictor functions, and each predictor function may be configured to perform predictions or mappings for one or more stored procedures. In implementations, the load balancing node 108 may determine or select a predictor function for the stored procedure based at least in part on the identifier (e.g., the name or index) of the stored procedure.

In implementations, the load balancing node 108 or the database system 102 may include or store a data structure (such as a table, etc. ) that includes a mapping relationship between the plurality of stored procedures and the plurality of predictor functions, such as a mapping relationship between identifiers of the plurality of stored procedures and identifiers of the plurality of predictor functions, or a mapping relationship between the identifiers of the plurality of stored procedures and locations storing the plurality of predictor functions, etc. The load balancing node 108 may determine or select a predictor function for the stored procedure based at least in part on the identifier of the stored procedure from the data structure that includes a mapping relationship between the plurality of stored procedures and the plurality of predictor functions.

By way of example and not limitation, the load balancing node 108 may determine or obtain the identifier of the stored procedure, obtain an identifier of the predictor function based on the identifier of the stored procedure and a mapping relationship, and select a predictor function from among the plurality of predictor functions based on the identifier of the predictor function.

In implementations, each predictor function may include a classification model that is trained based on a plurality of partition key values and corresponding identifiers of a plurality of shards including respective pieces of data corresponding to the plurality of partition key values in a database table, and the plurality of shards may be separately stored in a plurality of computing nodes. By way of example and not limitation, the classification model may include a neural network model.

At block 610, the load balancing node 108 may extract or determine input values for one or more input parameters of the stored procedure.

In implementations, the stored procedure may include an argument list through which the user 114 can enter values for one or more input parameters of the stored procedure. The load balancing node 108 may extract or determine the input values for one or more input parameters of the stored procedure from the argument list. In implementations, the load balancing node 108 may extract or determine the input values for one or more input parameters of the stored procedure using a static program analysis.

At block 612, the load balancing node 108 may determine a computing node to which the stored procedure is sent or assigned based at least in part on the predictor function and the input values for one or more input parameters of the stored procedure.

In implementations, after determining the predictor function and the input values for the one or more input parameters of the stored procedure, the load balancing node 108 may determine a computing node to which the stored procedure is sent or assigned. In implementations, the load balancing node 108 may apply the predictor function on the input values of the one or more input parameters of the stored procedure to produce an output of the predictor function. In implementations, the output of the predictor function may be a label that is related to a location of data potentially or likely accessed or queried by at least one query of the stored procedure. In implementations, the label may include an identifier of a data shard of a database table that includes or stores data potentially or likely accessed or queried by at least one query of the stored procedure. In implementations, the load balancing node 108 may then employ a placement function to map the identifier of the data shard to an identifier of a computing node that includes or stores the data shard.

In implementations, each predictor function of the plurality of predictor functions may be combined with the partition function to form a respective combined predictor‐partition function, and assigned with a same identifier of the respective predictor function. In this case, the load balancing node 108 may directly employ the combined predictor‐partition function to obtain the identifier of the computing node to which the stored procedure is assigned.

At block 614, the load balancing node may send the stored procedure to the computing node.

After determining the identifier of the computing node, the load balancing node may send the stored procedure to the computing node, to allow the computing node to process the stored procedure, and/or act as a controlling node to coordinate and manage processing of the stored procedure.

Although the above method blocks are described to be executed in a particular order, in some implementations, some or all of the method blocks can be executed in other orders, or in parallel.

Conclusion

Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.

The present disclosure can be further understood using the following clauses.

Clause 1: A method implemented by a load balancing node, the method comprising: receiving a distributed transaction including a stored procedure; selecting a predictor function for the stored procedure among from a plurality of predictor functions; extracting one or more input parameters from the stored procedure; determining a computing node based at least in part on the one or more input parameters using the predictor function; and forwarding the stored procedure to the computing node for processing the stored procedure.

Clause 2: The method of Clause 1, wherein the one or more input parameters comprise one or more partition key values associated with one or more queries included in the stored procedures.

Clause 3: The method of Clause 1, wherein selecting the predictor function for the stored procedure from among the plurality of predictor functions comprises: determining an identifier of the stored procedure; obtaining an identifier of the predictor function based on the identifier of the stored procedure and a mapping relationship; and selecting the predictor function from among the plurality of predictor functions based on the identifier of the predictor function.

Clause 4: The method of Clause 1, wherein extracting the one or more input parameters from the stored procedure comprises extracting the one or more input parameters from the stored procedure from an argument list of the stored procedure or using a static program analysis.

Clause 5: The method of Clause 1, wherein the stored procedure comprises a procedure is registered and stored in a database system in advance before being invoked.

Clause 6: The method of Clause 1, wherein determining the computing node based at least in part on the one or more input parameters using the predictor function comprises: determining an identifier of a data shard associated with at least one query included in the stored procedure based on the one or more input parameters using the predictor function; and determining an identifier of the computing node based on the identifier of the data shard using a placement function, the computing node storing the data shard.

Clause 7: The method of Clause 1, wherein the predictor function comprises a classification model that is trained based on a plurality of partition key values and corresponding identifiers of a plurality of shards including respective pieces of data corresponding to the plurality of partition key values in a database table, the plurality of shards being separately stored in a plurality of computing nodes.

Clause 8: The method of Clause 7, wherein the classification model comprises a neural network model.

Clause 9: One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: performing a static program analysis on a stored procedure to determine one or more input parameters; obtaining stored data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards; training a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs; and setting the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.

Clause 10: The one or more computer readable media of Clause 9, wherein a respective data shard of the plurality of data shards comprises a corresponding piece of data that is assessed by the stored procedure when the one or more input parameters take a respective value set of the plurality of value sets.

Clause 11: The one or more computer readable media of Clause 9, wherein performing the static program analysis on the stored procedure to determine the one or more input parameters comprises: identifying a variable that represents a partition key in at least one query included in the stored procedure; and determining the one or more input parameters for the variable using a graph traversal algorithm.

Clause 12: The one or more computer readable media of Clause 9, wherein the classification model comprises a neural network model.

Clause 13: The one or more computer readable media of Clause 9, further comprising sending the classification model to a load balancing node.

Clause 14: The one or more computer readable media of Clause 9, further comprising: receiving, from a plurality of computing nodes, new data associated with new value sets for the one or more input parameters of the stored procedure, and identifiers of data shards that separately include corresponding pieces of data assessed by the stored procedure when the one or more input parameters take the new value sets.

Clause 15: The one or more computer readable media of Clause 14, further comprising retraining the classification model based at least in part on the stored data and the new data.

Clause 16: A system comprising: one or more processors; and memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising: performing a static program analysis on a stored procedure to determine one or more input parameters; obtaining stored data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards; training a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs; and setting the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.

Clause 17: The system of Clause 16, wherein a respective data shard of the plurality of data shards comprises a corresponding piece of data that is assessed by the stored procedure when the one or more input parameters take a respective value set of the plurality of value sets.

Clause 18: The system of Clause 16, wherein performing the static program analysis on the stored procedure to determine the one or more input parameters comprises: identifying a variable that represents a partition key in at least one query included in the stored procedure; and determining the one or more input parameters for the variable using a graph traversal algorithm.

Clause 19: The system of Clause 16, wherein the classification model comprises a neural network model.

Clause 20: The system of Clause 16, wherein the acts further comprise: receiving, from a plurality of computing nodes, new data associated with new value sets for the one or more input parameters of the stored procedure, and identifiers of data shards that separately include corresponding pieces of data assessed by the stored procedure when the one or more input parameters take the new value sets; and retraining the classification model based at least in part on the stored data and the new data.

Claims

A method implemented by a load balancing node, the method comprising:

receiving a distributed transaction including a stored procedure;

selecting a predictor function for the stored procedure among from a plurality of predictor functions;

extracting one or more input parameters from the stored procedure;

determining a computing node based at least in part on the one or more input parameters using the predictor function; and

forwarding the stored procedure to the computing node for processing the stored procedure.
The method of claim 1, wherein the one or more input parameters comprise one or more partition key values associated with one or more queries included in the stored procedures.
The method of claim 1, wherein selecting the predictor function for the stored procedure from among the plurality of predictor functions comprises:

determining an identifier of the stored procedure;

obtaining an identifier of the predictor function based on the identifier of the stored procedure and a mapping relationship; and

selecting the predictor function from among the plurality of predictor functions based on the identifier of the predictor function.
The method of claim 1, wherein extracting the one or more input parameters from the stored procedure comprises extracting the one or more input parameters from the stored procedure from an argument list of the stored procedure or using a static program analysis.
The method of claim 1, wherein the stored procedure comprises a procedure is registered and stored in a database system in advance before being invoked.
The method of claim 1, wherein determining the computing node based at least in part on the one or more input parameters using the predictor function comprises:

determining an identifier of a data shard associated with at least one query included in the stored procedure based on the one or more input parameters using the predictor function; and

determining an identifier of the computing node based on the identifier of the data shard using a placement function, the computing node storing the data shard.
The method of claim 1, wherein the predictor function comprises a classification model that is trained based on a plurality of partition key values and corresponding identifiers of a plurality of shards including respective pieces of data corresponding to the plurality of partition key values in a database table, the plurality of shards being separately stored in a plurality of computing nodes.
The method of claim 7, wherein the classification model comprises a neural network model.
One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:

performing a static program analysis on a stored procedure to determine one or more input parameters;

obtaining stored data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards;

training a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs; and

setting the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.
The one or more computer readable media of claim 9, wherein a respective data shard of the plurality of data shards comprises a corresponding piece of data that is assessed by the stored procedure when the one or more input parameters take a respective value set of the plurality of value sets.
The one or more computer readable media of claim 9, wherein performing the static program analysis on the stored procedure to determine the one or more input parameters comprises:

identifying a variable that represents a partition key in at least one query included in the stored procedure; and

determining the one or more input parameters for the variable using a graph traversal algorithm.
The one or more computer readable media of claim 9, wherein the classification model comprises a neural network model.
The one or more computer readable media of claim 9, further comprising sending the classification model to a load balancing node.
The one or more computer readable media of claim 9, further comprising:

receiving, from a plurality of computing nodes, new data associated with new value sets for the one or more input parameters of the stored procedure, and identifiers of data shards that separately include corresponding pieces of data assessed by the stored procedure when the one or more input parameters take the new value sets.
The one or more computer readable media of claim 14, further comprising retraining the classification model based at least in part on the stored data and the new data.
A system comprising:

one or more processors; and

memory storing executable instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising:

performing a static program analysis on a stored procedure to determine one or more input parameters;

obtaining stored data including a plurality of value sets for the one or more input parameters of the stored procedure and identifiers of a plurality of data shards;

training a classification model using the plurality of value sets as inputs and the corresponding identifiers of the plurality of data shards as outputs; and

setting the trained classification model as a predictor function used for mapping a new value set for the one or more input parameters of the stored procedure to a corresponding data shard.
The system of claim 16, wherein a respective data shard of the plurality of data shards comprises a corresponding piece of data that is assessed by the stored procedure when the one or more input parameters take a respective value set of the plurality of value sets.
The system of claim 16, wherein performing the static program analysis on the stored procedure to determine the one or more input parameters comprises:

identifying a variable that represents a partition key in at least one query included in the stored procedure; and

determining the one or more input parameters for the variable using a graph traversal algorithm.
The system of claim 16, wherein the classification model comprises a neural network model.
The system of claim 16, wherein the acts further comprise:

receiving, from a plurality of computing nodes, new data associated with new value sets for the one or more input parameters of the stored procedure, and identifiers of data shards that separately include corresponding pieces of data assessed by the stored procedure when the one or more input parameters take the new value sets; and

retraining the classification model based at least in part on the stored data and the new data.