CN116302404B

CN116302404B - Resource decoupling data center-oriented server non-perception calculation scheduling method

Info

Publication number: CN116302404B
Application number: CN202310149359.4A
Authority: CN
Inventors: 金鑫; 刘譞哲; 祁晟
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-10-03
Anticipated expiration: 2043-02-16
Also published as: CN116302404A

Abstract

The embodiment of the invention provides a server non-aware computing scheduling method for a resource decoupling data center. Applied to a scheduler, the method comprises: determining a corresponding task type according to the received task request RPC; determining the allocation proportion corresponding to the task type according to the task type; and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities. The method aims at distributing various tasks to nodes matched with the self-running characteristics for execution, so that the task execution efficiency and the resource utilization rate of the system are improved, and the throughput of the system is further improved.

Description

Resource decoupling data center-oriented server non-perception calculation scheduling method

Technical Field

The invention relates to the technical field of data center scheduling, in particular to a server non-aware computing scheduling method for a resource decoupling data center.

Background

Server unaware computing is a new generation cloud computing paradigm that allows developers to apply development based on server unaware functions without concern for the details of underlying resource management. The server non-perception calculation adopts a resource decoupling architecture, decouples the calculation and the storage, and is divided into two independent resource pools, and server nodes in the calculation resource pools acquire data from the storage resource pools through a network and complete the calculation. Because the two resource pools are independent of each other, the resource decoupling architecture has good expandability and higher resource utilization rate.

However, the network overhead resulting from remotely accessing the data is often not negligible. For IO intensive tasks (IO stands for data input/output), moving data from a storage node to a compute node may result in multiple RTTs (Round-Trip Time) or occupy a lot of bandwidth. This problem, while solved by storage-side computing, uses RPC (Remote Procedure Call remote procedure call) to run the stored procedures registered on the storage node. However, since the resource pool is stored for storage, the computing resources are often limited, and the computing requirements of all tasks cannot be met. In the prior art, the workload is scheduled to the storage and calculation nodes for execution according to a certain allocation proportion, so that both resource pools are fully utilized. The limitation of the prior art is that the scheduling decision is only carried out at the load level, namely, a uniform allocation proportion is calculated for the whole load, and the scheduling mode does not consider the attribute of the task, so that the scheduling effect is poor and the system operation is low.

Disclosure of Invention

In view of this, the embodiment of the invention provides a server non-aware computing scheduling method for a resource decoupling data center. The method aims at distributing various tasks to nodes matched with the self-running characteristics for execution, so that the task execution efficiency and the resource utilization rate of the system are improved, and the throughput of the system is further improved.

The first aspect of the embodiment of the invention provides a server non-aware computing scheduling method for a resource decoupling data center, which is applied to a scheduler and comprises the following steps:

determining a corresponding task type according to the received task request RPC;

determining the allocation proportion corresponding to the task type according to the task type;

and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities.

Optionally, the determining, according to the task type, the allocation proportion corresponding to the task type includes:

when the task type is an unprocessed first task type, determining that the allocation proportion corresponding to the first task type is a default allocation proportion;

and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities, wherein the method comprises the following steps:

distributing the tasks of the first task type to a computing node or a storage node for execution according to the default distribution proportion, and counting the execution cost of the tasks of the first task type in a preset number on the computing node and the storage node, so as to determine the runtime characteristic of the first task type;

The optimal allocation proportion of the first task type is determined by inputting the runtime characteristic into a preset scheduling algorithm for calculation;

according to the optimal allocation proportion of the first task type, allocating the tasks in the first task type to a computing node or a storage node for execution according to the corresponding probability;

when the task type is a processed second task type, acquiring an optimal allocation proportion corresponding to the second task type, which is determined by a preset scheduling algorithm;

and distributing the tasks in the second task type to a computing node or a storage node for execution according to the optimal distribution proportion corresponding to the second task type.

Optionally, the runtime features include: the network overhead of data transmission when the computing node performs the task, and the CPU overhead when the storage node performs the task.

Optionally, the determining the optimal allocation proportion of the first task type by inputting the runtime feature into a preset scheduling algorithm for calculation includes:

Determining the relative cost of the first task type by inputting the runtime feature into a preset scheduling algorithm;

sorting the first task type and the second task type according to the relative cost;

determining an optimal solution of the dividing point parameter through a sub-algorithm in the preset scheduling algorithm;

and determining the optimal allocation proportion of the first task type according to the sequencing result and the optimal solution of the segmentation point parameters.

Optionally, the method further comprises:

according to the end-to-end delay in the task execution process and the service level target given by the user, the throughput of the system is adjusted through a current limiter;

and adjusting the partition point parameter of the scheduler according to the throughput of the system so as to maximize the throughput of the system.

Optionally, the method further comprises:

determining the difference value between the running time characteristic of each task type in the current time window and the historical running time characteristic sliding average value of the running time characteristic;

setting the allocation proportion of the third task type, of which the difference value between the running time characteristic in the current time window and the historical running time characteristic sliding average exceeds a first threshold value, as a default allocation proportion;

distributing the third task type to a computing node or a storage node for execution according to the default distribution proportion, and counting new execution cost of a preset number of tasks of the third task type on the computing node and the storage node, so as to determine new runtime characteristics;

Determining the relative cost of the third task type by inputting the new runtime feature into a preset scheduling algorithm;

sequencing all task types according to the relative cost of each task type;

and determining the optimal allocation proportion of all task types according to the sequencing result of all task types and the optimal solution of the segmentation point parameters.

Optionally, the method further comprises:

determining a request discard condition of the flow restrictor;

when the current limiter discards the request, the capacity of the computing resource pool is increased stepwise until the current limiter does not discard the request any more;

and when the current limiter does not discard the request, the number of the calculation nodes is reduced stepwise according to the load condition of the system.

Optionally, the method further comprises:

determining the actual load of each storage node;

dividing the storage node into a new independent scheduling policy group when the difference between the actual load of the storage node and the average load of the system exceeds a second threshold;

dividing a preset number of computing nodes into the new independent scheduling policy group;

each policy group operates independently as a subsystem for task scheduling and execution.

The embodiment of the invention has the following advantages:

the server non-aware computing scheduling method for the resource decoupling data center provided by the embodiment of the invention determines the task type of a task according to a received task request RPC, determines the allocation proportion corresponding to the task type according to the task type (for example, for the computationally intensive task, all or most of the task is allocated to a computing node with sufficient computing resources for execution, the rest is allocated to a storage node for execution, for the IO intensive task, all or most of the task is allocated to the storage node with higher IO efficiency for execution, and the rest is allocated to the computing node for execution), and according to the determined allocation proportion, the task in the task type is allocated to the computing node or the storage node for execution with the probability equal to the allocation proportion. The invention determines the allocation proportion corresponding to the task type according to the task type, and allocates the tasks in the task type to the computing node or the storage node for execution with the probability equal to the allocation proportion, so that the tasks in the task type can be executed at the node matched with the self-running characteristics, thereby improving the task execution efficiency and the resource utilization rate of the system, and further improving the throughput of the system.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a server unaware computing scheduling method for a resource-oriented decoupled data center, according to an embodiment of the invention;

fig. 2 is a system block diagram illustrating implementation of a server unaware computing scheduling method for a resource-oriented decoupled data center according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Before explaining the present invention, the background of the present invention will be described. In an actual application scene, the task types are not distinguished, and unified scheduling is only performed on the overall workload (set of all tasks) level, so that the scheduling accuracy is low and the system operation is inefficient. The invention discovers that the task types executed by the system are various in the actual application scene, the runtime characteristics of the tasks of different task types are different, the tasks are distributed to the nodes matched with the runtime characteristics of the tasks to be executed (such as placing the computationally intensive tasks on the computing nodes with sufficient computing capability to be executed, placing the IO intensive tasks on the storage nodes with higher IO efficiency to be executed so as to avoid multiple RTT and large-scale data transmission), and the respective advantages of each node can be respectively exerted, thereby improving the task execution efficiency and the resource utilization rate of the system, and further improving the task throughput of the system.

In the present invention, fig. 1 is a flowchart of a server unaware computing scheduling method for a resource decoupling data center according to an embodiment of the present invention. Referring to fig. 1, the server unaware computing scheduling method for a resource decoupling data center provided by the invention is applied to a scheduler and comprises the following steps:

Step S11: determining a corresponding task type according to the received task request RPC;

step S12: determining the allocation proportion corresponding to the task type according to the task type;

step S13: and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities.

In an embodiment of the invention, a user submits pre-written application code to the system, which is compiled by the system and stored in the computing node and the storage node, wherein a particular program corresponds to a task type. The user then submits the task call request to the scheduler in the form of an RPC. The scheduler is responsible for scheduling tasks to a computing node or a storage node for execution, and after receiving a task request RPC, the scheduler identifies a corresponding task type by analyzing the task request RPC, further determines a corresponding allocation proportion according to the task type, and schedules the current task request to the storage node according to the probability equal to the allocation proportion of the task type. Specifically, if the allocation proportion of the task type is 60%, then the task is scheduled to the storage node with a 60% probability and is scheduled to the computing node with a 40% probability. The allocation proportion in the invention represents the probability of the task being scheduled to the storage node, and in the actual scene, the allocation proportion represents the probability of the task being scheduled to the storage node or the probability of the task being scheduled to the calculation node can be freely selected. Macroscopically, according to the law of large numbers, the ratio between the number of tasks performed on a computing node and a storage node by a task type converges to the allocation ratio corresponding to the task type. Continuing with the above example, if the user submitted 10000 tasks belonging to the task type in total, about 6000 are executed on the storage node (about 60%), and about 4000 are executed on the computing node (about 40%), so the present invention refers to the ratio of 60% as the allocation ratio.

The scheduler jointly calculates the allocation proportion of each task type according to all the task types submitted in the current preset time length, and the process is performed asynchronously in the background of the scheduler, namely the scheduler does not influence the scheduling work of the scheduler on the current task while calculating the allocation proportion of the task types. When the actual task request RPC arrives, the dispatcher identifies the corresponding task type, and dispatches the task to the computing or storage node for execution according to the corresponding distribution proportion of the task type.

The allocation proportion corresponding to each task type is calculated by a preset scheduling algorithm in the scheduler, and a specific implementation manner of the allocation proportion will be described in a later embodiment.

The server non-perception calculation scheduling method for the resource decoupling data center provided by the embodiment of the invention determines the task type of the task according to the received task request RPC, determines the corresponding allocation proportion according to the task type, and allocates the task belonging to the task type to a calculation node or a storage node for execution according to the determined allocation proportion with corresponding probability. The invention independently sets the allocation proportion for different task types to schedule, so that each type of task can be executed at the node matched with the self-running characteristics (for example, the task with intensive computation is fully or mostly allocated to the computation node with sufficient computation resources to be executed, the rest is allocated to the storage node to be executed, the task with intensive IO is fully or mostly allocated to the storage node with higher IO efficiency to be executed, and the rest is allocated to the computation node to be executed), thereby improving the task execution efficiency and the resource utilization rate of the system, and further improving the task throughput of the system.

In the present invention, the determining, according to the task type, the allocation proportion corresponding to the task type includes: when the task type is an unprocessed first task type, determining that the allocation proportion corresponding to the first task type is a default allocation proportion; and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities, wherein the method comprises the following steps: distributing the tasks of the first task type to a computing node or a storage node for execution according to the default distribution proportion, and counting the execution cost of the tasks of the first task type in a preset number on the computing node and the storage node, so as to determine the runtime characteristic of the first task type; the optimal allocation proportion of the first task type is determined by inputting the runtime characteristic into a preset scheduling algorithm for calculation; according to the optimal allocation proportion of the first task type, allocating the tasks in the first task type to a computing node or a storage node for execution according to the corresponding probability; when the task type is a processed second task type, acquiring an optimal allocation proportion corresponding to the second task type, which is determined by a preset scheduling algorithm; and distributing the tasks in the task types to a computing node or a storage node for execution according to the distribution proportion with corresponding probabilities, wherein the method comprises the following steps: and distributing the tasks in the second task type to a computing node or a storage node for execution according to the optimal distribution proportion corresponding to the second task type.

In the embodiment of the present invention, according to a task type, one implementation manner of determining the allocation proportion corresponding to the task type is as follows: when the task type corresponding to the received task request is a first task type which is not processed by the system, distributing the first task type to a computing node or a storage node according to a default distribution proportion for executing the task.

In the embodiment of the present invention, the default allocation ratio is preferably 50%, and it should be understood that the default allocation ratio may be other allocation ratios, which is not specifically limited herein. The task of the first task type which is not processed is distributed to a computing node or a storage node in a default distribution proportion for task execution, so that the runtime characteristics of the task of the first task type when the computing node and the storage node execute are acquired, and the optimal distribution proportion corresponding to the first task type is calculated based on the runtime characteristics.

When the number of the tasks of the first task type distributed to the computing nodes reaches a preset number, counting the execution cost of the first task type of the preset number in the computing nodes; when the number of the tasks of the first task type distributed to the storage nodes reaches a preset number, counting the execution cost of the storage nodes of the first task type of the preset number; together, they constitute the runtime features of this first task type. The preset number of values may be set according to an actual application scenario, which is not specifically limited herein.

In an embodiment of the present invention, after both the compute node and the storage node have counted the runtime features of the first task type; and inputting the obtained two runtime features into a preset scheduling algorithm for calculation to obtain the optimal allocation proportion of the first task type. And then, according to the optimal allocation proportion, allocating the tasks belonging to the first task type to a computing node or a storage node for execution with corresponding probabilities.

In the embodiment of the invention, when the task type is the second task type processed by the system, the scheduler indicates that the runtime characteristic of the second task type when the computing node and the storage node execute is obtained before, and based on the runtime characteristic, the optimal allocation proportion corresponding to the second task type is obtained through calculation by a preset scheduling algorithm, and at this time, the task of the second task type is only allocated to the computing node or the storage node for executing according to the optimal allocation proportion with corresponding probability.

In the present invention, the runtime features include: the network overhead of data transmission when the computing node performs the task, and the CPU overhead when the storage node performs the task.

In embodiments of the invention, the runtime characteristics of a class of tasks refer to a tuple (c _i ，s _i ) Wherein subscript i is the number of the task type, c _i Is the execution cost of the ith task type on the computing node, s _i Is the cost of execution of the ith task type on the storage node. The execution cost of the task on the computing node and the storage node may represent the cost of a specific resource (such as CPU cost, network cost, etc.), or may be an abstract execution cost obtained by weighting multiple resources according to a certain heuristic.

In an embodiment of the present invention, since the present invention supports elastic scaling of the computing resource pool, the CPUs on the compute nodes do not become performance bottlenecks,therefore, the execution cost of the task in the computing node is preferably the network overhead of data transmission. The number of the storage nodes is determined by the data amount which is required to be stored in a lasting way by the application, and no significant fluctuation usually occurs in a short time, so that the CPU resources of the storage nodes are limited, and the execution cost of the task in the storage nodes is preferably CPU overhead. At this time, c _i Network overhead representing data transmission of tasks of the ith task type when executed by a computing node s _i A task representing the ith task type incurs CPU overhead when the storage node executes.

The task of the ith task type is executed in the computing node and the storage node to obtain the corresponding c _i Sum s _i Thereafter, c _i Sum s _i And feeding back to the scheduler. C of scheduler feedback for tasks in all ith task types _i Sum s _i Statistics and averages are performed to obtain the overall runtime characteristics (c) of the ith task type _i ，s _i ) I.e. for a class of tasks, c in the run-time feature tuple _i Sum s _i Representing the mathematical expectation, c for a single task _i Sum s _i A specific execution cost is indicated. By grouping the ith task type into an overall runtime feature (c _i ，s _i ) The scheduler may further determine an optimal allocation ratio of the ith task type, by inputting a preset scheduling algorithm.

In the embodiment of the invention, the runtime characteristic of the task is a dynamic change amount, and in order to cope with the dynamic change of the runtime characteristic of the task, the scheduler records the runtime characteristic c _i Sum s _i Obtained by means of a moving average.

In the present invention, the determining the optimal allocation proportion of the first task type by inputting the runtime feature into a preset scheduling algorithm for calculation includes: determining the relative cost of the first task type by inputting the runtime feature into a preset scheduling algorithm; sorting the first task type and the second task type according to the relative cost; determining an optimal solution of the dividing point parameter through a sub-algorithm in the preset scheduling algorithm; and determining the optimal allocation proportion of the first task type according to the sequencing result and the optimal solution of the segmentation point parameters.

In the embodiment of the invention, the duty ratio of n task types in the workload of the system in the invention is as follows: p= { P ₀ ，p ₁ ，...，p _n-1 }，∑ _i p _i =1, wherein p _i The ratio of the i-th task type is counted, and i is numbered from 0; the total throughput of the system is R; throughput for the ith task type is p _i R is R; the allocation proportion of the n-class tasks is as followsWherein x is _i ∈[0，1]Representing the ratio of the ith task type to be scheduled to the storage node (or the probability of a task belonging to the ith task type being scheduled to the storage node); t (T) _i Tail delays for the ith task type (e.g., delays of 99% quantiles of tasks in the ith task type); t is t _i Delayed SLO (Service LevelObjective service level objective) for the ith task type, expressed for T _i Constraint of (i.e. tail delay T of task) _i The delay SLO must not be exceeded.

In an embodiment of the present invention, given the duty cycle of the ith task type in all task types, T _i Is R andbecause on the one hand the delay is positively correlated with the throughput of the system and on the other hand the tasks are also affected by other tasks scheduled to the same node due to resource sharing. The object of the present invention is to maximize the overall throughput of the system while satisfying the delayed SLO for each task type. The problem can be formulated as the following equation (1) and equation (2):

For the above problems, a solution is requiredIs an n-dimensional vector, and the search space is too large to be directly solved. The invention converts the original problem expression formed by the formula (1) and the formula (2) into another group of equivalent expression through the mathematical property of queuing theory, and pushes out that the optimal solution has a structure divided into two on the basis, thereby greatly simplifying the search space. In particular, since the task is either scheduled to be executed to the compute node or to the storage node, the present invention delays the tail by T as described above _i Is decomposed into two parts: />And->Represents the tail delay of the ith task type when executing on the compute node and the storage node, respectively. Allocation ratio x according to the ith task type _i Overall tail delay T _i And T _i ^C ，T _i ^S The relationship of (2) can be represented by formulas (3) to (5):

T _i ＝T _i ^C ，x _i ＝0#(3)

T _i ＝T _i ^S ，x _i ＝1#(4)

T _i ≤max{T _i ^C ，T _i ^S }，x _i ∈(0，1)#(5)

therefore, to satisfy the delay SLO constraint in equation (2) above, T can be equivalently and individually applied _i ^C And T _i ^S And (5) selecting a row constraint.

Further, according to queuing theory, the T can be calculated by _i ^C And T _i ^S In conjunction with the load of the systemIs tied up. This has the advantage that the load of the system can be written as a closed form expression. Runtime feature (c) for the ith task type _i ，s _i ) In the embodiment of the invention, the execution cost of the task at the computing node is preferably the network cost for data transmission, and the execution cost of the task at the storage node is preferably the CPU cost. The network bandwidth resource capacity and the CPU resource capacity of the storage resource pool between the computing node and the storage node are designed to be C and S respectively, and the loads on the network bandwidth and the storage node CPU can be further defined asAnd->I.e. the ratio between the resource occupancy and the resource capacity.

In the embodiment of the invention, when the arrival process of the task request PRC is a Poisson process, the Tx can be deduced according to the queuing theory ^C Is ρ _C And T is a monotonically increasing function of _i ^S Is ρ _S Is a monotonically increasing function of (1). Thus, the load pair T of the system can be adjusted _i ^C And T _i ^S And (5) adjusting. On the basis, the invention further converts the original problems of the formulas (1) and (2) into problems corresponding to the following formulas (6) to (10):

wherein the method comprises the steps ofAnd->Respectively for T _i ^C And T _i ^S Equivalent delay SLO of (i.e. respectively for T) _i ^C And T _i ^S Is a delay constraint of (2). Can be demonstrated mathematically +.>And->But often cannot be split from the delay SLO given by the user to obtain the actual value of each. The main significance of introducing these two equivalent delay constraints is that the original problem of equation (1) and equation (2) can be converted into p _c And ρ _S Is described.

Specifically, first by adjusting the dispensing ratioSo that ρ is _C And ρ _S As much as possible; second, due to T _i ^C And T _i ^S Respectively regarding ρ _C And ρ _S Monotony, thus regulating->The equivalent delay constraint in equations (9) to (10) can be satisfied; finally, based on the equality given in equations (3) through (5), the satisfiability of the two equivalent delay constraints can be further translated into the satisfiability of the constraints in the original problem equation (2), therebyThe tail delay of the task is ensured not to exceed the delay SLO given by the user.

In a practical system, the restrictor in the scheduler delays T only for the overall tail _i And (5) carrying out statistics. The restrictor steps up (or down) the overall throughput of the system until T _i The delay SLO given by the user is reached. Taking the example of increasing throughput, this procedure corresponds to equations (6) through (10), then the current allocation ratio of the restrictorLet ρ be by increasing the total throughput R of the system _C And ρ _S Rising to lead any one of the formula (9) and the formula (10) to reach the equivalent delay constraint condition at first, thereby leading to the actual statistic T _i The delay SLO given by the user is reached.

Based on the above thought, the embodiment of the invention is implemented by the method of p _C And ρ _S To solve the original problems of the above-mentioned formula (1) and formula (2). According to the above formulas (6) to (10) and T _i ^C And T _i ^S Regarding ρ _C And ρ _S Can be deduced that the optimal solution should be such that ρ _C And ρ _S As small as possible. Specifically, only the binary group (ρ) satisfying pareto optimum needs to be considered _C ，ρ _S ). In other words, for a fixed ρ _C The optimal solution should be such that ρ _S Minimum; conversely, for a fixed ρ _S The optimal solution should be such that ρ _C Minimum. Consider that all tasks are initially scheduled to a compute node in its entirety, i.e., x _i =0, at this time ρ _S =0. Some tasks are now scheduled to storage nodes step by step to balance the load of network bandwidth. If the ith task type is selected to be scheduled to the storage node, then per slave ρ _C The load of one unit is reduced, corresponding to ρ _S The amplitude of the rise isThus, to make ρ _S As small as possible, should be given priority to s _i /c _i Scheduling smaller task types to memoryAnd (5) storing the nodes. Thus, the present invention uses s _i /c _i As a valuation function, the run-time characteristics of the tasks are converted into relative costs, and the ordering of the relative costs is used as the priority of each task type to be scheduled to the storage node, namely, the smaller the relative cost is, the higher the priority of the task type to be scheduled to the storage node is.

In the embodiment of the invention, after all task types are ordered from small to large according to the relative cost, the task types which are ordered in front are preferentially scheduled to the storage node. Therefore, there is a special split-point task type k in the optimal solution, all task types with small relative cost k are scheduled to the storage node (allocation proportion is 100%), all task types with large relative cost k are scheduled to the computing node (allocation proportion is 0%), and only the split-point task type k has the allocation proportion between 0% and 100%, and tasks are executed on both the storage node and the computing node. Therefore, only the division point task type k needs to be determined, and the respective optimal allocation proportion of all the task types can be generated in a split way, so that the search space can be simplified, and rapid convergence is facilitated. To facilitate numerical solution, the present invention is implemented by a method of [0,1 ]]The real number α (partition point parameter) above represents such a one-to-two structure. Distribution ratio x of various task types after sequencing _i The mapping relation with the segmentation point parameter alpha is as follows formula (11):

the allocation proportion of each task type can be determined through the mapping relation given by the formula (11) and the value of the partition point parameter alpha. The specific meaning expressed by the above formula (11) is: n task types (task types numbered 0 to n-1) are sequenced and mapped to [0,1 ] ]Intervals, wherein the ith task is mapped to subintervals [ (i/n, (i+1)/n)]The method comprises the steps of carrying out a first treatment on the surface of the And the task type corresponding to the subinterval where alpha is located is the partition point task type k. For example, there are 10 task types numbered 0 through 9; if the optimal solution alpha given by the scheduling algorithm is preset ^* 0.55, thenThe subinterval where the optimal solution is located is [0.5,0.6 ]]The number k=5 corresponds to the 6 th task type being a split point task type. At the same time, the allocation ratio of the task type is αn-i=0.5 (50%). For the first 5 task types, which are relatively lower in cost than the task types of class 6, the allocation proportion is 1 (100%), while the last 4 task types, which are relatively higher in cost than the task types of class 6, are all allocated 0%.

In the embodiment of the invention, the relative cost of the first task type is obtained by inputting the runtime characteristic of the first task type into a preset scheduling algorithm for calculation. Illustratively, the runtime characteristic at the first task type is (c ₁ ，s ₁ ) When the relative cost of the first task type is s ₁ /c ₁ . After the relative cost of the first task type is obtained, the relative cost of each second task type is obtained by the system because the second task type is the task type processed by the system, and the first task type and each second task type are ordered through the relative cost of each task type, so that a corresponding ordering result is obtained. Meanwhile, determining an optimal solution of the partition point parameter alpha through a sub-algorithm in a preset scheduling algorithm.

After the sequencing results of the first task type and the second task type and the optimal solution of the segmentation point parameter alpha are obtained, the optimal allocation proportion of the first task type can be determined based on the mapping relation expression of the allocation proportion of various task types and the segmentation point parameter, namely the formula (11).

In an embodiment of the present invention, an optimal solution α for determining the partition point parameter α in the scheduling algorithm is preset ^* The detailed flow of the sub-algorithm of (a) is as follows. First, the maximum throughput that the system can carry on the premise of meeting the delay SLO given by the user is a function of the segmentation point parameter α, denoted as R (α). For a given α, the corresponding value of R (α) can be approximated by a restrictor in the scheduler, and a curve of R (α) can be fitted by sampling α. Optimal solution alpha for solving partition point parameter alpha by scheduler ^* Based on a basic assumption, i.e. the assumption that R (α) is a unimodal function. Sub-algorithm generalOverserving alpha ^* Upper bound of (2)And lower boundaryαAnd gradually narrowing the search space (i.e. interval +.>). Specifically, the algorithm divides the α current search space into several cells, samples the endpoints of the cells one by one, and determines the peak value of R (α) (i.e., α ^* ) The range of intervals in which the device is located. The above process is referred to as a round of iteration. The initial search space of each iteration is a new interval range determined by the previous iteration, i.e. the new interval range given by the previous iteration is divided at the beginning of each iteration. The new interval range determined by each iteration is strictly contained in the initial search space, so that the algorithm can gradually shrink the search space until the optimal solution alpha is approached ^* 。

Specifically, in each iteration, the algorithm willEqually dividing into M cells (M is a parameter of an algorithm), and traversing all the endpoints of the cells in sequence to find local maxima; the local maximum is defined as: three consecutive endpoints alpha _i-1 ＜α _i ＜α _i+1 So that R (alpha) _i )＞R(α _i-1 ) And R (alpha) _i )＞R(α _i-1 ). Based on the assumption that R (α) is a unimodal function, α can be determined at this time ^* Located at [ alpha ] _i-1 ，α _i+1 ]As a result, the search space can be reduced. Since each cell is 1/M of the original cell, the algorithm can ensure that the search space is reduced in a constant proportion after each iteration. The algorithm has a logarithmic time complexity.

In an embodiment of the invention, the scheduler will search the intervalInitialized to [0,1 ]]And is combined withRepeating the iteration until the interval length is smaller than the threshold delta; at this time, the +. >One point in as a pair of optimal solutions alpha ^* Is a approximation of (a). After the algorithm converges, the scheduler will use this result until a change in workload or cluster state is detected. At this point, the scheduler will reinitialize the preset scheduling algorithm and calculate a new α ^* To cope with the relevant changes.

In the present invention, the method further comprises: according to the end-to-end delay in the task execution process and the service level target given by the user, the throughput of the system is adjusted through a current limiter; and adjusting the partition point parameter of the scheduler according to the throughput of the system so as to maximize the throughput of the system.

In an embodiment of the invention, the scheduler and the restrictors in the scheduler form a dual cycle control system. The restrictor in the scheduler circulates as an inner layer. In the inner loop, the restrictor counts the end-to-end delay in the task execution process to obtain the tail delay of each task type, and adjusts the total throughput R of the system according to the service level target (delay SLO) given by the user, thereby ensuring that the service level target of the user is satisfied. The scheduler then constitutes an outer loop. In the outer loop, the scheduler runs a sub-algorithm in the scheduling algorithm, adjusts the dividing point parameter alpha according to the total throughput R of the system given by the current limiter, and further optimizes the optimal allocation proportion of each task type so as to maximize the total throughput R of the system. The dual-cycle control system operates asynchronously in the background of the scheduler, i.e. the scheduler can schedule in the foreground at the same time according to the distribution proportion currently given by the dual-cycle control system.

In the embodiment of the invention, the inner loop and the outer loop of the control system are continuously subjected to iterative optimization until the optimal solution is converged. To avoid that the coupling relationship between the two loops does not converge, the loop frequencies of the two loops cannot be too close. Therefore, in the present invention, the frequency of the inner loop is preferably set to 200HZ, and the frequency of the outer loop is set to 20HZ, and it should be understood that the frequencies of the inner loop and the outer loop may be other frequencies as well, and only the difference between the frequencies of the inner loop and the outer loop needs to be ensured to reach a set threshold, and the set threshold may be set according to the actual application scenario, and is not limited herein.

In an embodiment of the present invention, if a new task type and/or a decrease in task type and/or a significant change in runtime characteristics of a task type occurs in the system, this means that the workload of the system has changed. At this time, the sub-algorithm in the preset scheduling algorithm determines the optimal solution of the new partition point parameter, and the specific implementation manner is the same as the above implementation manner, and will not be repeated here. If the number of nodes in the system increases or decreases, this means that the cluster state of the system changes, and at this time, a new optimal solution of the partition point parameter is determined by a sub-algorithm in the preset scheduling algorithm, and the specific implementation manner is the same as the above implementation manner, and will not be described herein.

In the present invention, the method further comprises: determining the difference value between the running time characteristic of each task type in the current time window and the historical running time characteristic sliding average value of the running time characteristic; setting the allocation proportion of the third task type, of which the difference value between the running time characteristic in the current time window and the historical running time characteristic sliding average exceeds a first threshold value, as a default allocation proportion; distributing the third task type to a computing node or a storage node for execution according to the default distribution proportion, and counting new execution cost of a preset number of tasks of the third task type on the computing node and the storage node, so as to determine new runtime characteristics; determining the relative cost of the third task type by inputting the new runtime feature into a preset scheduling algorithm; sorting all task types according to the relative cost; determining an optimal solution of the dividing point parameter through a sub-algorithm in the preset scheduling algorithm; and determining the optimal allocation proportion of all task types according to the sequencing result of all task types and the optimal solution of the segmentation point parameters.

In an embodiment of the invention, differences between the run-time characteristics of each task type and the respective historical run-time characteristic running average are determined within the current time window. When the difference value between the running time characteristic in the current time window of the task type and the running time characteristic sliding average value of the current time window exceeds a first threshold value, the running time characteristic of the task type is shown to be changed obviously, at the moment, the relative cost and the corresponding optimal allocation proportion of the task type are also changed, the allocation proportion currently used by the task type is not accurate any more, and the relative cost and the corresponding optimal allocation proportion of the task type are required to be corrected.

At this time, the invention sets the allocation proportion of the third task type, in which the difference value between the running time characteristic in the current time window and the historical running time characteristic sliding average exceeds the first threshold, as a default allocation proportion, allocates the third task type to the computing node or the storage node for execution according to the default allocation proportion, counts the new execution cost of the tasks of the third task type on the computing node and the storage node in a preset number, and further determines the new running time characteristic. The relative cost of the third task type is determined by inputting the new runtime feature into a preset scheduling algorithm, and the specific implementation manner is the same as the above implementation manner, and will not be repeated here. And sequencing all task types again according to the relative cost of each task type, and redefining the optimal solution of the dividing point parameter through a sub-algorithm in a preset scheduling algorithm, wherein the specific implementation manner is the same as that of the above implementation manner, and the detailed description is omitted. According to the obtained sorting result of all task types and the optimal solution of the partition point parameter, and the mapping relation expression (i.e. the above formula 11) of the allocation proportion of each task type and the partition point parameter, the optimal allocation proportion of all task types is redetermined, and the specific implementation is the same as the above implementation and will not be repeated here.

In an embodiment of the present invention, another alternative implementation is to actively re-count the runtime characteristics of each task type at preset time intervals to cope with possible changes in the runtime characteristics.

In the present invention, the method further comprises: determining a request discard condition of the flow restrictor; when the current limiter discards the request, the capacity of the computing resource pool is increased stepwise until the current limiter does not discard the request any more; and when the current limiter does not discard the request, the number of the calculation nodes is reduced stepwise according to the load condition of the system.

In an embodiment of the present invention, with a fixed resource capacity, the restrictor may limit the number of task requests in the system to meet the delay SLO. While in real scenes the size of task requests may fluctuate significantly over time. In order to avoid discarding overflowed requests when the request amount is large or avoiding wasting due to idle resources when the request amount is low, the scheduler of the invention can adjust the capacity of the computing resource pool according to the real-time scale of task requests. Since the demand for computing resources depends on the number of requests submitted by the user, there is a large variation in the magnitude, whereas the demand for storage resources depends on the size of the data volume that the application persists, the magnitude of the variation is typically small. Therefore, the invention selects a computing resource pool as the object of the flexible expansion of resources.

In the embodiment of the invention, in order to realize the flexible expansion of resources, a new outer loop is added outside the existing double loops of the scheduler, the outer loop determines the request discarding condition of the restrictor, and in the condition that the restrictor discards overflow requests, the outer loop can stepwise increase the capacity of a computing resource pool until the restrictor no longer discards requests. While when the request is not discarded by the current limiter, according to the network bandwidth of the system and the load rho on the CPU of the storage node _C And ρ _S The number of the calculation nodes is reduced stepwise on the premise of not affecting throughput and delaying SLO. The embodiment of the invention performs the elastic expansion of the resources on the level of the CPU core, and can adjust the granularity of the elastic expansion according to the specific configuration and implementation of the system.

In the present invention, the method further comprises: determining the actual load of each storage node; dividing the storage node into a new independent scheduling policy group when the difference between the actual load of the storage node and the average load of the system exceeds a second threshold; dividing a preset number of computing nodes into the new independent scheduling policy group; each policy group operates independently as a subsystem for task scheduling and execution.

In a real scenario, the data of an application is typically fragmented across multiple storage nodes, while tasks may exhibit a skewed access pattern when accessing the data fragments, i.e., a few hot spot fragments carry most of the access. In this case ρ _C And ρ _S The average value represented will have a large gap from the actual load of the hot spot slices, thereby affecting the scheduling effect.

In order to solve the problem, the invention provides a scheduling policy group mechanism. The mechanism will collect the actual load of the individual storage nodes. When the access pattern is not skewed, i.e. there is no hot spot fragmentation, then all nodes belong to a default scheduling policy group. And when the access mode is skewed, so that the difference value between the actual load of a certain storage node and the average load of the system exceeds a second threshold value, dividing the storage node into a new independent scheduling policy group, and dividing a preset number of calculation nodes into the new scheduling policy group, wherein each scheduling policy group has independent calculation and storage nodes. The number of computing nodes in the new scheduling policy group can set an initial value according to the computing intensity of the tasks in the group; the invention can realize the elastic expansion of the computing resource, and the setting of the initial value does not influence the final performance. As the workload changes, the storage nodes of the load level regression average may be recombined into the default scheduling policy group. Each scheduling policy group will independently run the dual-loop control system and the resource flexible extension in the above embodiments. Because resource sharing is not performed among the scheduling policy groups, each policy group independently operates as a subsystem for task scheduling and execution. This means that the scheduling algorithms of each scheduling policy group may be performed in parallel.

In the embodiment of the invention, the first task type represents the task type of the system which is not processed, the second task type represents the task type of the system which is processed, and the third task type represents the task type of the runtime characteristic which is obviously changed and needs to be counted again.

In an embodiment of the present invention, the present invention focuses on the hybrid load (hybrid load of multiple task types) scheduling problem of a resource-decoupled data center, with the goal of maximizing the throughput of the system while satisfying the delayed SLO (service level objective) of various task types. Delay SLO typically specifies that the tail delay of a task does not exceed a set constant; the delay SLO may be set individually for each task type.

The core of the invention is to schedule the various task types separately according to their runtime characteristics, i.e. each task type is scheduled according to its own allocation ratio. Specifically, the invention discovers and proposes that the optimal solution of the hybrid load scheduling problem has a special structure, and can greatly simplify the complexity of the problem. On the basis, the invention designs an efficient algorithm to solve the optimal scheduling scheme in real time. In addition, in order to cope with the fluctuation of the task request quantity along with time, the resource utilization is better carried out, and the invention supports the elastic expansion and contraction of a computing resource pool. Finally, when there is a skew in the data access patterns, the present invention provides a mechanism to schedule policy groups to cope with unbalanced loads inside the cluster.

In an embodiment of the present invention, the prototype system implemented by the present invention is composed of three parts, including a scheduler, a computing node, and a storage node, as shown in fig. 2. The scheduler is used for scheduling tasks to the computing or storage nodes for execution. The scheduler processes the different task types separately and calculates a corresponding optimal allocation ratio for each task type. The scheduler obtains the runtime characteristic of the task by counting the task execution cost on the computing node and the storage node and takes the runtime characteristic as the input of a preset scheduling algorithm. The scheduler also includes a restrictor that limits the number of task requests based on the end-to-end delay of the task to meet the delayed SLO.

The computing node is configured to receive tasks assigned by the scheduler and is responsible for execution. The computing node is equipped with a large number of CPUs (central processing units) and has a greater computing power than the storage node. The CPUs on the compute nodes are organized into several units of work. These work units access the storage nodes remotely to obtain data while performing tasks. The computing node further comprises a monitoring unit for tracking the runtime characteristics of the task. When the task is completed, a feedback message containing these statistics will be sent to the scheduler.

The storage node is composed of a data warehouse and a plurality of working units. The units of work on the storage nodes are similar to those on the compute nodes, but local data can be accessed directly (i.e., computation on the storage side) while performing tasks. The number of work units on a storage node is typically less than a compute node. The storage node is identical to the computing node and also comprises a monitoring unit for tracking the runtime characteristics of the task. When the task is completed, a feedback message containing these statistics will be sent to the scheduler.

In embodiments of the present invention, the data accessed by a task may be data fragmented across multiple storage nodes. The default task of the invention only accesses one fragment, namely, the data required by each task can be obtained only by accessing a single storage node. This default assumption is consistent with the reality of a stored procedure in an existing distributed database. When a user submits a task call request, the user needs to indicate the data fragments accessed by the user. If the scheduler chooses to schedule the task to a storage node, the storage node must have associated shards stored; if the scheduler chooses to schedule the task to a compute node, it is not limited because the data needs to be obtained remotely over the network.

In embodiments of the invention, the scheduler instance may be increased or decreased independently of the computing resource pool and the storage resource pool. Because the preset scheduling algorithm only needs to run periodically, the scheduler is integrated as a background process on a widely deployed load balancer inside the data center. Since the preset scheduling algorithm is not located on the critical path of task execution, the scheduler will not typically be a system bottleneck. To accommodate the throughput requirements of large data centers, multiple load balancer nodes may be added as is customary in the industry.

In order to verify the effectiveness of the scheduling method provided by the invention, the invention uses various composite loads and application loads to test the prototype system. Experimental results show that the sub-second convergence rate can be realized, the method can be suitable for various dynamic workloads, and compared with the current scheduling method, the system throughput can be improved by 3 to 21 times. The system has better transverse expandability.

The server non-perception calculation scheduling method for the resource decoupling data center has the following advantages: the application performance is more optimized, and the invention can realize high throughput on the premise of ensuring the delay SLO; the resource utilization rate is higher. The invention can balance the load between the computing resource pool and the storage resource pool so as to improve the resource utilization rate; the invention can quickly react when the load or cluster configuration changes, and stabilize the system in an optimal state; the invention is easy to deploy and use, and the upper layer application is not required to change codes.

In an embodiment of the invention, the restrictor adjusts the throughput of the system according to the end-to-end delay of the task to mitigate queuing of task requests on the compute node and the storage node, thereby ensuring that the delay SLO is satisfied. In the prototype system of the present invention, the restrictor employs the AIMD algorithm. The flow limiter may also employ other congestion control type algorithms. In order to enable an application program to execute on both a computing node (remote data access) and a storage node (local data access), the present invention abstracts the remote data repository on the computing node into a local data repository, providing an API (application programming interface) consistent with the local data access for the application. The monitoring units on the computing node and the storage node are responsible for counting the runtime characteristics of the task and feeding back to the scheduler. The monitoring unit only counts the end-to-end execution costs of the task, e.g. the total CPU time occupied, and is thus non-invasive to the application.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The invention provides a server non-perception calculation scheduling method for a resource decoupling data center, which is described in detail, wherein specific examples are applied to illustrate the principle and the implementation of the invention, and the description of the above examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A server unaware computing scheduling method for a resource decoupling data center, which is characterized by being applied to a scheduler, the method comprising:

2. The server-unaware computing scheduling method for a resource-oriented decoupling data center of claim 1, wherein the runtime feature comprises: the network overhead of data transmission when the computing node performs the task, and the CPU overhead when the storage node performs the task.

3. The server-less aware computing scheduling method for a resource decoupling data center of claim 1, wherein said determining an optimal allocation ratio for said first task type by inputting said runtime features into a preset scheduling algorithm for computation comprises:

4. The server-unaware computing scheduling method for a resource-oriented decoupling data center of claim 1, further comprising:

5. The server-unaware computing scheduling method for a resource-oriented decoupling data center of claim 4, further comprising:

sequencing all task types according to the relative cost of each task type;

6. The server-unaware computing scheduling method for a resource-oriented decoupling data center of claim 1, further comprising:

determining a request discard condition of the flow restrictor;

7. The server-unaware computing scheduling method for a resource-oriented decoupling data center of claim 1, further comprising:

determining the actual load of each storage node;