CN108139930B

CN108139930B - Resource scheduling method and device based on Q learning

Info

Publication number: CN108139930B
Application number: CN201680056785.1A
Authority: CN
Inventors: 亚伊·阿里安; 夏伊·霍罗威茨; 郑淼
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-05-24
Filing date: 2016-05-24
Publication date: 2021-08-20
Anticipated expiration: 2036-05-24
Also published as: WO2017201662A1; CN108139930A

Abstract

A resource scheduling method and device based on Q learning can improve resource scheduling performance. The method comprises the following steps: updating a Q value corresponding to a first state-action combination of a plurality of state-action combinations of an application to a first value according to a return value of the first state-action combination (S210), wherein the first state-action combination represents that a first action is executed when the application is in a first state, and the first state is a state in which a second feedback period earlier than the first feedback period is positioned by the application; updating the Q value corresponding to at least one state-action combination in the plurality of state-action combinations according to the first numerical value; determining the action corresponding to the state-action combination with the maximum Q value in at least two state-action combinations corresponding to the current state, wherein the current state is the state of the application in the first feedback period; the amount of resources allocated to the application is adjusted according to the determined action (S230).

Description

Resource scheduling method and device based on Q learning

Technical Field

The embodiment of the invention relates to the technical field of information, in particular to a resource scheduling method and device based on Q learning.

Background

Reinforcement learning (also called reinjection learning, evaluation learning) is an important machine learning method, and has many applications in the fields of intelligent robot control, analysis and prediction, and the like. Reinforcement learning is the learning of the intelligent system from environment to behavior mapping to maximize the value of the reward value function, where the value of the reward value function provided by the environment in reinforcement learning evaluates how good or bad an action is, rather than telling the reinforcement learning system how to produce the correct action. Since the information provided by the external environment is very small, reinforcement learning must be performed by its experience. In this way, reinforcement learning gains knowledge in the context of action-assessment, improving the action scheme to suit the context. The Q-learning (Q-learning) method is one of the most classical algorithms in reinforcement learning, and is a learning algorithm independent of the model.

The data center may perform resource scheduling of an Application (Application) based on the Q learning method described above. In the resource scheduling method based on Q learning, a current state in which an application is located may be determined, a target action having a maximum Q value is selected from all candidate actions (actions) of the current state, and the target action is executed; then, a next state entered after the target action is executed in the current state may be determined, and the Q value of the target action in the current state may be updated according to the maximum Q value of all candidate actions in the next state. However, in the existing resource scheduling method based on Q learning, a large number of feedback cycles are required to make the Q values corresponding to candidate actions in each state of application reach a stable convergence state, where the action corresponding to the largest Q value does not change for most states in the Q table, that is, when the Q table reaches convergence, the same action can be taken in the same state.

Disclosure of Invention

The embodiment of the invention provides a resource scheduling method and device based on Q learning, which can improve the resource scheduling performance.

In a first aspect, a resource scheduling method based on Q learning is provided, including: updating a Q value corresponding to a first state-action combination to a first value according to a return value of the first state-action combination in a plurality of state-action combinations of an application in a first feedback period, wherein the first state-action combination represents that a first action is executed when the application is in a first state, the first state is a state in which a second feedback period earlier than the first feedback period is located, and the first action is used for adjusting the quantity of resources allocated to the application; updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first value in the first feedback period; determining the action corresponding to the state-action combination with the maximum Q value in at least two state-action combinations corresponding to the current state, wherein the current state is the state of the application in the first feedback period; and in the first feedback period, adjusting the quantity of the resources allocated to the application according to the determined action.

Optionally, the application is in the first state in the second feedback cycle, and a first action is taken in the first state, then a return value corresponding to the first state-action combination may be determined according to the current state of the application in the first feedback cycle, and the Q value corresponding to the first state-action combination may be updated according to the return value.

After updating the Q value corresponding to the first state-action combination, the Q value corresponding to each state-action combination in at least one state-action combination different from the first state-action combination of the application can be updated according to the updated Q value corresponding to the first state-action combination, that is, the first numerical value, so that the convergence rate of the Q values corresponding to the state-action combinations of the Q table can be increased, and the resource scheduling performance based on Q learning can be improved.

In a first possible implementation form of the first aspect, the at least one state-action combination comprises a second state-action combination representing that a second action different from the first action is performed when the application is in the first state; the updating the Q value corresponding to at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first value includes: and updating the Q value corresponding to the second state-action combination according to the first value and the adjustment direction of the second action to the quantity of the resources allocated to the application compared with the first action.

Optionally, the Q value of the second state-action combination may be updated according to the updated Q value corresponding to the first state-action combination and the internal logical relationship between the first action and the second action, for example, the magnitude relationship between the adjustment magnitudes of the actions.

With reference to the foregoing possible implementation manners, in a second possible implementation manner of the first aspect, the updating the Q value corresponding to the second state-action combination according to the first value and the adjustment direction of the second action to the number of resources allocated to the application compared to the first action includes: updating the Q value corresponding to the second state-action combination to a value less than the first value if the reward value is less than zero and the second action adjusts the amount of resources allocated to the application toward an increased amount compared to the first action; and/or updating the Q value corresponding to the second state-action combination to a value greater than the first value if the reward value is less than zero and the second action adjusts the amount of resources allocated to the application toward a reduced amount compared to the first action.

Optionally, if the reported value is smaller than zero and the adjustment range of the second action is smaller than the adjustment range of the first action, the Q value corresponding to the second state-action combination may be updated to a value smaller than the first value.

Optionally, if the reported value is smaller than zero and the adjustment range of the second action is larger than the adjustment range of the first action, the Q value corresponding to the second state-action combination may be updated to a value larger than the first value.

With reference to the foregoing possible implementations, in a third possible implementation manner of the first aspect, the at least one state-action combination further includes at least one third state-action combination, and the third state-action combination represents that an action different from the first action is performed when the application is in the first state; the updating the Q value corresponding to at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first value includes: and updating the Q value corresponding to each of the at least one third state-action combination according to the first value, so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases from the target action as the starting point in the direction of increasing the number of resources allocated to the application, and/or so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases from the target action as the starting point in the direction of decreasing the number of resources allocated to the application, wherein the Q value corresponding to the state-action combination composed of the target action and the first state is the largest in all the state-action combinations corresponding to the first state.

Optionally, the updating of the Q value corresponding to the at least one third state-action combination may be such that the updated Q value of the state-action combination corresponding to the at least one action with an overall magnitude greater than the target action decreases with increasing magnitude of adjustment of the action, and/or the updating of the Q value corresponding to the at least one third state-action combination may be such that the updated Q value of the state-action combination corresponding to the at least one action with a magnitude of adjustment less than the target action increases with increasing magnitude of adjustment of the action.

Optionally, the Q values corresponding to the state-action combinations of the first state and at least one other action may be updated according to the updated Q values of the first state-action combination, so that the Q values of all the state-action combinations corresponding to the first state satisfy the unimodal-ipsilateral monotonicity.

With reference to the foregoing possible implementations, in a fourth possible implementation of the first aspect, the at least one state-action combination includes a fourth state-action combination, and the fourth state-action combination indicates that the first action is performed when the application is in a second state different from the first state; the updating the Q value corresponding to at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first value includes: and updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the values of the state characteristic parameters of the first state and the second state.

Optionally, the Q value corresponding to the second state and the fourth state-action combination formed by the first action may be updated according to the updated Q value of the first state-action combination and the internal logical relationship between the first state and the second state.

With reference to the foregoing possible implementation manners, in a fifth possible implementation manner of the first aspect, the state feature parameter includes an average resource occupancy rate; the updating the Q value corresponding to the fourth state-action combination according to the first value and the values of the state characteristic parameters of the first state and the second state includes: updating the Q value of the fourth state-action combination according to the first value and the values of the average resource occupancy rates of the first state and the second state.

Optionally, if the return value is less than zero and the average resource occupancy of the second state is higher than the average resource occupancy of the first state, the Q value corresponding to the fourth state-action combination is updated to a value less than the first value.

With reference to the foregoing possible implementation manners, in a sixth possible implementation manner of the first aspect, the status characteristic parameter includes an average resource occupancy change rate, where the average resource occupancy change rate is used to reflect a change trend of the average resource occupancy; the updating the Q value corresponding to the fourth state-action combination according to the first value and the values of the state characteristic parameters of the first state and the second state includes: and updating the Q value corresponding to the fourth state-action combination according to the first value and the values of the average resource occupation change rates of the first state and the second state.

The application state is represented by introducing the average resource occupation change rate, and the application state can be more accurately described, so that the resource scheduling performance based on Q learning is further improved.

Optionally, the value of the average resource utilization rate applied in the current feedback cycle may be determined according to the value of the average resource occupancy applied in three feedback cycles, for example, the value of the average resource occupancy applied in three consecutive feedback cycles, for example, the three consecutive feedback cycles include the current feedback cycle and the first two feedback cycles of the current feedback cycle.

With reference to the foregoing possible implementation manners, in a seventh possible implementation manner of the first aspect, the updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the value of the average resource occupation change rate of the first state and the second state includes: if the reported value is less than zero and the value of the average resource usage change rate of the second state is higher than the value of the average resource usage change rate of the first state, updating the Q value corresponding to the fourth state-action combination to a value less than the first value.

With reference to the foregoing possible implementations, in an eighth possible implementation manner of the first aspect, the at least one state-action combination further includes at least one fifth state-action combination, where the fifth state-action combination indicates that an action different from the first action is performed when the application is in the second state; the updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first value further comprises: when the reported value is greater than zero and the first value is the maximum value of the Q values corresponding to all the state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average resource occupancy change rate of the second state is higher than the value of the average resource occupancy change rate of the first state, updating the Q value corresponding to the at least one fifth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that in all the state-action combinations corresponding to the second state, the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the third action is maximum, wherein the third action adjusts the amount of resources allocated to the application toward an increased amount as compared to the first action.

Alternatively, the second state may be distinguished from the first state only in that the value of the average rate of change of resource occupancy of the second state is higher than the value of the average rate of change of resource occupancy of the first state.

Optionally, when the reward value is greater than zero and the first value is still the maximum value of the Q values of all state-action combinations corresponding to the first state, the Q values of one or more state-action combinations corresponding to the second state may be updated, so that the Q value of the state-action combination formed by the action or the adjustment amplitude of the first action is greater than that of the action of the first action and the second state in all candidate actions.

With reference to the foregoing possible implementations, in a ninth possible implementation manner of the first aspect, the at least one state-action combination further includes at least one sixth state-action combination, where the sixth state-action combination indicates that an action different from the first action is performed when the application is in the second state; the updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first value further comprises: when the reward value is greater than zero and the first value is the maximum value of the Q values of all state-action combinations corresponding to the first state, if the difference between the first state and the second state is that the value of the average resource occupancy of the second state is greater than the value of the average resource occupancy of the first state, updating the Q value of the at least one sixth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that the Q value corresponding to the fourth state-action combination or a state-action combination composed of the second state and a fourth action is the maximum in all state-action combinations corresponding to the second state, wherein the fourth action adjusts the number of resources allocated to the application toward an increased number compared to the first action.

Alternatively, the second state may differ from the first state only in that the value of the average resource occupancy of the third state is higher than the value of the average resource occupancy of the first state.

Optionally, when the reward value is greater than zero and the first value is still the maximum value of the Q values of all state-action combinations corresponding to the first state, the Q values of one or more state-action combinations corresponding to the third state may be updated, so that the Q value of the state-action combination formed by the action or the adjustment amplitude of the first action is greater than that of the action of the first action and the second state in all candidate actions.

With reference to the foregoing possible implementations, in a tenth possible implementation of the first aspect, the at least one state-action combination further includes a seventh state-action combination, where the seventh state-action combination indicates that the first action is performed when the application is in a third state, and where the first state, the second state, and the third state differ by a different amount of resources allocated to the application; the updating the Q value of at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first value further comprises: and determining the updated Q value of the seventh state-action combination by interpolation, extrapolation or polynomial fitting according to the updated Q value corresponding to the first numerical value and the fourth state-action combination.

Alternatively, the Q value corresponding to the state-action combination of the first action and the third state may be updated by interpolation, extrapolation or polynomial fitting and the updated Q value of the first state-action combination, wherein the third state may be different from the first state only in the number of resources.

With reference to the foregoing possible implementation manners, in an eleventh possible implementation manner of the first aspect, the method further includes: if the proportion of the number of the at least one state-action combination in the plurality of state-action combinations exceeds a preset threshold value, updating the Q value corresponding to the state-action combination which is not updated in the plurality of state-action combinations by using an association rule mining method in the first feedback period.

In a second aspect, a resource scheduling apparatus based on Q learning is provided, which is configured to perform the method in the first aspect or any possible implementation manner of the first aspect. In particular, the apparatus comprises means for performing the method of the first aspect described above or any possible implementation manner of the first aspect.

In a third aspect, another apparatus for resource scheduling based on Q learning is provided, including: a non-transitory computer readable storage medium for storing instructions and a processor for executing the instructions stored by the non-transitory computer readable storage medium, and when the processor executes the instructions stored by the non-transitory computer readable storage medium, the execution causes the processor to perform the first aspect or the method in any possible implementation manner of the first aspect.

In a fourth aspect, there is provided a non-transitory computer-readable storage medium for storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of a resource scheduling method based on Q learning in the prior art.

Fig. 2 is a schematic flowchart of a resource scheduling method based on Q learning according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a variation curve of the Q value of the state-action combination corresponding to the same state with the action in the method according to the embodiment of the present invention.

Fig. 4 is a schematic block diagram of a resource scheduling apparatus based on Q learning according to an embodiment of the present invention.

Fig. 5 is a schematic block diagram of another resource scheduling apparatus based on Q learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

It should be understood that the technical solution of the embodiment of the present invention can be applied to various fields, for example, the field of resource adaptive scheduling of a data center. The data center may include a computer cluster, and the data center may adjust, in real time, the number of machines (e.g., virtual machines, containers, etc.) allocated to the application according to information such as a load change condition of the application, for example, increase or decrease the number of resources, or keep the number of machines unchanged, and so on, so as to achieve the purpose of improving the overall resource utilization rate of the data center on the premise of effectively meeting the application requirements.

The basic concepts involved in this document are briefly described below.

The state of the application: describing the current running condition of the application, it may be denoted as S (M, U), where M denotes the number of machines used by the current application, where the machines herein may include Physical Machines (PM), Virtual Machines (VM), and/or containers (Docker), and so on, and U may denote the average resource occupancy rate of all machines in the cluster of machines currently used by the application.

Resources of the data center: may include one or more of computing resources, storage resources, and network resources of a data center. Where computing resources may be represented as processors or processor arrays, storage resources may be represented as memories and memory arrays, network resources may be represented as network ports and a matrix of network ports, and so on. Moreover, scheduling of application resources in a data center can be accomplished by adjusting the number of machines assigned to the application, and so forth.

The actions are as follows: various actions that the Q learning method can take in the data center cluster may be specifically set according to the load condition of the application. For example, when scheduling resources in a data center cluster based on Q learning, an action may be used to adjust the number of resources or the number of machines allocated to an application, for example, to reduce the number of machines, keep the number of machines unchanged, or increase the number of machines, where the specific adjustment amount of the action to the resources allocated to the application may be set according to actual needs, and is not limited in the embodiment of the present invention.

Reward value function (also called reward value function): a reward value (also referred to as a reward value) that may be used to determine a state-action combination (S, a) that may indicate that action a is performed when the application is in state S, which reward value may be used to evaluate the effect of performing action a when the application is in state S, e.g., if the reward value is positive, indicating that a Service Level Objective (SLO) of the application can be satisfied in time after taking the action; if the reward value is negative, it indicates that the SLO of the application cannot be satisfied after taking the action.

By way of example, the reward value function may be represented by the following equation:

where U may represent the average resource occupancy of all machines currently in use by the application, and p is a configuration parameter that is set by default to 2, T_respRepresenting a 99% response time percentage of a data center with a value greater than or equal to the response time of 99% of the applications of the data center, SLO may represent a service level goal of 99% response time percentage to ensure that 99% of the applications are all responsive in time.

Q value: for measuring the cumulative return of an action for a state, the update can be represented by the following formula:

where c and gamma are adjustable parameters, r(s)_tA) represents a state-action combination(s)_tA) the corresponding return value, wherein the state s_tCan be the state of the application at time t, Q(s)_tA) represents a state-action combination(s)_tA) the corresponding Q value,

can represent a state s_t+TOf all state-action combinations of (a), wherein state s corresponds to a maximum Q value_t+TCan be the state at which the application is at time T + T, state s_t+TMay be a composite state s_t+TState-action combinations with all optional actions a.

Q table: for recording the Q values of various possible state-action combinations made up of all possible states and all optional actions of the application. In the example shown in table 1, column 1 of the Q table indicates possible application states. Columns 2 through M +1 of the Q table represent M selectable actions, respectively. Q_ijThe numerical values in the ith row and the jth column in the Q table are expressed, specifically, the Q values corresponding to the state-action combinations composed of the application state in the ith row and the action in the jth column, that is, the Q values corresponding to the actions in the jth column taken in the application state in the ith row.

Table 1Q shows examples

State/action	Action 1	Action 2	...	Action M
					State 1	Q₁₁	Q₁₂	...	Q_1M
...	...	...	...	...
					State N	Q_N1	Q_N2	...	Q_NM

Specifically, the Q table of the application may be initialized first, for example, the Q table of the application may be initialized randomly or all values in the Q table are initialized to fixed values, and then the Q table of the application may be utilized to perform resource scheduling on the application.

Fig. 1 illustrates a method 100 for resource scheduling of an application in the prior art. In S110, the state S of the application at time t is determined. In S120, the Q table is consulted to determine the action a corresponding to the maximum Q value in state S. In S130, an action a is performed, such as increasing the number of machines allocated to the application, decreasing the number of machines allocated to the application, or keeping the number of machines allocated to the application unchanged. In S140, after performing act a, an average resource occupancy rate of the application at time T + T is obtained, where T is a system feedback period. In S150, the return value for taking action a in state S is calculated using the parameters such as the average resource occupancy and the application SLO. In S160, the Q table is consulted to determine the maximum Q value of the application in the state S 'at time T + T, and the Q value corresponding to the state-action combination (S, a) in the Q table may be updated according to the above-mentioned report value, the Q value of the state-action combination (S, a), and the maximum Q value in the state S'.

However, in the prior art, at time T + T, only the Q value corresponding to the state-action combination (S, a), i.e. the maximum Q value corresponding to the state S at time T of the application, is updated. Thus, a large number of feedback cycles are required to update, so that the Q of the application is expressed to a stable convergence state, and in the process of expressing the Q to the convergence state, resource scheduling based on the Q table is poor in performance.

It should be understood that, in the embodiment of the present invention, the term "target action in the application state" may specifically refer to an action corresponding to a state-action combination with the largest Q value among all state-action combinations corresponding to the application state.

Fig. 2 shows a resource scheduling method 200 based on Q learning according to an embodiment of the present invention, and for convenience of understanding, the method 200 is described below by taking resource scheduling performed in a certain feedback period (i.e., a first feedback period or a current feedback period) as an example, but the embodiment of the present invention is not limited thereto.

S210, updating a Q value corresponding to a first state-action combination of a plurality of state-action combinations of an application to a first value according to a reported value of the first state-action combination, where the first state-action combination indicates that a first action is performed when the application is in a first state, where the first state is a state in which a second feedback cycle earlier than the first feedback cycle is performed by the application, and the first action is used to adjust the amount of resources allocated to the application.

Specifically, in a second feedback period, an application may be in the first state, where the second feedback period is earlier than the first feedback period, for example, the second feedback period is a feedback period before the first feedback period, but the embodiment of the invention is not limited thereto.

In the second feedback cycle, a first action may be performed, for example, the Q value of the first action corresponding to the first state-action combination composed of the first state is the largest among all the state-action combinations corresponding to the first state. The first action may be for adjusting the amount of resources allocated to the application, wherein optionally the amount of resources allocated to the application is adjustedThe adjustment of (2) may be achieved by adjusting the number of machines allocated to the application, e.g., when the application is in the first state, the number of machines allocated to the application is N₀Then the first action may be used to be at N₀Increase the number of machines allocated to the application on the basis of, or at N₀On the basis of reducing the number of machines allocated to the application or keeping the number of machines allocated to the application to be N₀. Alternatively, the adjustment of the amount of the resource allocated to the application may be implemented in other manners, which is not limited in the embodiment of the present invention.

Optionally, during a first feedback period, a current state of the application may be determined, where the state of the application during the first feedback period may be referred to as a current state, and a reward value corresponding to the first state-action combination may be determined during the first feedback period. The reward value corresponding to the first state-action combination may be used to indicate an effect of performing the first action when the application is in the first state, and optionally, the reward value may be determined by equation (1) or other manners, which is not limited in this embodiment of the present invention.

In the first feedback period, the Q value corresponding to the first state-action combination may be updated according to the reported value corresponding to the first state-action combination. For example, the maximum value of the Q values of all state-action combinations corresponding to the current state may be determined, and the updated Q value corresponding to the first state-action combination may be determined according to the maximum value and the reward value corresponding to the first state-action combination, where optionally, the updated Q value may be determined by equation (2), but the embodiment of the present invention is not limited thereto.

S220, updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first value.

The at least one state-action combination may be embodied as one or more state-action combinations, and each state-action combination of the at least one state-action combination may be different from the first state-action combination. For example, a state corresponding to a state-action combination of the at least one state-action combination may be a first state, and an action corresponding to the state-action combination may be another action different from the first action; or, the action corresponding to a certain state-action combination in the at least one state-action combination may be the first action, and the state in the state-action combination may be another state different from the first state; alternatively, an action in a certain state-action combination of the at least one state-action combination may be another state different from the first state, and an action in the state-action combination may be another action different from the first action, but the embodiment of the present invention is not limited thereto.

Optionally, after updating the Q value corresponding to each state-action combination in the at least one state-action combination, resource scheduling may be performed according to the updated Q values corresponding to the at least one state-action combination and the first state-action combination, that is, the amount of resources allocated to the application is adjusted, but the embodiment of the present invention is not limited thereto.

And S230, adjusting the quantity of the resources allocated to the application according to a target action in the current state of the application, wherein the target action is the action with the maximum corresponding Q value in at least two state-action combinations corresponding to the current state of the application.

Specifically, the action corresponding to the state-action combination with the largest Q value may be determined in at least two state-action combinations corresponding to the current state, where optionally, the at least two state-action combinations corresponding to the current state may be specifically all state-action combinations corresponding to the current state. For example, assuming that the Q table format of the application is as shown in table 1, in the first feedback period, the maximum Q value in the corresponding row of the current state may be looked up in the Q table of the application, and the action corresponding to the maximum Q value (i.e., the action corresponding to the column of the maximum Q value) may be determined as the target action in the current state. Then, the target action in the current state may be performed to perform an adjustment process on the number of resources allocated to the application, where the adjustment process may include increasing the number of resources (e.g., increasing the number of machines), decreasing the number of resources (e.g., decreasing the number of machines), or keeping the number of resources unchanged (e.g., keeping the number of machines unchanged), but the embodiment of the present invention is not limited thereto.

As an optional embodiment, the at least one state-action combination may include one or more state-action combinations composed of the current state, and in this case, optionally, the target action in the current state may be determined according to the updated Q value corresponding to the state-action combination composed of the current state, but the embodiment of the present invention is not limited thereto.

Therefore, according to the method of the embodiment of the present invention, in the first feedback period, according to the reward value corresponding to the first state-action combination, the Q value corresponding to the first state-action combination is updated to the first value, and in the first feedback period, according to the first value, the Q value corresponding to each state-action combination in at least one state-action combination of the application, which is different from the first state-action combination, is updated, so that the Q values corresponding to the plurality of state-action combinations of the application can be converged faster, thereby achieving the resource scheduling performance based on Q learning.

As an alternative embodiment, in the embodiment of the present invention, the following limitation may be performed on the Q value of a state-action combination corresponding to a certain application state: the Q value of part or all state-action combinations corresponding to the application state can satisfy the unimodal-ipsilateral monotonicity principle, namely, satisfy the unimodal characteristic and the monotonicity on any side of the peak value. The single-peak characteristic may specifically mean that there may be and only one maximum value among the Q values of all state-action combinations corresponding to the application state. Monotonicity on any side of the peak value may refer to that the target motion of the application state is taken as a central point, and the Q value corresponding to the state-motion combination formed by the motion on any side and the application state may change monotonously with the adjustment amplitude, wherein the target motion of the application state may refer to the motion with the largest Q value in all the state-motion combinations corresponding to the application state, and the any side of the target motion may refer to the side with the adjustment amplitude smaller than the target motion or the side with the adjustment amplitude larger than the target motion. As an alternative example, the adjustment magnitude of the action may specifically be in a positive direction of increasing the amount of resources allocated to the application, and in this case, the magnitude of the adjustment magnitude of the action may refer to how much the action increases the amount of resources allocated to the application. For example, the adjustment range of the action for increasing the number of resources allocated to the application may be greater than the adjustment range of the action for decreasing the number of resources allocated to the application, the adjustment range of the action for increasing 2 machines may be greater than the adjustment range of the action for increasing 1 machine, and the adjustment range of the action for decreasing 1 machine may be greater than the adjustment range of the action for decreasing 2 machines, but the embodiment of the present invention is not limited thereto.

An example of satisfying the above single peak characteristic and monotonicity on either side of the peak can be seen in fig. 3. Specifically, in the example of the Q value-adjustment amplitude curve corresponding to a certain application state as shown in fig. 3, the Q value of at least one action of which the adjustment amplitude is smaller than the target action of the application state may increase with the increase of the adjustment amplitude, and the Q value of at least one action of which the adjustment amplitude is larger than the target action may decrease with the increase of the adjustment amplitude. Alternatively, the Q value of the state-action combination corresponding to the application state may monotonically increase or decrease as the adjustment amplitude increases, but the embodiment of the present invention is not limited thereto.

Optionally, in S220, the Q value corresponding to at least one third state-action combination corresponding to the first state may be updated according to the first numerical value, so that part or all of the state-action combinations corresponding to the first state satisfy the unimodal-ipsilateral monotonicity condition. Wherein a third state-action combination may correspond to the first state and an action different from the first action, i.e. the third state-action combination may represent that an action different from the first action is performed when the application is in the first state. Optionally, the at least one third state-action combination may be specifically one or more state-action combinations composed of the first state, may include a state-action combination composed of an action with an adjustment amplitude larger than that of the first action and the first state, and/or include a state-action combination composed of an action with an adjustment amplitude smaller than that of the first action and the first state, which is not limited in this embodiment of the present invention.

As an optional example, the updating of the Q values corresponding to the at least one third state-action combination may be such that the Q values corresponding to all the state-action combinations corresponding to the first state monotonically increase or monotonically decrease with the adjustment amplitude of the action, or such that the Q values corresponding to all the state-action combinations corresponding to the first state monotonically increase and then monotonically decrease with the adjustment amplitude of the action, with the target action of the first state as a demarcation point.

Specifically, taking the above fig. 3 as an example, the updating of the Q value corresponding to at least one third state-action combination may be performed in the following manner:

the Q value corresponding to the state-action combination corresponding to the first state may monotonically decrease in a direction toward increasing the number of resources allocated to the application starting from the target action of the first state; and/or

The Q value for the state-action combination for the first state monotonically decreases in a direction toward decreasing the number of resources allocated to the application starting from the target action for the first state.

As another alternative, in S220, the Q value corresponding to a second state-action combination may be updated according to the updated Q value (i.e. the first value) of the first state-action combination, where the second state-action combination may indicate that a second action different from the first action is performed when the application is in the first state. At this time, the Q value corresponding to the second state-action combination may be updated according to the magnitude relationship between the adjustment magnitudes of the first action and the second action.

Optionally, the Q value corresponding to the second state-action combination may be updated according to the reported value of the first state-action combination, the first numerical value, and the magnitude relationship between the adjustment magnitudes of the first action and the second action, that is, the Q value corresponding to the second state-action combination may be updated according to the adjustment direction of the first numerical value and the second action relative to the first action on the number of resources allocated to the application.

Optionally, the adjustment direction of the number of resources allocated to the application may be specifically an increasing number direction or a decreasing number direction, which is not limited in the embodiment of the present invention.

As an alternative embodiment, in S220, if the reward value is less than zero and the second action is adjusted toward an increased amount of the resource amount allocated to the application compared to the first action, the Q value corresponding to the second state-action combination is updated to a value less than the first value.

In particular, if the reward value for the first state-action combination is less than zero, it indicates that the resources allocated to the application after taking the first action in the first state are insufficient to satisfy the service request of the application. In this way, when the application is in the first state, an action may be taken that adjusts by a magnitude greater than the first action, so that it is possible to satisfy the service request of the application, i.e. when the application is in the first state, an action may be taken that increases the amount of resources allocated to the application compared to the first action. At this time, the Q value corresponding to the state-action combination formed by one or more actions with adjustment amplitude larger than the first action and the first state may be updated to a value larger than the first value, so that when the application is in the first state, an action with adjustment amplitude larger than the first action may be taken, but the embodiment of the present invention is not limited thereto.

As another alternative, if the reward value is less than zero and the second action is adjusted toward a reduced amount of resources allocated to the application compared to the first action, the Q value corresponding to the second state-action combination is updated to a value greater than the first value.

Similarly, if the first action cannot meet the requirements of the application when the application is in the first state, the action with the adjustment magnitude smaller than the first action is less able to meet the requirements of the application. At this time, the Q value of the state-action combination of one or more actions with an adjustment amplitude smaller than the first action and the first state may be updated to a value smaller than the first value, so that when the application is in the first state, the possibility of taking an action with an adjustment amplitude smaller than the first action is smaller than the possibility of taking the first action, but the embodiment of the present invention is not limited thereto.

As another alternative, after updating the Q values corresponding to the first state-action combination and the second state-action combination, the Q value corresponding to the third state-action combination of the first state and the third action may also be updated according to the updated Q value (i.e. the first value) corresponding to the first state-action combination, the updated Q value corresponding to the second state-action combination, and the above-mentioned unimodal-ipsilateral monotonicity principle, where the adjustment magnitudes of the first action, the second action, and the third action may be monotonously changed, for example, the adjustment magnitudes are gradually increased or gradually decreased, i.e. are sequentially adjusted toward increasing the number of resources allocated to the application or sequentially adjusted toward decreasing the number of resources allocated to the application. In this case, the Q value corresponding to the state-action combination of the first state and the third action may be updated such that the updated Q value of the state-action combination of the first state and the first action, the second action, and the third action changes monotonically with the adjustment range of the action, for example, the updated Q value of the state-action combination of the first state and the first action, the second action, and the third action may be sequentially decreased in a direction of increasing the number of resources allocated to the application, but the embodiment of the present invention is not limited thereto.

As another alternative embodiment, the at least one state-action combination may include a fourth state-action combination that may indicate that the first action is performed when the application is in a second state different from the first state. At this time, the Q value corresponding to the fourth state-action combination may be updated according to the logical relationship between the first state and the second state.

As an alternative embodiment, in S220, the Q value corresponding to the fourth state-action combination may be updated according to the first numerical value and the values of the state characteristic parameters of the first state and the second state.

In an embodiment of the present invention, the state characteristic parameter may be used to characterize the state of the application. For example, the status characteristic parameter may include the number of resources (or the number of machines), the average resource occupancy rate, or may further include other parameters, which is not limited in this embodiment of the present invention.

Optionally, the Q value of the fourth state-action combination may be updated according to the reported value of the first state-action combination, the first numerical value, and the value of the state feature parameter of the second state. For example, the Q value of the fourth state-action combination may be updated according to the reported value of the first state-action combination, the first value and the magnitude relationship between the values of the state characteristic parameters of the first state and the second state, but the embodiment of the invention is not limited thereto.

As another alternative embodiment, the status characteristic parameter includes an average resource occupancy. At this time, optionally, the Q value corresponding to the fourth state-action combination may be updated according to the first value and the value of the average resource occupancy of the second state. For example, if the reward value is less than zero and the average resource occupancy of the second state is higher than the average resource occupancy of the first state, the Q value corresponding to the fourth state-action combination may be updated to a value less than the first value.

Specifically, if the reward value corresponding to the first state-action combination is less than zero, it indicates that the resources allocated to the application after the first action is taken in the first state are insufficient to satisfy the service request of the application. At this time, if the second state differs from the first state in terms of the average resource occupancy, e.g., the second state has the same number of resources as the first state, the higher the risk that the application is not timely served since a higher average resource occupancy means fewer resources available, and therefore the higher the likelihood that the service requirement of the application will not be met if the first action is also taken in the second state where the average resource occupancy is higher than the average resource occupancy of the first state. Accordingly, the Q value corresponding to the state-action combination of the second state and the first action, which has the average resource occupancy higher than the first state, may be updated to a value smaller than the first value, but the embodiment of the present invention is not limited thereto.

As another alternative, a concept of an average resource occupancy change rate may also be introduced to characterize the application state, wherein the average resource occupancy change rate may be used to reflect a trend of the average resource occupancy of the application over time. At this time, optionally, the application state of the system may be represented as S (M, U, R), where M may represent the number of machines, U may represent the average resource occupancy, and R may represent the average resource occupancy change rate, but the embodiment of the present invention is not limited thereto.

Alternatively, the average rate of change of resource occupancy of the application may be determined by a value of the average resource occupancy of the application for at least three feedback periods. For example, if the average resource occupancy rate applied at the time T-2T is U_t-2TThe average resource occupancy rate at the time T-T is U_t-TAnd the average resource occupancy rate at time t is U_tThen can be according to U_t-2T、U_t-TAnd U_tAnd determining the average resource occupation change rate applied at the time t. As an alternative example, the value of the average resource occupancy change rate of the application may be set as follows: the more the average resource occupancy rate of the application changes towards an increasing trend, the larger the value of the average resource occupancy rate of the application. For example, if the average resource occupancy of an application is continuously reduced for at least three feedback periods, i.e. U_t-2T＞U_t-TAnd U_t-T＞U_tThe value of the average resource occupation change rate applied at the time t can be-1; if the average resource occupancy rate of the application is increased and decreased in at least three feedback periods, a non-monotonicity change trend is presented, namely U_t-2T＞U_t-TAnd U_t-T＜U_tOr U is_t-2T＞U_t-TAnd U_t-T＜U_TThe value of the average resource occupation change rate applied at the time t may be 0; if the average resource occupancy of the application is continuously rising for at least three feedback periods, i.e. U_t-2T＜U_t-TAnd U_t-T＜U_tThen apply tothe average rate of change of resource occupancy at time t may have a value of + 1. Alternatively, the average resource occupation change rate of the application may be set in other manners, or may be set to other values, which is not limited in the embodiment of the present invention.

Optionally, if the state characteristic parameter includes an average resource occupation change rate, the Q value corresponding to the fourth state-action combination may be updated according to the first value and the value of the average resource occupation change rate of the second state. As an alternative embodiment, the Q value corresponding to the fourth state-action combination may be updated according to the magnitude relationship between the first value and the value of the average resource occupation change rate of the first state and the second state. For example, if the reported value is less than zero and the average resource usage change rate of the second state is higher than the average resource usage change rate of the first state, the Q value corresponding to the fourth state-action combination is updated to a value less than the first value.

Specifically, assuming that a larger value of the average resource occupancy change rate indicates that the average resource occupancy of the application is more towards an increasing trend, if the reward value corresponding to the first state-action combination is less than zero, it may indicate that the resource allocated to the application after the first action is taken while the application is in the first state is insufficient to meet the service request of the application. At this time, if the value of the average resource occupancy change rate of the second state is higher than that of the first state, and the value of the other state characteristic parameters of the second state and the first state are the same, for example, the same number of machines and the same average resource occupancy, the value of the average resource occupancy of the application in the second state changes toward a smaller direction than that of the first state, and at this time, if the first action is also taken in the second state where the value of the average resource occupancy change rate is higher than that of the first state, the service requirement of the application is more likely to be unsatisfied. Therefore, the Q value corresponding to the fourth state-action combination may be updated to a value smaller than the first value, but the embodiment of the invention is not limited thereto.

Therefore, the application state is represented by introducing the average resource occupation change rate, different application states can be more accurately described and distinguished, and the resource scheduling performance based on Q learning is further improved.

As another alternative, the status characteristic parameter may include a resource amount, wherein the resource amount may be embodied as a machine amount. At this time, optionally, the Q value corresponding to the fourth state-action combination may be updated according to the first numerical value and the resource quantity of the second state.

As an alternative embodiment, the Q value corresponding to the fourth state-action combination may be updated according to the value and the magnitude relationship between the number of machines in the first state and the second state.

Optionally, in this embodiment of the present invention, after the Q value corresponding to the fourth state-action combination is updated, the Q values corresponding to other state-action combinations may also be updated according to the first numerical value and the updated Q value corresponding to the fourth state-action combination.

As an alternative embodiment, the Q value corresponding to a seventh state-action combination composed of a third state and the first action may be updated according to the first value and the updated Q value corresponding to the fourth state-action combination, where the third state is different from the first state and the second state.

Alternatively, if the first state, the second state and the third state differ in the number of resources, for example, the number of machines, the Q value corresponding to the seventh state-action combination may be updated by interpolation, extrapolation or polynomial fitting according to the first value and the updated Q value corresponding to the fourth state-action combination.

As an optional example, assuming that the number of machines allocated by the application in the first state, the second state and the third state is N, N +2 and N +1, respectively, the updated Q value Q of the seventh state-action combination may be obtained by interpolation according to the updated Q value corresponding to the first value and the fourth state-action combination_N+1E.g. Q_N+1＝(Q_N+Q_N+2) /2 wherein Q_NDenotes a first value, Q_N+2Indicating an updated Q value for the fourth state-action combination.

As another alternative example, it is assumed that the Q value y corresponding to the state-action combination and the number x of machines in the application state satisfy the following functional relationship: a is₁x²+a₂x+a₃Wherein a is₁、a₂And a₃Is constant, then a can be determined based on the updated Q value corresponding to the first value and the fourth state-action combination₁、a₂And a₃Of (a), e.g. determining a by least squares₁、a₂And a₃And determining the updated Q value corresponding to the seventh state-action combination according to the polynomial and the number of machines allocated by the application in the third state, but the embodiment of the invention is not limited thereto.

As another alternative, in S220, the Q value corresponding to at least one fifth state-action combination may be updated according to the updated Q value corresponding to the fourth state-action combination, where the fifth state-action combination may be composed of the second state and an action that is not used for the first action.

As an alternative example, when the reward value is greater than zero and the first value is the maximum value of the Q values corresponding to all state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average resource occupancy change rate of the second state is higher than the value of the average resource occupancy change rate of the first state, the Q value for at least one fifth state-action combination may be updated based on the updated Q value for the fourth state-action combination, such that in all state-action combinations for the second state, the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the third action is maximum, wherein the third action adjusts the amount of resources allocated to the application toward an increased amount as compared to the first action.

Specifically, if the reward value corresponding to the first state-action combination is greater than zero, and the updated value (i.e., the first value) of the first state-action combination is still the maximum Q value of all the state-action combinations of the first state, it may indicate that the application is more inclined to increase the number of resources. At this time, in the second state where the value of the average resource occupancy is higher than the first state, the first action or an action with an adjustment amplitude higher than the first action may also be taken, but the embodiment of the present invention is not limited thereto.

As another alternative, when the reward value is greater than zero and the first value is the maximum of the Q values of all state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average resource occupancy of the second state is greater than the value of the average resource occupancy of the first state, the Q value of at least one fifth state-action combination may be updated according to the updated Q value corresponding to the fourth state-action combination, such that in all state-action combinations corresponding to the second state, the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the fourth action is maximum, wherein the fourth action adjusts the amount of resources allocated to the application toward an increased amount as compared to the first action.

Specifically, if the reward value corresponding to the first state-action combination is greater than zero and the first value is still the maximum Q value of all the state-action combinations of the first state, it may indicate that the application is more inclined to increase the number of resources. At this time, in the second state where the value of the average resource occupation change rate is higher than the first state, the first action or an action with an adjustment amplitude higher than the first action may also be taken, but the embodiment of the present invention is not limited thereto.

The resource scheduling method based on Q learning provided by the embodiment of the present invention will be described in more detail with reference to specific examples. For convenience of understanding, in the present embodiment, it is assumed that the application state may be represented as S (M, U, R), where M represents the number of machines, U represents the average resource occupancy of the application, and R represents the average resource occupancy change rate of the application, where the average resource occupancy of the application state is in 10% unit, and there are 10 different ranges of average resource occupancy, and the values of the average resource occupancy change rate may be +1, 0, and-1, and candidate actions that may be taken may include: decreasing 2 machines (which may be denoted-2), decreasing 1 machine (which may be denoted-1), keeping the number of machines unchanged (which may be denoted 0), increasing 1 machine (which may be denoted +1), and increasing 2 machines (which may be denoted + 2).

Further, it is assumed that the second feedback period is specifically time T, and the first feedback period is time T + T. Specifically, at time T, the average resource occupancy of the application has a value of 55%, the number of machines is 20, and the average resource occupancy of the application at time T-2T and time T-T has values of 50% and 60%, respectively.

1. A first state at which the application is in time t may be determined.

Specifically, the value of the average resource occupancy rate at time T may be determined according to the values of the average resource occupancy rates at time T, time T-T and time T-2T of the application, respectively. As can be seen from the above, the average resource occupancy applied at time T-2T, time T-T and time T is 50%, 60% and 55%, respectively, indicating that the value of the average resource occupancy applied at time T-2T is lower than the value of the average resource occupancy at time T-T, and the value of the average resource occupancy at time T-T is higher than the value of the average resource occupancy applied at time T, at which time, optionally, the value of the average resource occupancy change rate applied at time T may be determined to be 0.

The Q table may be queried according to the number of machines, the value of the average resource occupancy rate, and the value of the average resource occupancy change rate corresponding to the application at time t to determine the first state of the application at time t. It is assumed here that the first state is specifically state 1 of the application, and the current Q value of the state-action combination of state 1 and each candidate action may be as shown in table 2.

TABLE 2 example Q values for State-action combinations of State 1 and respective candidate actions

2. As can be seen from table 2, in each candidate action, the Q value corresponding to the action +1 and the state-action combination constituted by the state 1 is the largest, and accordingly, the action +1 can be determined as the target action of the state 1, and the target action is executed.

3. The return value corresponding to taking action +1 in state 1 (i.e., state-action combination (state 1, +1)) may be determined based on the value of the average resource occupancy applied at time T + T and/or other parameters. Furthermore, the value of the average resource occupancy rate at time T + T, and thus the current state at time T + T, may be determined based on the values of the average resource occupancy at time T + T, time T, and time T-T. Further, the Q value corresponding to the state-action combination (state 1, +1) may be updated according to the current state of the application at time T + T and the reported value corresponding to the state-action combination (state 1, + 1). It is assumed that the updated Q value of the state-action combination (state 1, +1) is-1, but the embodiment of the present invention is not limited thereto.

4. The Q values for other state-action combinations of the application may be updated based on the updated Q values for the state-action combinations (state 1, + 1).

Alternatively, it can be determined whether the Q value of the state-action combination corresponding to this state 1 satisfies the unimodal-ipsilateral monotonicity principle by looking up table 2. Specifically, assuming that the updated Q value of (state 1, +1) is-1, then action +1 is still the target action of state 1, i.e. the Q value corresponding to the state-action combination composed of action +1 and state 1 is the largest among the various candidate actions, then it can be determined whether both sides of the row in which action +1 is located satisfy the homonymy monotonicity far side. As can be seen from Table 2, if the Q values corresponding to action-2, action-1 and action 0 do not satisfy the same-side monotonicity principle, the Q value corresponding to action-1 or-2 can be updated, for example, as shown in Table 3, the Q value corresponding to action-2 can be updated to-20- α₁Wherein α is₁> 0, but the embodiments of the invention are not limited thereto.

TABLE 3 example updated Q values for each state-action combination corresponding to State 1

Alternatively, the Q value corresponding to the state-action combination of state 1 and another action may be updated based on the reported value corresponding to the state-action combination (state 1, +1) and the updated Q value. Assuming that the reward value corresponding to the state-action combination (state 1, +1) is negative, the Q value corresponding to the state-action combination of state 1 and other actions may optionally be updated such that the Q value of the state-action combination corresponding to the first state increases as the adjustment amplitude of the action increases. Specifically, the Q value of the state-action combination composed of the action with the adjustment amplitude smaller than the action +1 (i.e., the action +2) and the state 1 may be updated to a value larger than-1, and the Q value of the state-action combination composed of the action with the adjustment amplitude smaller than the action +1 and the state 1 may be updated to a value smaller than-1. For example, as shown in Table 4, the Q value of action +2 may be updated to-1 + α 2, where α 2 > 0, which may not be further updated since the current Q values of actions 0, -1, and-2 are already less than-1.

Table 4 example updated Q values for each state-action combination corresponding to state 1

Alternatively, the Q value of the state-action combination corresponding to the other state may be updated according to the updated Q value of the state-action combination (state 1, + 1).

Specifically, assuming that the reward value of the state-action combination (state 1, +1) is negative, the Q value corresponding to the state-action combination of action +1 and the other states can be updated.

As an alternative example, assuming that the reward value of the state-action combination (state 1, +1) is negative, the average resource occupancy or the change rate of the average resource occupancy may be updated to be higher than the Q value corresponding to the state-action combination formed by the action +1 and the other state of the state 1, so that the updated Q value of the state-action combination formed by the action +1 and the other state of the state 1 with the change rate of the average resource occupancy or the average resource occupancy is higher than-1.

For example, the Q value corresponding to the state-action combination of action +1 for at least one state with the number of machines being 20, the average resource occupancy rate of change being 0, and the average resource occupancy being higher than 50% -60% may be updated such that the updated Q value for the state-action combination of the at least one state and the action +1 decreases as the average resource occupancy for the state increases. In the example shown in table 5, the Q values of the state-action combinations of state 2 to state 5 and action +1 can be updated to-1- α, respectively₃、-1-2α₃、-1-3α₃And-1-4 alpha₃Wherein α is₃> 0, but the embodiments of the invention are not limited thereto.

TABLE 5 updated Q value example of State-action combinations of states with different average resource occupancy and action +1

State/action	+1
		State 1(20, 50% -60%, 0)	-1
State 2(20, 60% -70%, 0)	-1-α₃
		State 3(20, 70% -80%, 0)	-1-2α₃
State 4(20, 80% -90%, 0)	-1-3α₃
		State 5(20, 90% -100%, 0)	-1-4α₃

For another example, the Q value corresponding to the state-action combination of the action +1 and at least one state with the number of machines being 20, the average resource occupancy being 50% -60%, and the average resource occupancy change rate being higher than 0 may be updated, so that the updated Q value of the state-action combination of the at least one state and the action +1 decreases as the average resource occupancy change rate of the state increases. In the example shown in table 6, the Q value corresponding to the state-action combination of state 6 and action +1 may be updated to-1- α while keeping the current value 205 unchanged, and the Q value corresponding to the state-action combination of state 7 and action +1 may be updated to₄Wherein α is₄> 0, but the embodiments of the invention are not limited thereto.

TABLE 6 updated Q value examples for state-action combinations of states with different average rates of change of resource occupancy and action +1

State/action	+1
		State 6(20, 50% -60%, -1)	2.05
State 1(20, 50% -60%, 0)	-1
		State 7(20, 50% -60%, +1)	-1-α₄

As another alternative example, after updating the Q values of the state-action combinations of action +1 and state 1 and at least one other state, the Q values corresponding to the state-action combinations of action +1 and other states having different machine numbers may be updated using interpolation. For example, as shown in table 7, assuming that the updated Q value for the state-action combination (state 9, action +1) is known to be 2, the updated Q value for the state-action combination (state 8, action +1) can be determined by interpolation: (-1+2)/2 is 0.5, but this is not limited in the examples of the present invention.

TABLE 7 updated Q value example with different machine number of state and State-action combinations of action +1

State/action	+1
		State 1(20, 50% -60%, 0)	-1
State 8(21, 50% -60%, 0)	0.5
		State 9(22, 50% -60%, 0)	2

For another example, after updating the Q values of the action +1 and the state-action combination composed of the state 1 and at least one other state, a polynomial function fitting may be performed using the currently updated Q value in the column of the action +1, and the Q value not yet updated in the column of the action +1 may be updated using the polynomial function obtained by fitting.

Alternatively, assuming that the reward value of the state-action combination (state 1, +1) is a positive number, and-1 is still greater than the Q value of the state-action combination formed by other actions and state 1, the average resource occupancy rate or the average resource occupancy change rate may be updated to be greater than the Q value of the row in which the other state in state 1 is located, so that the action with action +1 or the action with larger adjustment amplitude is the target action of the other state, that is, in the other state, the Q value corresponding to the action with action +1 or the action with adjustment amplitude greater than +1 is the largest, but the embodiment of the present invention is not limited thereto.

Optionally, when the number of updated Q values in the applied Q table is large and a trigger condition of association rule mining is reached, for example, a proportion of the updated Q values in the Q table in all Q values in the Q table exceeds a preset threshold, the Q values not updated in the Q table may be inferred by using an association rule mining method. Where association rules describe how much the presence of item a affects the presence of item B by quantified numbers, it can be used to describe knowledge patterns of laws that occur simultaneously among multiple items in a transaction. For example, the Q value of a certain action in a certain state may be inferred by calculating the support degree of the Q value on each action by using association rule mining, but the embodiment of the present invention is not limited thereto.

It should be noted that the examples of tables 2 to 7 are for helping those skilled in the art to better understand the embodiments of the present invention, and are not intended to limit the scope of the embodiments of the present invention. It will be apparent to those skilled in the art that various equivalent modifications or variations are possible in light of the examples given in tables 2 through 7, and such modifications or variations are intended to be included within the scope of the embodiments of the present invention.

It should be understood that the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present invention.

The resource scheduling method based on Q learning according to the embodiment of the present invention is described in detail above with reference to fig. 1 to 3, and the resource scheduling apparatus based on Q learning according to the embodiment of the present invention is described in detail below with reference to fig. 4 to 5.

Fig. 4 shows a resource scheduling apparatus 300 based on Q learning according to an embodiment of the present invention, where the apparatus 300 includes:

a first processing unit 310, configured to update a Q value corresponding to a first state-action combination of multiple state-action combinations of an application to a first value according to a reported value of the first state-action combination, where the first state-action combination indicates that a first action is performed when the application is in a first state, where the first state is a state in which the application is in a second feedback period earlier than the first feedback period, the first action is used to adjust the number of resources allocated to the application, and update a Q value corresponding to at least one state-action combination of the multiple state-action combinations different from the first state-action combination according to the first value;

a second processing unit 320, configured to determine, in at least two state-action combinations corresponding to a current state, an action corresponding to a state-action combination with a maximum Q value, where the current state is a state in which the application is located in the first feedback cycle;

the third processing unit 330 is configured to perform adjustment processing on the amount of resources allocated to the application according to the action determined by the second processing unit 320 in the first feedback period.

Optionally, the at least one state-action combination comprises a second state-action combination, the second state-action combination representing that a second action different from the first action is performed when the application is in the first state; at this time, the first processing unit 310 may be specifically configured to update the Q value corresponding to the second state-action combination according to the first value and the adjustment direction of the second action with respect to the amount of the resource allocated to the application compared to the first action.

Optionally, the first processing unit 310 is specifically configured to: if the reward value is less than zero and the second action is adjusted toward an increased amount of the resource amount allocated to the application compared to the first action, updating the Q value corresponding to the second state-action combination to a value less than the first value.

Optionally, the first processing unit 310 is specifically configured to: if the reward value is less than zero and the second action is adjusted toward a reduced amount of resources allocated to the application compared to the first action, updating the Q value corresponding to the second state-action combination to a value greater than the first value.

Optionally, the at least one state-action combination further comprises at least one third state-action combination, the third state-action combination representing that an action different from the first action is performed when the application is in the first state. At this time, the first processing unit 310 may specifically be configured to:

and updating the Q value corresponding to each of the at least one third state-action combination according to the first value, so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases from the target action as the starting point in the direction of increasing the number of resources allocated to the application, and/or so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases from the target action as the starting point in the direction of decreasing the number of resources allocated to the application, wherein the Q value corresponding to the state-action combination composed of the target action and the first state is the largest in all the state-action combinations corresponding to the first state.

Optionally, the at least one state-action combination comprises a fourth state-action combination, the fourth state-action combination representing that the first action is performed when the application is in a second state different from the first state. At this time, the first processing unit 310 may specifically be configured to: and updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the values of the state characteristic parameters of the first state and the second state.

Optionally, the status characteristic parameter includes an average resource occupancy rate; at this time, the Q value corresponding to the fourth state-action combination may be updated according to the first value and the values of the average resource occupancy rates of the first state and the second state.

Optionally, the first processing unit 310 may be specifically configured to: if the reported value is less than zero and the average resource occupancy of the second state is higher than the average resource occupancy of the first state, updating the Q value corresponding to the fourth state-action combination to a value less than the first value.

Optionally, the status characteristic parameter includes an average resource occupancy change rate, where the average resource occupancy change rate is used to reflect a change trend of the average resource occupancy. At this time, the first processing unit 310 may specifically be configured to: and updating the Q value corresponding to the fourth state-action combination according to the first value and the values of the average resource occupation change rates of the first state and the second state.

As an optional embodiment, the first processing unit 310 is specifically configured to: if the reported value is less than zero and the value of the average resource usage change rate of the second state is higher than the value of the average resource usage change rate of the first state, updating the Q value corresponding to the fourth state-action combination to a value less than the first value.

Optionally, the at least one state-action combination further comprises at least one fifth state-action combination, the fifth state-action combination representing that an action different from the first action is performed when the application is in the second state. At this time, the first processing unit 310 may specifically be configured to:

when the reported value is greater than zero and the first value is the maximum value of the Q values corresponding to all the state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average resource occupancy change rate of the second state is higher than the value of the average resource occupancy change rate of the first state, updating the Q value corresponding to the at least one fifth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that in all the state-action combinations corresponding to the second state, the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the third action is maximum, wherein the third action adjusts the amount of resources allocated to the application toward an increased amount as compared to the first action.

Optionally, the at least one state-action combination further comprises at least one sixth state-action combination, the sixth state-action combination representing that an action different from the first action is performed when the application is in the second state. At this time, the first processing unit 310 may specifically be configured to:

when the reward value is greater than zero and the first value is the maximum value of the Q values of all state-action combinations corresponding to the first state, if the difference between the first state and the second state is that the value of the average resource occupancy of the second state is greater than the value of the average resource occupancy of the first state, updating the Q value of the at least one sixth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that the Q value corresponding to the fourth state-action combination or a state-action combination composed of the second state and a fourth action is the maximum in all state-action combinations corresponding to the second state, wherein the fourth action adjusts the number of resources allocated to the application toward an increased number compared to the first action.

Optionally, the at least one state-action combination further comprises a seventh state-action combination, the seventh state-action combination representing that the first action is performed when the application is in a third state, wherein the first state, the second state and the third state differ by a difference in the amount of resources allocated to the application. At this time, the first processing unit 310 may specifically be configured to: and determining the updated Q value of the seventh state-action combination by interpolation, extrapolation or polynomial fitting according to the updated Q value corresponding to the first numerical value and the fourth state-action combination.

Optionally, the first processing unit 310 is further configured to: if the proportion of the number of the at least one state-action combination in the plurality of state-action combinations exceeds a preset threshold value, updating the Q value corresponding to the state-action combination which is not updated in the plurality of state-action combinations by using an association rule mining method in the first feedback period.

It should be understood that the apparatus 300 herein is embodied in the form of a functional unit. The term "unit" herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic Circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic Circuit, and/or other suitable components that support the described functionality. In an alternative example, those skilled in the art may understand that the apparatus 300 may be configured to perform various processes and/or steps in the above method embodiments, and details are not described herein again to avoid repetition.

Fig. 5 shows another apparatus 400 for resource scheduling based on Q learning according to an embodiment of the present invention, where the apparatus 400 includes: a processor 410 and a non-volatile (non-volatile) computer-readable storage medium 420. Wherein the non-transitory computer readable storage medium 420 is coupled to the processor 410 and configured to store instructions that, when executed by the processor 410, cause the processor 410 to perform operations comprising:

updating a Q value corresponding to a first state-action combination to a first value according to a return value of the first state-action combination in a plurality of state-action combinations of an application, wherein the first state-action combination represents that a first action is executed when the application is in a first state, the first state is a state in which a second feedback period earlier than the first feedback period is positioned by the application, and the first action is used for adjusting the quantity of resources allocated to the application;

updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first numerical value;

determining the action corresponding to the state-action combination with the maximum Q value in at least two state-action combinations corresponding to the current state, wherein the current state is the state of the application in the first feedback period;

and adjusting the quantity of the resources allocated to the application according to the determined action.

The memory may optionally include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information. The processor may be configured to execute instructions stored in the memory and, when the processor executes the instructions stored in the memory, to perform the various steps and/or procedures of the above-described method embodiments.

It should be understood that in the embodiments of the present invention, the processor may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), ready-made programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor executes instructions in the memory, in combination with hardware thereof, to perform the steps of the above-described method. To avoid repetition, it is not described in detail here.

It should be understood that the apparatus 400 may be configured to perform each step and/or flow of the above method embodiments, and therefore, in order to avoid repetition, the detailed description is omitted here.

It should be understood that the terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or relationship.

Those of ordinary skill in the art will appreciate that the various method steps and elements described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the steps and elements of the various embodiments have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A resource scheduling method based on Q learning is applied to a data center and is characterized by comprising the following steps:

in a first feedback period, updating a Q value corresponding to a first state-action combination to a first numerical value according to a return value of the first state-action combination in a plurality of state-action combinations of an application, where the first state-action combination indicates that a first action is executed when the application is in a first state, the first state is a state in which a second feedback period earlier than the first feedback period is located, and the first action is used for adjusting the number of resources allocated to the application;

updating, in the first feedback period, a Q value corresponding to at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first numerical value;

determining an action corresponding to the state-action combination with the maximum Q value in at least two state-action combinations corresponding to the current state, wherein the current state is a state where the application is in the first feedback period;

and in the first feedback period, adjusting the quantity of the resources allocated to the application according to the action corresponding to the state-action combination with the maximum Q value.

2. The method of claim 1, wherein the at least one state-action combination comprises a second state-action combination that represents a second action that is different from the first action being performed when the application is in the first state;

the updating the Q value corresponding to at least one state-action combination of the plurality of state-action combinations other than the first state-action combination according to the first numerical value includes:

and updating the Q value corresponding to the second state-action combination according to the first value and the adjustment direction of the second action to the quantity of the resources allocated to the application compared with the first action.

3. The method of claim 2, wherein updating the Q value corresponding to the second state-action combination according to the first value and the adjustment direction of the second action to the amount of resources allocated to the application compared to the first action comprises:

if the reward value is less than zero and the second action adjusts the amount of the resource allocated to the application towards an increased amount compared with the first action, updating the Q value corresponding to the second state-action combination to a value less than the first value; and/or

If the reward value is less than zero and the second action is adjusted toward a reduced amount of the resource amount allocated to the application compared with the first action, updating the Q value corresponding to the second state-action combination to a value greater than the first value.

4. The method of any of claims 1-3, wherein the at least one state-action combination further comprises at least one third state-action combination, the third state-action combination representing an action to be performed when the application is in the first state that is different from the first action;

updating the Q value corresponding to each of the at least one third state-action combination according to the first value, so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases starting from the target action and in the direction of increasing the number of resources allocated to the application, and/or so that the Q value corresponding to the state-action combination corresponding to the first state monotonically decreases starting from the target action and in the direction of decreasing the number of resources allocated to the application, wherein the Q value corresponding to the state-action combination composed of the target action and the first state is the largest in all the state-action combinations corresponding to the first state.

5. The method of any of claims 1-3, wherein the at least one state-action combination comprises a fourth state-action combination, the fourth state-action combination representing the first action being performed when the application is in a second state different from the first state;

and updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the values of the state characteristic parameters of the first state and the second state.

6. The method of claim 5, wherein the status characterizing parameters include an average resource occupancy;

the updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the values of the state characteristic parameters of the first state and the second state includes:

if the return value is less than zero and the average resource occupancy rate of the second state is higher than the average resource occupancy rate of the first state, updating the Q value corresponding to the fourth state-action combination to a value less than the first value.

7. The method according to claim 5, wherein the status characteristic parameter comprises an average resource occupancy change rate, wherein the average resource occupancy change rate is used for reflecting a change trend of the average resource occupancy;

and updating the Q value corresponding to the fourth state-action combination according to the first numerical value and the values of the average resource occupation change rates of the first state and the second state.

8. The method of claim 7, wherein updating the Q value corresponding to the fourth state-action combination according to the first value and the value of the average resource occupancy change rate of the first state and the second state comprises:

if the reward value is less than zero and the value of the average resource occupation change rate of the second state is higher than the value of the average resource occupation change rate of the first state, updating the Q value corresponding to the fourth state-action combination to a value smaller than the first value.

9. The method of claim 5, wherein the at least one state-action combination further comprises at least one fifth state-action combination, the fifth state-action combination representing an action to be performed when the application is in the second state that is different from the first action;

the updating the Q value corresponding to at least one state-action combination different from the first state-action combination in the plurality of state-action combinations according to the first numerical value further comprises:

when the reported value is greater than zero and the first value is the maximum value of the Q values corresponding to all state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average rate of change of resource occupancy of the second state is higher than the value of the average rate of change of resource occupancy of the first state, updating the Q value corresponding to the at least one fifth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the third action is the largest in all the state-action combinations corresponding to the second state, wherein the third action adjusts the amount of resources allocated to the application towards an increased amount as compared to the first action.

10. The method of claim 5, wherein the at least one state-action combination further comprises at least one sixth state-action combination, the sixth state-action combination representing an action to be performed when the application is in the second state that is different from the first action;

when the reward value is greater than zero and the first value is the largest value among the Q values of all state-action combinations corresponding to the first state, if the first state and the second state differ in that the value of the average resource occupancy of the second state is greater than the value of the average resource occupancy of the first state, updating the Q value of the at least one sixth state-action combination according to the updated Q value corresponding to the fourth state-action combination, so that the Q value corresponding to the fourth state-action combination or the state-action combination composed of the second state and the fourth action is the largest in all the state-action combinations corresponding to the second state, wherein the fourth action adjusts the amount of resources allocated to the application towards an increased amount as compared to the first action.

11. The method of claim 5, wherein the at least one state-action combination further comprises a seventh state-action combination, the seventh state-action combination representing the first action being performed when the application is in a third state, wherein the first state, the second state, and the third state differ by a different amount of resources allocated to the application;

said updating the Q value of at least one of said plurality of state-action combinations other than said first state-action combination based on said first value further comprises:

and determining the updated Q value of the seventh state-action combination by interpolation, extrapolation or polynomial fitting according to the updated Q value corresponding to the first numerical value and the fourth state-action combination.

12. The method according to any one of claims 1 to 3, further comprising:

and if the proportion of the number of the at least one state-action combination in the plurality of state-action combinations exceeds a preset threshold value, updating the Q value corresponding to the state-action combination which is not updated in the plurality of state-action combinations by using an association rule mining method in the first feedback period.

13. A resource scheduling device based on Q learning is applied to a data center and is characterized by comprising:

a first processing unit, configured to update, according to a return value of a first state-action combination in a plurality of state-action combinations of an application, a Q value corresponding to the first state-action combination to a first numerical value, where the first state-action combination indicates that a first action is performed when the application is in a first state, the first state is a state in which the application is in a second feedback cycle that is earlier than the first feedback cycle, the first action is used to perform adjustment processing on the number of resources allocated to the application, and update, according to the first numerical value, a Q value corresponding to at least one state-action combination, which is different from the first state-action combination, in the plurality of state-action combinations;

the second processing unit is configured to determine, in at least two state-action combinations corresponding to a current state, an action corresponding to a state-action combination with a maximum Q value, where the current state is a state in which the application is located in the first feedback period;

and the third processing unit is used for adjusting the quantity of the resources allocated to the application according to the action corresponding to the state-action combination with the maximum Q value determined by the second processing unit in the first feedback period.

14. The apparatus of claim 13, wherein the at least one state-action combination comprises a second state-action combination that represents a second action that is different from the first action performed when the application is in the first state;

the first processing unit is specifically configured to update the Q value corresponding to the second state-action combination according to the first numerical value and an adjustment direction of the second action to the number of resources allocated to the application compared to the first action.

15. The apparatus according to claim 14, wherein the first processing unit is specifically configured to:

16. The apparatus according to any of claims 13-15, wherein the at least one state-action combination further comprises at least one third state-action combination, the third state-action combination representing an action to be performed when the application is in the first state that is different from the first action;

the first processing unit is specifically configured to:

17. The apparatus according to any of claims 13-15, wherein the at least one state-action combination comprises a fourth state-action combination, the fourth state-action combination representing the first action being performed when the application is in a second state different from the first state;

the first processing unit is specifically configured to:

18. The apparatus of claim 17, wherein the status characterization parameter comprises an average resource occupancy;

the first processing unit is specifically configured to:

19. The apparatus of claim 17, wherein the status characteristic parameter comprises an average resource occupancy change rate, wherein the average resource occupancy change rate is used to reflect a change trend of the average resource occupancy;

the first processing unit is specifically configured to:

20. The apparatus according to claim 19, wherein the first processing unit is specifically configured to:

21. The apparatus of claim 17, wherein the at least one state-action combination further comprises at least one fifth state-action combination, the fifth state-action combination representing an action to be performed when the application is in the second state that is different from the first action;

the first processing unit is specifically configured to:

22. The apparatus of claim 17, wherein the at least one state-action combination further comprises at least one sixth state-action combination, the sixth state-action combination representing an action to be performed when the application is in the second state that is different from the first action;

the first processing unit is specifically configured to:

23. The apparatus of claim 17, wherein the at least one state-action combination further comprises a seventh state-action combination, the seventh state-action combination representing the first action being performed when the application is in a third state, wherein the first state, the second state, and the third state differ by a different amount of resources allocated to the application;

the first processing unit is specifically configured to:

24. The apparatus according to any one of claims 13 to 15, wherein the first processing unit is further configured to: