CN113240003A

CN113240003A - Training method and system of scheduling model, scheduling method and system, and storage medium

Info

Publication number: CN113240003A
Application number: CN202110512654.2A
Authority: CN
Inventors: 黄隆波; 胡丕河
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-08-10

Abstract

The application discloses a training method and a system of a scheduling model, a scheduling method and a system and a storage medium. The training method of the scheduling model comprises the following steps: acquiring a training sample set in at least one preset time length, wherein the training sample set comprises training samples in each preset time length, the training samples comprise scheduling data in each time slot in the preset time length, and the scheduling data comprise environment information, resource allocation of a scheduling object obtained by operating the scheduling model and throughput obtained after scheduling operation is executed based on the resource allocation; and training the scheduling model based on the training sample set to update parameters of the scheduling model. According to the method and the device, the scheduling model is trained through the training sample set within a preset time length to update the parameters of the scheduling model, so that the aim of maximizing throughput while meeting resource consumption is achieved when the scheduling model is used for scheduling operation.

Description

Training method and system of scheduling model, scheduling method and system, and storage medium

Technical Field

The present application relates to the field of resource allocation technologies, and in particular, to a method and a system for training a scheduling model, a method and a system for scheduling, and a storage medium.

Background

In applications involving resource allocation, how to schedule to achieve optimal resource allocation has been a significant concern. However, due to the distribution of arrival flows of the scheduling objects, the randomness of state changes of service channels serving the scheduling objects, resource competition of different users caused by sharing a system resource pool by multiple users, and the limitation of resource consumption, how to maximize throughput under the condition of meeting the resource consumption limitation in the scheduling operation is an urgent problem to be solved.

Disclosure of Invention

In view of the above, an object of the present application is to provide a training method and system, a scheduling method and system, and a storage medium for a scheduling model, which are used to solve the technical problem of maximizing throughput while satisfying resource consumption in scheduling operation.

To achieve the above and other related objects, a first aspect of the disclosure provides a training method of a scheduling model, including the steps of: acquiring a training sample set in at least one preset time length, wherein the training sample set comprises training samples in each preset time length, the training samples comprise scheduling data in each time slot in the preset time length, and the scheduling data comprise environment information, resource allocation of a scheduling object obtained by operating the scheduling model and throughput obtained after scheduling operation is executed based on the resource allocation; and training the scheduling model based on the training sample set to update parameters of the scheduling model.

A second aspect of the present disclosure provides a training system of a scheduling model, including: an obtaining module, configured to obtain a training sample set within at least one preset duration, where the training sample set includes training samples within each preset duration, the training samples include scheduling data in each time slot within the preset duration, and the scheduling data includes environment information, resource allocation of a scheduling object obtained by operating the scheduling model, and throughput obtained after performing scheduling operation based on the resource allocation; and a training module for training the scheduling model based on the training sample set to update parameters of the scheduling model.

A third aspect of the present disclosure provides a scheduling method, including the following steps: receiving a scheduling object; obtaining the resource allocation amount of the scheduling object by adopting the scheduling model obtained by training through the training method; and carrying out scheduling operation on the scheduling object according to the resource allocation amount.

A fourth aspect of the present disclosure provides a scheduling system, including: an input unit for receiving a scheduling object; and the scheduling model obtained by training according to the training method is used for obtaining the resource allocation amount of the scheduling object and performing scheduling operation on the scheduling object according to the resource allocation amount.

A fifth aspect of the present disclosure provides an electronic device, comprising: at least one memory for storing at least one program; and the at least one processor is connected with the at least one memory and is used for executing and realizing the training method of the scheduling model or the scheduling method when running the at least one program.

A sixth aspect of the present disclosure provides a cloud server system, including: at least one storage device for storing at least one program; and the processing device is connected with the storage device and is used for executing and realizing the training method of the scheduling model or executing and realizing the scheduling method when the at least one program is operated.

A seventh aspect of the present disclosure provides a computer-readable storage medium storing at least one program which, when executed by a processor, executes and implements the above-described training method of a scheduling model, or executes and implements the above-described scheduling method.

In summary, the present application provides a training method and system, a scheduling method and system, and a storage medium for a scheduling model, in which the scheduling model is trained through a training sample set within a preset duration to update parameters of the scheduling model, so that the purpose of maximizing throughput while meeting resource consumption is achieved when the scheduling model is used for scheduling operation.

Other aspects and advantages of the present application will be readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application have been shown and described in the following detailed description. As those skilled in the art will recognize, the disclosure of the present application enables those skilled in the art to make changes to the specific embodiments disclosed without departing from the spirit and scope of the invention as it is directed to the present application. Accordingly, the descriptions in the drawings and the specification of the present application are illustrative only and not limiting.

Drawings

The specific features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates will be better understood by reference to the exemplary embodiments described in detail below and the accompanying drawings. The brief description of the drawings is as follows:

fig. 1 is a flowchart illustrating a training method of a scheduling model according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart illustrating the step S110 in the training method of the scheduling model according to an embodiment of the present application.

Fig. 3 is a schematic diagram showing a critics network structure adopted by the scheduling model in the present application.

Fig. 4 shows a schematic diagram of an actor network architecture employed by the scheduling model of the present application.

Fig. 5 is a flowchart illustrating a training method of the scheduling model according to another embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a training system of the scheduling model of the present application in an embodiment.

Fig. 7 is a flowchart illustrating a scheduling method according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a scheduling system according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a cloud server system according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided for illustrative purposes, and other advantages and capabilities of the present application will become apparent to those skilled in the art from the present disclosure.

In the following description, reference is made to the accompanying drawings that describe several embodiments of the application. It is to be understood that other embodiments may be utilized and that changes in the module or unit composition, electrical, and operation may be made without departing from the spirit and scope of the present disclosure. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Although the terms first, second, etc. may be used herein to describe various elements, information, or parameters in some instances, these elements or parameters should not be limited by these terms. These terms are only used to distinguish one element or parameter from another element or parameter. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the various described embodiments. Both the first and second elements are described as one element, but they are not the same element unless the context clearly dictates otherwise. Depending on context, for example, the word "if" as used herein may be interpreted as "at … …" or "at … …".

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

In applications involving resource allocation, how to reasonably schedule resources is a matter of concern. The scheduling problem is difficult to be solved simply because it is restricted by various factors, such as the distribution of the arrival streams of the scheduling object and the randomness of the state change of the service channel serving the scheduling object, the competition of different users for resources due to the multi-user shared system resource pool in the scheduling system, the limiting factor of resource consumption, the constraint factor of time delay, etc. For example, in the field of wireless communication, a mobile user receives a data packet from a relay through a wireless channel, and a transmission antenna is responsible for transmitting the data packet to the user.

At present, with the continuous development of machine learning, unlike supervised learning and unsupervised learning in the conventional machine learning, reinforcement learning is widely applied because it can learn through interaction between an agent and the environment. Specifically, the agent continuously learns the dynamics of the environment in interaction with the environment according to different benefits obtained from different actions in an attempt to adapt to the environment and maximize the cumulative benefits. And then, a deep neural network is introduced on the basis of reinforcement learning to obtain deep reinforcement learning, and due to the advantage of self-adaptive learning of the deep reinforcement learning in a specific environment, the problem of user scheduling can be solved by utilizing the deep reinforcement learning, so that the throughput can be maximized under the condition of meeting the resource consumption in the scheduling operation.

In view of this, the present application provides a method for training a scheduling model, where the scheduling model functions as a scheduling device or a scheduler that allocates resources to a scheduling object and then performs a scheduling operation according to the amount of the allocated resources. In some embodiments, the scheduling model is a scheduler or a scheduling device that performs resource allocation on scheduling objects by running a scheduling algorithm based on deep reinforcement learning. In other embodiments, the neural network in the scheduling model employs a neural network with memory rather than a simple fully-connected network. For example, the Neural Network of the scheduling model is based on a Recurrent Neural Network (RNN) structure, and two typical variants of the structure include a Long Short-Term Memory Network (LSTM) structure, a Gated Recurrent Unit (GRU) structure, and the like.

Generally, the reinforcement learning is expressed by a Markov Decision Process (MDP). In particular, the Markov decision process may use quadruples

It is shown that, among others,

representing all action sets of the agent;

a set of states representing an environment;

representing a revenue function, i.e.

Is a constant or random number;

the state transition matrix representing the environment, i.e.

Binary group

Also known as a Model (Model) of this markov decision process. Gamma denotes a discount factor. At each time t, the agent observes an environmental state s_tAnd take action a_tThen the environment is transferred to the next state s_t+1And the agent obtains an instantaneous profit r_t。

However, the markov property of the environment is often more difficult to satisfy for scheduling problems, i.e. the agent cannot fully observe the environment state s at time t_tOnly a subset o of it can be observed_t. In view of this, for the scheduling problem, the present application introduces a Partially Observable Markov Decision Process (POMDP) in reinforcement learning. In particular, the Markov decision process may use a quintuple in the present application

It is shown that, among others,

a set of observed variables is represented that are,

a set of states representing the environment is presented,

representing all action sets of the agent;

representing a revenue function, i.e.

Is a constant or random number;

the state transition matrix representing the environment, i.e.

Binary group

Also referred to as a Model (Model) of this part of the observable markov decision process. Gamma denotes a discount factor. In the partially observable Markov decision process of the present application for scheduling problems, the goal of the agent is to learn a decision strategy π(s) that will allow long-term discounted revenue generation

And max, where T represents a preset duration.

In other words, the problem to be solved by the scheduling model of the present application can be characterized by a partially observable markov decision process and solved by using an algorithm based on deep reinforcement learning. Wherein the action a of the agent_tRepresenting the resource allocation quantity of the scheduling object obtained by operating the scheduling model; environmental variable o observed by agent_tRepresenting environment information observed by a scheduling model, wherein the environment information refers to information directly observed by the scheduling model, and under the condition of all observability, the environment information can be represented by an environment state and is represented by s_tIndicating that, in the case of partial observability, the environment information is represented by an environment variable o_tCharacterizing; benefit r of agent_tRepresenting the throughput obtained by the scheduling model after performing the scheduling operation.

Considering the partial observability of the environment in the scheduling operation, if the scheduling model is trained only according to the environment information at a certain time, the resource allocation amount of the scheduling object at the time obtained by running the scheduling model based on the environment information, and the throughput achieved after the scheduling operation is executed based on the resource allocation amount, the environment information obtained at each time cannot sufficiently reflect the current environment, so that the scheduling model trained according to the information is difficult to achieve efficient scheduling. In view of the above, in order to enable a scheduling model based on deep reinforcement learning to perform efficient scheduling and maximize throughput while satisfying resource consumption in consideration of partial observability of an environment, the present application provides a training method of a scheduling model, the scheduling model trained by the training method can maximize throughput when performing scheduling operations, and the main idea is to train the scheduling model by observing environment information in a certain continuous time period, resource allocation of a scheduling object at the time obtained by running the scheduling model based on the environment information, and throughput achieved after performing scheduling operations based on the resource allocation.

Referring to fig. 1, a flowchart of an embodiment of the training method for a scheduling model of the present application is shown, and as shown in the figure, the training method for a scheduling model includes step S100 and step S110.

In step S100, a training sample set within at least a preset duration is obtained. The training sample set comprises training samples in each preset duration, the training samples comprise scheduling data in each time slot in the preset duration, and the scheduling data comprise environment information, resource allocation of a scheduling object obtained by running a scheduling model, and throughput obtained after scheduling operation is executed based on the resource allocation.

The training process of the scheduling model can be divided into a plurality of preset durations according to interaction between the scheduling model and the environment, the training method of the scheduling model is executed according to each preset duration, and the operation is circulated until the whole training process is completed. The preset duration may be set according to the unit amount of the scheduling decision made by the scheduling model, or may be set by a person skilled in the art to which the scheduling model is applied according to experience. Aiming at a certain determined scheduling problem, the lengths of all preset durations in the whole training process of the scheduling model need to be kept consistent. In some embodiments, the plurality of preset durations may also be referred to as a plurality of segments, and the training process of the scheduling model is divided into a plurality of segments, and the length of each segment corresponds to the length of the preset duration.

The time slot is a division of a preset time length and corresponds to a unit quantity of scheduling decisions made by a scheduling model in the preset time length. For example, if the scheduling decision is to be made by the scheduling model at each time, the length of the preset duration may be set to T, that is, T times are included, and the time slot is each time. For another example, if the scheduling model makes a scheduling decision every 24 hours (one day), the length of the preset duration may be set to N days, and the time slot is every day (24 hours).

For convenience of description, taking an example that the scheduling model makes a scheduling decision at each time T, the length of the preset time duration is set to be T, that is, the preset time duration includes T times, and the time slot is represented as each time. Accordingly, the training samples are the set of scheduling data at each time slot within a preset time duration. The training sample set is a set of training samples under a plurality of preset durations and comprises a plurality of continuous scheduling data with the duration of T, namely the training sample set is a set of continuous scheduling data with the duration of T, wherein each T duration data comprises scheduling data under T moments.

The scheduling data includes environment information, a resource allocation amount of a scheduling object obtained by running the scheduling model, and a throughput obtained after the scheduling model performs a scheduling operation based on the resource allocation amount. The environment information refers to information observed when the scheduling model interacts with the environment. Generally, the context information includes a status of a scheduling object and a status of a service channel, where the service channel is a channel for allocating resources to the scheduling object. In some embodiments, the status of a scheduling object includes information for distinguishing one scheduling object from another, e.g., the status of a scheduling object includes information indicating a queue in which the scheduling object is located, information of which user the scheduling object is to be scheduled to, constraint information of the scheduling object itself such as latency status information of the scheduling object, etc. The status of the service channel includes information of the service channel that allocates resources for the scheduling object. For example, in the field of wireless communication scheduling, the state of a service channel includes information indicating the state of a wireless channel between a transmitting antenna and a destination user; in the field of video stream scheduling, the state of a service channel includes information indicating the state of a communication link between a video buffer and user terminal equipment; in the field of order delivery scheduling, the status of a service aisle includes information indicating the status of a physical road between a delivery station and a delivery destination.

The resource allocation amount of the scheduling object is generated by the scheduling model according to the observed environment information, the throughput is the instant benefit obtained after the scheduling model executes the scheduling operation according to the obtained resource allocation amount, and the instant benefit refers to instant weighted throughput in the invention. The throughput is related to the status of the scheduling objects and the number of successfully scheduled scheduling objects.

In some embodiments, the step S100 of obtaining the training sample set within at least a preset time period may include: and according to the environmental information of each time slot, obtaining the resource allocation amount and the instantaneous weighted throughput of a scheduling object in the corresponding time slot by operating the scheduling model, and caching the training sample. In a specific example, first, the scheduling model may be randomly initialized and the playback buffer may be initialized. The playback buffer can be configured in the scheduling model in advance, and can also communicate with the scheduling model to provide the scheduling model with the data sets stored in the playback buffer. Then, for the condition that the preset duration is T, the scheduling model obtains the resource allocation amount and the instantaneous weighted throughput of the scheduling object at the time by operating the scheduling model for the environmental information at the first time, obtains the resource allocation amount and the instantaneous weighted throughput of the scheduling object at the time by operating the scheduling model for the environmental information at the second time, and so on until obtaining the resource allocation amount and the instantaneous weighted throughput at the T time, and stores the scheduling data at each time in a playback cache in a sequence form as a training sample.

Under the condition that the training process of the scheduling model is divided into a plurality of preset durations, the scheduling data further comprises an index d for representing whether the current preset duration is ended or not_t. In some embodiments, whether the current preset duration is over may be indicated in the schedule data in the form of 1/0. For example, if the current duration is over after the resource allocation amount of the scheduling object is obtained through the scheduling model according to the environment information at the current time, 1, that is, d is displayed in the scheduling data_t1, otherwiseShowing 0, i.e. d_t0. In an embodiment, assuming that the length of the preset duration is T, whether the current duration is finished or not can be represented by a comparison result of the current time T and the preset duration T, and if T is finished, the current duration is represented by a comparison result of the current time T and the preset duration T>T, then d_t1, otherwise d_t＝0。

In actual operation, after the scheduling data at each time within the preset time length is obtained, the scheduling data is cached in a playback cache in a sequence form to be used as a training sample, and the step of obtaining the training sample is repeated to obtain a training sample set formed by a plurality of training samples. The one or more training samples are input as samples for scheduling model training. Here, the buffered training samples are data sequences including environmental information, resource allocation amount, weighted throughput, and an index indicating whether the current preset duration is finished.

In step S110, the scheduling model is trained based on the training sample set to update parameters of the scheduling model.

The training samples are used as the input of the scheduling model, and the scheduling model is trained through the deep reinforcement learning based on the partial Markov decision so as to update the parameters of the scheduling model. For example, the parameters of the scheduling model are weights of a neural network structure. In some embodiments, the parameters of the scheduling model may be updated by calculating optimal parameters, i.e. by using an analytical method, which is more suitable for scheduling problems in a simple environment. In other embodiments, the parameters of the scheduling model may be updated in a gradient-decreasing manner, so that the scheduling model achieves efficient scheduling.

The training method of the scheduling model trains the scheduling model according to the training sample set, under the condition that partial observability of the environment is fully considered, according to information of continuous interaction between the intelligent agent and the environment within a period of time, hidden information which is not observable in the environment can be mined in the training process, and the purpose of maximizing throughput while meeting resource consumption constraint is achieved when the scheduling model is used for scheduling operation.

In training the scheduling model based on the set of training samples, in some embodiments, the scheduling model may be trained from all current training samples stored in the playback cache. In other embodiments, portions of the training samples may be sampled from the playback buffer to train the scheduling model to reduce computational complexity.

Because reinforcement learning is continuously learning in the interaction process of the agent and the environment, in order to improve the training accuracy of the scheduling model, the step of training the scheduling model according to the training sample set can be started after the scheduling model is initially operated to obtain the training samples within a plurality of preset durations. In some embodiments, during training of the scheduling model, the scheduling model and the playback buffer are first initialized, and then a plurality of training samples are obtained and deposited in the playback buffer according to the steps described above. After a plurality of training samples are stored in the playback buffer, the training learning process of the scheduling model is started. And on one hand, the scheduling model interacts with the environment to obtain a new training sample, and on the other hand, parameter updating and cyclic operation are carried out through the cached training sample. Based on this, please refer to fig. 2, which is a schematic flow chart of step S110 in the training method of the scheduling model of the present application in an embodiment, as shown in the figure, step S110 includes step S111 to step S113.

In step S111, a training sample set is sampled.

Training sample data stored in a playback cache of the scheduling model is increased continuously in the training process, and in order to introduce updated training sample data in the training process, reduce the computational complexity and improve the training accuracy, the stored training sample set is randomly sampled to obtain training samples for training the scheduling model.

In step S112, parameters of the scheduling model are updated in a gradient-down manner based on the sampled data.

In order to be able to learn implicit information that is not observable in the environment based on the environment variables, in some embodiments, the parameters of the scheduling model are updated in a gradient-decreasing manner, taking into account the partial observability of the environment in the scheduling problem. The gradient can be calculated by adopting a softmax operator, a max operator and the like, and gradient reduction is further realized.

In step S113, step S111 and step S112 are repeated until the preset number of updates is satisfied.

In the training operation within a preset time length, the updating times of the scheduling model can be set, then the process of sampling from the training sample set is repeatedly executed based on the set updating times, and the parameters of the scheduling model are updated in a gradient descending mode according to the sampled data until the updating times are reached. Wherein the number of updates may be set empirically by a person skilled in the art to which the scheduling model is applied. In some embodiments, the number of updates is set to two, and after the two updates are completed, the next preset duration is started. It should be noted that, in the process of acquiring the training samples within the next preset time period, the parameters of the scheduling model are updated, and the updated scheduling model obtains the resource allocation amount and the throughput of the scheduling object according to the environment information, and so on.

Under the condition that the scheduling model is of a deep neural network structure, in order to enable the performance of a scheduling algorithm to be better, the scheduling model comprises two types of network structures, a criticizing family network structure and an actor network structure, and the scheduling model is combined according to the two types of network structures to obtain neural networks with different structures. Hereinafter, for convenience of description, an example of the criticizing network is denoted as Q, and all weight parameters therein are denoted as θ. Let an instance of the actor network be denoted as pi, with all weight parameters denoted as

In certain embodiments, the critic network architecture comprises a fully-connected and long-short term memory dual-branch architecture. The actor network structure includes a fully connected and long-short term memory dual-branch structure.

Referring to fig. 3, a schematic diagram of a criticizing network structure adopted by the scheduling model of the present application is shown, wherein the input of the criticizing network includes: the environmental information at the current time t, in some examples, is an environmental variable o_t(ii) a Resource allocation amount of scheduling model to scheduling object under current moment ta_t(ii) a And the resource allocation amount (a) of the scheduling model to the scheduling object at the last moment (t-1)_t-1). In some embodiments, vector (o) is concatenated_t,a_t)、(o_t,a_t-1) Is then input into two branches, namely a first Fully Connected (FC) branch and a first Long Short Term Memory (LSTM) branch. In the first fully-connected branch, input (o)_t,a_t) The first full connection layer is firstly passed, then the first rectification linear unit activation function (ReLu) is passed, and then the output is carried out. In the first long-short term memory branch, input (o)_t,a_t-1) The first full connection layer, the second rectification linear unit activation function (ReLu), the first long and short term memory layer and the second long and short term memory layer are sequentially connected. The outputs of the first full-connection branch and the first long-short term memory branch are spliced on a first splicing layer, the spliced outputs pass through a third full-connection layer, then pass through a third rectification linear unit activation function (ReLu), and then pass through a fourth full-connection layer to output the output which adopts the action a in the current state_tReward Q (o) for later earning_t,a_t)。

Referring to fig. 4, a schematic diagram of an actor network architecture used in the scheduling model of the present application is shown, wherein the actor network input includes: the environmental information at the current time t, in some examples, is an environmental variable o_t(ii) a The resource allocation amount (a) of the scheduling model to the scheduling object at the last moment (t-1)_t-1). In some embodiments, vector (o) is concatenated_t,a_t-1) Then into two branches, namely a second fully-connected branch and a second long-short term memory branch, respectively. In the second fully-connected branch, input (o)_t,a_t-1) The first full connection layer is passed through, and then the second Rectified Linear Unit (ReLU) activation function is passed through, and then output. In the second long-short term memory branch, input (o)_t,a_t-1) The first full connection layer, the second rectification linear unit activation function, the second long and short term memory layer and the output layer. A second fully-connected branch andthe output of the second long-short term memory branch is spliced by the second splicing layer, passes through a seventh full connection layer, and then passes through a Hyperbolic Tangent (tanh) activation function to output the resource allocation amount a of the scheduling model to the scheduling object at the current moment t_t。

In the case where the scheduling model includes a critics network structure and an actor network structure, the training of the scheduling model based on the training sample set includes: training the critic network structure and the actor network structure based on a training sample set. That is, the weight parameter θ of the criticizing family network structure and the weight parameter of the actor network structure are subjected to training based on the training sample set

Training and updating are carried out. For example, the criticizing family network target value is calculated by means of a softmax operator, so that the algorithm performance is improved. Alternatively, the calculation of the criticizing target value may be performed by means of a max operator.

Since the parameter training of the actor network is affected by the output of the critic network, in order to update the parameters of the actor network more stably, in some embodiments, the step of training the scheduling model based on the training sample set includes: training the criticizing family network structure based on the training sample set; and training the actor network structure by using the criticizing family network structure after at least one training. That is, during training of the scheduling model according to the training sample set, successive updates are performed on parameters of a critic network in the scheduling model, and intermittent updates are performed on parameters of an actor network in the scheduling model. The number of intervals is empirically determined, and a generally preferred number of intervals is 2. For example, in the case that the training process of the scheduling model is divided into M durations according to the interaction of the scheduling model with the environment, it may be set that the criticizing family network parameters and the actor network parameters are updated simultaneously only at an even number of durations. Then, if the training process is divided into six durations, the weighting parameters of the critic network and the actor network are updated simultaneously only in the second duration, the fourth duration and the sixth duration when the scheduling model is trained according to the training sample set.

According to the training method of the scheduling model, the training data set is sampled in the training process, the scheduling model is trained in a gradient descending mode, the optimal scheduling strategy can be gradually approached, meanwhile, the implicit information in the environment can be mined, the efficiency of the scheduling model for executing scheduling operation is improved, and the throughput is maximized. In addition, the stability of the scheduling model training is improved through the interval updating of the actor network.

In the actual scheduling problem, the scheduling operation is limited by resource consumption, in order to ensure that the scheduling model can provide hard guarantee of resource consumption constraint when executing the scheduling operation, the training method of the scheduling model further comprises the step of determining whether the obtained resource allocation quantity meets a preset constraint condition, and if the constraint condition is met, executing the scheduling operation based on the obtained resource allocation quantity; and if the constraint condition is not met, not executing the scheduling operation based on the obtained resource allocation amount.

Wherein the constraint condition can be set by a person skilled in the art to which the scheduling model is applied according to the actual environment constraint. For example, the constraints may include instantaneous resource consumption constraints, average resource consumption constraints, service channel capacity constraints, and the like. In some embodiments, the constraint is an instantaneous resource consumption constraint. In an example, the constraint condition may be set that the instantaneous resource allocation amount is less than or equal to a fixed value, and when the instantaneous resource consumption amount of the current scheduling object obtained by the scheduling model is less than or equal to the fixed value, the scheduling model performs the scheduling operation according to the resource allocation amount; and when the instantaneous resource consumption of the current scheduling object obtained by the scheduling model is larger than a fixed value, the scheduling model does not execute the scheduling operation. In other embodiments, the constraint satisfies an average resource consumption constraint. In one example, the constraint condition is set to set the average of the sum of the obtained resource allocation amount and the accumulated used resource allocation amount at the current time to be less than or equal to a preset value. The resource allocation amount obtained at the current time refers to the resource allocation amount obtained by the scheduling model at the current time, and the accumulated used resource allocation amount refers to all the time before the current timeSum of the carved resource allocation. The preset value is set according to the limit of the practical scheduling problem, for example, in the field of wireless communication scheduling, the preset value is set according to the average power of the transmitting antennas. In other words, the constraint is set such that the average resource to be consumed is equal to or less than a fixed value. When the average value of the current accumulated consumed resources of the scheduling object obtained by the scheduling model is less than or equal to a fixed value, the scheduling model executes scheduling operation according to the resource allocation amount; and when the average value of the current accumulative consumed resources of the scheduling object obtained by the scheduling model is larger than a fixed value, the scheduling model does not execute the scheduling operation, wherein the average value of the current accumulative consumed resources is obtained by adding the sum of the resource allocation quantity obtained at the current moment and the resource allocation quantities at all moments before the current moment and then averaging. In a specific example, the scheduling model obtains the resource allocation amount a of the scheduling object based on the environment information at the time t_tThen, the action f (a) satisfying the average resource consumption constraint is output after the internal function f_t) To be executed in a learning environment, in other words, the scheduling model obtains the resource allocation amount a of the scheduling object based on the environment information at the time t_tThereafter, if the average resource consumption constraint is satisfied

Then a scheduling operation is performed, otherwise, actions that do not meet the average resource consumption constraint are zeroed. Wherein the content of the first and second substances,

refers to the average resource, P, consumed by the scheduling model_avgRefers to the average resource consumption, as shown in equation (1).

Accordingly, the data constituting the training sample is the data output after being optimized by the internal function f, that is, in the scheduling data entered into the learning environment and buffered as the training sample without satisfying the average resource consumption constraint, the resource allocation amount of the scheduling object and the throughput obtained after performing the scheduling operation based on the resource allocation amount are set to zero, which enables to guide the scheduling model to consciously perceive and comply with the constraint during the training.

In addition, a delay limiting condition may be set to the scheduling object in consideration of the delay problem in the scheduling operation. The delay limiting condition can be set by a person skilled in the art of applying the scheduling model according to the actual system requirement. In this case, the training method of the scheduling model of the present application further includes the steps of: and caching the scheduling object and processing the cached scheduling object according to the set time delay limiting condition. In some embodiments, the step of processing the cached scheduling objects according to the latency constraint includes serving the scheduling objects that satisfy the latency constraint and discarding the scheduling objects that do not satisfy the latency constraint. In some embodiments, the same delay limit is set for the scheduling object of the same user, and then the scheduling object of the user needs to be served within the set delay time after reaching the buffer, otherwise, the scheduling object is discarded, so that the scheduling model can achieve the maximization of throughput under the condition that the resource consumption and the delay limit are met when the scheduling operation is performed.

In a specific example, the scheduling model makes scheduling decisions at each time instant. On this basis, the training process of the scheduling model is set to include M preset durations, also referred to as M segments, where each preset duration includes T times, that is, the length of each segment is T. In addition, the updating times of the scheduling model are set to be two times for each segment; for the entire training process, it is set to update the actor network parameters when the number of segments is a multiple of q, q being 2 in this example. Please refer to fig. 5, which is a flowchart illustrating a training method of the scheduling model according to another embodiment of the present application. As shown, the training method of the scheduling model includes steps S200 to S250.

Training of the scheduling model is started based on the above settings, and in step S200, the scheduling model is randomly initialized and the playback buffer is initialized.

In step S210, it is determined whether the current number of fragments is greater than the total number of fragments M, if so, the training process is ended, otherwise, step S220 is executed. In this example, the number of segments is 1.

In step S220, it is determined whether the current time T is greater than the segment length T, and if so, step S230 is executed, otherwise, step S221 and step S222 are executed, in this example, starting from T ═ 1.

In step S221, a scheduling model is run to obtain scheduling data according to the observed environmental information and perform a scheduling operation. In some embodiments, in order to ensure that the scheduling operation satisfies the constraint condition, before the scheduling model performs the scheduling operation according to the environment information, the scheduling model needs to be optimized by the constraint adaptation module, that is, it is determined whether to perform the scheduling operation according to the constraint condition and obtain the scheduling data processed by the constraint condition.

In step S222, the next time, i.e., t ═ t +1, is obtained from the current time, and the process returns to step S220.

In step S230, the schedule data obtained in step S221 is stored in a playback buffer in the form of a sequence as a training sample.

In step S240, it is determined whether the current update time i is greater than the preset update time 2, if so, step S250 is performed, otherwise, steps S241 to S246 are performed. In this example, starting with i ═ 1.

In step S241, a part of the training samples from the training sample set buffered in step S230 is sampled as a training sample input.

In step S242, the criticist network target value is calculated using the softmax operator.

In step S243, the criticizing network parameters are updated.

In step S244, it is determined whether the current number of fragments is a multiple of 2, and if so, step S245 and step S246 are performed, otherwise, step S246 is directly performed.

In step S245, the actor network parameters are updated.

In step S246, the next update count, i.e., i ═ i +1, is obtained from the current update count, and the process returns to step S240.

In step S250, the next number of fragments is obtained from the current number of fragments, that is, the number of fragments is equal to the current number of fragments +1, and the process returns to step S210.

In a specific example, the scheduling model of the present application employs a neural network in a reinforcement learning based scheduling algorithm. In the training phase of the scheduling model, the scheduling algorithm guides the agent (corresponding to the scheduling model in this application) to interact with the environment. The intelligent agent continuously explores the environment in each interactive time slot of each interactive segment (corresponding to the preset duration of the application) of the environment, and maximizes the accumulated benefit (corresponding to the throughput of the application) by utilizing the characteristics of the environment. After a large number of interactive segments, the training phase of the agent is finished to obtain a deep neural network which can be actually deployed.

For example, the scheduling model of the present application may perform scheduling operations for N users, taking the scheduling model to make a scheduling decision at each time t as an example. In some embodiments, the scheduling model further comprises a caching module

The cache module

For caching the scheduling object. In the scheduling operation, time is divided into discrete time instances t ∈ {0,1, 2. }, and a scheduling object serves in units of discrete packets. After the scheduling objects arrive at the scheduling model, the scheduling objects are firstly buffered in the buffer module

Then, the scheduling algorithm selects to perform service on the service channel at a proper time, so as to send the service to the destination user of the scheduling object. Cache module

There are N unordered queues, corresponding to N users respectively. At time t, queue q for user i_iThe number of arriving scheduling objects is A_i(t) with a weight of β_iIts corresponding service channel status is c_i(t) of (d). With the same delay bound τ for each scheduling object belonging to user i_iThat is, the scheduling object of the user needs to be at τ after reaching the buffer module_iOtherwise, the time-out is discarded. Therefore, the cache module

The state of each scheduled object in (a) can be uniquely determined by the user index and the remaining time delay (i, τ). In particular, for user i queue q_iThe state at time t is

Wherein

Is shown as queue q_iThe number of users with the remaining time delay of tau.

At each time t, the scheduling model makes a scheduling decision. Queue q for user i_iIts scheduling decision at time t is

Wherein

Is shown to be queue q_iThe service resource of the scheduling object with the time delay of tau remains. Whether the scheduling object can be transmitted successfully or not finally depends on the service resource and the channel state of the scheduling object. The higher the service resource is, the higher the probability that the scheduling object is successfully transmitted to the target user is, and the scheduling object which fails to be transmitted is abandoned and not retransmitted. At time t, the number of scheduling objects successfully sent by user i is recorded as D_i(t) of (d). Then, starting from time 0 to time t, the weighted throughput of the scheduling model is

Wherein, beta_iIs a weight of the throughput of user i. The average resource consumed by the scheduling model is

On this basis, for the training process of the scheduling model, after a given segment length T, for a time T e {1, …, T } in a segment: the actions of the agent are

Representing the resource allocation amount of each state scheduling object in the cache; the observed environmental variable of the agent is

Representing the states of scheduling objects of each queue in the cache and the states of service channels observed by the agent, wherein the observed environment variables are partial observed quantities of complete environment variables; d_tIndicates whether the current segment is finished: if the agent takes action a_tAfter this segment is over, d_t1, otherwise d_t0; the instantaneous benefit of the agent is

Representing the instantaneous weighted throughput. In addition, for any time instant, the state transition probability matrix

Are not known at all. Gamma is set to 1, i.e. the optimization objective of the agent is

In addition, based on the critics network structure and the actor network structure described above, in the present example, the numbers of neurons of the full connection layer and the long-short term memory layer in both network templates are set to L. Finally generate 8The deep neural network example is as follows: generation of two example actor networks from actor network templates₁Actor network pi₂(ii) a Generation of two example criticizing network Q from criticizing network template₁Criticism network Q₂(ii) a And target network copies of two actor network instances and two critic network instances: target actor network

Target actor network

Target criticizing family network

Target criticizing family network

During the training of the scheduling model, the inputs to the scheduling algorithm include: training segment number M, segment length T, network width L, action noise variance sigma, training sample number K, sampling noise sample number J, sampling noise variance sigma', noise margin c, inverse temperature beta, and sampling distribution

Policy delay q, target network update rate τ. The outputs obtained from the scheduling algorithm include: trained actor network pi₁Actor network pi₂Criticizing network Q₁Criticizing network Q₂. The specific process of the scheduling algorithm is now shown in connection with fig. 5.

First, an actor network, a critics network, and a corresponding target actor network and target critics network are initialized. In particular, with random weight parameters

Initializing actor network pi₁Actor network pi₂(ii) a With a random weight parameter theta₁，θ₂Initializing criticizing network Q₁Criticizing network Q₂(ii) a With random weight parameters

Initializing a target actor network

Target actor network

With random weight parameters

Initializing a target criticizing network

Target criticizing family network

Second, the playback buffer is initialized

Then, the following cycle is repeated for the number of fragments from 1 to M:

a) the following cycle is repeated for time T from 1 to T:

i) agent observes environmental variable o_t；

ii) let time t historical observed quantity h_t←(h_t-1，a_t-1，o_t)；

iii) order action a_t←g(π₁，π₂) B is in the form of

Subject to a normal distribution with parameters (0, sigma),

iv) agent performs action f (a) optimized by constraint adaptation module_t) Into a learning environment;

b) will sequence (o)₁，a₁，r₁，d₁，...，o_T，a_T，r_T，d_T) Logging cache

c) The following cycle is repeated for i from 1 to 2:

i) random slave cache

Middle sampling K series samples { (o)₁，a₁，r₁，...，o_T，a_T，r_T)}；

ii) the following cycle is repeated for time T from 1 to T:

1) randomly sampling J noise samples E', wherein

I.e., obeying a normal distribution with parameters (0, σ');

2)

wherein

3)

4) Defined and calculated according to

Wherein, beta is the inverse temperature,

is the sampling profile. To obtain

iii) according to Bellman losses:

updating criticizing family network weight parameter theta by gradient descent_i

iv) if the number of fragments is a multiple of q, then

1) According to a cyclic strategy gradient:

updating actor network weight parameters using gradient descent

2) Updating the target network weight parameter according to the following formula:

the scheduling model trained based on the training method can strictly meet the average resource consumption constraint (

Wherein P is_avgIs resource limited) and delay limited, efficient scheduling is performed to achieve nearly maximum weighted average throughput

Fig. 6 is a schematic structural diagram of a training system for a scheduling model according to an embodiment of the present disclosure. As shown, the training system of the scheduling model includes an obtaining module 10 and a training module 11.

The obtaining module 10 is configured to obtain a training sample set in at least one preset duration, where the training sample set includes training samples in each preset duration, the training samples include scheduling data in each time slot in the preset duration, and the scheduling data includes environment information, resource allocation of a scheduling object obtained by running the scheduling model, and throughput obtained after performing scheduling operation based on the resource allocation.

The interaction between the scheduling model and the environment in the training process can be divided into a plurality of preset durations, the training method of the scheduling model is executed according to each preset duration, and the operation is circulated until the whole training process is completed. The preset duration may be set according to the unit amount of the scheduling decision made by the scheduling model, or may be set by a person skilled in the art to which the scheduling model is applied according to experience. The length of each preset time length in the whole training process of the scheduling model is the same. In some embodiments, the plurality of preset durations may also be referred to as a plurality of segments, and the training process of the scheduling model is divided into a plurality of segments, and the length of each segment corresponds to the length of the preset duration.

In certain embodiments, the obtaining module comprises: the acquiring unit is used for acquiring the resource allocation amount and the instantaneous weighted throughput of the scheduling object in the corresponding time slot by operating the scheduling model according to the environmental information in each time slot; and a buffer unit for buffering the training samples.

In a specific example, first, the scheduling model may be randomly initialized and the playback buffer may be initialized. The playback buffer can be configured in the scheduling model in advance, and can also communicate with the scheduling model to provide the scheduling model with the data sets stored in the playback buffer. Then, for the condition that the preset duration is T, the scheduling model obtains the resource allocation amount and the instantaneous weighted throughput of the scheduling object at the time by operating the scheduling model for the environmental information at the first time, obtains the resource allocation amount and the instantaneous weighted throughput of the scheduling object at the time by operating the scheduling model for the environmental information at the second time, and so on until obtaining the resource allocation amount and the instantaneous weighted throughput at the T time, and stores the scheduling data at each time in a playback cache in a sequence form as a training sample.

Under the condition that the training process of the scheduling model is divided into a plurality of preset durations, the scheduling data further comprises an index d for representing whether the current preset duration is ended or not_t. In some embodiments, whether the current preset duration is over may be indicated in the schedule data in the form of 1/0. For example, if the current duration is over after the resource allocation amount of the scheduling object is obtained through the scheduling model according to the environment information at the current time, 1, that is, d is displayed in the scheduling data_tOtherwise, 0 is shown, i.e. d_t0. In an embodiment, assuming that the length of the preset duration is T, whether the current duration is finished or not can be represented by a comparison result of the current time T and the preset duration T, and if T is finished, the current duration is represented by a comparison result of the current time T and the preset duration T>T, then d_t1, otherwise d_t＝0。

The training module 11 is configured to train the scheduling model based on the training sample set to update parameters of the scheduling model.

The training system of the scheduling model trains the scheduling model through the training sample set, under the condition that partial observability of the environment is fully considered, according to a plurality of pieces of environment information observed within a period of time, hidden information which is not observable in the environment can be mined in the training process, and the purpose of maximizing throughput while meeting resource consumption is achieved when the scheduling model is used for scheduling operation.

Because reinforcement learning is continuously learning in the interaction process of the agent and the environment, in order to improve the training accuracy of the scheduling model, the step of training the scheduling model according to the training sample set can be started after the scheduling model is initially operated to obtain the training samples within a plurality of preset durations. In some embodiments, during training of the scheduling model, the scheduling model and the playback buffer are first initialized, and then a plurality of training samples are obtained and deposited in the playback buffer according to the steps described above. After a plurality of training samples are stored in the playback buffer, the training learning process of the scheduling model is started. And on one hand, the scheduling model interacts with the environment to obtain a new training sample, and on the other hand, parameter updating and cyclic operation are carried out through the cached training sample. Based on this, the training module comprises: the sampling unit is used for sampling the training sample set; and the updating unit is used for updating the parameters of the scheduling model in a gradient descending mode based on the sampled data. The steps of the training performed by the training module are shown in fig. 2 and the corresponding description thereof, and are not described herein again.

Under the condition that the scheduling model is a deep neural network structure, in order to enable the performance of a scheduling algorithm to be better, the scheduling model comprises a criticizing family network structure and an actor network structure, and the training module is used for training the criticizing family network structure and the actor network structure based on the training sample set. The critics network structure and the actor network structure are shown in fig. 3 and 4 and their corresponding descriptions, and are not described in detail herein.

Since the parameter training of the actor network is affected by the output of the critic network, in order to update the parameters of the actor network more stably, in some embodiments, the training module includes: the first training unit is used for training the criticizing family network structure based on the training sample set; and a second training unit for training the actor network structure using the critic network structure after at least one training.

In the actual scheduling problem, the scheduling operation is limited by resource consumption, and in order to ensure that the scheduling model can provide a hard guarantee of resource consumption constraint when executing the scheduling operation, the training system of the scheduling model further comprises a constraint adaptation module for determining whether the obtained resource allocation quantity meets a preset constraint condition; if the constraint condition is met, executing scheduling operation based on the resource allocation quantity; and if the constraint adjustment is not satisfied, executing the scheduling operation based on the resource allocation quantity. Wherein the constraint condition can be set by a person skilled in the art to which the scheduling model is applied according to the actual environment constraint. The constraints may include instantaneous resource consumption constraints, average resource consumption constraints, service channel capacity constraints, and the like. In some embodiments, the constraint condition is set that the average of the sum of the obtained resource allocation amount and the accumulated used resource allocation amount at the current time is less than or equal to a preset value. The resource allocation quantity obtained at the current moment refers to the resource allocation quantity obtained by the scheduling model at the current moment, and the accumulated used resource allocation quantity refers to the sum of the resource allocation quantities at all moments before the current moment. The preset value is set according to the limit of the practical scheduling problem, for example, in the field of wireless communication scheduling, the preset value is set according to the average power of the transmitting antennas.

In addition, in consideration of the delay problem in the scheduling operation, a delay limiting condition may be set for the scheduling object, wherein the delay limiting condition may be set by a person skilled in the art to which the scheduling model is applied according to experience or according to actual system requirements. In this case, the training system of the scheduling model of the present application further includes: the cache module is used for caching the scheduling object; and the processing module is used for processing the cached scheduling object according to the time delay limiting condition.

Here, the working modes of the modules in the training system of the scheduling model of the present application are the same as or similar to the corresponding steps in the training method of the scheduling model, and are not described herein again.

Referring to fig. 7, a flowchart of an embodiment of the scheduling method of the present application is shown, and as shown in the drawing, the scheduling method includes steps S300 to S320.

In step S300, a scheduling object is received.

The scheduling object is an object to be sent to a user and is related to an application scene of scheduling operation. In the scheduling operation, time is divided into discrete time slots, and a scheduling object serves in discrete packet units. For example, in the field of wireless communication scheduling, a scheduling object is a wireless communication packet; in the field of video stream scheduling, a scheduling object is a video stream data packet; in the field of order delivery scheduling, a scheduling object is a delivery order.

In step S310, the resource allocation amount of the scheduling object is obtained by using the scheduling model obtained by the training method.

In some embodiments, in the case that a delay limitation condition is set, after receiving the scheduling object, the scheduling object is firstly cached in a caching module of the scheduling model, and then the scheduling model performs the scheduling operation at an appropriate time. And when the scheduling object cached in the caching module is not served within the set delay time, discarding the scheduling object.

And after receiving the scheduling object, the scheduling model obtains the resource allocation amount of the scheduling object according to the observed environment information. Specifically, the use of the scheduling model is as follows: actor network pi trained in training phase by using scheduling model₁Actor network pi₂Criticizing network Q₁Criticizing network Q₂At time t, the following steps are carried out:

a) agent observes environmental variable o_t；

b) Let the historical observed quantity h at time t_t←(h_t-_1，a_t-_1，o_t)；

c) Let act a_t←g(π₁，π₂) Wherein

d) The agent performs the action f (a) optimized by the constraint adaptation module_t) Into a learning environment;

e) and entering the next moment, and repeating the step a).

The scheduling model as described above may be applied in a variety of scenarios involving scheduling problems to maximize throughput while meeting resource consumption constraints and/or latency constraints. For example, in the field of wireless communication scheduling, a mobile subscriber receives data packets from a relay via a wireless channel, and a transmission antenna is responsible for transmitting the data packets to the subscriber. The scheduling object is a wireless communication data packet, the data packet has a time delay limit, and the data packet is discarded if the data packet is not sent after the time is out after the data packet reaches the relay. In addition, the transmit antennas are limited by the average transmit power. Therefore, the number of successfully scheduled data packets can be maximized under the condition that the average transmission power limit and the delay limit are met by adopting the trained scheduling model for scheduling operation.

As another example, in the field of video stream scheduling, a mobile user requests an online video stream via a wireless network, and a router is responsible for transmitting a video data stream to the user. The scheduling object is a video stream data packet. Video stream packets have a delay limitation, since low-latency video-on-demand services are critical to the user experience. In addition, transmission resources such as bandwidth are limited. Therefore, the video traffic successfully scheduled can be maximized under the condition that the bandwidth limitation and the time delay limitation are met by adopting the trained scheduling model for scheduling operation.

For another example, in the area of order delivery scheduling, a user orders an item on an online platform, which is responsible for delivery. Wherein, the dispatching object is a delivery order. In order to improve user experience, it is very important to ensure timely delivery, so that order delivery has a delay limit. However, the delivery capacity is often limited. Therefore, the scheduling operation performed by the trained scheduling model can maximize the successfully scheduled order quantity under the condition of meeting the delivery capacity limit and the time delay limit.

In step S320, the scheduling target is scheduled according to the obtained resource allocation amount.

Referring to fig. 8, a schematic structural diagram of the scheduling system of the present application in an embodiment is shown, and as shown in the drawing, the scheduling system includes an input unit 20 and a scheduling model 21. The input unit 20 is used for receiving a scheduling object. The scheduling model 21 is configured to obtain a resource allocation amount of the scheduling object, and perform a scheduling operation on the scheduling object according to the resource allocation amount.

Here, the working modes of the modules in the scheduling system of the present application are the same as or similar to the corresponding steps in the scheduling method, and are not described herein again.

The application also provides an electronic device. Referring to fig. 9, which is a schematic structural diagram of an embodiment of an electronic device according to the present application, as shown, the electronic device includes at least one memory 30 and at least one processor 31.

In an embodiment, the electronic device is, for example, an electronic device loaded with an APP application or having web/website access capabilities, and includes components such as a memory, a memory controller, one or more processing units (CPUs), a peripheral interface, RF circuitry, audio circuitry, a speaker, a microphone, an input/output (I/O) subsystem, a display screen, other output or control devices, and an external port, which communicate via one or more communication buses or signal lines. The electronic device includes, but is not limited to, personal computers such as desktop computers, notebook computers, tablet computers, smart phones, smart televisions, and the like. The electronic device can also be an electronic device consisting of a host with a plurality of virtual machines and a human-computer interaction device (such as a touch display screen, a keyboard and a mouse) corresponding to each virtual machine.

The at least one memory is for storing at least one program; in embodiments, the memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In certain embodiments, the memory may also include memory that is remote from the one or more processors, such as network attached memory that is accessed via RF circuitry or external ports and a communications network, which may be the internet, one or more intranets, local area networks, wide area networks, storage area networks, and the like, or suitable combinations thereof. The memory controller may control access to the memory by other components of the device, such as the CPU and peripheral interfaces.

In an embodiment, the at least one processor is connected to the at least one memory and configured to execute and implement at least one embodiment described above for the training method of the scheduling model, such as the embodiments described in fig. 1,2, and 5, when the at least one program is executed; alternatively, at least one embodiment described above for the scheduling method, such as the embodiment described in fig. 7, is performed and implemented. In an embodiment, the processor is operatively coupled with a memory and/or a non-volatile storage device. More specifically, the processor may execute instructions stored in the memory and/or the non-volatile storage device to perform operations in the computing device, such as generating image data and/or transmitting image data to an electronic display. As such, the processor may include one or more general purpose microprocessors, one or more special purpose processors, one or more field programmable logic arrays, or any combination thereof.

The application also provides a cloud server system. Referring to fig. 10, a schematic structural diagram of a cloud server system according to an embodiment of the present application is shown, where the cloud server system includes at least one storage device 40 and at least one processing device 41.

In some embodiments of the present application, the cloud server system may be arranged on one or more entity servers according to various factors such as function, load, and the like. When distributed in a plurality of entity servers, the server may be composed of servers based on a cloud architecture. For example, a Cloud-based server includes a Public Cloud (Public Cloud) server and a Private Cloud (Private Cloud) server, wherein the Public or Private Cloud server includes Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS), Infrastructure as a Service (IaaS), and Infrastructure as a Service (IaaS). The private cloud service end is used for example for a Mei Tuo cloud computing service platform, an Array cloud computing service platform, an Amazon cloud computing service platform, a Baidu cloud computing platform, a Tencent cloud computing platform and the like. The server may also be formed by a distributed or centralized cluster of servers. For example, the server cluster is composed of at least one entity server. Each entity server is provided with a plurality of virtual servers, each virtual server runs at least one functional module in the catering merchant information management server, and the virtual servers are communicated with each other through a network.

In an embodiment, the at least one storage device is configured to store at least one program, and the at least one processing device is connected to the at least one storage device and configured to execute and implement at least one embodiment described above for the training method of the scheduling model, such as the embodiments described in fig. 1, fig. 2, and fig. 5, when the at least one program is executed; alternatively, at least one embodiment described above for the scheduling method, such as the embodiment described in fig. 7, is performed and implemented.

The present application also provides a computer-readable and writable storage medium storing at least one program which, when executed by a processor, performs and implements at least one embodiment as described above for the training method of the scheduling model, such as the embodiments described in fig. 1,2 and 5; or to perform and implement at least one of the embodiments described above for the scheduling method, such as the embodiment described in fig. 7.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application.

In the embodiments provided herein, the computer-readable and writable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, a USB flash drive, a removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable-writable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be non-transitory, tangible storage media. Disk and disc, as used in this application, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

In one or more exemplary aspects, the functions described in the computer program of the methods described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and/or writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.

The flowcharts and block diagrams in the figures described above of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims

1. A training method of a scheduling model is characterized by comprising the following steps:

acquiring a training sample set in at least one preset time length, wherein the training sample set comprises training samples in each preset time length, the training samples comprise scheduling data in each time slot in the preset time length, and the scheduling data comprise environment information, resource allocation of a scheduling object obtained by operating the scheduling model and throughput obtained after scheduling operation is executed based on the resource allocation; and

training the scheduling model based on the training sample set to update parameters of the scheduling model.

2. The training method of the scheduling model according to claim 1, wherein the step of obtaining the training sample set within a preset time duration comprises:

according to the environmental information of each time slot, the resource allocation amount and the instantaneous weighted throughput of the scheduling object in the corresponding time slot are obtained by operating the scheduling model; and

the training samples are buffered.

3. A training method for a scheduling model according to claim 1 or 2, characterized in that the environment information comprises: the state of the scheduling object and the state of a service channel for allocating resources to the scheduling object.

4. The method for training a scheduling model according to claim 1, wherein the step of training the scheduling model based on the training sample set comprises:

sampling the training sample set;

updating parameters of the scheduling model in a gradient descent mode based on the sampled data; and

and repeating the steps until the preset updating times are met.

5. The training method of the scheduling model according to claim 1 or 4, wherein the scheduling model includes a critics network structure and an actor network structure, and the step of training the scheduling model based on the training sample set includes: training the critic network structure and the actor network structure based on the training sample set.

6. The method for training a scheduling model according to claim 5, wherein the step of training the scheduling model based on the training sample set further comprises:

training the criticizing network structure based on the training sample set; and

and training the actor network structure by using the critic network structure after at least one training.

7. The training method of a dispatch model as claimed in claim 5, wherein the critics network architecture comprises a full connection and long-short term memory dual-branch architecture and the actor network architecture comprises a full connection and long-short term memory dual-branch architecture.

8. The training method of the dispatch model of claim 7, wherein the criticizing family network structure comprises a first fully-connected branch and a first long-short term memory branch, and the first fully-connected branch and the first long-short term memory branch are sequentially connected with a third fully-connected layer, a third rectifying linear unit activation function and a fourth fully-connected layer after passing through a first splicing layer; the first full-connection branch sequentially comprises a first full-connection layer and a first rectification linear unit activation function, and the first long-short term memory branch sequentially comprises a second full-connection layer, a second rectification linear unit activation function and a first long-short term memory layer.

9. The training method of the scheduling model according to claim 7, wherein the actor network structure comprises a second fully-connected branch and a second long-short term memory branch, and the second fully-connected branch and the second long-short term memory branch are sequentially connected with a seventh fully-connected layer and a hyperbolic tangent activation function after passing through a second splicing layer; the second full-connection branch sequentially comprises a fifth full-connection layer and a fifth rectifying linear unit activation function, and the second long-short term memory branch sequentially comprises a sixth full-connection layer, a sixth rectifying linear unit activation function and a second long-short term memory layer.

10. The training method of the scheduling model according to claim 1, further comprising the step of determining whether the obtained resource allocation amount satisfies a preset constraint condition; if the constraint condition is met, executing scheduling operation based on the resource allocation quantity; and if the constraint adjustment is not satisfied, executing the scheduling operation based on the resource allocation quantity.

11. The method of claim 10, wherein the constraint condition comprises that an average of a sum of the obtained resource allocation and the accumulated used resource allocation at a current time is less than or equal to a predetermined value.

12. The training method of the scheduling model according to claim 1, wherein a delay constraint condition is set for the scheduling object, and the training method of the scheduling model further comprises:

caching the scheduling object; and

and processing the cached scheduling object according to the time delay limiting condition.

13. A training system for a scheduling model, comprising:

an obtaining module, configured to obtain a training sample set within at least one preset duration, where the training sample set includes training samples within each preset duration, the training samples include scheduling data in each time slot within the preset duration, and the scheduling data includes environment information, resource allocation of a scheduling object obtained by operating the scheduling model, and throughput obtained after performing scheduling operation based on the resource allocation; and

a training module to train the scheduling model based on the training sample set to update parameters of the scheduling model.

14. The training system of a scheduling model of claim 13 wherein the obtaining module comprises:

the acquiring unit is used for acquiring the resource allocation amount and the instantaneous weighted throughput of the scheduling object in the corresponding time slot by operating the scheduling model according to the environmental information in each time slot; and

and the buffer unit is used for buffering the training samples.

15. Training system of a scheduling model according to claim 13 or 14, characterized in that the context information comprises: the state of the scheduling object and the state of a service channel for allocating resources to the scheduling object.

16. The training system of the dispatch model of claim 13, wherein the training module comprises:

the sampling unit is used for sampling the training sample set;

and the updating unit is used for updating the parameters of the scheduling model in a gradient descending mode based on the sampled data.

17. Training system of a dispatch model according to claim 13 or 16, wherein the dispatch model comprises a critics network structure and an actor network structure, the training module being adapted to train the critics network structure and the actor network structure based on the set of training samples.

18. The training system of the dispatch model of claim 17, wherein the training module comprises:

the first training unit is used for training the criticizing family network structure based on the training sample set; and

and the second training unit is used for training the actor network structure by using the critic network structure after at least one training.

19. The training system of the scheduling model of claim 13 further comprising a constraint adaptation module for determining whether the obtained resource allocation amount satisfies a preset constraint condition; if the constraint condition is met, executing scheduling operation based on the resource allocation quantity; and if the constraint adjustment is not satisfied, executing the scheduling operation based on the resource allocation quantity.

20. The system for training a scheduling model of claim 19 wherein the constraint condition includes that the average of the sum of the obtained resource allocation and the accumulated used resource allocation at the current time is less than or equal to a predetermined value.

21. The training system of scheduling model of claim 13, wherein a delay constraint is set for the scheduling object, the training system of scheduling model further comprising:

the cache module is used for caching the scheduling object; and

and the processing module is used for processing the cached scheduling object according to the time delay limiting condition.

22. A scheduling method, comprising the steps of:

receiving a scheduling object;

obtaining the resource allocation amount of the scheduling object by using the scheduling model obtained by training the training method of any one of claims 1 to 12; and

and carrying out scheduling operation on the scheduling object according to the resource allocation amount.

23. The method of claim 22, wherein the scheduling object is a wireless communication packet.

24. The method of claim 22, wherein the scheduling object is a video stream packet.

25. The method of scheduling of claim 22 wherein the scheduling object is a delivery order.

26. A scheduling system, comprising:

an input unit for receiving a scheduling object;

the training method according to any one of claims 1 to 12, training the obtained scheduling model, for obtaining a resource allocation amount of the scheduling object, and performing a scheduling operation on the scheduling object according to the resource allocation amount.

27. An electronic device, comprising:

at least one memory for storing at least one program;

at least one processor, coupled to the at least one memory, configured to execute and implement the training method of the scheduling model according to any one of claims 1 to 12, or the scheduling method according to any one of claims 22 to 25 when running the at least one program.

28. A cloud server system, comprising:

at least one storage device for storing at least one program;

at least one processing device, coupled to the storage device, configured to execute and implement the training method of the scheduling model according to any one of claims 1 to 12, or the scheduling method according to any one of claims 22 to 25 when running the at least one program.

29. A computer-readable storage medium, characterized in that at least one program is stored which, when being executed by a processor, carries out and implements a training method of a scheduling model according to any one of claims 1 to 12, or carries out and implements a scheduling method according to any one of claims 22 to 25.