CN117545085A

CN117545085A - Multi-user downlink scheduling method, device, equipment and storage medium

Info

Publication number: CN117545085A
Application number: CN202311526532.4A
Authority: CN
Inventors: 戴静; 陆宇涛; 鞠震宇; 郑康; 漆雨菂; 盛锋; 王坚
Original assignee: China Mobile Zijin Jiangsu Innovation Research Institute Co ltd; China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Current assignee: China Mobile Zijin Jiangsu Innovation Research Institute Co ltd; China Mobile Communications Group Co Ltd; China Mobile Group Jiangsu Co Ltd
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-02-09

Abstract

The invention discloses a multi-user downlink scheduling method, a multi-user downlink scheduling device, multi-user downlink scheduling equipment and a storage medium, and belongs to the technical field of wireless transmission. The invention obtains the priority factor of the logic channel; determining a weight coefficient of the priority factor based on the priority factor; adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result representing the priority of the logic channel; and determining a target logic channel according to the evaluation result, and carrying out multi-user downlink scheduling according to the target logic channel, wherein the resource allocation scheme of each user scheduling priority can be dynamically adjusted, so that the user time delay is reduced as much as possible, and the performance and the user experience of the wireless network are improved.

Description

Multi-user downlink scheduling method, device, equipment and storage medium

Technical Field

The present invention relates to the field of wireless transmission technologies, and in particular, to a method, an apparatus, a device, and a storage medium for multi-user downlink scheduling.

Background

In 5G, the MAC (Medium Access Control, media access) layer plays a vital role in managing and allocating radio channel resources, optimizing resource utilization, meeting the needs of different users and service types to provide high-rate, low-latency wireless data transmission. The scheduling policy of the MAC layer is formulated to optimize the use of radio channel resources and meet the requirements of different users and service types, so as to improve the system capacity, bandwidth utilization and user experience, and ensure fairness and quality of service.

The 5G MAC layer employs various scheduling policies to optimize allocation and management of radio resources. The scheduling policy is a general principle of determining resource allocation in the whole network range, needs to consider requirements of different users and service types in the network, and improves system capacity, coverage and user experience through reasonable resource allocation.

Once the scheduling strategy in the existing multi-user downlink scheduling is determined, the scheduling is performed according to a fixed mode, and flexible and dynamic adjustment cannot be performed according to user requirements and network congestion conditions.

Disclosure of Invention

The invention mainly aims to provide a multi-user downlink scheduling method, device, equipment and storage medium, and aims to solve the technical problem that scheduling effect is poor due to insufficient flexibility of multi-user downlink scheduling in the prior art.

In order to achieve the above object, the present invention provides a multi-user downlink scheduling method, which includes the following steps:

acquiring a priority factor of a logic channel;

determining a weight coefficient of the priority factor based on the priority factor;

adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result representing the priority of the logic channel;

and determining a target logic channel according to the evaluation result, and carrying out multi-user downlink scheduling according to the target logic channel.

Optionally, the acquiring the priority factor of the logical channel includes:

acquiring influencing factors of multi-user downlink scheduling;

determining the scheduling priority of the logic channel according to the influence factors;

and setting a priority factor of the logic channel through the scheduling priority.

Optionally, the adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result of characterizing the priority of the logic channel includes:

setting a learning frequency upper limit threshold value based on a preset reinforcement learning strategy, and initializing a reward value, an initialization time and a reward value table for storing the reward value;

taking the weight coefficient of the priority factor as an agent based on the preset reinforcement learning strategy;

selecting actions based on the agent using a preset greedy strategy and calculating an immediate return;

calculating a target prize value based on the immediate rewards;

and adjusting the weight coefficient of the priority factor through the target reward value to obtain an evaluation result representing the priority of the logic channel.

Optionally, the selecting an action based on the agent using a preset greedy policy and calculating an immediate return includes:

selecting actions by using a preset greedy strategy based on the intelligent agent, and counting the average time delay of each logic channel in a preset time period;

and calculating an immediate return according to the average time delay.

Optionally, the calculating a target prize value based on the immediate rewards includes:

acquiring a learning rate and a discount factor set by a preset reinforcement learning strategy;

a target prize value is calculated based on a relationship between the prize value and the immediate payback, the learning rate, and the discount factor, the immediate payback, the learning rate, and the discount factor.

Optionally, the adjusting the weight coefficient of the priority factor by the target reward value to obtain an evaluation result of characterizing the priority of the logic channel includes:

a target weight coefficient of a priority factor is obtained through the target reward value;

adjusting the weight coefficient through the target weight coefficient, and calculating the target scheduling priority of the logic channel;

calculating the target priority of the logic channel according to the target scheduling priority;

and obtaining an evaluation result of the characterization logic channel priority based on the target priority.

Optionally, the obtaining the evaluation result characterizing the priority of the logic channel based on the target priority includes:

sorting the target priority to obtain a sorting result;

and selecting a corresponding logic channel according to the sorting result to obtain an evaluation result for representing the priority of the logic channel.

In addition, in order to achieve the above objective, the present invention further provides a multi-user downlink scheduling device, where the multi-user downlink scheduling device includes:

the acquisition module is used for acquiring the priority factor of the logic channel;

a determining module, configured to determine a weight coefficient of a priority factor based on the priority factor;

the adjustment module is used for adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result of the characterization logic channel priority;

the determining module is further configured to determine a target logical channel according to the evaluation result, and perform multi-user downlink scheduling according to the target logical channel.

In addition, in order to achieve the above objective, the present invention further provides a multi-user downlink scheduling device, where the multi-user downlink scheduling device includes: the system comprises a memory, a processor and a multi-user downlink scheduler stored on the memory and operable on the processor, the multi-user downlink scheduler configured to implement the steps of the multi-user downlink scheduling method as described above.

In addition, in order to achieve the above object, the present invention further provides a storage medium, where a multi-user downlink scheduling program is stored, where the multi-user downlink scheduling program, when executed by a processor, implements the steps of the multi-user downlink scheduling method as described above.

The invention obtains the priority factor of the logic channel; determining a weight coefficient of the priority factor based on the priority factor; adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result representing the priority of the logic channel; and determining a target logic channel according to the evaluation result, and carrying out multi-user downlink scheduling according to the target logic channel, wherein the resource allocation scheme of each user scheduling priority can be dynamically adjusted, so that the user time delay is reduced as much as possible, and the performance and the user experience of the wireless network are improved.

Drawings

Fig. 1 is a schematic structural diagram of a multi-user downlink scheduling device in a hardware operation environment according to an embodiment of the present invention;

fig. 2 is a flow chart of a first embodiment of the multi-user downlink scheduling method of the present invention;

fig. 3 is a flow chart of a second embodiment of the multi-user downlink scheduling method of the present invention;

fig. 4 is a flow chart of a third embodiment of the multi-user downlink scheduling method of the present invention;

fig. 5 is a flow chart of a fourth embodiment of the multi-user downlink scheduling method of the present invention;

FIG. 6 is a flowchart illustrating the optimization of weight coefficients using a preset reinforcement learning strategy in an embodiment of the multi-user downlink scheduling method according to the present invention;

fig. 7 is a block diagram of a first embodiment of a multi-user downlink scheduling apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a multi-user downlink scheduling device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the multi-user downlink scheduling apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in fig. 1 does not constitute a limitation of the multi-user downlink scheduling apparatus, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a multi-user downlink scheduler may be included in the memory 1005 as one type of storage medium.

In the multi-user downlink scheduling device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the multi-user downlink scheduling device of the present invention may be disposed in the multi-user downlink scheduling device, where the multi-user downlink scheduling device invokes the multi-user downlink scheduling program stored in the memory 1005 through the processor 1001, and executes the multi-user downlink scheduling method provided by the embodiment of the present invention.

An embodiment of the present invention provides a multi-user downlink scheduling method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the multi-user downlink scheduling method of the present invention.

In this embodiment, the multi-user downlink scheduling method includes the following steps:

step S10: the priority factor of the logical channel is obtained.

It should be noted that, the execution body of the embodiment may be a multi-user downlink scheduling device, or may be other devices that may implement the same or similar functions, which is not limited in this embodiment, and the embodiment is described by taking the multi-user downlink scheduling device as an example.

Current common scheduling strategies include static scheduling, collaborative-based scheduling, queue-based scheduling, power control-based scheduling, and feedback-based scheduling. Typical implementations of these scheduling strategies currently include RR scheduling, max C/I scheduling, and PF scheduling.

RR polling firstly determines the polling sequence of users, and can be ordered according to the priority of the users, the length of a queue or other requirements; then, at the beginning of each scheduling period, the base station sequentially selects the next user in a predetermined order and allocates resources thereto. The allocation may be based on the amount of resources needed in the user queue and the condition of the available resources.

The Max C/I strategy firstly carries out channel measurement, and evaluates the channel condition of the user through the received information such as power or signal quality index; then selecting a user with the maximum carrier-to-interference ratio (C/I) from all the current users as an object for next resource allocation; and finally, the base station allocates resources for the selected users so as to improve the transmission rate and the system capacity of the selected users.

The PF policy then first calculates a scheduling index for each user, typically using a comprehensive assessment of the user's historical transmission rate and the current channel quality, such as using the product of the average transmission rate and the channel quality parameter; then selecting the user with the maximum scheduling index, and selecting the user with the maximum scheduling index from all users as the object of the next allocated resource; and finally, distributing resources for the selected users so as to balance the system capacity and the user experience.

These policies may be selected and optimized in the 5G MAC layer according to different scenarios and objectives. They aim to improve resource utilization, coverage and user experience and find a balance between fairness and system throughput. The polling policy determines the order of resource allocation, while the scheduling priority determines the priority of each user or device in polling. By reasonably setting the scheduling priority, flexible allocation of different user demands can be realized, and the system performance and user experience are improved.

However, the RR policy provides a non-differentiated service, and users with different priorities or channel qualities are not explicitly differentiated without considering different requirements and channel conditions of the users, so that a situation of resource waste may occur because the same amount of resources may be obtained even if channel conditions of some users are poor. The Max C/I policy may be biased towards users of a high quality channel, resulting in reduced experience for other users, and the policy does not take into account different traffic demands of users, only focuses on channel quality, and may not be able to meet applications with higher specific requirements for delay, bandwidth, etc. The scheduling complexity of the PF strategy is higher, index calculation and comparison are needed for each user, and once each index is determined, flexible and dynamic adjustment cannot be carried out according to the user requirements and the network congestion condition. The response to high priority users may be relatively slow due to the aggregate consideration of the needs of the individual users. Therefore, the existing scheduling policy may have problems of incomplete information, unbalance, lack of flexibility, etc., for example, scheduling priority is determined only by relying on some limited information, however, the information only represents a part of network status, and other factors such as user type, service requirement, network congestion, etc. cannot be fully considered. Therefore, the actual demands of users and network conditions may not be accurately reflected, or the preset weights of factors affecting the scheduling priority may cause some factors to be over-focused, while other factors are ignored. This will result in some users or services always being dominant, while others will not get a fair allocation of resources. Or once the scheduling strategy is determined, scheduling is performed according to a fixed mode, and flexible and dynamic adjustment cannot be performed according to user requirements and network congestion conditions.

Therefore, the embodiment mainly aims at the uRLLC scene and based on the Q-Learning theory, proposes a resource allocation scheme for dynamically adjusting the scheduling priority of each user so as to reduce the time delay of the user as much as possible. Compared with the traditional scheme, the method has the advantages of adaptability, learning ability, system efficiency, personalized service, reinforcement learning characteristics and the like. These advantages enable Q-Learning based schemes to better meet QoS index requirements of users and improve performance and user experience of wireless networks.

In the implementation, when multi-user downlink scheduling is performed, five factors of 5QI type, time delay, speed, channel quality and scheduling duration are comprehensively considered, so that the scheduling priority of the logic channel of the corresponding non-zero buffer user terminal list can be designed and calculated, the priority factor of the logic channel can be obtained, and the priority factor of the logic channel characterizes the scheduling priority of each logic channel.

Optionally, the step of obtaining the priority factor of the logical channel includes: acquiring influencing factors of multi-user downlink scheduling; determining the scheduling priority of the logic channel according to the influence factors; and setting a priority factor of the logic channel through the scheduling priority.

It should be noted that, because five factors of 5QI type, time delay, rate, channel quality and scheduling duration are comprehensively considered, the influence factors of multi-user downlink scheduling can be obtained, the influence factors include 5QI type, time delay, rate, channel quality and scheduling duration, and after the influence factors are determined, the following formula 1 is a scheduling priority calculating process of a logic channel:

P _DL (t)＝P ₁ +P ₂ +P ₃ +P ₄ +P ₅ (1)

In the above formula 1, P _DL For scheduling priority of logical channels, P ₁ Representing 5QI priority, P of logical channel ₂ Representing the delay priority of a logical channel, P ₃ Representing rate priority, P, of logical channels ₄ Representing channel quality priority, P, of logical channels ₅ The calculation of the 5QI priority of the logical channel, which represents the scheduling interval priority of the logical channel, is as follows:

P ₁ ＝(100-DefaultPriorityLevel ₁ )*f ₁ (2)

In the above formula 2, defaultPrioritiyLevel ₁ The default priority corresponding to 5QI of the logic channel can be obtained by referring to a protocol table, the higher the priority of the 5QI service type, the smaller the value, the maximum is not more than 100, the difference between the design 100 and the value is taken as the factor of the 5QI priority so as to accord with the design with higher calculated value and higher priority, f ₁ A weight coefficient of 5QI priority factor.

The delay priority of the logical channel is calculated as follows in equation 3:

in equation 3 above, τ represents the packet header delay of the logical channel in the RLC queue, and maintenance is required whenever there is a new incoming packet in the RLC queueProtecting its arrival time. When the packet header delay is updated during sequencing, namely the current time is subtracted by the packet arrival time of the head packet of the RLC queue, PDB represents the packet delay budget corresponding to 5QI of a logic channel, and the packet delay budget can be obtained by referring to a protocol table. The greater the ratio of τ to PDB, the longer the LC header delay, and the higher the priority. f (f) ₂ Is a weight coefficient of the delay priority factor.

The rate priority of the logical channel is calculated as follows in equation 4:

in the above formula 4, if the present logical channel carries GBR service, the corresponding guaranteed bit rate is indicated; if Non-GBR traffic is carried, a reference value is configured, and R represents the average rate of the logic channel to the current moment. GBR is a configured fixed value, and when the ratio of GBR to R is larger, the lower the current LC average rate is, the higher the priority is, and f ₃ Is a weight coefficient of the rate priority factor.

The channel quality priority of the logical channel is calculated as follows in equation 5:

P ₄ ＝CQI*f ₄ (5)

In the above equation 5, CQI indicates the CQI that the UE has reported last time. The larger the value, the better the channel quality and the higher the priority. f (f) ₄ Is a weighting factor for the channel quality priority factor.

The scheduling interval priority of the logical channel is calculated as follows in equation 6:

P ₅ ＝SchT*f ₅ (6)

In the above equation 6, schT represents an interval last scheduled to the current time. The larger the value, the longer the scheduling interval, the higher the priority, f ₅ Is a weight coefficient of the scheduling interval priority factor.

Step S20: and determining a weight coefficient of the priority factor based on the priority factor.

In a specific implementation, after the priority factor is obtained, the weight coefficient of the priority factor may be obtained by the above formulas 1 to 6.

Step S30: and adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result representing the priority of the logic channel.

The preset reinforcement Learning strategy is Q-Learning algorithm, and based on the Q-Learning theory, in order to achieve the goal of reducing the average time delay of each user, the weight coefficient of the priority factor can be optimized through the preset reinforcement Learning strategy, so that the magnitude of each factor reaches the same level, the influence of the characterization on the result is not lost, and the evaluation result of the characterization logic channel priority is obtained.

It should be noted that, the evaluation result of the priority of the logical channel is the final priority of each logical channel.

Step S40: and determining a target logic channel according to the evaluation result, and carrying out multi-user downlink scheduling according to the target logic channel.

In a specific implementation, after the evaluation result of the priority of the characterization logical channel is obtained, the priority of each logical channel can be determined, so that the logical channel with the highest priority is selected as a target logical channel, and multi-user downlink scheduling is performed through the target logical channel, so that personalized services can be provided for different users and service types. By Learning the specific needs and behavior patterns of each user, Q-Learning can be dynamically allocated according to the QoS index requirements and priorities thereof to meet the specific needs of different users. An optimal allocation decision may be made based on a combination of factors such as channel quality, congestion conditions, user requirements, etc.

The embodiment obtains the priority factor of the logic channel; determining a weight coefficient of the priority factor based on the priority factor; adjusting the weight coefficient of the priority factor by using a preset reinforcement learning strategy to obtain an evaluation result representing the priority of the logic channel; and determining a target logic channel according to the evaluation result, and carrying out multi-user downlink scheduling according to the target logic channel, wherein the resource allocation scheme of each user scheduling priority can be dynamically adjusted, so that the user time delay is reduced as much as possible, and the performance and the user experience of the wireless network are improved.

Referring to fig. 3, fig. 3 is a flow chart of a second embodiment of the multi-user downlink scheduling method according to the present invention.

Based on the above first embodiment, the step S30 of the multi-user downlink scheduling method of this embodiment includes:

step S301: and setting a learning frequency upper threshold based on a preset reinforcement learning strategy, and initializing a reward value, an initialization time and a reward value table for storing the reward value.

It should be noted that, when training the Q-Learning algorithm, the Learning frequency upper limit threshold Tmax may be set, and the logic channel priority factor may be initialized to 1, and the prize value may be initialized, where the initialization time t=0, and the prize value is the Q value, and the Q table for storing the Q value of the intelligent agent for each state and action combination may be initialized.

Step S302: and taking the weight coefficient of the priority factor as an agent based on the preset reinforcement learning strategy.

It should be noted that, when learning is started, an agent, actions and immediate returns can be designed, and the weighting coefficient of the priority factor is set as the agent and usedRepresenting an optional action, setting the step size to 1,/for ensuring the stability and reliability of the final calculated data>Wherein f is any one of the above agents.

In the Q-Learning algorithm, an agent (or proxy) is an entity that performs a Learning task. An agent is a learner that learns from experience and formulates optimal strategies to maximize his long-term rewards by interacting with the environment. The main tasks of the agent include: selecting action: at each time step, the agent needs to select an action to perform.

Step S303: and selecting actions based on the agent by using a preset greedy strategy, and calculating immediate returns.

Selecting actions according to a preset greedy strategy by updating time t=t+1Thereby calculating an immediate return, and if the immediate return increases, maintaining the action for the next learning.

It should be noted that the learning sequence of the above f is not limited, and in the Q-learning algorithm, the agent interacts with the environment to learn the optimal strategy, and updates the Q-value function to guide the action selection of the agent. In Q-learning, an agent is not able to directly acquire the state of an environment, but rather obtains information about the environment by selecting an action and observing an immediate return. If we can determine the adjustment order of the priority factor f in advance, then a fixed adjustment order may result in a locally optimal solution: the goal of Q-learning is to find a globally optimal strategy, but due to the complexity and uncertainty of the environment, we cannot determine the correct tuning order in advance. If we fix the adjustment order, the agent may get into a locally optimal solution and not find the optimal strategy.

Optionally, the process of calculating the immediate return specifically includes: selecting actions by using a preset greedy strategy based on the intelligent agent, and counting the average time delay of each logic channel in a preset time period; and calculating an immediate return according to the average time delay.

In a specific implementation, at time t, when agent f selects actionWhen the method is used, the average time delay of each logic channel in the preset time period can be counted, and the time delay of each logic channel is +.>Wherein (1)>PD _i Is the delay of each data packet from the UPF to the UE. The preset time period may be 100ms, 120ms, etc., which is not limited in this embodiment.

If the following is providedAverage time delay of all logic channels in set T in time periodThe change occurs and an immediate return function is designed as follows:

in the above formula 7, w _t For immediate return.

Step S304: a target prize value is calculated based on the immediate rewards.

In implementations, the target prize value may be calculated by immediate rewards, the target prize value being the optimal Q value.

For example, the target prize value may be calculated by parameters such as discount factors, learning rates, and immediate rewards.

Step S305: and adjusting the weight coefficient of the priority factor through the target reward value to obtain an evaluation result representing the priority of the logic channel.

It will be appreciated that the coefficients assigned to each priority may be determined by the target prize value, such that the weighting coefficients of the priority factors may be adjusted to obtain an evaluation result characterizing the priority of the logical channel.

By continuously updatingThe value is stored in a Q table and is judged whether the maximum learning upper limit T is reached _max If the learning upper limit T is reached _max Ending the learning, if the learning upper limit T is not reached _max Then the update time t=t+1 continues to calculate the immediate return.

The embodiment sets the upper limit threshold of the learning times based on a preset reinforcement learning strategy, and initializes the reward value, the initialization time and a reward value table for storing the reward value; taking the weight coefficient of the priority factor as an agent based on the preset reinforcement learning strategy; selecting actions based on the agent using a preset greedy strategy and calculating an immediate return; calculating a target prize value based on the immediate rewards; and adjusting the weight coefficient of the priority factor through the target reward value to obtain an evaluation result representing the priority of the logic channel, and providing personalized services for different users and service types to optimize a sub-channel allocation strategy.

Referring to fig. 4, fig. 4 is a flow chart of a third embodiment of the multi-user downlink scheduling method according to the present invention.

Based on the first and second embodiments, step S304 of the multi-user downlink scheduling method of this embodiment includes:

step S3041: and acquiring a learning rate and a discount factor set by a preset reinforcement learning strategy.

It will be appreciated that, since the definition of the Q function Q (a, s) is: starting from state s, the maximum reduced cumulative return for the first action a is performed, i.e. the Q value is the sum of the immediate return obtained after performing action a in state s and the subsequent reduced value obtained following the optimal strategy.

In state s according to the definition of Q value _n The following Q values are the sum of the long-term cumulative rewards, and therefore the definition of Q values is as follows in equation 8:

in formula 8 above, p is represented by state s _t Transition to the next state s _t+1 γ is a discount factor, and in the case where the state transition probability is unknown, a specific iterative update formula of Q value is as follows:

in the above formula 9, a is the learning rate. In order to ensure the convergence of the Q-Learning algorithm, the Learning rate a needs to satisfy:

when t→∞，Will converge with probability 1 to an optimal Q value +.>

Thus, the learning rate a and the discount factor γ set by the preset reinforcement learning strategy can be obtained.

Step S3042: a target prize value is calculated based on a relationship between the prize value and the immediate payback, the learning rate, and the discount factor, the immediate payback, the learning rate, and the discount factor.

In this embodiment, the relationship between the prize value and the immediate return, the learning rate and the discount factor is obtained as in equations 8 and 9 above, and thus the target prize value is calculated by the relationship between the prize value and the immediate return, the learning rate and the discount factor, the immediate return, the learning rate and the discount factor.

In the embodiment, the learning rate set by the preset reinforcement learning strategy and the discount factor are obtained; the target rewards value is calculated according to the relation between the rewards value and the immediate rewards, the learning rate and the discount factors, and the optimal rewards value can be obtained quickly according to the learning rate, the discount factors and the immediate rewards.

Referring to fig. 5, fig. 5 is a flowchart illustrating a fourth embodiment of a multi-user downlink scheduling method according to the present invention.

Based on the first and second embodiments, step S305 of the multi-user downlink scheduling method of the present embodiment includes:

step S3051: and a target weight coefficient of a priority factor is obtained through the target reward value.

It should be noted that, after the target reward value Q is obtained by calculation, the target weight coefficient of the priority factor may be determined, and when the optimal Q value is obtained, the optimal solution of the weights corresponding to the priority factors of the logic channel may be obtained, so as to obtain the target weight coefficient of the priority factor.

Step S3052: and adjusting the weight coefficient through the target weight coefficient, and calculating the target scheduling priority of the logic channel.

In a specific implementation, the target weight coefficient may be brought into the above equations 2 to 6, so as to calculate each priority P of the logical channel, that is, the target scheduling priority.

Step S3053: and calculating the target priority of the logic channel through the target scheduling priority.

In implementations, the target priority of the logical channel may be calculated by the target scheduling priority and equation 1 above.

Step S3054: and obtaining an evaluation result of the characterization logic channel priority based on the target priority.

It will be appreciated that the target priorities of the logical channels may be ordered, so as to screen out the final logical channel priorities, thereby obtaining an evaluation result that characterizes the logical channel priorities.

Optionally, the step of obtaining an evaluation result characterizing the priority of the logical channel based on the target priority specifically includes: sorting the target priority to obtain a sorting result; and selecting a corresponding logic channel according to the sorting result to obtain an evaluation result for representing the priority of the logic channel.

After the target priority of each logical channel is obtained, the target priority may be ordered in a descending order, so as to obtain an ordering result, so that a corresponding logical channel is selected according to the ordering result, and a logical channel corresponding to the largest target priority in the ordering result is used as a final evaluation result.

As shown in fig. 6, fig. 6 is a flowchart of a weight coefficient optimization by using a preset reinforcement learning strategy, firstly, determining a logic channel set, setting an upper learning frequency limit Tmax, initializing a Q value, initiating learning time t=0, updating learning time t=t+1, selecting actions according to a greedy algorithm, calculating immediate returns, updating the Q value, determining whether the maximum learning frequency is reached, if so, determining coefficients allocated to each priority, obtaining a calculation result, if not, updating learning time t=t+1, continuing to select actions according to the greedy algorithm, determining whether a target is reached, if so, completing the optimization, and if not, returning to the step of determining the logic channel set.

The embodiment obtains the target weight coefficient of the priority factor through the target reward; adjusting the weight coefficient through the target weight coefficient, and calculating the target scheduling priority of the logic channel; calculating the target priority of the logic channel according to the target scheduling priority; and obtaining an evaluation result of the characterization logic channel priority based on the target priority, and optimizing distribution through intelligent decision, so that the system efficiency and the resource utilization rate are improved.

Referring to fig. 7, fig. 7 is a block diagram illustrating a first embodiment of a multi-user downlink scheduling apparatus according to the present invention.

As shown in fig. 7, the multi-user downlink scheduling apparatus provided by the embodiment of the present invention includes:

an acquisition module 10, configured to acquire a priority factor of a logical channel.

A determining module 20, configured to determine a weight coefficient of the priority factor based on the priority factor.

And the adjusting module 30 is configured to adjust the weight coefficient of the priority factor by using a preset reinforcement learning strategy, so as to obtain an evaluation result representing the priority of the logic channel.

The determining module 20 is further configured to determine a target logical channel according to the evaluation result, and perform multi-user downlink scheduling according to the target logical channel.

In an embodiment, the obtaining module 10 is further configured to obtain an influencing factor of multi-user downlink scheduling; determining the scheduling priority of the logic channel according to the influence factors; and setting a priority factor of the logic channel through the scheduling priority.

In one embodiment, the adjustment module 30 is further configured to set an upper learning frequency threshold based on a preset reinforcement learning strategy, and initialize a prize value, an initialization time, and a prize value table storing prize values; taking the weight coefficient of the priority factor as an agent based on the preset reinforcement learning strategy; selecting actions based on the agent using a preset greedy strategy and calculating an immediate return; calculating a target prize value based on the immediate rewards; and adjusting the weight coefficient of the priority factor through the target reward value to obtain an evaluation result representing the priority of the logic channel.

In an embodiment, the adjusting module 30 is further configured to select an action based on the agent using a preset greedy policy, and count an average time delay of each logical channel in a preset period of time; and calculating an immediate return according to the average time delay.

In an embodiment, the adjusting module 30 is further configured to obtain a learning rate and a discount factor set by a preset reinforcement learning strategy; a target prize value is calculated based on a relationship between the prize value and the immediate payback, the learning rate, and the discount factor, the immediate payback, the learning rate, and the discount factor.

In an embodiment, the adjusting module 30 is further configured to pass the target weight coefficient of the priority factor through the target prize value; adjusting the weight coefficient through the target weight coefficient, and calculating the target scheduling priority of the logic channel; calculating the target priority of the logic channel according to the target scheduling priority; and obtaining an evaluation result of the characterization logic channel priority based on the target priority.

In an embodiment, the adjusting module 30 is further configured to sort the target priorities to obtain a sorting result; and selecting a corresponding logic channel according to the sorting result to obtain an evaluation result for representing the priority of the logic channel.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a multi-user downlink scheduling program, and the multi-user downlink scheduling program realizes the steps of the multi-user downlink scheduling method when being executed by a processor.

Because the storage medium adopts all the technical schemes of all the embodiments, the storage medium has at least all the beneficial effects brought by the technical schemes of the embodiments, and the description is omitted here.

It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.

It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.

In addition, technical details not described in detail in this embodiment may refer to the multi-user downlink scheduling method provided in any embodiment of the present invention, which is not described herein again.

Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The multi-user downlink scheduling method is characterized by comprising the following steps of:

acquiring a priority factor of a logic channel;

2. The multi-user downlink scheduling method of claim 1, wherein the obtaining the priority factor of the logical channel comprises:

acquiring influencing factors of multi-user downlink scheduling;

3. The multi-user downlink scheduling method of claim 1, wherein the adjusting the weight coefficient of the priority factor using a preset reinforcement learning strategy to obtain the evaluation result characterizing the priority of the logical channel comprises:

calculating a target prize value based on the immediate rewards;

4. The multi-user downlink scheduling method of claim 3, wherein the selecting an action based on the agent using a preset greedy strategy and calculating an immediate return comprises:

and calculating an immediate return according to the average time delay.

5. The multi-user downlink scheduling method of claim 3, wherein said calculating a target prize value based on said immediate rewards comprises:

6. The multi-user downlink scheduling method of claim 3, wherein the adjusting the weight coefficient of the priority factor by the target prize value to obtain the evaluation result characterizing the priority of the logical channel comprises:

7. The multi-user downlink scheduling method of claim 6, wherein the obtaining the evaluation result characterizing the logical channel priority based on the target priority comprises:

sorting the target priority to obtain a sorting result;

8. A multi-user downlink scheduling apparatus, wherein the multi-user downlink scheduling apparatus comprises:

9. A multi-user downlink scheduling apparatus, characterized in that the multi-user downlink scheduling apparatus comprises: a memory, a processor and a multi-user downlink scheduler stored on the memory and operable on the processor, the multi-user downlink scheduler configured to implement the multi-user downlink scheduling method of any one of claims 1 to 7.

10. A storage medium, wherein a multi-user downlink scheduler is stored on the storage medium, and the multi-user downlink scheduler, when executed by a processor, implements the multi-user downlink scheduling method according to any one of claims 1 to 7.