CN112101729A

CN112101729A - Mobile edge computing system energy distribution method based on deep double-Q learning

Info

Publication number: CN112101729A
Application number: CN202010829544.4A
Authority: CN
Inventors: 林伟伟; 黄天晟; 许银海; 黄文俊
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-12-18
Anticipated expiration: 2040-08-18
Also published as: CN112101729B

Abstract

The invention discloses a mobile edge computing system energy distribution method based on deep double-Q learning, which comprises the following steps: converting an energy distribution process of a mobile edge computing system into a Markov decision process, wherein the Markov decision process comprises three elements of a system state s, a system action a and an action value function Q (s, a); and predicting the accurate value of the action value function through an energy distribution algorithm based on deep double-Q learning, selecting the action corresponding to the maximum action value function to obtain an optimal energy distribution strategy, and completing energy distribution of the mobile edge computing system. According to the method, the deep double-Q learning (DDQN) is applied to the energy distribution of the mobile edge computing system, and the optimal energy distribution is solved through the deep double-Q learning (DDQN) algorithm, so that the benefit maximization of the long-term sustainable computing of the edge computing system server is realized.

Description

Mobile edge computing system energy distribution method based on deep double-Q learning

Technical Field

The invention belongs to the technical field of energy distribution of a mobile edge computing system, and particularly relates to a mobile edge computing system energy distribution method based on deep double-Q learning.

Background

ETSI sets forth the concept of mobile edge computing as a "new platform that can provide an IT service environment and cloud computing capabilities at the edge of a Radio Access Network (RAN) near a mobile user. The MEC sinks the remote cloud data center to the edge of the wireless network, breaks through a three-layer architecture formed by the traditional mutual connection of a wireless access network, a core backbone network and an application network, and realizes the fusion of a wireless side and an application side.

Because the Mobile Edge Computing (MEC) has the technical characteristics of realizing localization of computing/storage services, processing task requests with low time delay, wireless information/content perception and the like, the MEC has quite rich application scenes such as (1) intensive computing assistance, (2) video/file caching, (3) car networking and the like.

Since mobile edge computing requires the deployment of millions of small servers in a city, renewable energy driven mobile edge systems are becoming a new direction of research to reduce their power costs. How to distribute system energy and maximize sustainable computing benefits becomes a new challenge.

Disclosure of Invention

The invention mainly aims to overcome the defects and shortcomings of the prior art and provides a mobile edge computing system energy allocation method based on deep double-Q learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a mobile edge computing system energy distribution method based on deep double-Q learning, which comprises the following steps:

converting an energy distribution process of a mobile edge computing MEC system into a Markov decision process, wherein the Markov decision process comprises three elements of a system state s, a system action a and an action value function Q (s, a); the change of the system state is triggered by an arrival event, the arrival event is divided into a task arrival event, an energy arrival event and a task completion event, and when the task arrival event arrives, the MEC system can take corresponding system action;

predicting the accurate value of the action value function through an energy distribution algorithm based on deep double-Q learning, selecting the system action corresponding to the maximum action value function to obtain an optimal energy distribution strategy, and completing energy distribution of the mobile edge computing system;

the energy allocation algorithm comprises the following steps:

initializing a Q network and parameters thereof;

inputting a characteristic vector phi(s) of a current system state s into a Q network to obtain Q value outputs corresponding to all system actions of the Q network, and selecting the corresponding system action in the current Q value outputs by using an element-greedy method;

executing the current system action a in the current system state s to reach the next state s ' to obtain a feature vector phi (s ') and an award r corresponding to the new system state s ';

storing the (s, a, r, s') 4-tuple in an empirical return visit set D;

randomly taking m samples(s) from the experience set D_j,a_j,r_j,s′_j) J is 1,2, m to train the Q network and calculate the current target Q value y_j；

Calculating a loss function of the training Q network and updating Q network parameters;

updating the target Q network parameters and updating the E;

and judging whether the preset training times are reached, if so, ending, otherwise, repeatedly executing the steps after initializing the Q network and the parameters thereof.

Further, the system state s is specifically represented as follows:

wherein b represents the remaining energy of the MEC system in the current state,

representing the number of running virtual machines, k, allocated to a unit of energy_nRepresentation assignment to virtualThe unit energy quantity of the machine.

Further, when a task arrival event occurs, if the system action a is equal to 0, the controller refuses to arrive at the task; if the system action a is equal to k_nThe system assigns a value k to the arriving task event_n(k_n< b) virtual machine of unit energy, system residual electric quantity b ═ b-k_nThe more energy is allocated, the faster the task request is completed;

when the arrival event is an energy arrival event, one unit of energy is brought to the system, namely the system residual capacity b is min (b +1, b_m)，b_mThe system electric quantity is limited;

other events arrive, the MEC system does not have substantial system action;

set of actions A given system state s_SIs represented as follows:

wherein V_mRepresenting the maximum number of virtual machines that can be run.

Further, the action value function Q (s, a) is expressed as follows:

wherein s' represents the next state of the system, namely the battery residual capacity and the running state of the VM when the next arrival event occurs; r (s, a) is the system award earned by the away state s;

represents the maximum Q value in all actions of the next state s', ζ being the discount factor;

the system rewards are specifically expressed as follows:

r(s，a)＝g(s，a)-c(s，a)τ(s，a)

wherein g (s, a) represents a direct reward, c (s, a) and τ (s, a) represent the cost rate and dwell time between the current task arrival event and the next task arrival event, respectively;

the direct prize g (s, a) is specifically expressed as follows:

wherein U represents the local computation time to the task;

the cost rate c (s, a) is specifically expressed as follows:

wherein the content of the first and second substances,

representing the number of running virtual machines in the MEC system, the number of virtual machines not changing between event arrivals; 1_{a＞0}Is shown in system state a>0 is equal to 1, otherwise 0.

Further, the initializing Q network and its parameters are specifically:

random initialization S₀Initializing i to 1 for the first state of the current state sequence, and randomly initializing all parameters theta of the current Q network_iInitializing parameters of the target Q network

Initializing an experience playback set D with the capacity of M, and initializing the E to the E₀。

Further, the selecting, by using an e-greedy method, a corresponding system action in the current Q value output is specifically:

setting an element value, greedily selecting the behavior considered to be the most behavior value at present by using a probability of 1-element, namely selecting the system action corresponding to the maximum Q network output value, and selecting the system action from all selectable system actions by using the probability of element, wherein the formula is as follows:

further, the current target Q value y_jThe calculation formula is as follows:

wherein, theta_iIs the Q-network parameter and,

is the target Q network parameter(s),

a function of the predicted Q value for the target Q network,

is represented at current s'_jAnd predicting the system action corresponding to the maximum Q value in the state Q network.

Further, the loss function of the Q network is as follows:

wherein, Q(s)_j,a_j；θ_i) A predicted Q value function for the Q network;

the Q network parameter updating step is to update the Q network parameter theta in a gradient descending manner_iThe update formula is as follows:

where γ is the update step.

Further, the updated target Q network parameter is specifically

ifi％N_tWhen the value is equal to 0, then

On the contrary, the method can be used for carrying out the following steps,

wherein N is_tRepresenting the frequency of update of the target Q network, i.e. N per training_tSecondly, the target Q network is updated once;

the update belonging to is specifically as follows:

∈＝max(ζ∈,∈_min)，i＝i+1，s＝s′。

further, the obtaining of the optimal energy allocation strategy specifically includes:

in any system state, when a task arrival event arrives, the MEC system selects a system action corresponding to the maximum action value function, and the optimal energy distribution strategy of the mobile edge computing system is expressed as follows:

π^*＝arg max_aQ^*(S,a),

compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the method, the deep double-Q learning (DDQN) is applied to the energy distribution of the mobile edge computing system, and the optimal energy distribution is solved through the deep double-Q learning (DDQN) algorithm, so that the benefit maximization of the long-term sustainable computing of the edge computing system server is realized. The DDQN achieves the purpose of eliminating the problem of over estimation in Q learning by decoupling two steps of selection of a target Q value action and calculation of a target Q value; meanwhile, the experience playback can break the similarity between continuous training samples, realize the accurate estimation of the Q value and is beneficial to the effective distribution of the subsequent system energy.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is an energy allocation algorithm of the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

The invention discloses a mobile edge computing system energy allocation method based on deep double-Q learning, which converts the energy allocation process of a mobile edge computing system into a Markov decision process, mainly comprises three elements of a system state s, a system action a and an action value function Q (s, a), solves optimal energy allocation through a deep double-Q learning (DDQN) algorithm under the condition of not knowing the transition probability, and realizes the benefit maximization of long-term sustainable computing of an edge computing system server. The DDQN achieves the purpose of eliminating the problem of over estimation in Q learning by decoupling two steps of selection of a target Q value action and calculation of a target Q value; at the same time, empirical playback can break similarities between successive training samples. Compared with Q learning, the DDQN has more accurate estimation on the Q value and more profit of the energy distribution strategy.

Examples

As shown in FIG. 1, the invention relates to a method for allocating energy of a mobile edge computing system based on deep double-Q learning, which comprises the following steps:

s1, converting the energy distribution process of the mobile edge computing system into a Markov decision process, specifically:

converting the energy distribution process of the mobile edge computing system into a Markov decision process, wherein the Markov decision process comprises three elements of a system state s, a system action a and an action value function Q (s, a); a change in system state is caused by an arrival event; the arrival events are divided into task arrival events, energy arrival events and task completion events, and when the task arrival events arrive, the MEC system can take corresponding system actions.

The system state s is represented as follows:

representing the number of running virtual machines, k, allocated to a unit of energy_nRepresenting the amount of unit energy allocated to the virtual machine.

When the task arrival event occurs, if the system action a is 0, the controller refuses to arrive the task; if the system action a is equal to k_nThe system assigns a request with k to the arriving task_n(k_n< b) virtual machine of unit energy, system residual electric quantity b ═ b-k_nThe more energy allocated, the faster the task request can be completed.

When the arrival event is an energy arrival event, a unit of energy is brought to the system, that is, the system residual electric quantity b is min (b +1, b)_m)，b_mAnd the system electric quantity is limited. Other events arrive and the system has no substantial action. Set of actions A given system state s_SIs represented as follows:

wherein V_mRepresenting the maximum number of running virtual machines.

With too much power allocated, the mobile edge computing system may miss the next few requests or allocate less power for subsequent task requests due to low battery power, resulting in slow computing speed.

The action value function is expressed as follows:

where s' represents the next state of the system, i.e., the battery margin and the VM running state when the next task arrival event occurs. r (s, a) is the system award earned by the away state s.

Represents the maximum Q value in all the activities of the next state s ', which can be understood as the state value function V (s'), ζ being the discount factor.

The system rewards are specifically expressed as follows:

r(s，a)＝g(s，a)-c(s，a)τ(s，a)

the direct prize g (s, a) is specifically expressed as follows:

wherein U represents the local computation time to the task;

the cost rate c (s, a) is specifically expressed as follows:

wherein the content of the first and second substances,

S2, predicting the accurate value of the action value function through an energy distribution algorithm, and selecting the action corresponding to the maximum action value function by the system server to complete energy distribution, wherein the method specifically comprises the following steps:

through an energy distribution algorithm based on deep double-Q learning, the accurate value of the action value function Q (s, a) is predicted, so that in any state, when a task arrival event arrives, the MEC system selects the action corresponding to the maximum action value function, and therefore the optimal energy distribution strategy of the mobile edge computing system is represented as follows:

in this embodiment, as shown in fig. 2, the energy allocation algorithm specifically includes the following steps:

s21, initialization, random initialization S₀Initializing i to 1 for the first state of the current state sequence, and randomly initializing all parameters theta of the current Q network_iInitializing parameters of the target Q network

And S22, obtaining Q value output corresponding to all system actions of the Q network by using the feature vector phi (S) of the current system state S as input in the Q network. Selecting corresponding system actions in the current Q value output by using an element-greedy method, which specifically comprises the following steps:

and selecting the system action by adopting an element-greedy method, greedily selecting the action which is considered as the maximum action value at present by setting a smaller element value and using a probability of 1-element, namely selecting the system action corresponding to the maximum Q network output value, and randomly selecting the action from all selectable actions by using the element probability, wherein the action is expressed as:

and S23, executing the current system action a in the current system state S to reach the next state S ', and obtaining the feature vector phi (S ') and the reward r corresponding to the new system state S '.

S24, store the 4-tuple of (S, a, r, S') into the empirical playback set D.

S25, randomly drawing m samples (S) from the experience playback set D_j,a_j,r_j,s′_j) J is equal to 1,2, m trains the Q network and calculates the current target Q value y_j，y_jThe calculation formula is as follows:

wherein, theta_iIs the Q-network parameter and,

is the target Q network parameter(s),

predicted Q-value function, argmax, for the target Q-network_a，Q(s′_j,a′_j；θ_i) Is represented at current s'_jAnd predicting the system action corresponding to the maximum Q value in the state Q network.

S26, calculating a loss function L_i(θ_i) Updating Q network parameter θ by gradient descent_iLoss function L_i(θ_i) As follows:

wherein, Q(s)_j,a_j；θ_i) A predicted Q value function for the Q network;

the update formula is:

where γ is the update step size.

S27, updating the target Q network parameter ifi% N_tWhen the value is equal to 0, then

On the contrary, the method can be used for carrying out the following steps,

wherein N is_tRepresenting the frequency of update of the target Q network, i.e. N per training_tNext, the target Q network is updated once.

S28, updating parameters: e ∈ max (ζ ∈, ∈ max)_min) I +1, s' wherein, in order to converge the algorithm, generally e is gradually reduced along the iterative process of the algorithm, so that e is updated once per iteration.

S29、ifi＜N_trainJumping to S22; otherwise, ending; wherein N is_trainRepresenting the total number of training sessions.

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A mobile edge computing system energy distribution method based on deep double-Q learning is characterized by comprising the following steps:

the energy allocation algorithm comprises the following steps:

initializing a Q network and parameters thereof;

storing the (s, a, r, s') 4-tuple in an empirical return visit set D;

randomly taking m samples(s) from the experience set D_j,a_j,r_j,s′_j) J is 1,2, m to train the Q network and calculate the current target Q value y_h；

updating the target Q network parameters and updating the E;

2. The method according to claim 1, wherein the system state s is specifically expressed as follows:

representing the number of running virtual machines, k, allocated to a unit of energy_nRepresentation assignmentThe unit energy amount of the virtual machine.

3. The method for allocating the energy of the mobile edge computing system based on the deep double-Q learning of claim 2, wherein when a task arrival event occurs, if a system action a is 0, it indicates that the controller refuses to arrive at the task; if the system action a is equal to k_nThe system assigns a value k to the arriving task event_n(k_n< b) virtual machine of unit energy, system residual electric quantity b ═ b-k_nThe more energy is allocated, the faster the task request is completed;

other events arrive, the MEC system does not have substantial system action;

set of actions A given system state s_SIs represented as follows:

wherein V_mRepresenting the maximum number of virtual machines that can be run.

4. The method according to claim 1, wherein the action value function Q (s, a) is expressed as follows:

the system rewards are specifically expressed as follows:

r(s，a)＝g(s，a)-c(s，a)τ(s，a)

the direct prize g (s, a) is specifically expressed as follows:

wherein U represents the local computation time to the task;

the cost rate c (s, a) is specifically expressed as follows:

wherein the content of the first and second substances,

5. The method according to claim 1, wherein the initialized Q network and its parameters are specifically:

InitialAnd quantizing an experience playback set D with the capacity of M, and initializing the element as the element₀。

6. The method for allocating the energy of the moving edge computing system based on the deep double-Q learning as claimed in claim 5, wherein the selecting the corresponding system action in the current Q value output by using an e-greedy method is specifically:

7. the method as claimed in claim 6, wherein the current target Q value y is a value of y_jThe calculation formula is as follows:

wherein, theta_iIs the Q-network parameter and,

is the target Q network parameter(s),

predicted Q-value function, argmax, for the target Q-network_a′Q(s′_j,a′_j；θ_i) Is represented at current s'_jAnd predicting the system action corresponding to the maximum Q value in the state Q network.

8. The method of claim 7, wherein the Q network loss function is as follows:

wherein, Q(s)_j，a_j；θ_i) A predicted Q value function for the Q network;

where γ is the update step.

9. The method according to claim 8, wherein the updated target Q network parameters are specifically the updated target Q network parameters

ifi％N_tWhen the value is equal to 0, then

On the contrary, the method can be used for carrying out the following steps,

the update belonging to is specifically as follows:

∈＝max(ζ∈，∈_min)，i＝i+1，s＝s′。

10. the method for energy allocation of a mobile edge computing system based on deep dual-Q learning according to claim 1, wherein the obtaining of the optimal energy allocation strategy specifically comprises:

π^*＝argmax_aQ^*(s，a)，