CN115640131A

CN115640131A - Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Info

Publication number: CN115640131A
Application number: CN202211341446.1A
Authority: CN
Inventors: 陈志江; 雷磊; 宋晓勤; 蒋泽星; 唐胜; 王执屹
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-24

Abstract

The invention provides a calculation task unloading algorithm based on deep reinforcement learning aiming at the requirements of calculation intensive and delay sensitive mobile services. And (4) considering constraint conditions such as the flight ranges, the flight speeds, the system fairness benefits and the like of the multiple unmanned aerial vehicles, and minimizing the weighted sum of the network average calculation delay and the energy consumption of the unmanned aerial vehicles. The non-convex and NP difficult problems are converted into a partial observation Markov decision process, and the multi-agent depth certainty strategy gradient algorithm is used for unloading decision of the mobile user and optimizing the flight track of the unmanned aerial vehicle. Simulation results show that the performance of the algorithm is superior to that of a baseline algorithm in the aspects of fairness of the mobile service terminal, average system time delay, total energy consumption of multiple unmanned aerial vehicles and the like.

Description

Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient

Technical Field

The invention belongs to the field of Mobile Edge Computing (MEC), relates to a Multi-unmanned aerial vehicle assisted Mobile Edge Computing method, and more particularly relates to a Multi-Agent Deep determination Policy Gradient (MADDPG) based computational migration method.

Background

With the development of 5G technology, computing-intensive applications such as network gaming, VR/AR, telemedicine, etc. running on user devices will become more prosperous and popular. These mobile applications typically require a large amount of computing resources, consume a large amount of energy, and may interrupt the connection with the server while the user is mobile due to the limited coverage of the server. The server which originally requests to unload cannot send the calculation result at the next position of the user in time, which causes the waste of the calculation resources of the server and increases the time delay and energy consumption for the user to upload and unload the calculation task again. For user off-loadable tasks, many studies will adopt a mode of unloading all tasks into the MEC server for execution, but when the number of users is large or the number of unloaded tasks is large, limited server computing resources can cause task queuing and the time delay of the unloading computation is increased. Due to high mobility and flexibility, unmanned Aerial Vehicles (UAVs) can assist mobile edge computing in military and civilian areas, especially in remote or natural disaster areas, without relying on infrastructure. When natural disasters cause network infrastructure to be unavailable or mobile devices suddenly increase to exceed network service capacity, the unmanned aerial vehicle can be used as a temporary communication relay station or an edge computing platform to enhance wireless coverage in areas with communication interruption or traffic hot spots, and computing support is provided. However, the computational resources and power of the drone are limited, and there are many key issues to be solved to improve the performance of the MEC system, including security ^[8] Task offloading, energy consumption, resource allocation, user delay performance in various channel scenarios, etc.

In an unmanned aerial vehicle MEC network, various types of variables (such as unmanned aerial vehicle tracks, task unloading strategies and computing resource allocation) can be optimized to achieve a desired scheduling target, and a traditional optimization method needs a large amount of iteration and priori knowledge to obtain an approximately optimal solution, so that the method is not suitable for real-time MEC application in a dynamic environment. With the wide application of machine learning in research, many researchers are also exploring learning-based MEC scheduling algorithms, and deep reinforcement learning has become a research focus in view of the recent progress of machine learning. With the increase of network scale, multi-agent deep reinforcement learning provides a distributed view for resource management of a multi-unmanned-aerial-vehicle MEC network.

The invention provides an unmanned aerial vehicle assisted mobile edge computing system which utilizes computing resources provided by an unmanned aerial vehicle to provide unloading services for nearby user equipment. The unmanned aerial vehicle trajectory and unloading optimization problem is solved through a multi-agent deep reinforcement learning method, so that an extensible and effective scheduling strategy is obtained, a terminal unloads a part of calculation tasks to a UAV, other tasks are executed locally at the terminal, and system processing delay and unmanned aerial vehicle energy consumption are minimized through joint optimization of user scheduling, task unloading ratio, unmanned aerial vehicle flight angle and flight speed.

Disclosure of Invention

The purpose of the invention is as follows: in consideration of the non-convexity, high-dimensional state space and continuous action space of the problem, the deep reinforcement learning algorithm based on the MADDPG is provided, and the algorithm can obtain an optimal calculation unloading strategy in a dynamic environment, so that the combined optimization of the lowest system delay and the energy consumption of the unmanned aerial vehicle is realized.

The technical scheme is as follows: in consideration of the scene of multi-user task calculation unloading at the same time, the purposes of jointly optimizing system time delay and unmanned aerial vehicle energy consumption are achieved through reasonable and efficient unmanned aerial vehicle path planning and unloading decisions. And regarding each unmanned aerial vehicle as an intelligent agent, and selecting the associated user based on locally observed state information and task information obtained in each time slot by adopting a distributed execution and centralized training mode. And optimizing the deep reinforcement learning model by using the MADDPG algorithm through establishing the deep reinforcement learning model. And obtaining the optimal flight track and unloading strategy according to the optimized MADDPG model. The invention is realized by the following technical scheme: an unmanned aerial vehicle assisted computing migration method based on MADDPG comprises the following steps:

(1) The traditional MEC server is deployed in a base station or other fixed facilities, and the mobile MEC server is adopted at this time to combine the unmanned aerial vehicle technology with edge calculation;

(2) The user equipment unloads the calculation task to the unmanned aerial vehicle end through wireless communication so as to reduce the calculation delay;

(3) An unmanned aerial vehicle auxiliary user unloading system model, a moving model, a communication model and a calculation model are constructed, and an optimization objective function is given;

(4) The unmanned aerial vehicle acquires a user position set, a task set, service times and channel parameter information in an observation range;

(5) The method comprises the steps of modeling by adopting a Partially Observable Markov Decision Process (POMDP), jointly optimizing flight tracks of multiple unmanned aerial vehicles and calculating unloading strategies based on the positions and task information of users under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, and constructing a deep reinforcement learning model by taking the aim of minimizing system time delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the users as the target;

(6) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;

(7) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment;

further, the step (3) comprises the following specific steps:

(3a) Establishing a Mobile edge computing system model for assisting user unloading by an unmanned aerial vehicle, wherein the system comprises M Mobile user equipment (MD) and an unmanned aerial vehicle with a U-frame carrying an MEC server, and the M Mobile user equipment (MD) and the U-frame carry the MEC server respectively by sets

And

and (4) showing. Unmanned aerial vehicle with fixed height H _u And (3) flying, wherein the total time length of the unmanned aerial vehicle for executing one flight task is T, the total time length can be divided into N time slots with equal length, and the set of the time slots is recorded as

Each MD has one in each time slot tauA computing intensive task, denoted as S _m (τ)＝{D _m (τ)，F _m (τ) }, in which D _m (τ) represents the amount of data bits, F _m (τ) represents the required CPU cycles per bit;

(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta _m，u (τ)∈[0，1]. The offload decision variable between the drone and the user equipment may be expressed as:

D＝{α _m，u (τ) | U ∈ U, M ∈ M, τ ∈ T } expression 1

Wherein alpha is _m，u (τ) ∈ {0,1}, when α is _m，u (τ) =1 indicates that the device MD _m The calculation task at the time slot tau is carried out by an Unmanned Aerial Vehicle (UAV) _u Auxiliary calculation, Δ _m，u (τ) > 0; when alpha is _m，u (τ) =0, which means that the calculation task is only executed locally, Δ _m，u (τ) =0. Decision variables need to be satisfied:

(3c) And establishing a movement model, wherein the mobile device moves to a new position randomly in each time slot, and the movement of each device is related to the current speed and angle of the device. Suppose MD _m The coordinate at time slot τ is denoted as c _m (τ)＝[x _m (τ)，y _m (τ)]Then the coordinates of its next slot τ +1 can be expressed as:

wherein d is _max Representing the maximum distance of the standby movement, the movement direction and the distance probability are subject to uniform distribution, rho _1，m ，ρ _2，m U (0, 1), the drone serves the terminal considering only its starting position in the slot.

(3d) Each unmanned plane is at height H _u Can also use the discrete position c of the unmanned plane in each time slot _u (τ) assuming a UAV _u Selecting fly-to-service MD at time slot tau _m Then its flight direction is recorded as β _u (τ)∈[0，2π]The flying speed is v _u (τ)∈[0，V _max ]Time of flight t _fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:

wherein μ =0.5M _u t _fly ，M _u Is the total mass of the unmanned aerial vehicle.

(3e) And the computation offload adopts a partial offload strategy, then MD _m The local computation delay at slot τ can be expressed as:

wherein f is _m Denotes MD _m Local computing power (number of CPU cycles per second).

(3f) The actual unmanned aerial vehicle-to-ground communication is simulated by adopting a line-of-sight link model, and the channel gain h between the unmanned aerial vehicle and a user _m，u (τ) follows a free space path loss model, which can be expressed as:

wherein g is ₀ Is the channel power gain per meter.

(3g) Instantaneous transmission rate r between unmanned aerial vehicle and ground equipment _m，u (τ) is defined as:

wherein B represents a channelThe bandwidth of the communication channel is controlled,

transmit power, σ, for the mobile device uplink ² Representing white gaussian noise at the unmanned end.

Associated user MD _m The transmission data delay is as follows:

after the computation task is transmitted, the unmanned aerial vehicle executes an unloading computation task, wherein the time delay and the energy consumption of the unloading computation are respectively as follows:

wherein f is _u Representing the computational power of the drone,

denotes the CPU power, κ, of the drone when performing the calculations _u ＝10 ^-27 Is a chip constant

(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be ignored, user MD _m Completing task S in time slot tau _m Time delay T of (tau) _m (τ) can be expressed as:

unmanned Aerial Vehicle (UAV) _u The total energy consumption of the auxiliary computing offload at slot τ is:

(3i) User MD _m The average delay of (d) may be expressed as:

the system average calculated delay can be calculated as:

(3j) In order to ensure user fairness, a fairness index xi is defined _τ To measure the fairness of service:

(3k) In summary, the following objective functions and constraints can be established:

wherein P = { beta = _u (τ)，v _u (τ)}，Z＝{α _m，u (τ)，Δ _m，u (τ)}，φ _t And phi _e For weighting parameters, C1 limits that each time slot of the unmanned aerial vehicle only serves one user, C2 and C6 limit the flight range of the unmanned aerial vehicle, C3 and C4 limit the flight speed and angle of the unmanned aerial vehicle, C6 represents that the calculation task can be partially unloaded, C7 ensures the fair benefit of the system, d _safe And xi _min The minimum safety distance and the minimum fairness index between the preset unmanned planes are adopted.

Further, the step (5) comprises the following specific steps:

(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. Usually has a plurality ofAgents interact with the environment, each agent being based on a current state s _τ Get self observation o _τ Belongs to O and make action a _τ E.g. A, the environment generates an instant reward r for the action _τ E.g. R to evaluate the current action and with probability Pr (S) _τ+1 |S _τ ，A _τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) _τ |o _τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:

where gamma is the reward discount.

(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r _obs Therefore, only partial state information can be observed, and global state information and actions of other drones are unknown. UAV (unmanned aerial vehicle) _u The information observable in the time slot τ has its own position information c _u (tau) and current location information, task information and number of services for K mobile users in observation range

The action space a is the transmit power and the selected channel, and is represented as:

o _u (τ)＝{c _u (τ)，k _u (τ) } expression 18

(5c) Specifically defining an action space, and based on observed information, determining which user is served by the drone at the current time slot tau and the load shedding ratio delta _m，u (tau) and determining the flight angle beta thereof _u (τ) and flight velocity v _u (τ), which can be written as:

a _u (τ)＝{m(τ)，Δ _m，u (τ)，β _u (τ)，v _u (τ) } expression 19

(5d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observation results:

s(τ)＝{o _u (τ) | U ∈ U } expression 20

(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging whether the action is good or bad and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negative, so the reward after the drone executes actions is defined as:

r _u (τ)＝D _m (τ)·(-T ^mean (τ)-ψE _u (τ)-P _u (τ)) expression 21

Wherein D _m (τ)∈[0，1]For the attenuation coefficient, the benefit obtained after the unmanned aerial vehicle processes the mobile terminal unloading task is defined, and the specific calculation is as follows:

wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P _u (τ) is an additional penalty, which is added if the drone flies out of the field after performing the action or has a distance to the remaining drones that is less than a safe distance.

(5f) And establishing a deep reinforcement learning model on the basis of the MADDPG according to the established S, A, O and R, and adopting an operator-critic framework, wherein each agent has an operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent _u (τ)|θ _u )，θ _u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a ₁ (τ)，...，a _U (τ)|w _u )，w _u Representing its network parameters. The inputs to the critic network include observations and actions of all agents within a time slot, but the inputs to the operator network only require their own observations at the time of distributed execution.

The algorithm learns the Q function and the optimal strategy at the same time, when the criticc network is updated, H groups of records need to be extracted from the experience pool of each intelligent agent, and each group at the same time is spliced to obtain H new records, which are recorded as: { s _t，i ，a _1，i ，...，a _U，i ，r _1，i ，...，r _U，i ，s _t+1，i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:

wherein y is _u，i Obtained from formula (24):

wherein the content of the first and second substances,

and

respectively represent Unmanned Aerial Vehicle (UAV) _u The critic target network and the operator target network, wherein the target network has network parameters updated later, so that the training becomes more stable.

Critic networks need to minimize losses to approximate true Q ^* The operator network updates the network parameters by gradient ascending of the determination strategy gradient of the Q value so as to maximize the action value:

finally at a fixed interval at an update rate

Updating the target network:

further, the step (6) comprises the following specific steps:

(6a) Starting environment simulation, and initializing each agent operator network, a critic network and respective target network parameters thereof;

(6b) Initializing the number of training rounds;

(6c) Updating the position set, the task set and the service times of the user, and the position set and the channel parameters of the unmanned aerial vehicle;

(6d) For each agent, an actor network is distributed and operated according to the observation o _u (τ), output action a _u (τ) and obtaining an instant prize r _u (τ) while moving to the next state s _τ+1 To thereby obtain training data o _u (τ)，a _u (τ)，r _u (τ)，o _u (τ+1)}；

(6e) Storing the training data into respective experience playback pools;

(6f) Each agent randomly samples H training data from the experience playback pool to form a training data set;

(6g) Each agent calculates a loss value L (w) through the critic network and the target network _u ) Updating w _u And adopting a determined strategy gradient to perform gradient rise, and updating the parameter theta of the operator network through back propagation of the neural network _u ；

(6h) When the training times reach the target network updating interval, updating the target network parameters;

(6i) Judging whether convergence is met, if yes, finishing optimization to obtain an optimized deep reinforcement learning model, and otherwise, entering the step (6 c);

further, the step (7) comprises the following specific steps:

(7a) The method comprises the steps of training a deep reinforcement learning model by using an MADDPG algorithm, and inputting state information at a certain moment;

(7b) Outputting the optimal action strategy

And obtaining the optimal migration strategy and flight path.

Has the advantages that: according to the large-scale multi-unmanned aerial vehicle assisted MEC network computing unloading method based on the MADDPG algorithm, under the condition that constraint conditions are met, the energy consumption of the unmanned aerial vehicle and the average computing time delay of the system are reduced to the greatest extent by jointly optimizing the unloading decision and the flight path of the unmanned aerial vehicle, and the method can be stably represented in the optimization of a series of continuous state spaces and continuous action spaces. Under the condition of meeting the similar scene, the performance of the MADDPG-based deep reinforcement learning algorithm in the aspects of reducing energy consumption and average task delay is superior.

Drawings

Fig. 1 is a schematic structural diagram of an unmanned aerial vehicle-assisted computation offloading model provided in an embodiment of the present invention;

fig. 2 is a schematic diagram of a POMDP decision process of a multi-drone computation migration algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of algorithm training based on MADDPG according to an embodiment of the present invention;

fig. 4 is a simulation result diagram of a relationship between energy consumption and computation performance of the unmanned aerial vehicle under the maddppg algorithm provided by the embodiment of the present invention.

Detailed Description

The core idea of the invention is that: a distributed reinforcement learning method is adopted, each unmanned aerial vehicle is regarded as an intelligent agent, and a deep reinforcement learning model is optimized by means of an MADDPG algorithm through establishment of the deep reinforcement learning model. And obtaining the optimal migration strategy and flight path according to the optimized model.

The present invention is described in further detail below.

(3) An unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model are constructed, and an optimization objective function is given;

the method comprises the following specific steps:

And

Each MD has a compute intensive task at each time slot τ, denoted as S _m (τ)＝{D _m (τ)，F _m (τ) }, in which D _m (τ) represents the amount of data bits, F _m (τ) represents the required CPU cycles per bit;

(3b) Each unmanned aerial vehicle only provides calculation unloading service for one terminal device in each time slot tau, a user only needs to calculate a small part of tasks locally, the rest tasks are unloaded to the unmanned aerial vehicle for auxiliary calculation, so that the calculation delay and energy consumption are reduced, and the proportion of unloading calculation amount is recorded as delta _m，u (τ)∈[0，1]. The offloading decision variables between the drone and the user equipment may be expressed as:

D＝{α _m，u (tau) | U ∈ U, M ∈ M, tau ∈ T } expression 1

Wherein alpha is _m，u (τ) ∈ {0,1}, when α is _m，u (τ) =1 indicates the device MD _m The calculation task at time slot τ is carried out by an unmanned aerial vehicle UAV _u Auxiliary calculation, Δ _m，u (τ) > 0; when alpha is _m，u (τ) =0 then this means that the computation task is only performed locally, Δ _m，u (τ) =0. Decision variables need to be satisfied:

wherein d is _max The maximum distance representing the movement of the device, the movement direction and the distance probability are uniformly distributed, and rho _1，m ，ρ _2，m U (0, 1), the drone serves the terminal considering only its starting position in the slot.

(3d) Each unmanned plane is at height H _u Can also use the discrete position c of the unmanned plane in each time slot _u (τ) assuming a UAV _u Selecting a fly-to-service MD at a time slot τ _m Then its flight direction is recorded as β _u (τ)∈[0，2π]The flying speed is v _u (τ)∈[0，V _max ]Time of flight t _fly The energy consumed by the flight of the unmanned aerial vehicle is as follows:

(3e) Computing offload adoption of optional partsOffloading strategy, then MD _m The local computation delay at slot τ can be expressed as:

wherein g is ₀ Is the channel power gain per meter.

where B represents the bandwidth of the channel and,

transmitting power, σ, for the uplink of a mobile device ² Representing white gaussian noise at the unmanned end.

Associated user MD _m The transmission data delay is as follows:

wherein f is _u Representing the computational power of the drone,

denotes the CPU power, κ, at which the drone performs the calculations _u ＝10 ^-27 Is a chip constant

(3h) Since the resulting output data volume of various compute-intensive tasks is much smaller than the input, the delay spent on downlink transmission can be neglected, user MD _m Completing task S in time slot tau _m Time delay T of (tau) _m (τ) can be expressed as:

(3i) User MD _m The average delay of (d) may be expressed as:

the system average calculated delay may be calculated as:

(3j) In order to ensure user fairness, a fairness index ξ is defined _τ To measure fairness of service:

(5) The method adopts a Partially Observable Markov Decision Process (POMDP) for modeling, and jointly optimizes flight tracks of multiple unmanned aerial vehicles and calculates unloading strategies based on the position and task information of a user under the condition of considering the flight range and the safety distance of the unmanned aerial vehicles, so as to construct a deep reinforcement learning model by taking the aim of minimizing system delay and unmanned aerial vehicle energy consumption and simultaneously ensuring service fairness of the user as a target, and comprises the following specific steps:

(5a) The problem of multi-UAV assisted computational offloading is considered as a partial observation Markov decision process, and is composed of tuples { S, A, O, pr, R }. There are typically multiple agents interacting with the environment, each agent based on the current state s _τ Get self observation o _τ Belongs to O and make action a _τ E.g. A, the environment generates an instant reward r for the action _τ E.g. R to evaluate the current action and with probability Pr (S) _τ+1 |S _τ ，A _τ ) The next state is entered and the new state depends only on the current state and the actions of the respective agent. Actions of agents are based on a policy π (a) _τ |o _τ ) Enforcement, with the goal of learning an optimal strategy to maximize long-term jackpot, can be expressed as:

where gamma is the reward discount.

(5b) Specifically defining observation space, each unmanned aerial vehicle only has a limited observation range, and the radius of the observation range is set as r _obs Therefore, only partial state information can be observed, and global state information and the actions of other drones are unknown. Single unmanned aerial vehicle UAV _u The information observable in the time slot τ has its own position information c _u (tau) and current location information, task information and number of services for K mobile users in observation range

o _u (τ)＝{c _u (τ)，k _u (τ) } expression 18

(5c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ _m，u (τ) and determining the flight angle β thereof _u (τ) and flight velocity v _u (τ), can be written as:

a _u (τ)＝{m(τ)，Δ _m，u (τ)，β _u (τ)，v _u (τ) } expression 19

s(τ)＝{o _u expression 20 of (tau) | U ∈ U }

(5e) Specifically, the reward is defined, and the feedback obtained after the agent executes the action is called the reward and is used for judging the quality of the action and guiding the agent to update the strategy. Generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:

r _u (τ)＝D _m (τ)·(-T ^mean (τ)-ψE _u (τ)-P _u (τ)) expression 21

wherein eta and beta are correlation constants, the function image is of a sigmoid type, the input is the accumulated service times of the current user, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is. Psi is used to numerically align the drone energy consumption and the user average delay. P _u (τ) is an additional penalty term, which is added if the drone flies out of the field or is less than a safe distance away from the rest of the drones after performing the action.

(5f) According to the established S, A, O and R, a deep reinforcement learning model is established on the basis of MADDPG, an operator-critic framework is adopted, and each agent has an own operator network, a critic network and a respective target network. The Actor network is responsible for formulating a policy pi (o) for an agent _u (τ)|θ _u )，θ _u Representing its network parameters; the estimate of the critic's network output to the optimal state-action cost function is denoted as Q (s (τ), a ₁ (τ)，..，a _U (τ)|w _u )，w _u Representing its network parameters. The inputs to the critic network contain observations and actions of all agents within a time slot, but the inputs to the actor network only require their own observations when the distribution is performed.

The algorithm carries out the Q function and the optimal strategy simultaneouslyLearning, when updating the critic network, H groups of records need to be extracted from the experience pool of each agent, and H new records are obtained by splicing each group at the same time and are recorded as: { s _t，i ，a _1，i ，...，a _U，i ，r _1，i ，...，r _U，i ，s _t+1，i I =1, 2., H }, training the criticic network of each agent using a set of timing differences, the penalty function of the training Q-value function being defined as:

wherein y is _u，i Obtained from formula (24):

wherein, the first and the second end of the pipe are connected with each other,

and

respectively representing Unmanned Aerial Vehicles (UAVs) _u And the critic target network and the actor target network, wherein the target network has network parameters updated later, so that the training becomes more stable.

Critic networks need to minimize losses to approximate true Q ^* The operator network updates the network parameters by using the gradient of the determination strategy of the Q value to perform gradient rise so as to maximize the action value:

finally at a fixed interval at an update rate

Updating the target network:

(6) Considering a continuous state space and a continuous action space, and performing model training of computation migration by using a MADDPG-based multi-agent deep reinforcement learning algorithm, the method comprises the following specific steps:

(6a) Starting environment simulation, and initializing each agent operator network, critic network and respective target network parameters thereof;

(6b) Initializing the number of training rounds;

(6e) Storing the training data into respective experience playback pools;

(7b) Outputting the optimal action strategy

And obtaining the optimal migration strategy and flight path.

In fig. 1, a model of a mobile edge computing system for drone-assisted user offloading is depicted, where the user offloads computing tasks to drone-assisted computing to reduce latency and energy consumption of the computing.

In fig. 2, a deep reinforcement learning model of the drone assisted MEC network is described, and it can be seen that multiple drones as agents select the current optimal policy according to the policy based on the current state, and obtain rewards from the environment.

In fig. 3, a training model of an operator-critic framework is described, and through centralized training and distributed execution, the critic network can refer to the behaviors of other agents in the training process, so that the performance of the operator network is better evaluated, and the stability of a strategy is improved.

In fig. 4, simulation results of the calculation performance and the energy consumption of the unmanned aerial vehicle with different algorithms are described, optimal power consumption control under different calculation performances can be obtained based on the maddppg algorithm, and when the CPU frequency is 12.5GHz, the energy consumption is reduced by 29.16% compared with a baseline, and is reduced by 8.67% compared with a random policy gradient algorithm. .

Those matters not described in detail in the present application are well within the knowledge of those skilled in the art.

Claims

1. An unmanned aerial vehicle assisted calculation migration method based on multi-agent depth determination strategy gradient is characterized by comprising the following steps:

(1) The traditional MEC servers are deployed in a base station or other fixed facilities, a movable MEC server is adopted at this time, the unmanned aerial vehicle technology is combined with edge calculation, and user equipment unloads calculation tasks to an unmanned aerial vehicle end through wireless communication so as to reduce calculation delay;

(2) Constructing an unmanned aerial vehicle auxiliary user unloading system model, a mobile model, a communication model and a calculation model, and giving an optimization objective function;

(3) The method is characterized in that a Partially Observable Markov Decision Process (POMDP) is adopted for modeling, under the condition of considering the flight range and the safety distance of the unmanned aerial vehicle, the flight tracks and the calculation unloading strategies of the multiple unmanned aerial vehicles are jointly optimized on the basis of the position and the task information of a user, a deep reinforcement learning model is constructed by taking the aim of minimizing the system delay and the energy consumption of the unmanned aerial vehicle and simultaneously ensuring the service fairness of the user as the target, and the method comprises the following specific steps:

(3a) The problem of the multi-unmanned aerial vehicle auxiliary computation unloading is regarded as a partial observation Markov decision process and is composed of tuples { S, A, O, pr and R }; there are typically multiple agents interacting with the environment, each agent based on the current state s _τ Get self observation o _τ E.g. O and make action a _τ E.g. A, the environment generates an instant reward r for the action _τ E.g. R to evaluate the current action and with probability Pr (S) _τ+1 |S _τ ，A _τ ) Entering a next state, the new state depending only on the current state and the actions of the respective agent; actions of Agents are based on policy π (a) _τ |o _τ ) Enforcement, with the goal of learning to an optimal strategy to maximize long-term jackpot, can be expressed as:

wherein γ is a reward discount;

(3b) Specifically defining observation space, each unmanned aerial vehicle only has limited observation range, and the radius of the observation range is set as r _obs Therefore, only partial state information can be observed, and the global state information and the actions of other unmanned planes are unknown; single unmanned aerial vehicle UAV _u The information observable in the time slot τ has its own position information c _u (tau) and current location information, task information and number of services for K mobile users in observation range

o _u (τ)＝{c _u (τ)，k _u (τ)}

(3c) Specifically defining an action space, based on the observed information, the drone needs to determine which user is served at the current time slot τ and the offloading ratio Δ _m，u (tau) and determining the flight angle beta thereof _u (τ) and flight velocity v _u (τ), which can be written as:

a _u (τ)＝{m(τ)，Δ _m，u (τ)，β _u (τ)，v _u (τ)}

(3d) Defining a state space, the state of the system can be regarded as a set of all unmanned aerial vehicle observations:

s(τ)＝{o _u (τ)|u∈U}

(3e) Specifically defining rewards, wherein feedback obtained after the intelligent agent executes actions is called the rewards and is used for judging the quality of the actions and guiding the intelligent agent to update the strategy; generally speaking, the reward functions all correspond to optimization objectives, the objective of this optimization is to minimize the energy consumption of the drone and the average system computation delay, and the maximum reward return is just negatively correlated, so the reward after the drone executes actions is defined as:

r _u (τ)＝D _m (τ)·(-T ^mean (τ)-ψE _u (τ)-P _u (τ))

wherein eta and beta are correlation constants, the function image is of a sigmoid type, the number of times of accumulated service of the current user is input, the more times are, the larger the value is, the smaller the reward is, and the lower the benefit is; psi is used to average the drone energy consumption and usersCarrying out numerical value alignment; p _u (tau) is an additional punishment item, and if the unmanned aerial vehicle flies out of the field after executing actions or the distance between the unmanned aerial vehicle and the rest unmanned aerial vehicles is less than the safe distance, the punishment item needs to be added;

(3f) Establishing a deep reinforcement learning model on the basis of MADDPG according to the established S, A, O and R, adopting an operator-critic framework, wherein each agent has an own operator network, critic network and respective target network; the Actor network is responsible for formulating a policy pi (o) for an agent _u (τ)|θ _u )，θ _u Representing its network parameters; the critic network outputs an estimate of the optimum state-action cost function denoted as Q (s (τ), a) ₁ (τ)，...，a _U (τ)|w _u )，w _u Representing its network parameters; the input of the critic network comprises the observed values and actions of all agents in a time slot, but the input of the actor network only needs the observed values of the actor network when the distribution is executed;

wherein y is _u，i Obtained from formula (24):

wherein the content of the first and second substances,

and

respectively representing Unmanned Aerial Vehicles (UAVs) _u The critic target network and the actor target network, the target network has network parameters updated later, so that the training becomes more stable;

finally at a fixed interval at an update rate

Updating the target network:

(4) Considering a continuous state space and a continuous action space, and performing model training of computational migration by using a multi-agent deep reinforcement learning algorithm based on MADDPG;

(5) In the execution stage, the unmanned aerial vehicle obtains an optimal user unloading scheme and a flight track by using a trained model based on the state s (tau) of the current environment.