CN116774731A

CN116774731A - Unmanned aerial vehicle formation path planning method based on reinforcement learning

Info

Publication number: CN116774731A
Application number: CN202310918688.0A
Authority: CN
Inventors: 孙伟; 易乃欣; 唐恒; 孙田野
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-09-19

Abstract

The invention discloses an unmanned aerial vehicle formation path planning method based on reinforcement learning, which comprises the following steps of; step S1: establishing a kinematic model of the unmanned aerial vehicle according to the kinematic equation and the state transfer equation of the unmanned aerial vehicle, and updating the motion state of the unmanned aerial vehicle; step S2: substituting the state parameters of the state space and the motion parameters of the action space obtained in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm to obtain Actor-Critic network parameters; step S3: substituting the Actor-Critic network parameters obtained in the step S2 into a function according to rewards to obtain rewards; step S4: and circularly calculating the Actor network and the Critic network until the rewarding value is converged, and obtaining action parameters to be executed according to the state of the unmanned aerial vehicle. The invention can enhance the capability of the system for resisting burst threat, and realizes the stability of formation structure and the autonomy of formation transformation by setting the dynamic formation rewarding function.

Description

Unmanned aerial vehicle formation path planning method based on reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle formation path planning, and particularly relates to an unmanned aerial vehicle formation path planning method based on reinforcement learning.

Background

Unmanned aerial vehicles are widely applied to various industries in recent years, and a single unmanned aerial vehicle has higher operability and convenience, but has the prominent defects of limited payload capacity, weak anti-interference capacity and the like. The unmanned aerial vehicle failure detection method cannot be suitable for complex tasks, and if the unmanned aerial vehicle fails in the task execution process, the failure of the task is announced. For this reason, completely new architectures for unmanned aerial vehicle formation are gradually applied to task execution. By means of the cluster advantages of unmanned aerial vehicle formation, the unmanned aerial vehicle formation realizes complementation of functions and superposition of capabilities in task execution, and the success rate of tasks can be remarkably improved.

In the task execution of unmanned aerial vehicle formation, path planning and formation control are important research contents among them. The existing formation cooperative control method is mainly a layered control mode, and the main idea is that the problem is divided into an upper layer, a middle layer and a lower layer, the upper layer makes a decision, the middle layer transmits instructions, and the lower layer executes tasks, so that the problem is subjected to dimension reduction treatment, and the solving space is simplified. Most scholars use a bionic method such as a genetic algorithm, a simulated annealing algorithm, an ant colony optimization algorithm and the like to solve formation problems. Although the algorithm is quick and effective in solving, the algorithm relies on acquiring environmental information in advance, and cannot be used for a dynamic environment which is continuously changed. In recent years, multi-agent reinforcement learning (Multi-Agent Reinforcement Learning, MARL) provides a new idea for formation cooperative control, unmanned aerial vehicle formation problems are modeled as Markov decision processes (Markov Decision Process, MDP), unmanned aerial vehicles autonomously and interactively learn experiences with the environment, and respective behaviors are adjusted through rewards and punishments given by the environment, so that the purpose of cooperative control is achieved.

Liu et al introduced the long-plane method into formation control, and trained an offline reinforcement learning method, thereby realizing formation control and path planning of five unmanned aerial vehicles, but only considering the situation of barrier-free environment. Pan et al combine a distributed formation control method with a model-based reinforcement learning method to solve the obstacle avoidance problem of in-line formation in complex environments, but do not consider other more complex formation structures.

The existing formation control method is found to have the following problems:

1. the existing method generally decomposes the formation control problem into formation maintenance and formation adjustment, and then processes the formation maintenance and formation adjustment by adopting different methods, which can cause the complexity of the algorithm to be too high and is unfavorable for the algorithm to converge rapidly.

2. The existing method is very dependent on the prior environment, meanwhile, the respective conditions of the environment and formation are considered relatively simply, burst threats in the environment cannot be avoided in time, and the generalization capability of the algorithm is limited.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide the unmanned aerial vehicle formation path planning method based on reinforcement learning, wherein the capability of a system for resisting sudden threat can be enhanced by setting dynamic barriers, and the stability of a formation structure and the formation transformation autonomy are realized by setting dynamic formation reward functions.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

an unmanned aerial vehicle formation path planning method based on reinforcement learning comprises the following steps of;

step S1: establishing a kinematic model of the unmanned aerial vehicle according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle, and updating the motion state of the unmanned aerial vehicle, wherein the motion state of the unmanned aerial vehicle comprises a state space and an action space;

step S2: substituting the state parameters of the state space and the motion parameters of the action space obtained in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm to obtain Actor-Critic network parameters;

step S3: substituting the Actor-Critic network parameters obtained in the step S2 into a function according to rewards to obtain rewards;

step S4: and circularly calculating the Actor network and the Critic network until the rewarding value is converged, and obtaining action parameters to be executed according to the state of the unmanned aerial vehicle.

In the step S1, the state parameters of the state space of the unmanned aerial vehicle include an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle;

the motion parameters of the motion space of the unmanned aerial vehicle include angular velocity and acceleration of the unmanned aerial vehicle.

In the step S1, a motion state space at the time t+1 of the ith unmanned aerial vehicle is obtained according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle; the kinematic equation of the unmanned aerial vehicle in the step S1 is as follows:

the state transfer equation of the unmanned aerial vehicle is as follows:

wherein x is _i Represented as abscissa, y of each unmanned aerial vehicle _i Representing the ordinate and psi of each unmanned aerial vehicle _i Representing the flying angle, v of each unmanned plane _i Representing the speed, omega of each unmanned plane _i Indicating the angular velocity of each unmanned plane, a _i Indicating the acceleration of each unmanned aerial vehicle.

In step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm is adopted to update parameters, an Actor network inputs self state information of the unmanned aerial vehicle, an action executed by the unmanned aerial vehicle is output, a Critic network inputs states and actions of all unmanned aerial vehicles, an evaluation Q value is output, and the Critic network has 1 input layer, 3 hidden layers and 1 output layer.

The input layer inputs state information and currently executed actions of the unmanned aerial vehicle; the first hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the second hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the third hidden layer is a full-connection layer and has 1 neuron; the output layer is a full-connection layer and is provided with 1 neuron and is used for outputting corresponding Q (s, a) as action evaluation; and storing data of each unmanned aerial vehicle for each state transition into a cache pool, inputting data obtained in batches into a network for each time to update parameters, and outputting a strategy set of each unmanned aerial vehicle after network training is converged. In the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm adopts a centralized training and distributed execution architecture;

inputting self state information of the unmanned aerial vehicle to the Actor network, wherein the self state information of the unmanned aerial vehicle comprises current position coordinates and state parameters of the unmanned aerial vehicle, outputting actions executed by the unmanned aerial vehicle, and updating parameters according to a gradient ascending method, wherein the gradient calculation formula is as follows:

wherein θ _i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w _i,j Represents Q network parameters, μ represents action policy;

inputting the states and actions of all unmanned aerial vehicles into a Critic network, wherein the states and actions of all unmanned aerial vehicles comprise current position coordinates, state parameters and executed action parameters of all unmanned aerial vehicles, outputting an evaluation Q value, and updating parameters according to a gradient descent method, wherein a loss function calculation formula is as follows:

wherein w is _i,j Representing network parameters, L representing a loss function, N representing data size, Q representing a Q value of the unmanned aerial vehicle, o representing states of the unmanned aerial vehicles, and a representing actions of the unmanned aerial vehicle.

The reward function according to each unmanned aerial vehicle in the step S3 is as follows:

R _i ＝α ₁ r ₁ +α ₂ r ₂ +α ₃ r ₃ +α ₄ r ₄ +α ₅ r ₅ +α ₆ r ₆ +α ₇ r ₇

wherein alpha is ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ 、α ₆ 、α ₇ R is the weighting coefficient ₁ 、r ₂ 、r ₆ For sparsity rewards, no one has the opportunity to end the round of training when the sparsity rewards are triggered, r ₃ 、r ₄ 、r ₅ 、r ₇ For a guided prize, each state transition is given a guided prize;

if the unmanned aerial vehicle arrives at the destination, obtaining the awarded value r given by the environment ₁ It is defined as follows:

wherein Δd represents the Euclidean distance between the unmanned aerial vehicle and the destination, d ₁ As a distance threshold value, delta theta is the difference value of the flight angle of the unmanned aerial vehicle and the angle formed by the unmanned aerial vehicle and the destination;

if the unmanned aerial vehicle collides with an obstacle or a boundary in the movement process, the unmanned aerial vehicle obtains negative rewards r ₂ It is defined as follows:

r ₂ ＝-10

each time the unmanned plane walks a step, a negative prize r is obtained ₃ Simulating energy consumed in the running process of the unmanned aerial vehicle:

r ₃ ＝-1

for a composite obstacle environment, because the obstacle needs to be avoided, the movement track of the unmanned aerial vehicle is generally not a straight line, namely, when the obstacle exists between the unmanned aerial vehicle and the destination connecting line, a certain included angle is needed between the flight direction of the unmanned aerial vehicle and the connecting line of the unmanned aerial vehicle and the destination, and the optimal flight angle of the unmanned aerial vehicle is determined by classifying and discussing the relative position relation of the unmanned aerial vehicle, the obstacle and the destination as follows:

θ _best ＝θ _L ±θ _ε

wherein θ _best Is the optimal flight angle theta of the unmanned aerial vehicle _L Is the angle theta of the unmanned aerial vehicle along the tangential direction of the obstacle _ε Is an angle value deviating from a tangent line;

for this purpose,determining a reward r according to the current flight angle and the optimal flight angle of the unmanned aerial vehicle ₄ Is defined as follows:

in the unmanned plane obstacle avoidance process, the unmanned plane may be temporarily far away from the end point, so that in order to accelerate the algorithm convergence speed, the situation can be tolerated, and a reward function for the distance degree from the destination is proposed:

negative rewards r generated by collision between unmanned aerial vehicles ₆ The definition is as follows:

the stable formation structure means that the distance between every two unmanned aerial vehicles is kept stable, and a reward function is designed according to the optimal distance between every two unmanned aerial vehicles and the current distance, so that formation can be kept stable when no obstacle exists, and formation fine adjustment is performed when the obstacle is encountered; deep analysis means that the distance between every two unmanned aerial vehicles is stabilized around a reasonable value; for this purpose, a bonus function is set for the actual distance between unmanned plane i and unmanned plane j and the optimal distance as follows:

wherein d _i (j) Is the actual distance d between unmanned aerial vehicle i and unmanned aerial vehicle j _opt,ij Is the optimal distance between unmanned aerial vehicle i and unmanned aerial vehicle j. The prize value is equal to d _i (j)/d _opt,ij In a quadratic function relationship, when d _i (j)/d _opt,ij At 1, the quadratic function takes the maximum value1, namely when the actual distance and the optimal distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j are equal, the maximum rewarding value is 1; d, d _i (j)/d _opt,ij The farther the deviation from 1, i.e. the farther the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j deviates from the optimal distance, the smaller the reward value, which accords with the original purpose of the reward function of the design. In the step S4, each unmanned aerial vehicle outputs the action a according to the following _t ：

a _t ＝μ(o _t ；θ _i )

Wherein a is _t Action performed for time t, o _t For the state at time t, μ is a policy function, θ _i Is a parameter of the policy network;

gradient of Critic network parameter θUpdating theta parameters by using a gradient descent method, and updating the phi parameters of the Actor network by using a gradient ascent method; and calculating the Q value according to the reward value function, comparing the Q value with the Q value of the previous round, ending calculation when the model is converged within the convergence range, and otherwise, continuing to circularly calculate until the Q value is converged.

The invention has the beneficial effects that:

the invention has the capability of coping with sudden threats in the environment. Meanwhile, the dynamic formation reward function is designed aiming at the problems of formation structural stability and formation transformation autonomy, so that the cooperative capacity of formation is improved.

The invention adopts dynamic barrier environment, thus enhancing the capability of the system for resisting sudden threat.

The invention expands the sparsity rewarding function, thereby solving the problem of internal collision prevention of formation.

The dynamic formation reward function is adopted, so that the stability of the formation structure and the formation transformation autonomy are realized at the same time.

Drawings

Fig. 1 is a flow diagram of a method for unmanned aerial vehicle formation path planning based on reinforcement learning in one embodiment.

FIG. 2 is a block diagram of a multi-agent dual delay depth deterministic strategy gradient algorithm in one embodiment.

FIG. 3 is a schematic diagram of a dynamic formation rewards function.

Fig. 4 is a graph of success rates for three schemes according to the present invention.

FIG. 5 is a per round jackpot graph for three scenarios in accordance with the present invention.

Fig. 6 is a schematic diagram of formation deformation ratios for three schemes.

Fig. 7 is a schematic diagram of an embodiment of unmanned aerial vehicle formation path planning.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

The invention provides a dynamic formation rewarding function, and a better optimization effect is obtained. As shown in fig. 1-5, the invention discloses a specific embodiment, and discloses an unmanned aerial vehicle formation path planning method based on reinforcement learning, which comprises the following steps:

step S1, a kinematic model of the unmanned aerial vehicle is established according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle, and the motion state of the unmanned aerial vehicle is updated, wherein a state space of the unmanned aerial vehicle comprises an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle; the movement space of the unmanned aerial vehicle comprises the angular speed and the acceleration of the unmanned aerial vehicle.

The kinematic equation is as follows:

the state transfer equation of the unmanned aerial vehicle is as follows:

wherein x is _i 、y _i 、ψ _i 、v _i 、ω _i 、a _i The unmanned aerial vehicle comprises an abscissa, an ordinate, a flight angle, a speed, an angular speed and an acceleration of each unmanned aerial vehicle.

S2, substituting the unmanned aerial vehicle state parameters and the motion parameters obtained by the unmanned aerial vehicle kinematic model in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm; as shown in fig. 2, the algorithm includes 1 Actor current network, 1 Actor target network, 2 Critic current network, 2 Critic target network, and in order to avoid the problem of overestimation of Q value, the smaller values in the two sets of Critic networks are selected as target Q values; the Actor network inputs the state information of the unmanned aerial vehicle, outputs the action executed by the unmanned aerial vehicle, updates the parameters according to the gradient ascent method, and the gradient calculation formula according to the Actor network is as follows:

wherein θ _i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w _i,j Represents Q network parameters and μ represents action policy.

The Critic network inputs the states and actions of all unmanned aerial vehicles, outputs an evaluation Q value, updates parameters according to a gradient descent method, and a loss function calculation formula is as follows:

And S3, substituting the Actor-Critic network parameters obtained in the step S2 into a reward function to obtain a reward value, wherein the reward function according to each unmanned aerial vehicle is as follows:

wherein alpha is ₁ 、α ₂ 、α ₃ 、α ₄ 、α ₅ 、α ₆ 、α ₇ R is the weighting coefficient ₁ 、r ₂ 、r ₆ For sparsity rewards, no one has the opportunity to end the round of training when the sparsity rewards are triggered, r ₃ 、r ₄ 、r ₅ 、r ₇ For a guided prize, each state transition is given a guided prize.

wherein Δd represents the Euclidean distance between the unmanned aerial vehicle and the destination, d ₁ For the distance threshold, Δθ is the difference between the unmanned aerial vehicle flight angle and the unmanned aerial vehicle angle to the destination.

r ₂ ＝-10

each time the unmanned plane walks a step, a negative prize r is obtained ₃ And simulating the energy consumed in the running process of the unmanned aerial vehicle.

r ₃ ＝-1

For a composite obstacle environment, because the obstacle needs to be avoided, the movement track of the unmanned aerial vehicle is generally not a straight line, namely, when the obstacle exists between the unmanned aerial vehicle and the destination connecting line, a certain included angle needs to exist between the flight direction of the unmanned aerial vehicle and the connecting line of the unmanned aerial vehicle and the destination. Through the classified discussion of the relative position relationship among the unmanned aerial vehicle, the obstacle and the destination, the optimal flight angle of the unmanned aerial vehicle is determined as follows:

θ _best ＝θ _L ±θ _ε

wherein θ _best Is unmanned planeIs the optimum flying angle of (theta) _L Is the angle theta of the unmanned aerial vehicle along the tangential direction of the obstacle _ε Is the angular value deviating from the tangent line.

For this purpose, a prize r is determined from the current flight angle and the optimal flight angle of the unmanned aerial vehicle ₄ Is defined as follows:

in order to ensure a stable formation structure, the deep analysis of the stable formation structure means that the distance between every two unmanned aerial vehicles is kept stable, and for this purpose, a reward function is designed according to the optimal distance between every two unmanned aerial vehicles and the current distance, so that formation can be ensured to be kept stable when no obstacle is present, and formation fine adjustment is performed when the obstacle is encountered. Deep analysis means that the distance between every two unmanned aerial vehicles is stabilized around a reasonable value. For this purpose, a bonus function is set for the actual distance between unmanned plane i and unmanned plane j and the optimal distance as follows:

wherein d _i (j) Is the actual distance d between unmanned aerial vehicle i and unmanned aerial vehicle j _opt,ij Is the optimal distance between unmanned aerial vehicle i and unmanned aerial vehicle j.

As shown in FIG. 3, it can be seen that the prize value is related to d _i (j)/d _opt,ij In a quadratic function relationship, when d _i (j)/d _opt,ij When the value is 1, the quadratic function obtains the maximum value 1, namely when the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j is equal to the optimal distance, the maximum value of the rewards is 1; d, d _i (j)/d _opt,ij The farther from 1, i.e., the farther the actual distance of unmanned aerial vehicle i from unmanned aerial vehicle j deviates from the optimal distance, the smaller the prize value, which meets the original purpose of designing the prize function herein.

For the dynamic formation rewards of the unmanned aerial vehicle i, r of the unmanned aerial vehicle i and other unmanned aerial vehicles in the formation are considered _d,ij The method is characterized by comprising the following steps:

wherein i and j are unmanned aerial vehicle numbers, r _d,ij A bonus function is awarded for the distance between drone i and drone j.

S4, after training convergence of the Actor network and the Critic network, each unmanned aerial vehicle outputs action a according to the following formula _t ：

a _t ＝μ(o _t ；θ _i )

Wherein a is _t Action performed for time t, o _t For the state at time t, μ is a policy function, θ _i Is a parameter of the policy network.

As fig. 4 shows the success rate curves of three schemes (multi-agent depth deterministic strategy gradient algorithm+bonus function before improvement, multi-agent dual-delay depth deterministic strategy gradient algorithm+dynamic formation bonus function), it can be clearly seen that the dynamic formation bonus function proposed by the present invention has a great advantage in terms of improving success rate.

As fig. 5 shows the cumulative prize change per round of the three schemes, it can be clearly seen that the dynamic formation prize function proposed by the present invention has a great advantage in improving the convergence speed of the algorithm.

FIG. 5a is a chart showing the cumulative rewards per round of the MADDPG algorithm;

FIG. 5b is a plot of the cumulative prize per round of the MATD3 algorithm;

FIG. 5c is a cumulative prize per round of the MATD3-IDFRF algorithm;

as fig. 6 shows the formation deformation rate curves of the three schemes, it is obvious that the dynamic formation rewards help the unmanned aerial vehicle form a stable formation structure and perform formation transformation timely, so that the lower formation deformation rate is maintained in the whole process, and the strong advantage is shown.

As shown in fig. 7, the unmanned aerial vehicle formation path planning problem is described as that a plurality of unmanned aerial vehicles start from respective starting points to form a specific formation in a limited area, avoid a plurality of obstacles, and finally reach an end point. Meanwhile, in the formation movement process, the original formation is changed as little as possible for the obstacles which cannot be avoided by the original formation. In the figure, three blue solid lines represent flight routes of three unmanned aerial vehicles, and a triangle formed by red solid points and red dotted lines represents formation formed by unmanned aerial vehicles.

Claims

1. The unmanned aerial vehicle formation path planning method based on reinforcement learning is characterized by comprising the following steps of;

2. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein in S1, the state parameters of the unmanned aerial vehicle state space include an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle;

3. The unmanned aerial vehicle formation path planning method based on reinforcement learning according to claim 1, wherein in the step S1, a motion state space at the time t+1 of the ith unmanned aerial vehicle is obtained according to a motion equation and a state transition equation of the unmanned aerial vehicle; the kinematic equation of the unmanned aerial vehicle in the step S1 is as follows:

the state transfer equation of the unmanned aerial vehicle is as follows:

4. The unmanned aerial vehicle formation path planning method based on reinforcement learning according to claim 1, wherein in the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm is adopted to update parameters, an Actor network inputs self state information of the unmanned aerial vehicle, an action executed by the unmanned aerial vehicle is output, a Critic network inputs states and actions of all unmanned aerial vehicles, an evaluation Q value is output, and the Critic network has 1 input layer, 3 hidden layers and 1 output layer.

5. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 4, wherein the input layer inputs the state information of the unmanned aerial vehicle and the currently executed action; the first hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the second hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the third hidden layer is a full-connection layer and has 1 neuron; the output layer is a full-connection layer and is provided with 1 neuron and is used for outputting corresponding Q (s, a) as action evaluation; and storing data of each unmanned aerial vehicle for each state transition into a cache pool, inputting data obtained in batches into a network for each time to update parameters, and outputting a strategy set of each unmanned aerial vehicle after network training is converged.

6. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 4, wherein in the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm adopts a architecture of centralized training and distributed execution;

wherein θ _i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w _i,j Representing the Q network parameter(s),μ represents action policy;

7. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein the reward function according to each unmanned aerial vehicle in step S3 is as follows:

8. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 7, wherein the unmanned aerial vehicle obtains an environmental rewarding value r if the unmanned aerial vehicle arrives at the destination ₁ It is defined as follows:

r ₂ ＝-10

r ₃ ＝-1

the optimal flight angle of the unmanned aerial vehicle is as follows:

θ _best ＝θ _L ±θ _ε

in the unmanned aerial vehicle obstacle avoidance process, a reward function for the distance degree from a destination is as follows:

the reward function for the actual distance between drone i and drone j versus the optimal distance is as follows:

9. The reinforcement learning-based unmanned aerial vehicle formation path planning method of claim 8, wherein the reward value is equal to d _i (j)/d _opt,ij In a quadratic function relationship, when d _i (j)/d _opt,ij When the value is 1, the quadratic function obtains the maximum value 1, namely when the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j is equal to the optimal distance, the maximum value of the rewards is 1; d, d _i (j)/d _opt,ij The further from 1, i.e. the further from the optimal distance the actual distance of unmanned aerial vehicle i from unmanned aerial vehicle j, the smaller the prize value.

10. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein in the step S4, each unmanned aerial vehicle outputs the action a according to the following formula _t ：

a _t ＝μ(o _t ；θ _i )

gradient of Critic network parameter θUpdating theta parameters by using a gradient descent method, and updating the phi parameters of the Actor network by using a gradient ascent method; calculating Q value according to the reward value function, and comparing with the previous round of Q valueAnd if the error is in the convergence range, ending calculation by the model convergence, otherwise, continuing to perform cyclic calculation until the Q value is converged.