CN116774731A - Unmanned aerial vehicle formation path planning method based on reinforcement learning - Google Patents

Unmanned aerial vehicle formation path planning method based on reinforcement learning Download PDF

Info

Publication number
CN116774731A
CN116774731A CN202310918688.0A CN202310918688A CN116774731A CN 116774731 A CN116774731 A CN 116774731A CN 202310918688 A CN202310918688 A CN 202310918688A CN 116774731 A CN116774731 A CN 116774731A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
parameters
state
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310918688.0A
Other languages
Chinese (zh)
Inventor
孙伟
易乃欣
唐恒
孙田野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310918688.0A priority Critical patent/CN116774731A/en
Publication of CN116774731A publication Critical patent/CN116774731A/en
Pending legal-status Critical Current

Links

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses an unmanned aerial vehicle formation path planning method based on reinforcement learning, which comprises the following steps of; step S1: establishing a kinematic model of the unmanned aerial vehicle according to the kinematic equation and the state transfer equation of the unmanned aerial vehicle, and updating the motion state of the unmanned aerial vehicle; step S2: substituting the state parameters of the state space and the motion parameters of the action space obtained in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm to obtain Actor-Critic network parameters; step S3: substituting the Actor-Critic network parameters obtained in the step S2 into a function according to rewards to obtain rewards; step S4: and circularly calculating the Actor network and the Critic network until the rewarding value is converged, and obtaining action parameters to be executed according to the state of the unmanned aerial vehicle. The invention can enhance the capability of the system for resisting burst threat, and realizes the stability of formation structure and the autonomy of formation transformation by setting the dynamic formation rewarding function.

Description

Unmanned aerial vehicle formation path planning method based on reinforcement learning
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle formation path planning, and particularly relates to an unmanned aerial vehicle formation path planning method based on reinforcement learning.
Background
Unmanned aerial vehicles are widely applied to various industries in recent years, and a single unmanned aerial vehicle has higher operability and convenience, but has the prominent defects of limited payload capacity, weak anti-interference capacity and the like. The unmanned aerial vehicle failure detection method cannot be suitable for complex tasks, and if the unmanned aerial vehicle fails in the task execution process, the failure of the task is announced. For this reason, completely new architectures for unmanned aerial vehicle formation are gradually applied to task execution. By means of the cluster advantages of unmanned aerial vehicle formation, the unmanned aerial vehicle formation realizes complementation of functions and superposition of capabilities in task execution, and the success rate of tasks can be remarkably improved.
In the task execution of unmanned aerial vehicle formation, path planning and formation control are important research contents among them. The existing formation cooperative control method is mainly a layered control mode, and the main idea is that the problem is divided into an upper layer, a middle layer and a lower layer, the upper layer makes a decision, the middle layer transmits instructions, and the lower layer executes tasks, so that the problem is subjected to dimension reduction treatment, and the solving space is simplified. Most scholars use a bionic method such as a genetic algorithm, a simulated annealing algorithm, an ant colony optimization algorithm and the like to solve formation problems. Although the algorithm is quick and effective in solving, the algorithm relies on acquiring environmental information in advance, and cannot be used for a dynamic environment which is continuously changed. In recent years, multi-agent reinforcement learning (Multi-Agent Reinforcement Learning, MARL) provides a new idea for formation cooperative control, unmanned aerial vehicle formation problems are modeled as Markov decision processes (Markov Decision Process, MDP), unmanned aerial vehicles autonomously and interactively learn experiences with the environment, and respective behaviors are adjusted through rewards and punishments given by the environment, so that the purpose of cooperative control is achieved.
Liu et al introduced the long-plane method into formation control, and trained an offline reinforcement learning method, thereby realizing formation control and path planning of five unmanned aerial vehicles, but only considering the situation of barrier-free environment. Pan et al combine a distributed formation control method with a model-based reinforcement learning method to solve the obstacle avoidance problem of in-line formation in complex environments, but do not consider other more complex formation structures.
The existing formation control method is found to have the following problems:
1. the existing method generally decomposes the formation control problem into formation maintenance and formation adjustment, and then processes the formation maintenance and formation adjustment by adopting different methods, which can cause the complexity of the algorithm to be too high and is unfavorable for the algorithm to converge rapidly.
2. The existing method is very dependent on the prior environment, meanwhile, the respective conditions of the environment and formation are considered relatively simply, burst threats in the environment cannot be avoided in time, and the generalization capability of the algorithm is limited.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide the unmanned aerial vehicle formation path planning method based on reinforcement learning, wherein the capability of a system for resisting sudden threat can be enhanced by setting dynamic barriers, and the stability of a formation structure and the formation transformation autonomy are realized by setting dynamic formation reward functions.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
an unmanned aerial vehicle formation path planning method based on reinforcement learning comprises the following steps of;
step S1: establishing a kinematic model of the unmanned aerial vehicle according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle, and updating the motion state of the unmanned aerial vehicle, wherein the motion state of the unmanned aerial vehicle comprises a state space and an action space;
step S2: substituting the state parameters of the state space and the motion parameters of the action space obtained in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm to obtain Actor-Critic network parameters;
step S3: substituting the Actor-Critic network parameters obtained in the step S2 into a function according to rewards to obtain rewards;
step S4: and circularly calculating the Actor network and the Critic network until the rewarding value is converged, and obtaining action parameters to be executed according to the state of the unmanned aerial vehicle.
In the step S1, the state parameters of the state space of the unmanned aerial vehicle include an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle;
the motion parameters of the motion space of the unmanned aerial vehicle include angular velocity and acceleration of the unmanned aerial vehicle.
In the step S1, a motion state space at the time t+1 of the ith unmanned aerial vehicle is obtained according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle; the kinematic equation of the unmanned aerial vehicle in the step S1 is as follows:
the state transfer equation of the unmanned aerial vehicle is as follows:
wherein x is i Represented as abscissa, y of each unmanned aerial vehicle i Representing the ordinate and psi of each unmanned aerial vehicle i Representing the flying angle, v of each unmanned plane i Representing the speed, omega of each unmanned plane i Indicating the angular velocity of each unmanned plane, a i Indicating the acceleration of each unmanned aerial vehicle.
In step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm is adopted to update parameters, an Actor network inputs self state information of the unmanned aerial vehicle, an action executed by the unmanned aerial vehicle is output, a Critic network inputs states and actions of all unmanned aerial vehicles, an evaluation Q value is output, and the Critic network has 1 input layer, 3 hidden layers and 1 output layer.
The input layer inputs state information and currently executed actions of the unmanned aerial vehicle; the first hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the second hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the third hidden layer is a full-connection layer and has 1 neuron; the output layer is a full-connection layer and is provided with 1 neuron and is used for outputting corresponding Q (s, a) as action evaluation; and storing data of each unmanned aerial vehicle for each state transition into a cache pool, inputting data obtained in batches into a network for each time to update parameters, and outputting a strategy set of each unmanned aerial vehicle after network training is converged. In the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm adopts a centralized training and distributed execution architecture;
inputting self state information of the unmanned aerial vehicle to the Actor network, wherein the self state information of the unmanned aerial vehicle comprises current position coordinates and state parameters of the unmanned aerial vehicle, outputting actions executed by the unmanned aerial vehicle, and updating parameters according to a gradient ascending method, wherein the gradient calculation formula is as follows:
wherein θ i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w i,j Represents Q network parameters, μ represents action policy;
inputting the states and actions of all unmanned aerial vehicles into a Critic network, wherein the states and actions of all unmanned aerial vehicles comprise current position coordinates, state parameters and executed action parameters of all unmanned aerial vehicles, outputting an evaluation Q value, and updating parameters according to a gradient descent method, wherein a loss function calculation formula is as follows:
wherein w is i,j Representing network parameters, L representing a loss function, N representing data size, Q representing a Q value of the unmanned aerial vehicle, o representing states of the unmanned aerial vehicles, and a representing actions of the unmanned aerial vehicle.
The reward function according to each unmanned aerial vehicle in the step S3 is as follows:
R i =α 1 r 12 r 23 r 34 r 45 r 56 r 67 r 7
wherein alpha is 1 、α 2 、α 3 、α 4 、α 5 、α 6 、α 7 R is the weighting coefficient 1 、r 2 、r 6 For sparsity rewards, no one has the opportunity to end the round of training when the sparsity rewards are triggered, r 3 、r 4 、r 5 、r 7 For a guided prize, each state transition is given a guided prize;
if the unmanned aerial vehicle arrives at the destination, obtaining the awarded value r given by the environment 1 It is defined as follows:
wherein Δd represents the Euclidean distance between the unmanned aerial vehicle and the destination, d 1 As a distance threshold value, delta theta is the difference value of the flight angle of the unmanned aerial vehicle and the angle formed by the unmanned aerial vehicle and the destination;
if the unmanned aerial vehicle collides with an obstacle or a boundary in the movement process, the unmanned aerial vehicle obtains negative rewards r 2 It is defined as follows:
r 2 =-10
each time the unmanned plane walks a step, a negative prize r is obtained 3 Simulating energy consumed in the running process of the unmanned aerial vehicle:
r 3 =-1
for a composite obstacle environment, because the obstacle needs to be avoided, the movement track of the unmanned aerial vehicle is generally not a straight line, namely, when the obstacle exists between the unmanned aerial vehicle and the destination connecting line, a certain included angle is needed between the flight direction of the unmanned aerial vehicle and the connecting line of the unmanned aerial vehicle and the destination, and the optimal flight angle of the unmanned aerial vehicle is determined by classifying and discussing the relative position relation of the unmanned aerial vehicle, the obstacle and the destination as follows:
θ best =θ L ±θ ε
wherein θ best Is the optimal flight angle theta of the unmanned aerial vehicle L Is the angle theta of the unmanned aerial vehicle along the tangential direction of the obstacle ε Is an angle value deviating from a tangent line;
for this purpose,determining a reward r according to the current flight angle and the optimal flight angle of the unmanned aerial vehicle 4 Is defined as follows:
in the unmanned plane obstacle avoidance process, the unmanned plane may be temporarily far away from the end point, so that in order to accelerate the algorithm convergence speed, the situation can be tolerated, and a reward function for the distance degree from the destination is proposed:
negative rewards r generated by collision between unmanned aerial vehicles 6 The definition is as follows:
the stable formation structure means that the distance between every two unmanned aerial vehicles is kept stable, and a reward function is designed according to the optimal distance between every two unmanned aerial vehicles and the current distance, so that formation can be kept stable when no obstacle exists, and formation fine adjustment is performed when the obstacle is encountered; deep analysis means that the distance between every two unmanned aerial vehicles is stabilized around a reasonable value; for this purpose, a bonus function is set for the actual distance between unmanned plane i and unmanned plane j and the optimal distance as follows:
wherein d i (j) Is the actual distance d between unmanned aerial vehicle i and unmanned aerial vehicle j opt,ij Is the optimal distance between unmanned aerial vehicle i and unmanned aerial vehicle j. The prize value is equal to d i (j)/d opt,ij In a quadratic function relationship, when d i (j)/d opt,ij At 1, the quadratic function takes the maximum value1, namely when the actual distance and the optimal distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j are equal, the maximum rewarding value is 1; d, d i (j)/d opt,ij The farther the deviation from 1, i.e. the farther the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j deviates from the optimal distance, the smaller the reward value, which accords with the original purpose of the reward function of the design. In the step S4, each unmanned aerial vehicle outputs the action a according to the following t
a t =μ(o t ;θ i )
Wherein a is t Action performed for time t, o t For the state at time t, μ is a policy function, θ i Is a parameter of the policy network;
gradient of Critic network parameter θUpdating theta parameters by using a gradient descent method, and updating the phi parameters of the Actor network by using a gradient ascent method; and calculating the Q value according to the reward value function, comparing the Q value with the Q value of the previous round, ending calculation when the model is converged within the convergence range, and otherwise, continuing to circularly calculate until the Q value is converged.
The invention has the beneficial effects that:
the invention has the capability of coping with sudden threats in the environment. Meanwhile, the dynamic formation reward function is designed aiming at the problems of formation structural stability and formation transformation autonomy, so that the cooperative capacity of formation is improved.
The invention adopts dynamic barrier environment, thus enhancing the capability of the system for resisting sudden threat.
The invention expands the sparsity rewarding function, thereby solving the problem of internal collision prevention of formation.
The dynamic formation reward function is adopted, so that the stability of the formation structure and the formation transformation autonomy are realized at the same time.
Drawings
Fig. 1 is a flow diagram of a method for unmanned aerial vehicle formation path planning based on reinforcement learning in one embodiment.
FIG. 2 is a block diagram of a multi-agent dual delay depth deterministic strategy gradient algorithm in one embodiment.
FIG. 3 is a schematic diagram of a dynamic formation rewards function.
Fig. 4 is a graph of success rates for three schemes according to the present invention.
FIG. 5 is a per round jackpot graph for three scenarios in accordance with the present invention.
Fig. 6 is a schematic diagram of formation deformation ratios for three schemes.
Fig. 7 is a schematic diagram of an embodiment of unmanned aerial vehicle formation path planning.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
The invention provides a dynamic formation rewarding function, and a better optimization effect is obtained. As shown in fig. 1-5, the invention discloses a specific embodiment, and discloses an unmanned aerial vehicle formation path planning method based on reinforcement learning, which comprises the following steps:
step S1, a kinematic model of the unmanned aerial vehicle is established according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle, and the motion state of the unmanned aerial vehicle is updated, wherein a state space of the unmanned aerial vehicle comprises an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle; the movement space of the unmanned aerial vehicle comprises the angular speed and the acceleration of the unmanned aerial vehicle.
The kinematic equation is as follows:
the state transfer equation of the unmanned aerial vehicle is as follows:
wherein x is i 、y i 、ψ i 、v i 、ω i 、a i The unmanned aerial vehicle comprises an abscissa, an ordinate, a flight angle, a speed, an angular speed and an acceleration of each unmanned aerial vehicle.
S2, substituting the unmanned aerial vehicle state parameters and the motion parameters obtained by the unmanned aerial vehicle kinematic model in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm; as shown in fig. 2, the algorithm includes 1 Actor current network, 1 Actor target network, 2 Critic current network, 2 Critic target network, and in order to avoid the problem of overestimation of Q value, the smaller values in the two sets of Critic networks are selected as target Q values; the Actor network inputs the state information of the unmanned aerial vehicle, outputs the action executed by the unmanned aerial vehicle, updates the parameters according to the gradient ascent method, and the gradient calculation formula according to the Actor network is as follows:
wherein θ i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w i,j Represents Q network parameters and μ represents action policy.
The Critic network inputs the states and actions of all unmanned aerial vehicles, outputs an evaluation Q value, updates parameters according to a gradient descent method, and a loss function calculation formula is as follows:
wherein w is i,j Representing network parameters, L representing a loss function, N representing data size, Q representing a Q value of the unmanned aerial vehicle, o representing states of the unmanned aerial vehicles, and a representing actions of the unmanned aerial vehicle.
And S3, substituting the Actor-Critic network parameters obtained in the step S2 into a reward function to obtain a reward value, wherein the reward function according to each unmanned aerial vehicle is as follows:
R i =α 1 r 12 r 23 r 34 r 45 r 56 r 67 r 7
wherein alpha is 1 、α 2 、α 3 、α 4 、α 5 、α 6 、α 7 R is the weighting coefficient 1 、r 2 、r 6 For sparsity rewards, no one has the opportunity to end the round of training when the sparsity rewards are triggered, r 3 、r 4 、r 5 、r 7 For a guided prize, each state transition is given a guided prize.
If the unmanned aerial vehicle arrives at the destination, obtaining the awarded value r given by the environment 1 It is defined as follows:
wherein Δd represents the Euclidean distance between the unmanned aerial vehicle and the destination, d 1 For the distance threshold, Δθ is the difference between the unmanned aerial vehicle flight angle and the unmanned aerial vehicle angle to the destination.
If the unmanned aerial vehicle collides with an obstacle or a boundary in the movement process, the unmanned aerial vehicle obtains negative rewards r 2 It is defined as follows:
r 2 =-10
each time the unmanned plane walks a step, a negative prize r is obtained 3 And simulating the energy consumed in the running process of the unmanned aerial vehicle.
r 3 =-1
For a composite obstacle environment, because the obstacle needs to be avoided, the movement track of the unmanned aerial vehicle is generally not a straight line, namely, when the obstacle exists between the unmanned aerial vehicle and the destination connecting line, a certain included angle needs to exist between the flight direction of the unmanned aerial vehicle and the connecting line of the unmanned aerial vehicle and the destination. Through the classified discussion of the relative position relationship among the unmanned aerial vehicle, the obstacle and the destination, the optimal flight angle of the unmanned aerial vehicle is determined as follows:
θ best =θ L ±θ ε
wherein θ best Is unmanned planeIs the optimum flying angle of (theta) L Is the angle theta of the unmanned aerial vehicle along the tangential direction of the obstacle ε Is the angular value deviating from the tangent line.
For this purpose, a prize r is determined from the current flight angle and the optimal flight angle of the unmanned aerial vehicle 4 Is defined as follows:
in the unmanned plane obstacle avoidance process, the unmanned plane may be temporarily far away from the end point, so that in order to accelerate the algorithm convergence speed, the situation can be tolerated, and a reward function for the distance degree from the destination is proposed:
negative rewards r generated by collision between unmanned aerial vehicles 6 The definition is as follows:
in order to ensure a stable formation structure, the deep analysis of the stable formation structure means that the distance between every two unmanned aerial vehicles is kept stable, and for this purpose, a reward function is designed according to the optimal distance between every two unmanned aerial vehicles and the current distance, so that formation can be ensured to be kept stable when no obstacle is present, and formation fine adjustment is performed when the obstacle is encountered. Deep analysis means that the distance between every two unmanned aerial vehicles is stabilized around a reasonable value. For this purpose, a bonus function is set for the actual distance between unmanned plane i and unmanned plane j and the optimal distance as follows:
wherein d i (j) Is the actual distance d between unmanned aerial vehicle i and unmanned aerial vehicle j opt,ij Is the optimal distance between unmanned aerial vehicle i and unmanned aerial vehicle j.
As shown in FIG. 3, it can be seen that the prize value is related to d i (j)/d opt,ij In a quadratic function relationship, when d i (j)/d opt,ij When the value is 1, the quadratic function obtains the maximum value 1, namely when the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j is equal to the optimal distance, the maximum value of the rewards is 1; d, d i (j)/d opt,ij The farther from 1, i.e., the farther the actual distance of unmanned aerial vehicle i from unmanned aerial vehicle j deviates from the optimal distance, the smaller the prize value, which meets the original purpose of designing the prize function herein.
For the dynamic formation rewards of the unmanned aerial vehicle i, r of the unmanned aerial vehicle i and other unmanned aerial vehicles in the formation are considered d,ij The method is characterized by comprising the following steps:
wherein i and j are unmanned aerial vehicle numbers, r d,ij A bonus function is awarded for the distance between drone i and drone j.
S4, after training convergence of the Actor network and the Critic network, each unmanned aerial vehicle outputs action a according to the following formula t
a t =μ(o t ;θ i )
Wherein a is t Action performed for time t, o t For the state at time t, μ is a policy function, θ i Is a parameter of the policy network.
As fig. 4 shows the success rate curves of three schemes (multi-agent depth deterministic strategy gradient algorithm+bonus function before improvement, multi-agent dual-delay depth deterministic strategy gradient algorithm+dynamic formation bonus function), it can be clearly seen that the dynamic formation bonus function proposed by the present invention has a great advantage in terms of improving success rate.
As fig. 5 shows the cumulative prize change per round of the three schemes, it can be clearly seen that the dynamic formation prize function proposed by the present invention has a great advantage in improving the convergence speed of the algorithm.
FIG. 5a is a chart showing the cumulative rewards per round of the MADDPG algorithm;
FIG. 5b is a plot of the cumulative prize per round of the MATD3 algorithm;
FIG. 5c is a cumulative prize per round of the MATD3-IDFRF algorithm;
as fig. 6 shows the formation deformation rate curves of the three schemes, it is obvious that the dynamic formation rewards help the unmanned aerial vehicle form a stable formation structure and perform formation transformation timely, so that the lower formation deformation rate is maintained in the whole process, and the strong advantage is shown.
As shown in fig. 7, the unmanned aerial vehicle formation path planning problem is described as that a plurality of unmanned aerial vehicles start from respective starting points to form a specific formation in a limited area, avoid a plurality of obstacles, and finally reach an end point. Meanwhile, in the formation movement process, the original formation is changed as little as possible for the obstacles which cannot be avoided by the original formation. In the figure, three blue solid lines represent flight routes of three unmanned aerial vehicles, and a triangle formed by red solid points and red dotted lines represents formation formed by unmanned aerial vehicles.

Claims (10)

1. The unmanned aerial vehicle formation path planning method based on reinforcement learning is characterized by comprising the following steps of;
step S1: establishing a kinematic model of the unmanned aerial vehicle according to a kinematic equation and a state transfer equation of the unmanned aerial vehicle, and updating the motion state of the unmanned aerial vehicle, wherein the motion state of the unmanned aerial vehicle comprises a state space and an action space;
step S2: substituting the state parameters of the state space and the motion parameters of the action space obtained in the step S1 into an Actor-Critic network model, and carrying out parameter updating on the Actor network and the Critic network according to a multi-agent double-delay depth deterministic strategy gradient algorithm to obtain Actor-Critic network parameters;
step S3: substituting the Actor-Critic network parameters obtained in the step S2 into a function according to rewards to obtain rewards;
step S4: and circularly calculating the Actor network and the Critic network until the rewarding value is converged, and obtaining action parameters to be executed according to the state of the unmanned aerial vehicle.
2. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein in S1, the state parameters of the unmanned aerial vehicle state space include an abscissa, an ordinate, a flight angle and a speed of the unmanned aerial vehicle;
the motion parameters of the motion space of the unmanned aerial vehicle include angular velocity and acceleration of the unmanned aerial vehicle.
3. The unmanned aerial vehicle formation path planning method based on reinforcement learning according to claim 1, wherein in the step S1, a motion state space at the time t+1 of the ith unmanned aerial vehicle is obtained according to a motion equation and a state transition equation of the unmanned aerial vehicle; the kinematic equation of the unmanned aerial vehicle in the step S1 is as follows:
the state transfer equation of the unmanned aerial vehicle is as follows:
wherein x is i Represented as abscissa, y of each unmanned aerial vehicle i Representing the ordinate and psi of each unmanned aerial vehicle i Representing the flying angle, v of each unmanned plane i Representing the speed, omega of each unmanned plane i Indicating the angular velocity of each unmanned plane, a i Indicating the acceleration of each unmanned aerial vehicle.
4. The unmanned aerial vehicle formation path planning method based on reinforcement learning according to claim 1, wherein in the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm is adopted to update parameters, an Actor network inputs self state information of the unmanned aerial vehicle, an action executed by the unmanned aerial vehicle is output, a Critic network inputs states and actions of all unmanned aerial vehicles, an evaluation Q value is output, and the Critic network has 1 input layer, 3 hidden layers and 1 output layer.
5. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 4, wherein the input layer inputs the state information of the unmanned aerial vehicle and the currently executed action; the first hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the second hidden layer is a full-connection layer, and has 32 neurons, and the activation function is ReLU; the third hidden layer is a full-connection layer and has 1 neuron; the output layer is a full-connection layer and is provided with 1 neuron and is used for outputting corresponding Q (s, a) as action evaluation; and storing data of each unmanned aerial vehicle for each state transition into a cache pool, inputting data obtained in batches into a network for each time to update parameters, and outputting a strategy set of each unmanned aerial vehicle after network training is converged.
6. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 4, wherein in the step S2, a multi-agent dual-delay depth deterministic strategy gradient algorithm adopts a architecture of centralized training and distributed execution;
inputting self state information of the unmanned aerial vehicle to the Actor network, wherein the self state information of the unmanned aerial vehicle comprises current position coordinates and state parameters of the unmanned aerial vehicle, outputting actions executed by the unmanned aerial vehicle, and updating parameters according to a gradient ascending method, wherein the gradient calculation formula is as follows:
wherein θ i Represents the network parameters, L represents the loss function, N represents the data size, Q represents the Q value of the unmanned aerial vehicle, o represents the state of each unmanned aerial vehicle, a represents the action of the unmanned aerial vehicle, and w i,j Representing the Q network parameter(s),μ represents action policy;
inputting the states and actions of all unmanned aerial vehicles into a Critic network, wherein the states and actions of all unmanned aerial vehicles comprise current position coordinates, state parameters and executed action parameters of all unmanned aerial vehicles, outputting an evaluation Q value, and updating parameters according to a gradient descent method, wherein a loss function calculation formula is as follows:
wherein w is i,j Representing network parameters, L representing a loss function, N representing data size, Q representing a Q value of the unmanned aerial vehicle, o representing states of the unmanned aerial vehicles, and a representing actions of the unmanned aerial vehicle.
7. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein the reward function according to each unmanned aerial vehicle in step S3 is as follows:
R i =α 1 r 12 r 23 r 34 r 45 r 56 r 67 r 7
wherein alpha is 1 、α 2 、α 3 、α 4 、α 5 、α 6 、α 7 R is the weighting coefficient 1 、r 2 、r 6 For sparsity rewards, no one has the opportunity to end the round of training when the sparsity rewards are triggered, r 3 、r 4 、r 5 、r 7 For a guided prize, each state transition is given a guided prize.
8. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 7, wherein the unmanned aerial vehicle obtains an environmental rewarding value r if the unmanned aerial vehicle arrives at the destination 1 It is defined as follows:
wherein Δd represents the Euclidean distance between the unmanned aerial vehicle and the destination, d 1 As a distance threshold value, delta theta is the difference value of the flight angle of the unmanned aerial vehicle and the angle formed by the unmanned aerial vehicle and the destination;
if the unmanned aerial vehicle collides with an obstacle or a boundary in the movement process, the unmanned aerial vehicle obtains negative rewards r 2 It is defined as follows:
r 2 =-10
each time the unmanned plane walks a step, a negative prize r is obtained 3 Simulating energy consumed in the running process of the unmanned aerial vehicle:
r 3 =-1
the optimal flight angle of the unmanned aerial vehicle is as follows:
θ best =θ L ±θ ε
wherein θ best Is the optimal flight angle theta of the unmanned aerial vehicle L Is the angle theta of the unmanned aerial vehicle along the tangential direction of the obstacle ε Is an angle value deviating from a tangent line;
for this purpose, a prize r is determined from the current flight angle and the optimal flight angle of the unmanned aerial vehicle 4 Is defined as follows:
in the unmanned aerial vehicle obstacle avoidance process, a reward function for the distance degree from a destination is as follows:
negative rewards r generated by collision between unmanned aerial vehicles 6 The definition is as follows:
the reward function for the actual distance between drone i and drone j versus the optimal distance is as follows:
wherein d i (j) Is the actual distance d between unmanned aerial vehicle i and unmanned aerial vehicle j opt,ij Is the optimal distance between unmanned aerial vehicle i and unmanned aerial vehicle j.
9. The reinforcement learning-based unmanned aerial vehicle formation path planning method of claim 8, wherein the reward value is equal to d i (j)/d opt,ij In a quadratic function relationship, when d i (j)/d opt,ij When the value is 1, the quadratic function obtains the maximum value 1, namely when the actual distance between the unmanned aerial vehicle i and the unmanned aerial vehicle j is equal to the optimal distance, the maximum value of the rewards is 1; d, d i (j)/d opt,ij The further from 1, i.e. the further from the optimal distance the actual distance of unmanned aerial vehicle i from unmanned aerial vehicle j, the smaller the prize value.
10. The reinforcement learning-based unmanned aerial vehicle formation path planning method according to claim 1, wherein in the step S4, each unmanned aerial vehicle outputs the action a according to the following formula t
a t =μ(o t ;θ i )
Wherein a is t Action performed for time t, o t For the state at time t, μ is a policy function, θ i Is a parameter of the policy network;
gradient of Critic network parameter θUpdating theta parameters by using a gradient descent method, and updating the phi parameters of the Actor network by using a gradient ascent method; calculating Q value according to the reward value function, and comparing with the previous round of Q valueAnd if the error is in the convergence range, ending calculation by the model convergence, otherwise, continuing to perform cyclic calculation until the Q value is converged.
CN202310918688.0A 2023-07-25 2023-07-25 Unmanned aerial vehicle formation path planning method based on reinforcement learning Pending CN116774731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310918688.0A CN116774731A (en) 2023-07-25 2023-07-25 Unmanned aerial vehicle formation path planning method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310918688.0A CN116774731A (en) 2023-07-25 2023-07-25 Unmanned aerial vehicle formation path planning method based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN116774731A true CN116774731A (en) 2023-09-19

Family

ID=87986015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310918688.0A Pending CN116774731A (en) 2023-07-25 2023-07-25 Unmanned aerial vehicle formation path planning method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN116774731A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117970935A (en) * 2024-04-02 2024-05-03 博创联动科技股份有限公司 Automatic obstacle avoidance method and system for agricultural machinery based on digital village

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117970935A (en) * 2024-04-02 2024-05-03 博创联动科技股份有限公司 Automatic obstacle avoidance method and system for agricultural machinery based on digital village
CN117970935B (en) * 2024-04-02 2024-06-11 博创联动科技股份有限公司 Automatic obstacle avoidance method and system for agricultural machinery based on digital village

Similar Documents

Publication Publication Date Title
CN110333739B (en) AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning
Ma et al. Multi-robot target encirclement control with collision avoidance via deep reinforcement learning
CN113900445A (en) Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning
CN111077909B (en) Novel unmanned aerial vehicle self-group self-consistent optimization control method based on visual information
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
CN116774731A (en) Unmanned aerial vehicle formation path planning method based on reinforcement learning
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN114003059B (en) UAV path planning method based on deep reinforcement learning under kinematic constraint condition
CN115167447A (en) Unmanned ship intelligent obstacle avoidance method based on radar image end-to-end depth reinforcement learning
CN115688268A (en) Aircraft near-distance air combat situation assessment adaptive weight design method
CN115903865A (en) Aircraft near-distance air combat maneuver decision implementation method
CN114138022B (en) Unmanned aerial vehicle cluster distributed formation control method based on elite pigeon crowd intelligence
CN116796843A (en) Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG
Basile et al. Ddpg based end-to-end driving enhanced with safe anomaly detection functionality for autonomous vehicles
CN116448119A (en) Unmanned swarm collaborative flight path planning method for sudden threat
CN116385909A (en) Unmanned aerial vehicle target tracking method based on deep reinforcement learning
CN113050420B (en) AUV path tracking method and system based on S-plane control and TD3
CN115718497A (en) Multi-unmanned-boat collision avoidance decision method
CN115373415A (en) Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning
CN116700353A (en) Unmanned aerial vehicle path planning method based on reinforcement learning
CN113093803B (en) Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm
CN116796505B (en) Air combat maneuver strategy generation method based on example strategy constraint
Ma et al. Trajectory tracking of an underwater glider in current based on deep reinforcement learning
CN115097853B (en) Unmanned aerial vehicle maneuvering flight control method based on fine granularity repetition strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination