CN115185288B

CN115185288B - Unmanned aerial vehicle layered flight decision method based on SAC algorithm

Info

Publication number: CN115185288B
Application number: CN202210594910.1A
Authority: CN
Inventors: 李波; 白双霞; 甘志刚; 康培棋; 杨慧林; 万开方; 高晓光
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2024-05-03
Anticipated expiration: 2042-05-27
Also published as: CN115185288A

Abstract

The invention provides an unmanned aerial vehicle layered flight decision method based on a SAC algorithm, which comprises the steps of firstly constructing an unmanned aerial vehicle flight control model, and then constructing a state space, a layered decision action space and a reward function according to a Markov decision process; next, constructing an unmanned aerial vehicle layered flight decision model structure based on a SAC algorithm; and defining model parameters, initializing the state of the unmanned aerial vehicle, training, initializing the state of the unmanned aerial vehicle, testing the layered flight decision model of the unmanned aerial vehicle, and evaluating the flight decision performance. The invention adopts a layered decision model, reduces the difficulty of algorithm training, improves the decision performance of the model, can effectively lead the unmanned aerial vehicle to autonomously decide, can efficiently explore the optimal strategy, and can efficiently explore the optimal flight strategy.

Description

Unmanned aerial vehicle layered flight decision method based on SAC algorithm

Technical Field

The invention relates to the technical field of unmanned aerial vehicle autonomous decision making, in particular to an unmanned aerial vehicle layered flight decision making method based on a SAC algorithm.

Background

Unmanned aerial vehicle is becoming an important component in the field of artificial intelligence in the future by virtue of the characteristics of high maneuverability and multiple degrees of freedom. Unmanned aerial vehicle flight decision in complex environment is the key point of unmanned aerial vehicle research in the future, requires unmanned aerial vehicle to realize accurate reconnaissance and perception through autonomous control technology, can accomplish relatively complicated autonomous decision and planning in various scenes. The unmanned aerial vehicle needs to make a flight decision by utilizing image information, position information, gesture information and the like acquired by the sensor. When the surrounding environment changes, the unmanned aerial vehicle needs to identify the obstacle, avoids the external risk, and continues to complete the flight task.

The existing unmanned aerial vehicle flight decision method is mainly divided into unmanned aerial vehicle flight decisions based on a traditional algorithm and unmanned aerial vehicle flight decisions based on an intelligent algorithm. The model-based method is an unmanned aerial vehicle flight decision method which excessively relies on modeling the unmanned aerial vehicle flight process, a large amount of measurement and accurate modeling are often needed, and modeling errors are difficult to compensate. If the drone is in a strange environment, a series of modeling work needs to be performed again, which makes this type of algorithm less adaptive to the environment. Regarding unmanned aerial vehicle flight decision research based on intelligent algorithm at present, most adopted methods include genetic algorithm, deep reinforcement learning algorithm and the like. The modeling of the unmanned aerial vehicle flight process by the algorithm is low in dependence, and the unmanned aerial vehicle can realize a flight task through continuous interaction with the environment.

However, most of the existing unmanned aerial vehicle decisions based on deep reinforcement learning adopt deterministic strategy training, which easily causes the decision strategy to be trapped into local optimum and cannot acquire the optimum strategy. Meanwhile, the existing method realizes flight decision by directly controlling the rotation speed of the unmanned aerial vehicle, and the difficulty of training and decision is greatly increased.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an unmanned aerial vehicle layered flight decision method based on a SAC algorithm. The invention provides an unmanned aerial vehicle layered flight decision based on a SAC algorithm to realize unmanned aerial vehicle flight decision, firstly, an unmanned aerial vehicle flight control model is constructed so as to acquire the position and posture information of an unmanned aerial vehicle in real time; then constructing a state space, a layered decision action space and a reward function according to the Markov decision process; next, constructing an unmanned aerial vehicle layered flight decision model structure based on a SAC algorithm; and defining model parameters, initializing the state of the unmanned aerial vehicle, training, initializing the state of the unmanned aerial vehicle, testing the layered flight decision model of the unmanned aerial vehicle, and evaluating the flight decision performance. The invention adopts a layered decision model, reduces the difficulty of algorithm training, improves the decision performance of the model, can effectively lead the unmanned aerial vehicle to independently decide, and can efficiently explore the optimal strategy.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

Step S1: constructing unmanned aerial vehicle flight control model

In order to solve the position and attitude information of the unmanned aerial vehicle in real time, constructing an unmanned aerial vehicle flight control rigid body model, wherein the unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;

step S2: constructing a state space, a layered decision action space and a reward function of unmanned aerial vehicle flight decisions according to a Markov decision process;

(1) State space design

The state space consists of two parts, namely environment information acquired by a sensor in real time and unmanned aerial vehicle flight state information, wherein the environment information comprises image information acquired by a front-end camera of the unmanned aerial vehicle, and the unmanned aerial vehicle flight state information is expressed as follows in a vector form:

Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>Respectively representing the position components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o _bx_by_bz_b,/>Respectively representing angular velocity components of the unmanned aerial vehicle around x _b,y_b,z_b coordinate axes in a machine body coordinate system;

(2) Action space design and layered decision model

Combining the reinforcement learning model with the traditional PID control model, and providing a layered control decision model of the unmanned aerial vehicle; the reinforcement learning strategy is responsible for top-level decision, and the reinforcement learning model outputs the flight linear velocity of the unmanned aerial vehicle in the flight decision processThe PID controller is responsible for bottom layer control, maps the linear speed into a motor instruction and is used for realizing commands such as pitching, rolling, yawing, accelerating and decelerating of the unmanned aerial vehicle;

(3) Bonus function design

The reward function consists of sparse rewards and continuous rewards, including position rewards, collision rewards and velocity rewards;

Step S3: constructing an unmanned aerial vehicle layered flight decision model structure based on a SAC algorithm;

Constructing an unmanned aerial vehicle layered flight decision model based on a deep reinforcement learning framework Actor-Critic, wherein the unmanned aerial vehicle layered flight decision model consists of an Actor network, a Critic network and an experience pool D;

The Actor network inputs a current time state s _t of the unmanned aerial vehicle, and comprises a gray image acquired by an onboard camera carried by the unmanned aerial vehicle and flight state information of the unmanned aerial vehicle, and outputs an unmanned aerial vehicle action a _t; the Critic neural network inputs the current time state s _t of the unmanned aerial vehicle and the action a _t executed by the unmanned aerial vehicle, and outputs Q (s _t,a_t) for evaluating the advantages and disadvantages of the decision action; the unmanned aerial vehicle executes the action a _t in the state s _t at the current moment, obtains the rewards r _t and the new state s _t+1, stores experience samples (s _t,a_t,r_t,s_t+1) containing the states, actions and rewards obtained in the interaction process of the unmanned aerial vehicle and the environment, and randomly extracts batch experience samples from the experience pool D for updating the Actor network and Critic network parameters;

step S4: defining parameters of an unmanned aerial vehicle layered flight decision model based on a SAC algorithm, initializing an unmanned aerial vehicle state, and training the unmanned aerial vehicle layered flight decision model through interaction with the environment;

step S5: initializing the state of the unmanned aerial vehicle, testing a flight decision model of the unmanned aerial vehicle, and evaluating the flight decision performance;

s51: initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s _t;

S52: inputting the state s _t into a trained Actor network to obtain an unmanned aerial vehicle decision action a _t, and executing the action to obtain a new state s _t+1;

S53: judging whether the flight decision task is finished, and if the flight decision task is finished, ending; if not, S _t+1＝s_t is executed, and steps S51 to S53 are executed;

S54: and recording a decision state in the decision process and analyzing the flight decision performance of the unmanned aerial vehicle.

The step of constructing the unmanned aerial vehicle flight control rigid body model comprises the following steps:

(1) Unmanned aerial vehicle kinematics model

The unmanned aerial vehicle kinematic model is irrelevant to the quality and stress of the unmanned aerial vehicle, only the relation among the speed, the angular velocity, the position and the gesture of the unmanned aerial vehicle is researched, the input of the unmanned aerial vehicle kinematic model is the speed and the angular velocity, the output is the position and the gesture, and the unmanned aerial vehicle kinematic model comprises a position kinematic model and a gesture kinematic model;

The position of the unmanned aerial vehicle is defined in an earth coordinate system o _ex_ey_ez_e, the earth coordinate system ignores the earth curvature, the earth surface is assumed to be a plane, the take-off position of the unmanned aerial vehicle is set to be that the origin o _e,o_ex_e axis of the earth coordinate system points to a certain direction in the horizontal plane, the o _ez_e axis is vertical to the ground and downwards, and finally the o _ey_e axis is determined through a right-hand rule;

The gesture of the unmanned aerial vehicle in the space describes the rotation relation between a machine body coordinate system and an earth coordinate system, the machine body coordinate system o _bx_by_bz_b is fixedly connected with the unmanned aerial vehicle body, the gravity center position of the unmanned aerial vehicle is set as a coordinate origin o _b,o_bx_b axis of the machine body coordinate system, and the axis points to the machine head direction in the plane of symmetry of the unmanned aerial vehicle; the o _bz_b axis is in the plane of symmetry of the unmanned plane, is vertical to the o _bx_b axis and can determine the o _by_b axis according to the right hand specification;

The positional kinematic model is defined as follows:

Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>The position change quantity of the unmanned aerial vehicle is represented, and v ^e represents the speed of the unmanned aerial vehicle under the earth coordinate system;

The unmanned aerial vehicle gesture adopts quaternion to represent, and the quaternion represents as follows:

Wherein, Is/>Scalar section of/>Is a vector portion; for real numbers, e.gThe corresponding quaternion is denoted q= [ s0 _1×3]^T, for pure vector/>The corresponding quaternion representation is q= [0 v ^T]^T;

reversely solving the attitude angle of the unmanned aerial vehicle through quaternion:

Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, The pitch angle of the unmanned aerial vehicle is set;

The gesture kinematic model is defined as follows:

Wherein, Represents the angular velocity of the unmanned aerial vehicle in the body coordinate system o _bx_by_bz_b,/>Is the scalar part of the quaternion,/>Is the vector part of the quaternion,/>The transpose of q _v is represented,The attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix;

(2) An unmanned aerial vehicle dynamic model;

The input of the unmanned aerial vehicle dynamic model is tension and moment, the moment comprises pitching moment, rolling moment and yaw moment, and the unmanned aerial vehicle speed and angular speed are output; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;

The location dynamics model is defined as follows:

Wherein, Represents the variation of the speed of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e, m represents the mass of the unmanned aerial vehicle, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e ₃＝[0,0,1]^T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;

The attitude dynamics model is built in the organism coordinate system as follows:

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,Representing the rotational inertia of the unmanned aerial vehicle per se,/>Representing gyro moment;

The comprehensive preparation method comprises the following steps:

Is a rigid body model for unmanned aerial vehicle flight control.

The reward function consists of sparse rewards and continuous rewards, including position rewards, collision rewards and speed rewards;

The position rewards include a position sparse reward and a position continuous reward;

The position consecutive prize is defined as r ₁ and is calculated as follows:

Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o _ex_ey_ez_e at the time t and the time t-1, and the y _goal represents the y _e -axis coordinate value of the flight mission destination of the unmanned aerial vehicle;

The defined location sparsity rewards r ₂ are as follows:

Wherein, N _barrier represents the total number of obstacles in the environment, and level represents the number of the unmanned aerial vehicle passing through the obstacles;

the collision rewards are sparse rewards for evaluating whether the unmanned aerial vehicle collides or not, and the unmanned aerial vehicle obtains collision rewards r ₃ in the flight process:

The speed prize r ₄ is defined as:

r₄＝r'+r”

Wherein v represents the current speed of the unmanned aerial vehicle, and v _limit represents the set minimum speed of the unmanned aerial vehicle; representing the component of the drone speed on the y _e axis in the earth coordinate system o _ex_ey_ez_e;

The comprehensive available prize function includes position prizes R ₁ and R ₂, collision prizes R ₃, and velocity prizes R ₄, i.e., r=r ₁+r₂+r₃+r₄ is a prize function.

The SAC algorithm hierarchical decision model training specifically comprises the following steps:

S41: setting an entropy regularization coefficient alpha, a learning rate lr, an experience pool size, a batch training sample number batch_size and a training round number; initializing the unmanned aerial vehicle, and acquiring environment state information, namely gray image information acquired by a camera and the self flight state of the unmanned aerial vehicle as decision initial states s _t;

S42: initializing experience pool D, randomly generating an Actor network weight phi, a Critic network weight theta ₁,θ₂, initializing an Actor network pi _φ and a Critic network Let the target Critic network weight theta ₁'＝θ₁,θ₂'＝θ₂ initialize the target Critic network/>And/>

S43: inputting state information s _t into an Actor network to obtain Gaussian strategy distribution with a mean value of mu and a variance of sigma; obtaining unmanned aerial vehicle decision action a _t～π_φ(a_t|s_t) according to random sampling of strategy distribution, obtaining next time state S _t+1 after the unmanned aerial vehicle executes action a _t, obtaining rewards r _t＝r(s_t,a_t through calculation of a winning function in step S3), and storing decision data (S _t,a_t,r_t,s_t+1) in an experience pool D;

S44: when the experience number in the experience pool is larger than the batch_size, randomly extracting a batch_size group experience sample M as training data of a SAC algorithm, and performing gradient descent with a learning rate lr aiming at a function J _π (phi) of the Actor network loss and a loss function J _Q(θ_i) i=1 and 2 of the Critic network during training to update the weights of the Actor network and the Critic network;

s45: judging whether the model is converged or not, wherein the convergence condition is whether the value of rewards obtained by the unmanned aerial vehicle in each round is stable or whether the value of rewards reaches a set training round number, and if the value of rewards obtained by the unmanned aerial vehicle in each round is converged, finishing training to obtain a training unmanned aerial vehicle flight decision model; if not, steps S41 to S45 are performed.

The method has the beneficial effects that the problem of difficult strategy exploration exists when the deep reinforcement learning algorithm is applied to unmanned aerial vehicle decision-making due to the huge state space of the unmanned aerial vehicle, and the method adopts the non-deterministic reinforcement learning SAC model, has strong exploration capacity and can efficiently explore the optimal flight strategy. Considering the nonlinear characteristics of an unmanned aerial vehicle model, the end-to-end control is difficult to realize by directly adopting deep reinforcement learning training, the invention provides an unmanned aerial vehicle layered decision model based on SAC algorithm to realize unmanned aerial vehicle flight top layer decision under complex environment, the bottom layer decision is realized by a PID controller, the difficulty of algorithm training is reduced, and the decision performance of the model is improved.

Drawings

FIG. 1 is a schematic diagram of the SAC hierarchical decision model architecture of the present invention.

Fig. 2 is a schematic diagram of an Actor network structure according to the present invention.

FIG. 3 is a schematic diagram of the Critic network of the present invention.

FIG. 4 is a schematic diagram of a hierarchical decision model training process based on SAC algorithm according to the present invention.

Fig. 5 is a graph of the SAC-based algorithm training procedure reward function of the present invention.

Fig. 6 is a flight trajectory diagram of an unmanned aerial vehicle according to an embodiment of the present invention, fig. 6 (a) is a diagram of a coordinate change of a position of the unmanned aerial vehicle on each coordinate axis during the flight of the unmanned aerial vehicle in order to complete a flight decision task, and fig. 6 (b).

Detailed Description

The invention will be further described with reference to the drawings and examples.

According to the design scheme provided by the invention, the unmanned aerial vehicle layered flight decision method based on the SAC algorithm comprises the following steps:

S1, constructing an unmanned aerial vehicle flight control model

In order to describe the pose and position of the drone, it is crucial to establish an appropriate coordinate system. A suitable coordinate system facilitates clearing the relationship between variables, facilitating representation and calculation. The position of the drone is defined in the earth coordinate system, and the pose in space mainly describes the rotational relationship between the body coordinate system and the earth coordinate system.

The earth coordinate system o _ex_ey_ez_e ignores the earth curvature, i.e. the surface of the earth is assumed to be a plane, and is used for researching the motion state of the aircraft relative to the ground and determining the three-dimensional position of the machine body. The o _e,o_ex_e axis is defined as pointing in a direction in the horizontal plane, the o _ez_e axis is defined as pointing in a direction perpendicular to the ground, and finally the o _ey_e axis can be determined by right hand rules, usually with the unmanned takeoff position or earth centered on earth as the origin of coordinates.

The machine body coordinate system o _bx_by_bz_b is fixedly connected with the machine body of the aircraft, and the origin o _b of the machine body coordinate system is defined at the gravity center position of the aircraft; the o _bx_b axis is defined as pointing in the aircraft nose direction in the plane of symmetry of the aircraft; the o _bz_b axis is defined in the plane of symmetry of the aircraft, perpendicular to the o _bx_b axis, and the o _by_b axis can be determined according to the right hand rule.

The unmanned aerial vehicle gesture is represented by a quaternion, which is generally represented as follows:

Wherein, Is/>Scalar section of/>Is the vector portion. For real numbers, e.gThe corresponding quaternion is denoted q= [ s0 _1×3]^T. For pure vector/>The corresponding quaternion representation is q= [0 v ^T]^T.

Unmanned aerial vehicle gesture can be reversely solved through quaternion:

Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, Is the pitch angle of the unmanned aerial vehicle.

In order to solve the position and posture information of the unmanned aerial vehicle in real time, the unmanned aerial vehicle flight control rigid body model is adopted, wherein the unmanned aerial vehicle flight control rigid body model comprises unmanned aerial vehicle kinematics and dynamics models.

(1) Unmanned aerial vehicle kinematics model

The unmanned aerial vehicle kinematics model inputs the speed and the angular velocity of the unmanned aerial vehicle, and the corresponding unmanned aerial vehicle position and gesture can be obtained. The unmanned aerial vehicle kinematic model comprises a position kinematic model and an attitude kinematic model:

The positional kinematic model is defined as follows:

Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>The position change of the unmanned aerial vehicle is represented, and v ^e represents the speed of the unmanned aerial vehicle under the earth coordinate system.

The gesture kinematic model is defined as follows:

Wherein, The angular velocity of the unmanned aerial vehicle is shown in the body coordinate system o _bx_by_bz_b. /(I)Is the scalar part of the quaternion,/>Is the vector portion of the quaternion. /(I)The transpose of q _v is represented,And the attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix.

(2) An unmanned aerial vehicle dynamic model;

the input of the unmanned aerial vehicle dynamic model is tension and moment (pitching moment, rolling moment and yawing moment), and the unmanned aerial vehicle speed and angular velocity are output; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;

The location dynamics model is defined as follows:

Wherein, The change amount of the speed of the unmanned aerial vehicle under the earth coordinate system o _ex_ey_ez_e is represented, the mass of the unmanned aerial vehicle is represented, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e ₃＝[0,0,1]^T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,

Is defined as the moment of inertia of the drone itself. /(I)Representing gyroscopic moment.

The rigid body model of unmanned aerial vehicle flight control is comprehensively available as follows:

And step S2, constructing a state space, a layered decision action space and a reward function of the unmanned aerial vehicle flight decision according to the Markov decision process.

(1) State space design

The state space designed by the invention consists of two parts of states: unmanned aerial vehicle flight state information and environmental information acquired by a sensor in real time. The environment state comprises image information obtained by a front camera of the unmanned aerial vehicle, and the flight state information of the unmanned aerial vehicle is expressed as follows in a vector form:

(2) Action space design and layered decision model

The invention combines a reinforcement learning model and a traditional PID control model, and provides a layered control decision model of an unmanned aerial vehicle, wherein the model structure is shown in figure 1. The reinforcement learning strategy is responsible for top layer decision, and the reinforcement learning model outputs the flight speed of the unmanned aerial vehicle in the flight decision processIn the aspect of bottom layer control, a PID controller is adopted to map the linear speed into a motor instruction so as to realize commands such as pitching, rolling, yawing, accelerating and decelerating of the unmanned aerial vehicle.

(3) Bonus function design

The reward function designed by the invention consists of sparse rewards and continuous rewards, and comprises position rewards, collision rewards and speed rewards;

The position rewards include a position sparse reward and a position continuous reward. The position sparse reward is set as a reward for the unmanned aerial vehicle to successfully pass a certain obstacle to evaluate the obstacle avoidance performance of the flight decision strategy.

The position consecutive rewards are defined as r ₁ as follows:

Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o _ex_ey_ez_e at the time t-1, and y _goal represents the y _e -axis coordinate value of the unmanned aerial vehicle flight mission destination;

The defined location sparsity rewards r ₂ are as follows:

The speed prize r ₄ is defined as:

r₄＝r'+r”

the comprehensive available prize function contains position prizes r ₁ and r ₂, collision prizes r ₃, and velocity prizes r ₄, and is defined as follows:

R＝r₁+r₂+r₃+r₄。

step S3: and constructing an unmanned aerial vehicle layered flight decision model structure based on a SAC algorithm.

The invention discloses an unmanned aerial vehicle flight decision model based on a deep reinforcement learning framework, which comprises an Actor network, a Critic network and an experience pool D.

The Actor network input is the current time state s _t of the unmanned aerial vehicle, wherein the Actor network input comprises a gray level image acquired by an onboard camera carried by the unmanned aerial vehicle and flight state information of the unmanned aerial vehicle. The Actor neural network is designed to be a network structure comprising 6 convolutional layers, 4 pooling layers and 4 full-connection layers, and the structure diagram of the Actor neural network is shown in fig. 2. The gray level image and the unmanned aerial vehicle flight state information are input into an Actor neural network to obtain unmanned aerial vehicle flight decision action output, the average value and variance of the unmanned aerial vehicle speed in the x _e,y_e,z_e axis component are obtained, and then the decision linear speed can be obtained by sampling Gaussian random strategy

The Critic network is designed to comprise a network structure consisting of 6 convolutional layers, 4 pooling layers and 4 full-connection layers, and the Critic neural network structure is shown in figure 3. The state information s _t composed of the gray level image and the unmanned plane flight state information and the unmanned plane motor a _t are input into a Critic neural network to obtain a Q value Q (s _t,a_t) for evaluating the quality of the decision action.

The experience pool D is used for storing experience data (s _t,a_t,r_t,s_t+1) containing states, actions and rewards obtained by the interaction of the unmanned aerial vehicle and the environment, and the implementation process of the hierarchical decision model based on the SAC algorithm is shown in fig. 4.

The unmanned aerial vehicle obtains environment image information through the front-facing camera, and the environment image information and the unmanned aerial vehicle flight state information are input into an Actor neural network decision of an unmanned aerial vehicle flight decision model to obtain the unmanned aerial vehicle flight speed. The state, action and rewards of the unmanned aerial vehicle are stored in an experience pool as training data of a flight decision model of the unmanned aerial vehicle, and during training, experience samples in the experience pool are randomly extracted to train the flight decision model of the unmanned aerial vehicle

Step S4: initializing the state of the unmanned aerial vehicle, defining experimental parameters, and realizing a layering decision model of the SAC algorithm of the unmanned aerial vehicle through interaction with the environment.

s41: setting an entropy regularization coefficient alpha, a learning rate lr, an experience pool size, a batch training sample number batch_size and a training round number; initializing the unmanned aerial vehicle, and acquiring environment state information, namely gray image information acquired by a camera and the self flight state of the unmanned aerial vehicle as decision initial states s _t.

S42: initializing experience pool D, randomly generating an Actor network weight phi, a Critic network weight theta ₁,θ₂, initializing an Actor network pi _φ and a Critic networkLet the target Critic network weight theta ₁'＝θ₁,θ₂'＝θ₂ initialize the target Critic network/>And/>

S43: and inputting the state information s _t into an Actor network to obtain Gaussian strategy distribution with the mean value of mu and the variance of sigma. The unmanned aerial vehicle decision action a _t～π_φ(a_t|s_t) is obtained according to the random sampling of the strategy distribution, the unmanned aerial vehicle obtains the next moment state S _t+1 after executing the action a _t, the prize r _t＝r(s_t,a_t is obtained by the calculation of the winning function in the step S3, and the data (S _t,a_t,r_t,s_t+1) are stored in the experience pools D-D { (S _t,a_t,r_t,s_t+1) }.

S44: when the experience number in the experience pool is larger than the batch_size, the batch_size group experience sample M is randomly extracted to serve as training data of the SAC algorithm, and when training is carried out, the gradient descent with the learning rate lr is carried out aiming at the function J _π (phi) of the loss of the Actor network and the loss function J _Q(θ_i) i=1 and 2 of the Critic network so as to update the weight of the Actor network and the Critic network, wherein the specific neural network loss function and the neural network updating process are as follows:

The double Soft-Q function is defined as the target Critic network The minimum value of the output, therefore, is:

Wherein, Critic network/>, respectivelyIs set to the target Q value.

The Actor network loss function J _π (phi) is defined as follows:

Critic network loss function J _Q(θ_i) i=1, 2 updates are defined as follows:

where α is the regularization coefficient of the policy entropy.

The target Critic network weight θ ₁',θ₂' is updated by:

θ_i'←τθ_i+(1-τ)θ_i'i∈{1,2}

wherein τ is a target Critic network soft update parameter.

S45: judging whether the model converges or reaches a set training round number, if so, finishing training to obtain an unmanned aerial vehicle flight decision model for finishing training; otherwise, steps S41 to S45 are performed.

Step S5: initializing the state of the unmanned aerial vehicle, testing the unmanned aerial vehicle flight decision model, and evaluating the flight decision performance.

S51: and initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s _t.

S52: the state s _t is input into the trained Actor network, the decision action a _t of the unmanned aerial vehicle is obtained and executed, and then a new state s _t+1 is obtained.

S53: judging whether a flight decision task is completed, if so, ending; otherwise S _t+1＝s_t, and steps S51 to S53 are performed.

Examples of applications of the present invention are as follows:

In the example environment, the endpoint y-axis coordinates are 57, the environment contains 4 obstacles, and the y-axis coordinates are 7, 17, 27.5, 45, respectively.

The initial state of the unmanned aerial vehicle is [ P ^e v^e q ω^e ] = [0,0,0,0,0,0,0,0,0,0].

Initializing experimental parameters: the entropy regularization coefficient alpha is 0.2 and automatically decays, the learning rate lr is 0.0006, the empirical pool size is 100000, the batch training sample number batch_size is 256, and the training round number is 1000.

Training the unmanned aerial vehicle layered flight decision model, and recording the change of the rewarding value in the training process. The prize value curve during training of the SAC algorithm is shown in fig. 5. Wherein the SAC algorithm obtains a maximum prize of 51.8 during training. Throughout the training process; the SAC algorithm curve converges at 805 rounds, eventually remaining at 48.3.

After training, initializing the unmanned aerial vehicle state [ P ^e v^e q ω^b ] = [0,0,0,0,0,0,0,0,0,0], performing maneuvering decision by using a training completion model, and drawing an unmanned aerial vehicle flight trajectory graph according to the recorded state, as shown in fig. 6. In the figure, the unmanned aerial vehicle flight track decided by using the unmanned aerial vehicle layered flight decision method based on the SAC algorithm successfully avoids the obstacle, finally reaches the end point with the y-axis coordinate of 57, and smoothly completes the flight task.

The unmanned aerial vehicle layered flight decision method based on the SAC algorithm has better convergence performance and quick and safe flight characteristics when a flight task is realized.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. The unmanned aerial vehicle layered flight decision method based on the SAC algorithm is characterized by comprising the following steps of:

Step S1: constructing unmanned aerial vehicle flight control model

(1) State space design

Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>Respectively representing the position components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o _bx_by_bz_b,Respectively representing angular velocity components of the unmanned aerial vehicle around x _b,y_b,z_b coordinate axes in a machine body coordinate system;

(2) Action space design and layered decision model

(3) Bonus function design

2. The SAC algorithm-based unmanned aerial vehicle layered flight decision method according to claim 1, wherein:

(1) Unmanned aerial vehicle kinematics model

The positional kinematic model is defined as follows:

Wherein, Is/>Scalar section of/>Is a vector portion; for example, for real/>The corresponding quaternion is denoted q= [ s0 _1×3]^T, for pure vector/>The corresponding quaternion representation is q= [0 v ^T]^T;

The gesture kinematic model is defined as follows:

Wherein, Represents the angular velocity of the unmanned aerial vehicle in the body coordinate system o _bx_by_bz_b,/>Is the scalar part of the quaternion,/>Is the vector part of the quaternion,/>Represents the transpose of q _v,/>The attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix;

(2) An unmanned aerial vehicle dynamic model;

The location dynamics model is defined as follows:

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,/>Representing the rotational inertia of the unmanned aerial vehicle per se,/>Representing gyro moment;

The comprehensive preparation method comprises the following steps:

Is a rigid body model for unmanned aerial vehicle flight control.

3. The SAC algorithm-based unmanned aerial vehicle layered flight decision method according to claim 1, wherein:

The defined location sparsity rewards r ₂ are as follows:

The speed prize r ₄ is defined as:

r₄＝r'+r”

4. The SAC algorithm-based unmanned aerial vehicle layered flight decision method according to claim 1, wherein: