CN115826601A

CN115826601A - Unmanned aerial vehicle path planning method based on reverse reinforcement learning

Info

Publication number: CN115826601A
Application number: CN202211437557.2A
Authority: CN
Inventors: 杨秀霞; 张毅; 王晨蕾; 杨林; 李文强; 姜子劼; 于浩
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-21

Abstract

The invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, and aims to solve the problems that a depth certainty strategy gradient algorithm is low in convergence speed, difficult in reward function setting and the like when an unmanned aerial vehicle safety collision avoidance path is planned. Firstly, acquiring a demonstration track data set of an expert-controlled UAV obstacle avoidance based on simulator software; secondly, a mixed sampling mechanism is adopted, high-quality expert demonstration track data are fused in self-exploration data, and network parameters are updated, so that the algorithm exploration cost is reduced; and finally, solving the optimal reward function implicit in the expert experience according to the maximum entropy reverse reinforcement learning algorithm, thereby solving the problem of difficult setting of the reward function in complex tasks. The comparison experiment result shows that the method can effectively improve the algorithm training efficiency and has better obstacle avoidance performance.

Description

Unmanned aerial vehicle path planning method based on reverse reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle path planning, and particularly relates to an unmanned aerial vehicle path planning method based on reverse reinforcement learning.

Background

With the further opening of the UAV (Unmanned Aerial Vehicle) field, the flight safety of the UAV is greatly threatened by the dense dynamic obstacles in complex environments such as cities, mountains and the like. Traditional path planning algorithms, such as heuristic algorithms like a, D, etc., and a graph theory-based through-view method, voronoi graph method, etc., can only deal with simple environments in which obstacle information is known in advance. However, the terrain of cities and mountains is complex and changeable, and the specific parameters of obstacles are difficult to obtain, so that the application range of the traditional obstacle avoidance algorithm is limited.

Different from the traditional path planning method, the navigation method based on reinforcement learning uses a learning mode of biological acquired perception development for reference, and continuously optimizes the obstacle avoidance strategy through interaction with the environment, so that the dependence on obstacle modeling and supervised learning is avoided, and the navigation method has strong generalization capability and robustness. Particularly, in recent years, the problem of 'exponential explosion' of a high-dimensional environment state space and a decision space is effectively solved by utilizing the strong perception and function fitting capability of deep reinforcement learning, and a new thought is provided for the path planning problem of the UAV in the dense dynamic obstacle environment. Deep reinforcement learning algorithms such as DDPG (Deep deterministic Policy gradient) algorithm, asynchronous advantageous AC (advanced AC) algorithm, TRPO (reliable Policy Optimization) algorithm and near-end Policy Optimization (PPO) algorithm are proposed in succession by Sliver, *** Deep Mind team, john Schulman of Berkeley university and OpenAI.

Although these methods have significant advantages in UAV path planning, it often requires exploration of large random obstacle environment samples to try new strategies, which are prone to fall into local optimality. Therefore, the invention provides a DDPG algorithm fusing expert experience loss, which introduces an expert demonstration track sample with a high reward value on the basis of a self-exploration sample so as to save an exploration space; meanwhile, an expert experience loss gradient function is introduced to optimize network parameters and obtain an optimal strategy.

The DDPG algorithm fused with the expert experience loss solves the problem of strategy iterative optimization, but the design of a reward function still has strong subjectivity, and the reward obtained through interaction with the environment is usually sparse, so that the algorithm is extremely difficult to converge during training, and the path planning effect is poor. When the expert finishes the obstacle avoidance task, the strategy of the expert is often optimal, so that the expert experience is learned from the expert demonstration track to construct the reward function, which is more suitable for the practical requirements than the artificially designed reward function form to a great extent, and under the condition of the given expert track, the algorithm for reversely deducing the reward function hidden in the expert experience is Inverse Reinforcement Learning (IRL). IRL can be divided into two broad categories, maximum margin and maximum entropy, but the method based on maximum margin tends to be ambiguous in that different reward functions with random preferences can be derived from the same expert strategy. The maximum entropy model is completely constructed based on known data (namely expert tracks), and unknown information is not distributed by any subjective assumption, so that the ambiguity problem is effectively avoided.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning.

The technical scheme for solving the technical problems is as follows:

an unmanned aerial vehicle path planning method based on reverse reinforcement learning comprises the following steps:

step 1, collecting an expert demonstration track data set and a self-exploration track data set of an expert operating UAV obstacle avoidance;

step 2, constructing an experience pool, wherein the experience pool is composed of an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample;

step 3, based on the DDPG, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating solving of an optimal strategy;

step 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving a hidden probability model generating a track under the condition that the track is demonstrated by known experts;

and 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.

Further, the establishing of the experience pool in the step 2 specifically includes the following steps:

experience pool demonstration of trajectory data set T by expert _expert And self-exploring trajectory dataset T _discover And a mixed sampling mechanism is adopted to sample from the two data sets respectively to form a final training sample T:

T＝α·T _expert +β·T _discover (1)

where α is the slave training set T _expert Specific gravity of middle sampling, beta is from training set T _discover Specific gravity of medium sample.

Further, the DDPG algorithm guided by the expert experience loss function introduced in the step 3 comprises an online policy network mu (s | theta) ^μ ) On-line value function network Q (s, a | theta [ ]) ^Q ) Target policy network μ' (s | θ) ^μ' ) And a network of target value functions Q' (s, a | theta) ^Q' )。

Further, an online value function network Q (s, a | θ) ^Q ) The optimization of the parameters specifically comprises the following steps:

according to the Bellman equation, at the ith training time step, the action target value y of the online value function network _i Comprises the following steps:

y _t ＝r _t +γQ'(s _t+1 ,μ'(s _t+1 |θ ^μ' )|θ ^Q' ) (2)

the action target value and the actual output Q(s) of the online value function network _i ,a _i |θ ^Q ) Error delta between _i Comprises the following steps:

δ _i ＝y _i -Q(s _i ,a _i |θ ^Q ) (3)

substituting equation (3) into equation (2) yields the loss function of the online value function network:

minimizing the loss function J (theta) by gradient descent method ^Q ) Function network parameter theta to online values ^Q Performing optimization updating to make J (theta) ^Q ) For network parameter theta ^Q Derivation to know the gradient value

Comprises the following steps:

the updating of the network parameters of the online value function is performed according to equation (5).

Further, the optimization of the online policy network parameters specifically comprises the following steps:

optimizing the network parameters of the online strategy by two parts, namely an expert demonstration track sample and a self-exploration sample;

for expert demonstration trajectory data, an online policy network is based on the current expert state

Predicted just-in-time strategy a _i With real expert strategy

Mean square error of (J) _exp (θ ^μ ) The prediction output strategy of the network is continuously driven to the expert strategy by introducing the expert experience loss:

in the formula (I), the compound is shown in the specification,

basing online policy networks on current expert state

A predicted just-in-time policy;

make expert experience lose J _exp (θ ^μ ) The policy network parameter θ ^μ Obtaining the gradient value by derivation

Is composed of

On-line strategy gradient value according to original DDPG algorithm

Updating the parameter theta ^μ ：

Using a fusion gradient

Updating parameters of the online policy network:

wherein, lambda is a fusion gradient regulatory factor.

Further, the updating of the target network parameters is based on the online network parameters and adopts a soft updating mode:

wherein τ <1.

Further, the constructing the reward function in step 4 includes the following steps:

the known experts manipulate UAV obstacle avoidance generated trajectory ζ:

ζ＝{(s ₁ ,a ₁ ),(s ₂ ,a ₂ ),…(s _n ,a _n )} (11)

the reward value r (ζ) for that track is then:

fitting the reward function with a linear combination of a limited number of significant feature functions f (-) then

In the formula (f) _i Is the i-th characteristic component of the reward function, θ _i Is the ith component of the reward function weight vector; n is the number of the feature vectors in the reward function;

if the Euclidean distance d of the UAV relative to the obstacle, the relative distance heading angle psi _d Angle of climbing with relative distance

The moving speed v and the heading angle psi of the UAV relative to the obstacle _v Angle of climb of relative movement speed

The information belongs to important characteristics in the UAV obstacle avoidance process, so

Definition F (ζ) is the sum of the characteristic components of the respective states in equation (14), in the specific form:

substituting equation (15) into equation (13), the prize value for each track is expressed as:

r(ζ)＝θ ^T F(ζ) (16)。

further, the step 4 of solving the reward function based on the maximum entropy inverse reinforcement learning algorithm specifically includes the following steps:

given m expert tracks, the characteristic expectation of an expert is:

with the expert trajectories known, assume a potential probability distribution of p (ζ) _i | θ), then the expert trajectory is characterized by:

the maximum entropy model is constructed based on the expert track in the formula, and the problem of solving the maximum entropy is converted into an optimization problem:

wherein p = p (ζ) _i |θ)；

Converting the optimization problem into a dual form:

in the formula, λ _j 、λ ₀ Is a lagrange multiplier;

let the loss function L (p) derive the expert demonstration trajectory distribution probability p, and obtain:

and (3) making the above expression equal to 0, obtaining a maximum entropy probability model of the expert demonstration track:

in the formula of _j A weight vector theta corresponding to a feature function in the reward function;

in the formula, Z (theta) is a distribution item, namely the sum of all possible expert track probabilities;

in the probability model shown in the above formula, the higher the probability of the occurrence of the expert trajectory, i.e. the higher the Z (θ), the closer the reward function setting is to the optimal strategy implied in the expert example; the method can convert the optimal reward function solving into the entropy of the maximized expert track distribution for optimization:

and (3) converting the above equation into a minimized negative log-likelihood function solution loss quantity of the characteristic component weight theta of the reward function:

and (3) calculating an expert track prediction distribution function Z (theta) under the current strategy:

in the formula, T _samp To representThe expert tracks under the current strategy, and n represents the number of the expert tracks under the current strategy;

for continuous expert state in sampled expert track

And corresponding real expert strategy

Discretization is performed, and random batch sampling is performed from the discretization, and the formula (25) is converted into:

in the above equation, the loss function J (θ) is:

the loss function J (θ) is differentiated from the weight θ of the reward function, and the optimal reward function is solved by a gradient descent method, so that:

in conclusion, the global optimal solution r of the reward function is finally learned through the equation (29) ^* (s _i ,a _i )。

Further, the training of the DDPG in the step 5 until the DDPG completes the flight mission with an optimal strategy under an optimal reward function implied by the expert trajectory includes the following steps:

randomly initializing online policy network Q (s, a | θ) ^Q ) And an online value function network mu (s | theta) ^μ ) Network parameter θ of ^μ And theta ^Q Initializing the target networks mu 'and Q' and their weights, initializing the reward function weight theta _i (ii) a Initializing an experience pool, and demonstrating a track data set T by the acquired experts _expert Storing the data into an experience pool;

a) The online strategy network obtains an action a = pi theta(s) + eta based on the current state s _t Wherein η _t Is random noise, and the action selection strategy pi depends on the design of a reward function;

b) Performing an action a with the environment in an interactive manner to obtain a new state s' and an instant reward value r;

c) Self-exploring sample data (s, a, r, s') generated by interacting with environment, namely T _discover Storing the data into an experience pool;

d)s＝s′；

e) Randomly sampling N sample data from experience pool, training, estimating distribution function Z (theta) according to formula (26), minimizing target value shown in formula (28), and weighting reward function theta _i Optimizing to obtain an optimal reward function;

f) Updating the value function network parameter theta according to equation (5) ^Q If the training data T is equal to T _expert Then updating the strategy network parameters according to the formula (7), if the training data T is equal to T _discover Then updating the policy network parameter θ according to equation (8) ^μ ；

g) Updating the target network parameter θ according to equation (10) ^Q′ And theta ^μ′ ；

h) When s' is in the termination state, the current iteration is ended, otherwise go to step a).

Compared with the prior art, the invention has the following technical effects:

the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, aiming at the problem of UAV path planning. Firstly, acquiring a demonstration track data set of an expert-controlled UAV obstacle avoidance based on simulator software; secondly, a mixed sampling mechanism is adopted, high-quality expert demonstration track data are fused in self-exploration data, and network parameters are updated, so that the exploration cost is reduced; and finally, solving the optimal reward function according to the maximum entropy reverse reinforcement learning algorithm, and solving the problems that the reward function is difficult to design and the experience of the invisible expert cannot be fully mined.

Drawings

FIG. 1 is a schematic diagram of a simulator assembly and data display according to the present invention;

FIG. 2 is a block diagram of simulator training of the present invention;

FIG. 3 is a diagram of an expert demonstration trace data set of the present invention;

FIG. 4 is a schematic diagram of a testing environment of the present invention;

FIG. 5 is a block diagram of a hybrid sampling mechanism of the present invention;

FIG. 6 is a diagram of the improved DDPG algorithm training framework of the present invention;

FIG. 7 is a graph of reward values for fused expert experience loss in accordance with the present invention;

FIG. 8 is a graph of reward values for the present invention based on maximum entropy inverse reinforcement learning;

figure 9 is a graph of prize values for the integrated improved DDPG algorithm of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In order to solve the problems of low convergence speed, difficult reward function setting and the like when a depth certainty strategy gradient algorithm is used for planning a safe collision avoidance path of an unmanned aerial vehicle, the invention provides an unmanned aerial vehicle path planning method based on reverse reinforcement learning, which comprises the following steps: step 1, collecting an expert demonstration track data set and a self-exploration track data set of an expert operating UAV obstacle avoidance; step 2, establishing an experience pool, wherein the experience pool is composed of an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample; step 3, constructing the DDPG, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating solving of an optimal strategy; step 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving a hidden probability model generating a track under the condition that the track is demonstrated by known experts; and 5, training the DDPG until the DDPG completes the flight task with the optimal strategy under the optimal reward function hidden by the expert track.

The following is a description of the above steps:

acquiring an expert demonstration track based on a complex obstacle scene carried by a simulator Phoenix RC software, and constructing a simple obstacle scene based on Python to generate a self-exploration data sample by interaction of the UAV and the environment.

The acquisition of the expert demonstration track is based on professional radio control flight simulation software Phoenix RC developed by running game companies. Expert demonstration traces were collected in a Phoenix RC simulator, using mainly the following three components: 1) A remote controller module: matching each channel of the remote controller with the functions of the UAV model, including the rod stroke of a rudder, an elevator, an accelerator and an aileron, so as to control the movement of the UAV; 2) A UAV model module: the device comprises a plurality of types such as fixed wings and rotary wings; 3) A scene module: hundreds of three-dimensional simulation obstacle environments are provided, and variable simulation reality such as wind power illumination can be customized. In a simulation environment, data such as UAV flight speed, course angle, GPS positioning, gyroscope, barometer and the like can be acquired through an interface function and displayed in real time. The components and data display are shown in FIG. 1.

The framework of the obstacle avoidance training of the manually manipulated UAV model in the Phoenix RC simulator is shown in fig. 2. After the environment state information is obtained in the three-dimensional simulated obstacle environment, the environment state information comprises obstacle positions, the distance between the UAV and the obstacles and the like, and experts manually operate the rudder, the elevator, the accelerator and the aileron stick stroke of the remote controller, and continuously adjust the course angle, the pitch angle, the flying speed and the like of the UAV model to avoid the obstacles.

Part of the expert demonstration traces collected from the Phoenix RC simulator are shown in figure 3. The benefits of collecting the obstacle environment data set in the simulator are: 1) Obstacle setting and scene types in the simulator are complex and changeable, and the degree of closeness to the real world is high; 2) Training is completely carried out in a simulation scene, and the unmanned aerial vehicle can be manually operated to simulate various different maneuvering actions so as to determine the optimal flight strategy; 3) The simulator visually displays the monocular RGB image at each moment in the obstacle avoidance process and parameters such as the course angle, the flight speed and the like of the UAV without using a complex sensor for sensing and measuring; 4) Without concern for collision damage and safety issues.

And constructing a three-dimensional obstacle environment Q shown in FIG. 4 for testing the performance of the algorithm, so that the UAV generates a self-exploration data set in interaction with the environment. The obstacle environment is L in length, W in width and H in height, dynamic and static obstacles with different threat degrees exist in the environment, and the space position, the movement speed and the influence range of the obstacles are unknown.

And 2, constructing an experience pool, wherein the experience pool is formed by an expert demonstration track data set and a self-exploration track data set, and a mixed sampling mechanism is adopted to sample the two data sets respectively to form a final training sample.

In the UAV obstacle avoidance training, in order to avoid resource waste caused by random inefficient exploration in the UAV initial training phase, diversification of samples is realized as much as possible, and further, the upper limit implied by the expert strategy is broken through, as shown in fig. 5, in the experience pool of the present embodiment, an expert demonstrates a track data set T _expert And self-exploring trajectory dataset T _discover And a mixed sampling mechanism is adopted to sample from the two data sets respectively to form a final training sample T:

T＝α·T _expert +β·T _discover (1)

And 3, introducing an expert experience loss function to guide iterative updating of DDPG parameters based on DDPG, and accelerating solving of an optimal strategy.

Aiming at the defects of large exploration space and low sample reward value in the initial stage of the original DDPG algorithm, the improved DDPG algorithm optimization strategy iteration fusing expert experience loss is provided. Training samples of an original DDPG algorithm only contain a data set generated by interacting with the environment and self-exploring, and the embodiment adopts a mixed sampling mechanism and introduces part of expert demonstration track samples on the basis of self-exploring samples. For an expert track data set, introducing an expert experience loss function to guide iterative updating of DDPG parameters, and accelerating to solve an optimal strategy; the self-exploring data samples are still updated according to the original DDPG algorithm.

The DDPG algorithm introduced with the guidance of the expert experience loss function comprisesOnline policy network mu (s | theta) ^μ ) An online value function network Q (s, a | theta |) ^Q ) Target policy network mu' (s | theta) ^μ' ) And a target value function network Q' (s, a | theta) ^Q' ) And fourthly, the method comprises the following steps.

y _t ＝r _t +γQ'(s _t+1 ,μ'(s _t+1 |θ ^μ ')|θ ^Q' ) (2)

δ _i ＝y _i -Q(s _i ,a _i |θ ^Q ) (3)

substituting equation (3) into equation (2) can yield the loss function of the online value function network:

Comprises the following steps:

The optimization of the online strategy network parameters is divided into two parts, namely an expert demonstration track sample and a self-exploration sample. For expert demonstration trajectory data, the online policy network may be based on the current expert state

Predicted just-in-time strategy a _i With real expert strategy

Mean square error of (J) _exp (θ ^μ ) The strategy network is introduced as expert experience loss, so that the prediction output strategy of the network continuously tends to the expert strategy:

in the formula (I), the compound is shown in the specification,

basing online policy networks on current expert state

Predicted immediate policy.

Is composed of

Due to the fact that the track of the expert strategy is limited, the whole state and action space cannot be covered, the UAV can explore a larger space in the interaction with the environment, the upper limit implied by the expert strategy is broken through, and the stability of the algorithm is improved. Therefore, when the expert experience loss gradient optimization online strategy iterative process is introduced, the online strategy gradient value of the self-exploration track data set and the original DDPG algorithm is kept

Updating the parameter θ ^μ ：

Adopting an online strategy gradient containing expert experience loss

And original online policy gradient

The expert experience loss function method introduces high-quality expert strategies to save exploration space in the initial stage and improve algorithm convergence efficiency on one hand, and continuously learns in self exploration on the other hand so as to try to acquire better strategies which are not involved in expert tracks. Finally, the fusion gradient was used according to the following formula

Updating parameters of the online policy network:

wherein, lambda is a fusion gradient regulatory factor.

The updating of the target network parameters is based on the online network parameters and adopts a soft updating mode:

wherein τ <1.

And 4, constructing a reward function, and solving the reward function based on a maximum entropy reverse reinforcement learning algorithm, namely solving an implicit probability model for generating the track under the condition that the track is demonstrated by known experts.

The known expert controls a track ζ generated by UAV obstacle avoidance:

ζ＝{(s ₁ ,a ₁ ),(s ₂ ,a ₂ ),…(s _n ,a _n )} (11)

the reward value r (ζ) for that track is then:

In the formula (f) _i Is the i-th characteristic component of the reward function, θ _i Is the ith component of the reward function weight vector; n is the number of feature vectors in the reward function.

In the process of operating the UAV obstacle avoidance by the expert, the expert operator often makes a decision according to the current UAV flight speed, the azimuth distance between the UAV and the obstacle, and the like. Thus, the Euclidean distance d of the UAV from the obstacle, the relative distance heading angle ψ _d Angle of climbing with relative distance

substituting equation (15) into equation (13), the prize value for each track can be expressed as:

r(ζ)＝θ ^T F(ζ) (16)

given m expert trajectories, the characteristic expectations of the experts are:

the maximum entropy model is constructed completely based on known data (namely expert tracks) in the formula, and no subjective assumption is made on unknown conditions, so that the ambiguity problem existing in the user-defined reward function can be effectively avoided. Converting the solution of the maximum entropy problem into an optimization problem:

wherein p = p (ζ) _i |θ)。

Converting the optimization problem into a dual form:

in the formula, λ _j 、λ ₀ Is a lagrange multiplier.

in the formula of _j A weight vector theta corresponding to a feature function in the reward function.

In the formula, Z (θ) is a distribution term, i.e., the sum of all possible expert trajectory probabilities.

In the probability model shown in the above formula, the greater the probability of occurrence of the expert trajectory, i.e., the greater the Z (θ), the closer the reward function setting is to the optimal strategy implied in the expert example. The method can convert the optimal reward function solving into the entropy of the maximized expert track distribution for optimization:

in the formula, T _samp Representing the expert tracks under the current strategy and n representing the number of expert tracks under the current strategy.

Since there is a certain difference in expert cognition, to reduce the fitting variance of the weight theta, the continuous expert states in the sampled expert trajectory are subjected to

And corresponding real expert strategy

in the above equation, the loss function J (θ) is:

in conclusion, the global optimal solution r of the reward function can be finally learned through the equation (29) ^* (s _i ,a _i )。

The obstacle avoidance problem using the modified DDPG algorithm can be described as: at a series of successive decision moments, the network makes a decision based on the UAV current state s; after the decision is implemented, the network acquires an instant reward value according to a reward function designed by reverse reinforcement learning, the reward corresponds to a network decision and an environment state, and then the network enters a state at the next moment corresponding to the decision and forwards updates network parameters through a DDPG algorithm for fusing expert supervision loss; and in the new training time, the network executes a new decision according to the current new state and obtains a new reward value, and the operation is repeated in a circulating way until the network completes the flight mission with the optimal strategy under the optimal reward function hidden in the expert track.

The specific training comprises the following steps:

1. randomly initializing an online policy network mu (s | theta) ^μ ) On-line value function network Q (s, a | theta [ ]) ^Q ) Network parameter θ of ^μ And theta ^Q Initializing the target networks mu 'and Q' and the weights thereof;

2. constructing a reward function and initializing the weight of the reward function;

3. initializing an experience pool, and demonstrating a track data set T by the acquired experts _expert Storing the data into an experience pool;

4. performing network training with iteration number M;

c) Self-exploring sample data (s, a, r, s') generated by interaction with environment, namely T _discover Storing the data into an experience pool;

d)s＝s′；

e) Randomly sampling N sample data from experience pool, training, estimating distribution function Z (theta), minimizing target value shown in formula (28), and weighting reward function theta _i Optimizing to obtain an optimal reward function;

h) When s' is in the termination state, the current iteration is ended, otherwise, the step a) is carried out.

And testing the obstacle avoidance performance of the algorithm in the simulation environment. The simulation experiment environment is Python 3.7, intercore i5 processor with 2.42GHz main frequency and Windows 10 operating system. The test scene is a dense dynamic and static obstacle environment as shown in fig. 4, a three-dimensional area with a simulation environment of 200 × 250 × 400 is set, and a plurality of dynamic obstacles moving at different movement speeds along different heading angles and climbing angle directions exist in the area.

In order to test the strategy learning performance of the DDPG algorithm fusing the expert experience loss, the invention compares and tests the influence of the improved algorithm and the original algorithm on the network convergence speed under the condition of keeping the consistency of the reward function. The design reward function is shown in equation (30), where R is the reward function and d _min Is d in formula (14) _i The minimum of (d), i.e., the distance between the UAV and the nearest obstacle, thereby urging the UAV away from the obstacle; d _tar The distance between the unmanned aerial vehicle and the terminal point is obtained, so that the UAV is promoted to fly towards the terminal point direction; k is a radical of ₁ ＝2，k ₂ ＝0.5。

The reward values during UAV training are shown in figure 7. As can be seen from the figure, the strategy learning speed of the UAV adopting the original DDPG algorithm rises approximately linearly in the iteration of 100 to 200 rounds; but in 200 to 600 iterations, the model falls into a local optimal solution, resulting in difficulty in network convergence; at 600 to 1600 rounds, the strategy learning speed continues to increase, but the speed increase is slowed down; the prize values converge gradually after 1600 iterations, roughly settling at 110, but with large fluctuations. By adopting a DDPG algorithm fused with expert experience loss, the UAV continuously tells the increase of the strategy learning speed in 100 to 300 rounds; after 300 iterations, the speed increase is slowed down to some extent, but no stagnation condition occurs; after 1300 iterations the network converged and the reward value was approximately stable at 125, higher than the original DDPG algorithm, and the fluctuation amplitude was small. In conclusion, the DDPG algorithm fused with experience loss has the advantages of higher convergence speed, better stability and better obstacle avoidance effect.

For the analysis of the influence of the maximum entropy inverse reinforcement learning algorithm on the obstacle avoidance effect, the reward value curve of the optimal reward function solved by the UAV based on the maximum entropy inverse reinforcement learning is shown in fig. 8. The UAV trained by the optimal reward function keeps increasing at a high speed in the strategy learning speed within 100 to 500 rounds; after 500 iterations, the acceleration rate is slightly slowed down; the prize value converges to approximately 170 after 1300 iterations. Compared with the original DDPG algorithm, the UAV reward value trained by the optimal reward function is obviously higher, the strategy learning speed is higher, and the stability is better.

And (3) integrating the expert experience loss and the reverse reinforcement learning to improve the DDPG algorithm. Figure 9 shows a prize value curve for the modified DDPG algorithm that combines expert experience loss and reverse reinforcement learning. It can be known from the figure that, combining the two improved DDPG algorithms, the two improved DDPG algorithms combine their respective advantages, so that not only the strategy learning speed is faster in the initial stage, but also the obstacle avoidance effect is better, and the reward value can converge to 175 in about 1300 iterations.

The invention aims at the UAV path planning problem and improves the DDPG algorithm. By introducing an expert experience loss function optimization strategy network iteration process, the initial exploration cost of the original DDPG algorithm is saved, and the network convergence speed is accelerated; meanwhile, the optimal reward function hidden in the expert demonstration track is solved based on the maximum entropy reverse reinforcement learning algorithm, and the problem that the reward function is difficult to set manually in a complex task is solved. The contrast experiment shows that the improved DDPG algorithm has higher convergence speed and better obstacle avoidance effect.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An unmanned aerial vehicle path planning method based on reverse reinforcement learning is characterized by comprising the following steps:

step 1, collecting an expert demonstration track data set and a self-exploration track data set of an UAV obstacle avoidance operated by an expert;

2. The method for unmanned aerial vehicle path planning based on reverse reinforcement learning according to claim 1, wherein the step 2 of constructing the experience pool specifically comprises the following steps:

T＝α·T _expert +β·T _discover (1)

3. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 1, wherein the DDPG algorithm guided by the expert experience loss function introduced in the step 3 comprises an online policy network μ (s | θ |) ^μ ) On-line value function network Q (s, a | theta [ ]) ^Q ) Target policy network mu' (s | theta) ^μ' ) And a target value function network Q' (s, a | theta) ^Q' )。

4. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 3, wherein the online value function network Q (s, a | θ) ^Q ) The optimization of the parameters specifically comprises the following steps:

training at the ith according to Bellman's equationAction target value y of time step, on-line value function network _i Comprises the following steps:

y _t ＝r _t +γQ'(s _t+1 ,μ'(s _t+1 |θ ^μ' )|θ ^Q' ) (2)

δ _i ＝y _i -Q(s _i ,a _i |θ ^Q ) (3)

Comprises the following steps:

5. The unmanned aerial vehicle path planning method based on reverse reinforcement learning according to claim 3, wherein the optimization of the online policy network parameters specifically comprises the following steps:

Predicted just-in-time strategy a _i With real expert strategy

in the formula (I), the compound is shown in the specification,

basing online policy networks on current expert state

A predicted just-in-time policy;

Is composed of

On-line strategy gradient value according to original DDPG algorithm

Updating the parameter θ ^μ ：

Using a fusion gradient

Updating parameters of the online policy network:

wherein, lambda is a fusion gradient regulatory factor.

6. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 3, wherein the update of the target network parameters is based on online network parameters in a soft update mode:

wherein τ <1.

7. The method of claim 6, wherein the constructing the reward function in step 4 comprises the following steps:

the known experts manipulate UAV obstacle avoidance generated trajectory ζ:

ζ＝{(s ₁ ,a ₁ ),(s ₂ ,a ₂ ),…(s _n ,a _n )} (11)

the reward value r (ζ) for that track is then:

Velocity v of UAV relative to an obstacle, relative velocity heading angle psi _v Angle of climb of relative movement speed

r(ζ)＝θ ^T F(ζ) (16)。

8. the method for unmanned aerial vehicle path planning based on inverse reinforcement learning of claim 7, wherein the step 4 of solving the reward function based on the maximum entropy inverse reinforcement learning algorithm specifically comprises the following steps:

given m expert tracks, the characteristic expectation of an expert is:

wherein p = p (ζ) _i |θ)；

Converting the optimization problem into a dual form:

in the formula, λ _j 、λ ₀ Is a lagrange multiplier;

in the formula, T _samp Representing expert tracks under the current strategy, wherein n represents the number of the expert tracks under the current strategy;

for continuous expert state in sampled expert track

And corresponding real expert strategy

in the above equation, the loss function J (θ) is:

9. The unmanned aerial vehicle path planning method based on reverse reinforcement learning of claim 8, wherein training the DDPG in the step 5 until the DDPG completes the flight mission with an optimal strategy under an optimal reward function implied by the expert trajectory comprises the following steps:

randomly initializing online policy network Q (s, a | θ) ^Q ) And an online value function network mu (s | theta) ^μ ) Network parameter θ of ^μ And theta ^Q Initializing the target networks mu 'and Q' and their weights, initializing the reward function weights theta _i (ii) a Initializing an experience pool, and demonstrating a track data set T by the acquired experts _expert Storing the data into an experience pool;

c) Number of self-exploring samples to be generated by interaction with environmentAccording to (s, a, r, s'), i.e. T _discover Storing the data into an experience pool;

d)s＝s′；