CN115268494B

CN115268494B - Unmanned aerial vehicle path planning method based on layered reinforcement learning

Info

Publication number: CN115268494B
Application number: CN202210883240.5A
Authority: CN
Inventors: 王�琦; 潘德民; 王栋; 高尚; 于化龙; 崔弘杨
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2024-05-28
Anticipated expiration: 2042-07-26
Also published as: CN115268494A

Abstract

The invention discloses an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which comprises the following steps: step 1: initializing a deep Q network algorithm and a Q learning algorithm; step 2: driving the unmanned aerial vehicle to move from a starting point to a target point, and training a deep Q network algorithm and a Q learning algorithm; when the unmanned aerial vehicle does not detect a dynamic obstacle in the moving process, planning a path by using a depth Q network algorithm; when the unmanned aerial vehicle detects a dynamic obstacle in the moving process, planning a path by using a Q learning algorithm; step 3: and (3) repeating the step (2) until training of the deep Q network algorithm and the Q learning algorithm is completed, setting the actual coordinates, the starting point coordinates and the target point coordinates of the unmanned aerial vehicle, and planning a path through the trained deep Q network algorithm and the trained Q learning algorithm. The method solves the problem that the network fitting is easily affected by dynamic obstacles when a single algorithm is applied to a dynamic environment, and improves the performance of algorithm path planning.

Description

Unmanned aerial vehicle path planning method based on layered reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle path planning, in particular to an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning.

Background

In recent years, unmanned aerial vehicles are widely applied in a plurality of fields of military use and civil use, so that the demands for autonomy are stronger, and the unmanned aerial vehicle autonomous path planning is the key point of research. At present, most of unmanned plane path planning researches focus on path planning in a static environment, and the research on dynamic environments is relatively few. In the current methods of path planning, reinforcement learning is a hot spot method of path planning due to the unique reward and punishment mechanism and the characteristic of autonomous learning of an optimal strategy through interaction with the environment. Q learning (Q-learning), which is the most classical algorithm for reinforcement learning, is widely applied to the path planning problem of unmanned aerial vehicles. However, Q learning cannot be applied to scenes with complex environments or large dimensions of state space due to the characteristics of table learning. Deep reinforcement learning in combination with deep learning has been proposed and applied to various complex unmanned path planning problems, the most widely used of which is the Deep Q Network (DQN) algorithm.

However, the inventor finds that in the unmanned aerial vehicle dynamic path planning problem based on the deep Q network algorithm, the reinforcement learning algorithm adopts a search strategy of random selection action, so that the efficiency at the initial stage of training is low, the iteration times are too long, and the planned path is not optimal. This situation is more severe in complex environments where dynamic and static obstacles coexist. In addition, it is found that when a single deep Q network algorithm faces a dynamic environment, the position of a dynamic obstacle is not fixed, so that the fitting of a network in the training process is poor, and the performance of the finally trained network is poor.

It can be seen that the prior art has the technical problems of low training efficiency and easily influenced network fitting.

Disclosure of Invention

The invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which aims to solve the problems of low training efficiency and easiness in influence of network fitting in the prior art.

The invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which comprises the following steps:

Step 1: initializing a deep Q network algorithm and a Q learning algorithm;

Step 2: driving the unmanned aerial vehicle to move from a starting point to a target point, and training a deep Q network algorithm and a Q learning algorithm;

When the unmanned aerial vehicle does not detect a dynamic obstacle in the moving process, planning a path by using a depth Q network algorithm;

When the unmanned aerial vehicle detects a dynamic obstacle in the moving process, planning a path by using a Q learning algorithm;

Step 3: and (3) repeating the step (2) until training of the deep Q network algorithm and the Q learning algorithm is completed, setting the actual coordinates, the starting point coordinates and the target point coordinates of the unmanned aerial vehicle, and planning a path through the trained deep Q network algorithm and the trained Q learning algorithm.

Further, when the unmanned aerial vehicle does not detect the dynamic obstacle, the depth Q network algorithm performs planning on the path, and then the method further comprises updating the Q learning algorithm through an experience tuple generated in the depth Q network algorithm after the path is planned currently. At this time, the reward function used by the update depth Q network algorithm is consistent with the normal update of the reward function;

when the unmanned aerial vehicle detects a dynamic obstacle, the Q learning algorithm performs planning on the path, and the method further comprises updating the deep Q network algorithm through an experience tuple generated in the Q learning algorithm after the path is planned currently.

Further, when the Q learning algorithm is updated by the experience tuple generated in the deep Q network algorithm after the current planning path, the reward function formula used by the Q learning algorithm is as follows:

reward＝η(d_s-1-d_s)

Wherein η is a constant; d _s-1 is the distance from the unmanned aerial vehicle to the target point at the last moment; d _s is the distance from the unmanned plane to the target point at the current moment.

Further, in the step 2, before planning the path by the deep Q network algorithm and the Q learning algorithm, the method further includes: the heuristic fish algorithm is used as action guidance of a deep Q network algorithm and a Q learning algorithm in path planning; wherein the heuristic fish algorithm comprises: the method comprises the following steps of a traveling behavior process and a foraging behavior process, wherein the traveling behavior process is used for acquiring the collision direction of an unmanned opportunity and surrounding obstacles; the foraging behavior process is to acquire a plurality of directions with high priority of the unmanned aerial vehicle advancing towards the target point, and the heuristic fish algorithm takes the collision direction out of the plurality of directions with high priority as action guidance.

Further, when the collision direction of the unmanned aerial vehicle with surrounding obstacles is obtained, and when the obstacles are dynamic, whether the unmanned aerial vehicle collides with the obstacles or not is judged through the movement direction and the movement speed of the obstacles.

The invention has the beneficial effects that:

the invention adds the action guidance strategy of heuristic fish algorithm into the action selection strategy of basic deep Q network algorithm and Q learning algorithm. And the action guidance is performed on two aspects of quickly reaching the target point and avoiding the dynamic and static obstacle, and the action guidance greatly reduces unnecessary exploration in the early stage of algorithm training so as to reduce the blindness of the original algorithm exploration.

The invention utilizes hierarchical reinforcement learning, and when facing a dynamic complex environment, two algorithms are used to respectively treat static and dynamic obstacles. The design overcomes the problem that network fitting is easily affected by dynamic obstacles when a single algorithm is applied to a dynamic environment, and improves the performance of algorithm path planning.

The two effects respectively solve the problems that the algorithm training efficiency is low and the planned path lacks safety consideration in the prior art.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic view of detection of a drone sensor in an environment according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a heuristic fish algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a foraging behavior of a heuristic fish algorithm according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a traveling behavior of a heuristic fish algorithm according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which has a flow structure shown in figure 1 and comprises the following steps:

Step 1: initializing network parameter theta of depth Q network algorithm, experience playback zone Q tables for Q learning; initializing a training round number N _episode, and setting a starting point P _O and a target point P _T of the unmanned aerial vehicle flight task;

Step 2: when the training round number is smaller than the set maximum round number, the state and the environment are reset, and the training of the round is started. Detecting the environment according to a sensor, judging whether dynamic obstacles exist in a detection range, wherein the detection range of the sensor is shown in figure 2;

And the depth Q network algorithm selects and executes the action according to the current position of the unmanned aerial vehicle and the position information of the static obstacle by using the heuristic fish algorithm as the action guidance of the algorithm, and then reaches the next state. Rewards for the current action may be derived from a rewards function, embodiments of the invention set the static path planning partial rewards function to:

Alpha, beta are constants that determine the weights of the two reward calculation units in the total reward function. According to experimental debugging, the present example sets α, β to 1.1,2, respectively. d _s represents the distance between the unmanned plane and the target point in the last state; d _s-1 denotes the distance between the drone and the target point in the next state. Is the distance from the unmanned aerial vehicle to each static obstacle.

Storing the experience tuple [ S, A, R, S' ] consisting of the current state, action, rewards and the next state obtained in the interaction into an experience playback areaIs a kind of medium. The algorithm then plays back zone/>, from experience, according to the set batch number mThe data is sampled to update the Q network of the deep Q network algorithm.

Meanwhile, when the deep Q network algorithm and the Q learning algorithm are switched and used, if one party is completely separated from the other party to stop working, the Q value of the partial state action pair is lost after training of the two algorithms is completed. To avoid this problem, when the deep Q network algorithm works, the Q table of the Q learning algorithm is also updated by using the experience tuples interactively generated in the previous step, and at this time, since the Q learning algorithm has no dynamic obstacle in the range of the unmanned aerial vehicle sensor in the non-working period, the reward function is defined as:

reward＝η(d_s-1-d_s)

Finally, if the action taken by the unmanned aerial vehicle at this time causes a collision, ending and starting a new training round; if no collision is caused, the current round of training is continued.

And the Q learning algorithm selects and executes the action according to the current position of the unmanned aerial vehicle and the information of the detected dynamic obstacle by using the heuristic fish algorithm as the action guidance of the algorithm, and the next state is reached. For the bonus function of the dynamic path planning part, the embodiment of the invention sets it as:

Gamma and delta are weight constants, and gamma and delta are respectively set as 1.1,1 according to experimental debugging; d' _u→t,d_u→t represents the distance between the unmanned plane and the target point at the previous moment and the current moment respectively; d' _u→o,d_u→o represents the distance of the unmanned aerial vehicle from the dynamic obstacle to be avoided at the previous moment and the current moment respectively.

And then, updating the Q table of the Q learning algorithm according to the information tuples [ S, A, R, S' ] obtained by the interaction.

And updating the network of the deep Q network algorithm by using the experience tuple obtained in the previous interaction. At this time, the reward function is consistent with the reward function when the deep Q network algorithm actually performs static path planning.

Step 3: repeating the step 2, and ending the current round if the unmanned aerial vehicle reaches the target point; if the current training round number of the unmanned plane reaches the set maximum round number N _episode, training of the deep Q network algorithm and the Q learning algorithm is completed. At the moment, the actual coordinates, the starting point coordinates and the target point coordinates of the unmanned aerial vehicle are set, and a path is planned through a depth Q network algorithm and a Q learning algorithm which are completed through training.

In step 2, before planning the path by the deep Q network algorithm and the Q learning algorithm, the method further includes: the heuristic fish algorithm is used as action guidance of a deep Q network algorithm and a Q learning algorithm in path planning; the heuristic fish algorithm is inspired by the phenomenon that fishes can feed by utilizing side organs in dark environments in nature, and comprises the following steps: the method comprises the following steps of a traveling behavior process and a foraging behavior process, wherein the traveling behavior process is used for acquiring the collision direction of an unmanned opportunity and surrounding obstacles; the foraging behavior process is to acquire a plurality of directions with high priority of the unmanned aerial vehicle advancing towards the target point, and the heuristic fish algorithm takes the collision direction out of the plurality of directions with high priority as action guidance. The algorithm flow is shown in fig. 3, and comprises the following steps:

Step 21: when the depth Q network algorithm or the Q learning algorithm calls the heuristic fish algorithm to select actions, the current state, the target point position and the information containing the dynamic and static obstacles are input into the heuristic fish algorithm. The experimental environment adopted by the invention is a grid environment, the unmanned aerial vehicle can take eight actions in directions, and the heuristic fish algorithm is responsible for selecting the optimal action in the current state.

Step 22: the foraging behavior calculates the set of selectable actions based on the current state and the target point location, as shown in fig. 4. Let the direction vector that unmanned aerial vehicle current position and target point constitute be L _u→t,L_horizontal and be a unit vector of unmanned aerial vehicle forward direction, then the contained angle of two vectors is:

Next, L _action, action e a is a unit direction vector on a certain action in the action space, and the included angle between each action and L _horizontal is:

The difference between θ _t and each θ _action is:

And finally, giving priority to each action from high to low according to the difference from small to large, and returning to the action set with the first five priorities.

Step 23: the traveling behavior calculates an optional action set that does not cause collision according to the current state and the moving and static obstacle information, as shown in fig. 5, gray squares represent static obstacles, and diagonal squares represent dynamic obstacles.

For avoiding the static obstacle, the position information of the static obstacle is utilized, when the unmanned aerial vehicle executes a certain action and enters the area of the static obstacle, the action is set as the forbidden action of the current state, and the available action is returned.

For avoiding a dynamic obstacle, predicting a threat area of the dynamic obstacle at the next moment according to a dynamic obstacle information set (speed, direction, position) detected by a sensor, setting the action as a forbidden action of the current state when the unmanned aerial vehicle executes a certain action and returning to an available action.

Step 24: combining the actions returned in step 22 and step 23, returning a plurality of actions with high priority and without collision to the deep Q network algorithm or Q learning. The call ends.

The process of the specific embodiment is exemplified in a simulation manner, and is specifically as follows:

example 1: hierarchical reinforcement learning

Step 1: initializing network parameters of a deep Q network algorithm, empirical playback zone size1000000; The Q table of the Q learning algorithm is initialized. Setting the total training round number as 500 rounds, wherein the starting point P _O = [0,0] and the target point P _T = [29, 29] of the unmanned aerial vehicle flight task;

Step 2: the sensor detection range is set to 3 as shown in fig. 2.

And if no dynamic obstacle exists in the current detection range of the unmanned aerial vehicle, calling a depth Q algorithm to conduct static path planning, and then calling a heuristic fish algorithm to conduct action selection. The unmanned aerial vehicle performs the selected action to enter the next state, and obtains a reward for performing the action. The algorithm deposits the experience tuples to the experience playback zone. And meanwhile, updating network parameters from the experience playback zone according to the set batch m=16 sampling information, and updating a Q table of a Q learning algorithm by using the experience tuple.

And if the dynamic obstacle exists in the detection range as in the case of fig. 2, calling a Q learning algorithm to perform dynamic path planning. And the heuristic fish algorithm is also called to select an action, and then the unmanned aerial vehicle executes the selected action to enter the next state and obtains rewards of the action. And finally, updating the Q table by using the experience tuple and simultaneously updating the network of the deep Q network algorithm by using the experience tuple.

Step 3: the unmanned plane is constantly circulated in the interaction process with the environment: detecting a dynamic obstacle, switching an algorithm, selecting an action, executing the action, calculating rewards, updating a Q network/Q table until collision with the obstacle occurs or a target point is reached, and ending the current round. When the total training round number reaches the set N _episode, the whole training is ended.

Example 2: heuristic fish algorithm

Step 1: the heuristic is invoked by a deep Q network algorithm or a Q learning algorithm and inputs information including the current state, the target point position, and the dynamic and static obstacles. The heuristic algorithm performs the foraging and travelling actions, respectively, to select the available action set.

Step 2: the foraging behavior calculates theta _t,θ_action according to the current state and the target point position, calculates the difference value between theta _t and each theta _action, gives eight actions different priorities according to the difference value, and returns to the action of the first five priorities. Referring to FIG. 4, the set of priority actions returned in this case is [ front left, front right, rear left ].

Step 3: the traveling behavior returns actions which do not lead to collisions according to the information of static and dynamic obstacles. For static obstacles, the action of selecting to enter the area is forbidden because of the fixed position; for dynamic obstacles, the position of the obstacle at the next moment is predicted by using a set [ speed, direction, position ], and then the action of selecting to enter the area is forbidden. As shown in fig. 5, a gray box is a static obstacle, a diagonal line is a dynamic obstacle, and the information of the dynamic obstacle is [1, left, current position ], so that the next time is the marked area in the figure. Finally, the actions causing collision [ left, right back ] are removed, and the rest 6 actions are optional actions.

Step 4: and 2, integrating the actions returned in the step 3, wherein the optional action set returned is [ left front, right front, left back ], and the call is ended.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations are within the scope of the invention as defined by the appended claims.

Claims

1. The unmanned aerial vehicle path planning method based on hierarchical reinforcement learning is characterized by comprising the following steps of:

Step 1: initializing a deep Q network algorithm and a Q learning algorithm;

when the unmanned aerial vehicle does not detect a dynamic obstacle in the moving process, planning a path by using a depth Q network algorithm, and updating a Q learning algorithm through an experience tuple generated in the depth Q network algorithm after the path is planned currently;

when the unmanned aerial vehicle detects a dynamic obstacle in the moving process, planning a path by using a Q learning algorithm, and updating a depth Q network algorithm through an experience tuple generated in the Q learning algorithm after the path is planned currently;

2. The hierarchical reinforcement learning-based unmanned aerial vehicle path planning method of claim 1, wherein when the Q learning algorithm is updated by an empirical tuple generated in a depth Q network algorithm after a current planned path, a reward function formula used by the Q learning algorithm is as follows:

reward＝η(d_s-1-d_s)

3. The unmanned aerial vehicle path planning method based on hierarchical reinforcement learning according to claim 1, wherein in the step 2, before planning the path by the deep Q network algorithm and the Q learning algorithm, the method further comprises: the heuristic fish algorithm is used as action guidance of a deep Q network algorithm and a Q learning algorithm in path planning; wherein the heuristic fish algorithm comprises: the method comprises the following steps of a traveling behavior process and a foraging behavior process, wherein the traveling behavior process is used for acquiring the collision direction of an unmanned opportunity and surrounding obstacles; the foraging behavior process is to acquire a plurality of directions with high priority of the unmanned aerial vehicle advancing towards the target point, and the heuristic fish algorithm takes the collision direction out of the plurality of directions with high priority as action guidance.

4. The unmanned aerial vehicle path planning method based on hierarchical reinforcement learning according to claim 3, wherein when the direction in which the unmanned aerial vehicle collides with surrounding obstacles is obtained, and when the obstacles are dynamic, whether the unmanned aerial vehicle collides with the obstacles is judged by the movement direction and movement speed of the obstacles.