CN118051063B

CN118051063B - Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Info

Publication number: CN118051063B
Application number: CN202410452069.1A
Authority: CN
Inventors: 程擎; 曾嘉诚; 王德超; 何汶键; 华翔
Original assignee: Civil Aviation Flight University of China
Current assignee: Civil Aviation Flight University of China
Priority date: 2024-04-16
Filing date: 2024-04-16
Publication date: 2024-06-14
Anticipated expiration: 2044-04-16
Also published as: CN118051063A

Abstract

The invention discloses a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, which belongs to the technical field of obstacle avoidance of unmanned aerial vehicles and comprises the following steps: constructing a three-dimensional obstacle avoidance scene; the unmanned aerial vehicle is arranged in a three-dimensional obstacle avoidance scene, a strategy gradient model is determined by utilizing the depth of the multi-experience pool to perform low-altitude unmanned aerial vehicle flight training, and the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight is completed. According to the invention, different flight experience data are stored in different experience pools in a classified manner, so that the utilization efficiency of the flight experience data is improved, an unmanned aerial vehicle can learn a successful obstacle avoidance flight arrival strategy better, and a failure strategy is avoided as much as possible, and the problems of low obstacle avoidance flight training speed and insufficient effect of the unmanned aerial vehicle are solved; the invention also sets the rewarding function based on different grade flight areas, and accelerates the algorithm convergence speed, thereby solving the problem of low training speed caused by low algorithm convergence speed.

Description

Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle obstacle avoidance, and particularly relates to a training method for low-altitude unmanned aerial vehicle obstacle avoidance flight.

Background

Along with the rapid development of unmanned aerial vehicle technology, unmanned aerial vehicles have been widely used in fields such as agriculture, aerial photography, emergency rescue, environmental monitoring, etc. However, unmanned aerial vehicles still have insufficient autonomous flight and obstacle avoidance capabilities in complex environments. The current unmanned aerial vehicle obstacle avoidance method can be roughly divided into: mathematical optimization methods, traditional potential field and navigation function combined obstacle avoidance methods, and obstacle avoidance methods based on machine learning.

At present, research hotspots of unmanned aerial vehicle obstacle avoidance algorithms focus on machine learning obstacle avoidance algorithms, visual detection of obstacles is carried out through an equipped camera, the distance between the obstacle and the camera is estimated, and then avoidance is carried out. Reinforcement learning and deep reinforcement learning based on machine learning extension are widely applied to solving the obstacle avoidance problem of an unmanned aerial vehicle, and are usually combined with monocular vision and a neural network model, but when the unmanned aerial vehicle is in a continuous state space and action space scene, the problems of low model offline learning training efficiency, low convergence speed, insufficient obstacle avoidance effect and the like exist.

Disclosure of Invention

Aiming at the defects in the prior art, the training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle provided by the invention realizes the improvement of the utilization efficiency of the flight experience data by storing the different flight experience data into different experience pools in a classified manner, so that the unmanned aerial vehicle can learn a successful obstacle avoidance flight arrival strategy better and avoid a failure strategy as much as possible.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

The invention provides a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, which comprises the following steps:

S1, constructing a three-dimensional obstacle avoidance scene;

S2, arranging the unmanned aerial vehicle in a three-dimensional obstacle avoidance scene, and performing low-altitude unmanned aerial vehicle flight training by utilizing a multi-experience pool depth determination strategy gradient model to finish the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight.

The beneficial effects of the invention are as follows: according to the method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle, the customizable virtual environment is set, a three-dimensional obstacle avoidance scene with an obstacle is provided for the obstacle avoidance flight training of the unmanned aerial vehicle, the unmanned aerial vehicle arranged in the three-dimensional obstacle avoidance scene is trained based on the multi-experience pool depth determination strategy gradient model, the utilization efficiency of flight experience data is improved through the arrangement of the arrival flight experience pool and the non-arrival flight experience pool, and the training speed and the training effect of the obstacle avoidance flight training of the unmanned aerial vehicle are also improved through the arrangement of the reward function.

Further, the three-dimensional obstacle avoidance scene comprises a plurality of levels of flight areas, wherein the flight areas of each level are respectively and correspondingly provided with an obstacle of a corresponding level except for the initial level of flight area, and the flight areas of the corresponding level are accessed through the obstacle of the corresponding level.

The beneficial effects of adopting the further scheme are as follows: through setting up the barrier to the three-dimensional obstacle avoidance scene, for unmanned aerial vehicle obstacle avoidance flight training provides the environment basis to through the division of different level barriers, and the division of different level flight areas, for constructing the reward function, with optimizing unmanned aerial vehicle flight strategy, and accelerate unmanned aerial vehicle training speed and provide the basis.

Further, the three-dimensional obstacle avoidance scene is a cuboid-shaped space with the whole space width of 20m, the height of 20m and the length of 70m, 5 grades of flight areas are arranged in total, corresponding obstacles in each grade of flight area are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first grade of obstacles are two cross beams arranged along the width direction, the second grade of obstacles are walls with two groups of four holes arranged along the width direction, the third grade of obstacles are two cylinders arranged along the width direction, and the fourth grade of obstacles are one inclined cylinder.

The beneficial effects of adopting the further scheme are as follows: the invention provides a cuboid three-dimensional obstacle avoidance scene, which is divided into five levels of flight areas based on obstacles uniformly arranged at intervals along the length direction, and the obstacle avoidance conditions of an unmanned aerial vehicle are enriched based on the arrangement of different levels of obstacles, so that the obstacle avoidance flight training of the unmanned aerial vehicle can be realized, and the obstacle avoidance flight capability of the unmanned aerial vehicle facing different obstacles is fully improved.

Further, the step S2 includes the following steps:

S21, setting a flight training experience buffer zone to store flight experience data in a classification mode to a arrived flight experience pool and an unreached flight experience pool;

s22, constructing a multi-experience pool depth determination strategy gradient model of the unmanned aerial vehicle obstacle avoidance flight, and initializing an actor network parameter, a reviewer network parameter, an actor target network parameter and a reviewer target network parameter of the multi-experience pool depth determination strategy gradient model;

S23, repeatedly enabling the unmanned aerial vehicle to autonomously explore the flight from the initial position within the duration of single flight training to obtain flight experience data, training target Q value and rewarding results corresponding to each unmanned aerial vehicle low-altitude obstacle avoidance training;

And S24, optimizing the action of the unmanned aerial vehicle during training by using the rewarding result, so that the Q value of the training target is continuously increased until the training target is stably close to the expected Q value of the training target, and completing the obstacle avoidance flight training of the low-altitude unmanned aerial vehicle.

The beneficial effects of adopting the further scheme are as follows: the invention provides a method for carrying out obstacle avoidance flight training on an unmanned aerial vehicle arranged in a three-dimensional obstacle avoidance scene by utilizing a multi-experience pool depth determination strategy gradient model, wherein the unmanned aerial vehicle is offline subjected to the obstacle avoidance flight training by setting an arrival flight experience pool and an unreachable flight experience pool, and experiences of successfully learning the arrival of the obstacle avoidance flight at a target position are more fully learned, and a basis is provided for avoiding failed flight experiences.

Further, the step S23 includes the steps of:

S231, setting an obstacle avoidance flight training time threshold value and a single flight training time length of the unmanned aerial vehicle;

S232, setting the unmanned aerial vehicle at any position in an initial-level flight area, taking the position as an initial position, and initializing unmanned aerial vehicle state data;

S233, selecting the action of the unmanned aerial vehicle at the moment t according to the state of the unmanned aerial vehicle at the moment t and the network parameters of the actor, and executing the action of the unmanned aerial vehicle at the moment t to obtain the rewarding and punishing result of the unmanned aerial vehicle at the moment t and the state of the unmanned aerial vehicle at the moment t+1 until the unmanned aerial vehicle reaches the target position of the highest-grade flight area or reaches the single flight training time of the unmanned aerial vehicle, so as to complete the single unmanned aerial vehicle obstacle avoidance flight training;

s234, calculating and obtaining a rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle according to the rewarding function;

S235, acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training, and updating a multi-experience pool depth determination strategy gradient model;

s236, judging whether the unmanned aerial vehicle reaches the target position of the highest-level flight area in the obstacle avoidance flight training of the single unmanned aerial vehicle, if so, entering S237, otherwise, entering S238;

S237, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in the obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until the unmanned aerial vehicle state data reaches a flight experience pool, and entering S239;

s238, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until a flight experience pool is not reached, and entering S239;

S239, judging whether the total number of the unmanned aerial vehicle obstacle avoidance flight training reaches a threshold value of the number of the unmanned aerial vehicle obstacle avoidance flight training, if yes, entering S24, otherwise returning to S231.

The beneficial effects of adopting the further scheme are as follows: according to the specific method for carrying out obstacle avoidance flight training of the unmanned aerial vehicle for many times in the three-dimensional obstacle avoidance scene, through the rewarding result of the obstacle avoidance flight training of the unmanned aerial vehicle each time, the action strategy selection of the unmanned aerial vehicle for carrying out obstacle avoidance and flying to the target position can be optimized, and then through obtaining the training target Q value of the obstacle avoidance flight training each time, network parameter updating can be carried out on the actor network, the commentator network, the actor target network and the commentator target network in the strategy gradient model for determining the depth of the multi-experience pool, so that the training speed, the training efficiency and the training effect of the obstacle avoidance flight training of the unmanned aerial vehicle are improved.

Further, the unmanned aerial vehicle state data is (s _t,a_t,r_t,s_t+1), wherein s _t represents the state of the unmanned aerial vehicle at the time t, a _t represents the action of the unmanned aerial vehicle at the time t, r _t represents the rewarding and punishing result of the unmanned aerial vehicle at the time t, s _t+1 represents the state of the unmanned aerial vehicle at the time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state and the strategy network at the time t, and after the action of the unmanned aerial vehicle at the time t is executed, the rewarding and punishing result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained.

The beneficial effects of adopting the further scheme are as follows: the unmanned aerial vehicle state data are provided, states, action execution, rewarded states and states of the unmanned aerial vehicle at the next moment in the unmanned aerial vehicle obstacle avoidance flight training are described through the unmanned aerial vehicle state data, the situation that the unmanned aerial vehicle is in a three-dimensional obstacle avoidance scene for obstacle avoidance flight training can be continuously described, data bases can be provided for storing flight experience data in an arrival flight experience pool and an unreachable flight experience pool, and bases are provided for improving the training speed and training effect of the unmanned aerial vehicle through offline learning.

Further, the calculation expression of the reward result is as follows:

，

Wherein R _e represents a reward result corresponding to low-altitude obstacle avoidance training of the unmanned aerial vehicle, R represents an obstacle avoidance passing reward and punishment function, a represents a grade corresponding to a flight area to which a starting position of the unmanned aerial vehicle belongs, i represents a corresponding grade number i of the flight area, R _i represents a flight area reward function, R represents a forward reward value of the unmanned aerial vehicle successfully reaching a target position, s represents a training state of the unmanned aerial vehicle, ar represents the target position reached by the unmanned aerial vehicle, nca represents that the unmanned aerial vehicle is not collided and does not reach the target position, los represents that the advancing speed of the unmanned aerial vehicle is smaller than a preset speed threshold, co represents that the unmanned aerial vehicle collides, sign (·) represents a symbol function, x represents a corresponding grade number of the flight area in which the unmanned aerial vehicle is positioned in training, and L represents a total grade number of the flight area.

The beneficial effects of adopting the further scheme are as follows: the invention provides a calculation method of the rewarding result, the rewarding result can be regarded as the state rewarding and punishing situation of the unmanned aerial vehicle at the last moment of the single flight training time, and the sum of rewarding and punishing situations of the unmanned aerial vehicle entering different levels of flight areas through the obstacle, the situation that the unmanned aerial vehicle reaches the target position, the collision situation, the travelling speed situation and the obstacle passing situation are comprehensively considered, the scoring based on the rewarding result is realized, and the action strategy of the unmanned aerial vehicle during obstacle avoidance flight training is optimized.

Further, the step S235 includes the steps of:

S2351, calculating a training target Q value of unmanned aerial vehicle obstacle avoidance flight training according to unmanned aerial vehicle state data;

The calculation expression of the training target Q value is as follows:

Wherein, Represent training target Q value,/>Represents the excitation function, gamma represents the discount rate,/>Representing critic target network,/>Representing actor target network parameters as/>Probability distribution of state of unmanned plane at time t+1,/>Representing critics target network parameters;

S2352, updating the actor network parameters and the commentator network parameters according to the training target Q value of the unmanned aerial vehicle obstacle avoidance flight training to obtain updated actor network parameters and commentator network parameters;

S2353, updating the actor target network parameters based on the updated actor network parameters to obtain updated actor target network parameters;

S2354, updating the criticism target network parameters based on the updated actor target network parameters and the criticism network parameters to obtain updated criticism target network parameters, and finishing updating of the multi-experience pool depth determination strategy gradient model.

The beneficial effects of adopting the further scheme are as follows: the invention provides a method for acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training and updating a multi-experience pool depth determination strategy gradient model, and based on a calculation result of the training target Q value, updating of network parameters is realized, so that the speed and effect of unmanned aerial vehicle obstacle avoidance flight can be improved.

Further, the arriving flight experience pool and the unreachable flight experience pool in the flight training experience buffer zone can provide flight experience data so as to perform offline unmanned aerial vehicle obstacle avoidance flight training, and therefore actor target network parameters and criticism target network parameters are optimized.

The beneficial effects of adopting the further scheme are as follows: according to the invention, through the arrangement of the two flight experience pools, the classification storage of the successful arrival flight experience data and the failed unreachable flight experience data is realized, wherein the flight experience data are acquired through random sampling, and the correlation between samples in the training process can be reduced, so that the stability of unmanned aerial vehicle obstacle avoidance flight is improved, the generalization capability of a actor commentator network model is also facilitated to be improved, in addition, the calculation amount during training can be reduced through experience playback and offline training, and interaction with the environment in a three-dimensional obstacle avoidance scene is not required for each training.

Other advantages that are also present with respect to the present invention will be more detailed in the following examples.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a step flowchart of a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle in an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

As shown in fig. 1, in one embodiment of the present invention, the present invention provides a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, including the following steps:

S1, constructing a three-dimensional obstacle avoidance scene;

the three-dimensional obstacle avoidance scene comprises a plurality of grades of flight areas, wherein the flight areas of each grade are respectively and correspondingly provided with an obstacle of a corresponding grade except for the initial grade flight area, and the flight areas of the grade are accessed through the obstacle of the corresponding grade.

The three-dimensional obstacle avoidance scene is a cuboid-shaped space with the whole space width of 20m, the height of 20m and the length of 70m, 5 grades of flight areas are arranged in total, corresponding obstacles in each grade of flight area are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first grade of obstacles are two cross beams arranged along the width direction, the second grade of obstacles are two groups of walls with four holes, the two groups of walls are arranged along the width direction, the third grade of obstacles are two cylinders arranged along the width direction, and the fourth grade of obstacles are an inclined cylinder.

In this embodiment, a class flight area of 0-5 is set, the unmanned aerial vehicle is trained by taking off from the initial class flight area, the class value of the initial class flight area is 0, the class value of the flight area where the unmanned aerial vehicle is located when flying across the first class obstacle is changed from 0 to 1, and the like, and the class value of the corresponding flight area is highest when the unmanned aerial vehicle successfully arrives at the target position. The width of the whole space of the three-dimensional obstacle avoidance scene is 20m, the height is 20m, the whole length of a flight area is 70m, the distances between every two obstacles which are arranged at intervals are equal, the space coordinates are downwards arranged by taking the position of the center of gravity of the unmanned aerial vehicle as a coordinate origin, the initial flight height is 3m, the initial coordinates are (0, 0-3), the first-level obstacle is two cross beams which are arranged along the width direction, the coordinates are (0, 14-8) and (0, 14-14), the second-level obstacle is a wall, two groups of square areas which can pass through in total are arranged inside the wall, the side length is 4m, the lowest height of each group is 6 and 14m respectively, the coordinates are (8, 28-6), (16, 28-6), (8, 28-14) and (16, 28-14), the third-level obstacle is two cylinders with the height of 20m, the coordinates are (7,42-10) and (14, 42-10), the fourth-level obstacle is a diagonal cylinder with the coordinates of (0,70-3), and the object position is 0,70-setting.

And the multi-experience pool depth determination strategy gradient model is used for classifying and storing experience data on the basis of the depth determination strategy gradient model.

The step S2 comprises the following steps:

the step S23 includes the steps of:

S231, setting an obstacle avoidance flight training time threshold value and a single flight training time length of the unmanned aerial vehicle; in the embodiment, the duration of single flight training of the unmanned aerial vehicle is 6 minutes;

The unmanned aerial vehicle state data is (s _t,a_t,r_t,s_t+1), wherein s _t represents the state of the unmanned aerial vehicle at the time t, a _t represents the action of the unmanned aerial vehicle at the time t, r _t represents the rewarding and punishing result of the unmanned aerial vehicle at the time t, s _t+1 represents the state of the unmanned aerial vehicle at the time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state at the time t and the strategy network, and the rewarding and punishing result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained after the action of the unmanned aerial vehicle at the time t is executed; the time t and the time t+1 are the time in the unmanned aerial vehicle obstacle avoidance flight training process.

The calculation expression of the rewarding result is as follows:

，

wherein R _e represents a reward result corresponding to low-altitude obstacle avoidance training of the unmanned aerial vehicle, R represents an obstacle avoidance passing reward and punishment function, a represents a grade corresponding to a flight area to which a starting position of the unmanned aerial vehicle belongs, i represents a corresponding grade number i of the flight area, R _i represents a flight area reward function, R represents a forward reward value of the unmanned aerial vehicle successfully reaching a target position, s represents a training state of the unmanned aerial vehicle, ar represents the target position reached by the unmanned aerial vehicle, nca represents that the unmanned aerial vehicle is not collided and does not reach the target position, los represents that the advancing speed of the unmanned aerial vehicle is smaller than a preset speed threshold, co represents that the unmanned aerial vehicle collides, sign (·) represents a symbol function, x represents a corresponding grade number of the flight area in which the unmanned aerial vehicle is positioned in training, and L represents a total grade number of the flight area. In this embodiment, the number of levels of the initial level flight area is 0, the number of levels of the other level flight areas is a corresponding positive integer, for example, the number of levels of the first level flight area is 1, the number of levels of the second level flight area is 2, … …, the number of levels of the nth level flight area is N, and the total number of levels of the flight areas is the maximum number of levels of the flight area.

In the embodiment, the obstacle avoidance is set as { position information, collision information, speed information, reaching information }, by inputting a reward and punishment function, wherein the higher the number of grades corresponding to a flight area, the higher the reward value obtained when calculating a reward result is; the collision information is obtained from a three-dimensional obstacle avoidance scene, if an obstacle is encountered, obstacle avoidance failure is carried out, punishment and division are carried out, otherwise, the unmanned aerial vehicle is in a normal state, and when the unmanned aerial vehicle does not reach a target position, the collision information and the arrival information need to be combined and analyzed so as to calculate the result of obstacle avoidance passing through rewards and punishments; for the situation that the flying speed of the unmanned aerial vehicle in the speed information is smaller than a preset speed threshold value, punishment and division may be carried out, so that the phenomenon that an action strategy is not output by a actor commentary network model under the situation that collision of the unmanned aerial vehicle is avoided, and the training speed and the training effect are influenced is avoided; and when the target position successfully reached by the unmanned aerial vehicle in the information is reached, corresponding rewards are obtained, and rewards and punishments executed by the unmanned aerial vehicle in the obstacle avoidance through rewarding and punishment conditions are carried out by taking the rewards obtained by the unmanned aerial vehicle reaching the target position as a reference. The obstacle avoidance gives forward rewards R to unmanned aerial vehicles which successfully arrive at the target point through a rewards and punishment function, and gives punishment to unmanned aerial vehicles which do not successfully arrive at the target point, have too small flight speed and collide in the middle of flight within a single training time threshold.

The step S235 includes the steps of:

The calculation expression of the training target Q value is as follows:

the critic network interacts with the three-dimensional obstacle avoidance scene, and the critic network parameters are updated by minimizing the difference between the predicted value and the target value of the current critic network parameters; and the actor network updates the actor network parameters through gradient rise to obtain updated actor network parameters by guiding the improvement of the action strategy of the unmanned aerial vehicle at each moment through the update value with the weight according to the comment network parameter predicted value.

the updated actor target network parameters are calculated as follows:

Wherein, Representing actor target network parameters, tau representing the softness parameter of parameter synchronization,/>Representing actor network parameters, wherein τ=1 is a hard update, τ=0 is no update, and 0< τ <1 is a soft update; the update of the actor target network parameters in the embodiment is soft update;

S2354, updating the criticism target network based on the updated actor target network parameters and the criticism network parameters to obtain updated criticism target network parameters, and finishing updating of the multi-experience pool depth determination strategy gradient model.

The updated critic target network parameters have the following calculation expression:

Wherein, Representing critic target network parameters,/>Representing commentator network parameters;

in this embodiment, adam optimizers are employed to update actor network parameters, and the learning rate of the actor network is set to The learning rate of the commentator network is set to/>The training batch is set to 64, the exploration rate is initially set to 1, and the unmanned aerial vehicle is enabled to explore continuously; with the progressive exploration rate of training gradually decreasing, in order to ensure that strategy stability and certain exploration capacity are simultaneously provided in the later stage of training, the minimum value is set to be 0.01. In the reward function, the forward reward value R for the successful arrival of the unmanned aerial vehicle at the target position is set to 2, the discount rate γ is set to 0.9, and the storage amount for flight experience data in the arrival flight experience pool and the unrealistic flight experience pool is set to 50000.

The step S236 may be replaced by that it is judged whether the rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle is greater than a preset rewarding result threshold value and no collision occurs, if yes, the step S237 is entered, otherwise the step S238 is entered; in this embodiment, the preset reward result threshold is 20, that is, when the reward result of the single unmanned aerial vehicle obstacle avoidance flight training is greater than 20 and no collision occurs, the flight experience data of the training has value and is added to the flight experience pool.

The arrival flight experience pool and the non-arrival flight experience pool in the flight training experience buffer zone can provide flight experience data so as to perform offline unmanned aerial vehicle obstacle avoidance flight training, and therefore actor target network parameters and criticism target network parameters are optimized.

In the invention, based on whether the unmanned aerial vehicle arrives at the target position, two different experience pools are generated by dividing, namely an arriving flight experience pool and an unreachable flight experience pool. Unmanned aerial vehicle single flight training data the unmanned aerial vehicle flight state data that this time keeps away barrier flight training and produce, if successfully reach the target position in the limiting time of experiment, deposit it and arrive flight experience pond. And for the situation that the exploration or collision still occurs within a limited time, storing the data in the situation that the flight experience pool is not reached. The arrival experience pool records that the target position is successfully reached or the target point is not reached within a limited time, but unmanned aerial vehicle flight state data with higher rewards are obtained. The unrealistic pool records unmanned aerial vehicle flight status data that has collided within a defined time, or is still exploring. During training, samples are taken from two different experience pools in a proportion, and a batch of data is taken from the two different experience pools for training each time the parameters of the algorithm need to be updated. According to the invention, the correlation between samples in the training process is reduced through the arrangement of the two flight experience pools, and the stability of the unmanned aerial vehicle in obstacle avoidance flight is also improved.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims

1. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle is characterized by comprising the following steps of:

S1, constructing a three-dimensional obstacle avoidance scene;

S2, arranging the unmanned aerial vehicle in a three-dimensional obstacle avoidance scene, and performing low-altitude unmanned aerial vehicle flight training by utilizing a multi-experience pool depth determination strategy gradient model to finish the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight;

The step S2 comprises the following steps:

the step S23 includes the steps of:

S239, judging whether the total number of the unmanned aerial vehicle obstacle avoidance flight training reaches a threshold value of the number of the unmanned aerial vehicle obstacle avoidance flight training, if yes, entering S24, otherwise returning to S231;

2. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the three-dimensional obstacle avoidance scene comprises a plurality of levels of flight areas, and each level of flight area is respectively and correspondingly provided with a corresponding level of obstacle except for an initial level of flight area, and the flight area of the level is entered through the corresponding level of obstacle.

3. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 2, wherein the three-dimensional obstacle avoidance scene is a cuboid space with the overall space width of 20m, the height of 20m and the length of 70m, 5 levels of flight areas are arranged in total, corresponding obstacles in each level of flight areas are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first level of obstacles are two cross beams arranged along the width direction, the second level of obstacles are two walls with two groups of four openings arranged along the width direction, the third level of obstacles are two cylinders arranged along the width direction, and the fourth level of obstacles are one inclined cylinder.

4. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the unmanned aerial vehicle state data is (s _t,a_t,r_t,s_t+1), s _t represents a state of the unmanned aerial vehicle at a time t, a _t represents an action of the unmanned aerial vehicle at the time t, r _t represents a reward and punish result of the unmanned aerial vehicle at the time t, s _t+1 represents a state of the unmanned aerial vehicle at a time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state at the time t and a policy network, and a reward and punish result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained after the action of the unmanned aerial vehicle at the time t is executed.

5. The training method for obstacle avoidance flight of a low altitude unmanned aerial vehicle of claim 4, wherein the calculation expression of the reward result is as follows:

，

6. The method for training the obstacle avoidance flight of the low altitude unmanned aerial vehicle of claim 4, wherein S235 comprises the steps of:

The calculation expression of the training target Q value is as follows:

7. The method for training the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the arrival flight experience pool and the non-arrival flight experience pool in the flight training experience buffer can provide flight experience data for offline unmanned aerial vehicle obstacle avoidance flight training so as to optimize actor target network parameters and critique target network parameters.