CN118051063B - Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle - Google Patents

Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle Download PDF

Info

Publication number
CN118051063B
CN118051063B CN202410452069.1A CN202410452069A CN118051063B CN 118051063 B CN118051063 B CN 118051063B CN 202410452069 A CN202410452069 A CN 202410452069A CN 118051063 B CN118051063 B CN 118051063B
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
flight
training
obstacle avoidance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410452069.1A
Other languages
Chinese (zh)
Other versions
CN118051063A (en
Inventor
程擎
曾嘉诚
王德超
何汶键
华翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Civil Aviation Flight University of China
Original Assignee
Civil Aviation Flight University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Civil Aviation Flight University of China filed Critical Civil Aviation Flight University of China
Priority to CN202410452069.1A priority Critical patent/CN118051063B/en
Publication of CN118051063A publication Critical patent/CN118051063A/en
Application granted granted Critical
Publication of CN118051063B publication Critical patent/CN118051063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, which belongs to the technical field of obstacle avoidance of unmanned aerial vehicles and comprises the following steps: constructing a three-dimensional obstacle avoidance scene; the unmanned aerial vehicle is arranged in a three-dimensional obstacle avoidance scene, a strategy gradient model is determined by utilizing the depth of the multi-experience pool to perform low-altitude unmanned aerial vehicle flight training, and the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight is completed. According to the invention, different flight experience data are stored in different experience pools in a classified manner, so that the utilization efficiency of the flight experience data is improved, an unmanned aerial vehicle can learn a successful obstacle avoidance flight arrival strategy better, and a failure strategy is avoided as much as possible, and the problems of low obstacle avoidance flight training speed and insufficient effect of the unmanned aerial vehicle are solved; the invention also sets the rewarding function based on different grade flight areas, and accelerates the algorithm convergence speed, thereby solving the problem of low training speed caused by low algorithm convergence speed.

Description

Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle obstacle avoidance, and particularly relates to a training method for low-altitude unmanned aerial vehicle obstacle avoidance flight.
Background
Along with the rapid development of unmanned aerial vehicle technology, unmanned aerial vehicles have been widely used in fields such as agriculture, aerial photography, emergency rescue, environmental monitoring, etc. However, unmanned aerial vehicles still have insufficient autonomous flight and obstacle avoidance capabilities in complex environments. The current unmanned aerial vehicle obstacle avoidance method can be roughly divided into: mathematical optimization methods, traditional potential field and navigation function combined obstacle avoidance methods, and obstacle avoidance methods based on machine learning.
At present, research hotspots of unmanned aerial vehicle obstacle avoidance algorithms focus on machine learning obstacle avoidance algorithms, visual detection of obstacles is carried out through an equipped camera, the distance between the obstacle and the camera is estimated, and then avoidance is carried out. Reinforcement learning and deep reinforcement learning based on machine learning extension are widely applied to solving the obstacle avoidance problem of an unmanned aerial vehicle, and are usually combined with monocular vision and a neural network model, but when the unmanned aerial vehicle is in a continuous state space and action space scene, the problems of low model offline learning training efficiency, low convergence speed, insufficient obstacle avoidance effect and the like exist.
Disclosure of Invention
Aiming at the defects in the prior art, the training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle provided by the invention realizes the improvement of the utilization efficiency of the flight experience data by storing the different flight experience data into different experience pools in a classified manner, so that the unmanned aerial vehicle can learn a successful obstacle avoidance flight arrival strategy better and avoid a failure strategy as much as possible.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
The invention provides a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, which comprises the following steps:
S1, constructing a three-dimensional obstacle avoidance scene;
S2, arranging the unmanned aerial vehicle in a three-dimensional obstacle avoidance scene, and performing low-altitude unmanned aerial vehicle flight training by utilizing a multi-experience pool depth determination strategy gradient model to finish the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight.
The beneficial effects of the invention are as follows: according to the method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle, the customizable virtual environment is set, a three-dimensional obstacle avoidance scene with an obstacle is provided for the obstacle avoidance flight training of the unmanned aerial vehicle, the unmanned aerial vehicle arranged in the three-dimensional obstacle avoidance scene is trained based on the multi-experience pool depth determination strategy gradient model, the utilization efficiency of flight experience data is improved through the arrangement of the arrival flight experience pool and the non-arrival flight experience pool, and the training speed and the training effect of the obstacle avoidance flight training of the unmanned aerial vehicle are also improved through the arrangement of the reward function.
Further, the three-dimensional obstacle avoidance scene comprises a plurality of levels of flight areas, wherein the flight areas of each level are respectively and correspondingly provided with an obstacle of a corresponding level except for the initial level of flight area, and the flight areas of the corresponding level are accessed through the obstacle of the corresponding level.
The beneficial effects of adopting the further scheme are as follows: through setting up the barrier to the three-dimensional obstacle avoidance scene, for unmanned aerial vehicle obstacle avoidance flight training provides the environment basis to through the division of different level barriers, and the division of different level flight areas, for constructing the reward function, with optimizing unmanned aerial vehicle flight strategy, and accelerate unmanned aerial vehicle training speed and provide the basis.
Further, the three-dimensional obstacle avoidance scene is a cuboid-shaped space with the whole space width of 20m, the height of 20m and the length of 70m, 5 grades of flight areas are arranged in total, corresponding obstacles in each grade of flight area are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first grade of obstacles are two cross beams arranged along the width direction, the second grade of obstacles are walls with two groups of four holes arranged along the width direction, the third grade of obstacles are two cylinders arranged along the width direction, and the fourth grade of obstacles are one inclined cylinder.
The beneficial effects of adopting the further scheme are as follows: the invention provides a cuboid three-dimensional obstacle avoidance scene, which is divided into five levels of flight areas based on obstacles uniformly arranged at intervals along the length direction, and the obstacle avoidance conditions of an unmanned aerial vehicle are enriched based on the arrangement of different levels of obstacles, so that the obstacle avoidance flight training of the unmanned aerial vehicle can be realized, and the obstacle avoidance flight capability of the unmanned aerial vehicle facing different obstacles is fully improved.
Further, the step S2 includes the following steps:
S21, setting a flight training experience buffer zone to store flight experience data in a classification mode to a arrived flight experience pool and an unreached flight experience pool;
s22, constructing a multi-experience pool depth determination strategy gradient model of the unmanned aerial vehicle obstacle avoidance flight, and initializing an actor network parameter, a reviewer network parameter, an actor target network parameter and a reviewer target network parameter of the multi-experience pool depth determination strategy gradient model;
S23, repeatedly enabling the unmanned aerial vehicle to autonomously explore the flight from the initial position within the duration of single flight training to obtain flight experience data, training target Q value and rewarding results corresponding to each unmanned aerial vehicle low-altitude obstacle avoidance training;
And S24, optimizing the action of the unmanned aerial vehicle during training by using the rewarding result, so that the Q value of the training target is continuously increased until the training target is stably close to the expected Q value of the training target, and completing the obstacle avoidance flight training of the low-altitude unmanned aerial vehicle.
The beneficial effects of adopting the further scheme are as follows: the invention provides a method for carrying out obstacle avoidance flight training on an unmanned aerial vehicle arranged in a three-dimensional obstacle avoidance scene by utilizing a multi-experience pool depth determination strategy gradient model, wherein the unmanned aerial vehicle is offline subjected to the obstacle avoidance flight training by setting an arrival flight experience pool and an unreachable flight experience pool, and experiences of successfully learning the arrival of the obstacle avoidance flight at a target position are more fully learned, and a basis is provided for avoiding failed flight experiences.
Further, the step S23 includes the steps of:
S231, setting an obstacle avoidance flight training time threshold value and a single flight training time length of the unmanned aerial vehicle;
S232, setting the unmanned aerial vehicle at any position in an initial-level flight area, taking the position as an initial position, and initializing unmanned aerial vehicle state data;
S233, selecting the action of the unmanned aerial vehicle at the moment t according to the state of the unmanned aerial vehicle at the moment t and the network parameters of the actor, and executing the action of the unmanned aerial vehicle at the moment t to obtain the rewarding and punishing result of the unmanned aerial vehicle at the moment t and the state of the unmanned aerial vehicle at the moment t+1 until the unmanned aerial vehicle reaches the target position of the highest-grade flight area or reaches the single flight training time of the unmanned aerial vehicle, so as to complete the single unmanned aerial vehicle obstacle avoidance flight training;
s234, calculating and obtaining a rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle according to the rewarding function;
S235, acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training, and updating a multi-experience pool depth determination strategy gradient model;
s236, judging whether the unmanned aerial vehicle reaches the target position of the highest-level flight area in the obstacle avoidance flight training of the single unmanned aerial vehicle, if so, entering S237, otherwise, entering S238;
S237, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in the obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until the unmanned aerial vehicle state data reaches a flight experience pool, and entering S239;
s238, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until a flight experience pool is not reached, and entering S239;
S239, judging whether the total number of the unmanned aerial vehicle obstacle avoidance flight training reaches a threshold value of the number of the unmanned aerial vehicle obstacle avoidance flight training, if yes, entering S24, otherwise returning to S231.
The beneficial effects of adopting the further scheme are as follows: according to the specific method for carrying out obstacle avoidance flight training of the unmanned aerial vehicle for many times in the three-dimensional obstacle avoidance scene, through the rewarding result of the obstacle avoidance flight training of the unmanned aerial vehicle each time, the action strategy selection of the unmanned aerial vehicle for carrying out obstacle avoidance and flying to the target position can be optimized, and then through obtaining the training target Q value of the obstacle avoidance flight training each time, network parameter updating can be carried out on the actor network, the commentator network, the actor target network and the commentator target network in the strategy gradient model for determining the depth of the multi-experience pool, so that the training speed, the training efficiency and the training effect of the obstacle avoidance flight training of the unmanned aerial vehicle are improved.
Further, the unmanned aerial vehicle state data is (s t,at,rt,st+1), wherein s t represents the state of the unmanned aerial vehicle at the time t, a t represents the action of the unmanned aerial vehicle at the time t, r t represents the rewarding and punishing result of the unmanned aerial vehicle at the time t, s t+1 represents the state of the unmanned aerial vehicle at the time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state and the strategy network at the time t, and after the action of the unmanned aerial vehicle at the time t is executed, the rewarding and punishing result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained.
The beneficial effects of adopting the further scheme are as follows: the unmanned aerial vehicle state data are provided, states, action execution, rewarded states and states of the unmanned aerial vehicle at the next moment in the unmanned aerial vehicle obstacle avoidance flight training are described through the unmanned aerial vehicle state data, the situation that the unmanned aerial vehicle is in a three-dimensional obstacle avoidance scene for obstacle avoidance flight training can be continuously described, data bases can be provided for storing flight experience data in an arrival flight experience pool and an unreachable flight experience pool, and bases are provided for improving the training speed and training effect of the unmanned aerial vehicle through offline learning.
Further, the calculation expression of the reward result is as follows:
Wherein R e represents a reward result corresponding to low-altitude obstacle avoidance training of the unmanned aerial vehicle, R represents an obstacle avoidance passing reward and punishment function, a represents a grade corresponding to a flight area to which a starting position of the unmanned aerial vehicle belongs, i represents a corresponding grade number i of the flight area, R i represents a flight area reward function, R represents a forward reward value of the unmanned aerial vehicle successfully reaching a target position, s represents a training state of the unmanned aerial vehicle, ar represents the target position reached by the unmanned aerial vehicle, nca represents that the unmanned aerial vehicle is not collided and does not reach the target position, los represents that the advancing speed of the unmanned aerial vehicle is smaller than a preset speed threshold, co represents that the unmanned aerial vehicle collides, sign (·) represents a symbol function, x represents a corresponding grade number of the flight area in which the unmanned aerial vehicle is positioned in training, and L represents a total grade number of the flight area.
The beneficial effects of adopting the further scheme are as follows: the invention provides a calculation method of the rewarding result, the rewarding result can be regarded as the state rewarding and punishing situation of the unmanned aerial vehicle at the last moment of the single flight training time, and the sum of rewarding and punishing situations of the unmanned aerial vehicle entering different levels of flight areas through the obstacle, the situation that the unmanned aerial vehicle reaches the target position, the collision situation, the travelling speed situation and the obstacle passing situation are comprehensively considered, the scoring based on the rewarding result is realized, and the action strategy of the unmanned aerial vehicle during obstacle avoidance flight training is optimized.
Further, the step S235 includes the steps of:
S2351, calculating a training target Q value of unmanned aerial vehicle obstacle avoidance flight training according to unmanned aerial vehicle state data;
The calculation expression of the training target Q value is as follows:
Wherein, Represent training target Q value,/>Represents the excitation function, gamma represents the discount rate,/>Representing critic target network,/>Representing actor target network parameters as/>Probability distribution of state of unmanned plane at time t+1,/>Representing critics target network parameters;
S2352, updating the actor network parameters and the commentator network parameters according to the training target Q value of the unmanned aerial vehicle obstacle avoidance flight training to obtain updated actor network parameters and commentator network parameters;
S2353, updating the actor target network parameters based on the updated actor network parameters to obtain updated actor target network parameters;
S2354, updating the criticism target network parameters based on the updated actor target network parameters and the criticism network parameters to obtain updated criticism target network parameters, and finishing updating of the multi-experience pool depth determination strategy gradient model.
The beneficial effects of adopting the further scheme are as follows: the invention provides a method for acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training and updating a multi-experience pool depth determination strategy gradient model, and based on a calculation result of the training target Q value, updating of network parameters is realized, so that the speed and effect of unmanned aerial vehicle obstacle avoidance flight can be improved.
Further, the arriving flight experience pool and the unreachable flight experience pool in the flight training experience buffer zone can provide flight experience data so as to perform offline unmanned aerial vehicle obstacle avoidance flight training, and therefore actor target network parameters and criticism target network parameters are optimized.
The beneficial effects of adopting the further scheme are as follows: according to the invention, through the arrangement of the two flight experience pools, the classification storage of the successful arrival flight experience data and the failed unreachable flight experience data is realized, wherein the flight experience data are acquired through random sampling, and the correlation between samples in the training process can be reduced, so that the stability of unmanned aerial vehicle obstacle avoidance flight is improved, the generalization capability of a actor commentator network model is also facilitated to be improved, in addition, the calculation amount during training can be reduced through experience playback and offline training, and interaction with the environment in a three-dimensional obstacle avoidance scene is not required for each training.
Other advantages that are also present with respect to the present invention will be more detailed in the following examples.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a step flowchart of a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle in an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
As shown in fig. 1, in one embodiment of the present invention, the present invention provides a training method for obstacle avoidance flight of a low-altitude unmanned aerial vehicle, including the following steps:
S1, constructing a three-dimensional obstacle avoidance scene;
the three-dimensional obstacle avoidance scene comprises a plurality of grades of flight areas, wherein the flight areas of each grade are respectively and correspondingly provided with an obstacle of a corresponding grade except for the initial grade flight area, and the flight areas of the grade are accessed through the obstacle of the corresponding grade.
The three-dimensional obstacle avoidance scene is a cuboid-shaped space with the whole space width of 20m, the height of 20m and the length of 70m, 5 grades of flight areas are arranged in total, corresponding obstacles in each grade of flight area are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first grade of obstacles are two cross beams arranged along the width direction, the second grade of obstacles are two groups of walls with four holes, the two groups of walls are arranged along the width direction, the third grade of obstacles are two cylinders arranged along the width direction, and the fourth grade of obstacles are an inclined cylinder.
In this embodiment, a class flight area of 0-5 is set, the unmanned aerial vehicle is trained by taking off from the initial class flight area, the class value of the initial class flight area is 0, the class value of the flight area where the unmanned aerial vehicle is located when flying across the first class obstacle is changed from 0 to 1, and the like, and the class value of the corresponding flight area is highest when the unmanned aerial vehicle successfully arrives at the target position. The width of the whole space of the three-dimensional obstacle avoidance scene is 20m, the height is 20m, the whole length of a flight area is 70m, the distances between every two obstacles which are arranged at intervals are equal, the space coordinates are downwards arranged by taking the position of the center of gravity of the unmanned aerial vehicle as a coordinate origin, the initial flight height is 3m, the initial coordinates are (0, 0-3), the first-level obstacle is two cross beams which are arranged along the width direction, the coordinates are (0, 14-8) and (0, 14-14), the second-level obstacle is a wall, two groups of square areas which can pass through in total are arranged inside the wall, the side length is 4m, the lowest height of each group is 6 and 14m respectively, the coordinates are (8, 28-6), (16, 28-6), (8, 28-14) and (16, 28-14), the third-level obstacle is two cylinders with the height of 20m, the coordinates are (7,42-10) and (14, 42-10), the fourth-level obstacle is a diagonal cylinder with the coordinates of (0,70-3), and the object position is 0,70-setting.
S2, arranging the unmanned aerial vehicle in a three-dimensional obstacle avoidance scene, and performing low-altitude unmanned aerial vehicle flight training by utilizing a multi-experience pool depth determination strategy gradient model to finish the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight.
And the multi-experience pool depth determination strategy gradient model is used for classifying and storing experience data on the basis of the depth determination strategy gradient model.
The step S2 comprises the following steps:
S21, setting a flight training experience buffer zone to store flight experience data in a classification mode to a arrived flight experience pool and an unreached flight experience pool;
s22, constructing a multi-experience pool depth determination strategy gradient model of the unmanned aerial vehicle obstacle avoidance flight, and initializing an actor network parameter, a reviewer network parameter, an actor target network parameter and a reviewer target network parameter of the multi-experience pool depth determination strategy gradient model;
S23, repeatedly enabling the unmanned aerial vehicle to autonomously explore the flight from the initial position within the duration of single flight training to obtain flight experience data, training target Q value and rewarding results corresponding to each unmanned aerial vehicle low-altitude obstacle avoidance training;
the step S23 includes the steps of:
S231, setting an obstacle avoidance flight training time threshold value and a single flight training time length of the unmanned aerial vehicle; in the embodiment, the duration of single flight training of the unmanned aerial vehicle is 6 minutes;
S232, setting the unmanned aerial vehicle at any position in an initial-level flight area, taking the position as an initial position, and initializing unmanned aerial vehicle state data;
The unmanned aerial vehicle state data is (s t,at,rt,st+1), wherein s t represents the state of the unmanned aerial vehicle at the time t, a t represents the action of the unmanned aerial vehicle at the time t, r t represents the rewarding and punishing result of the unmanned aerial vehicle at the time t, s t+1 represents the state of the unmanned aerial vehicle at the time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state at the time t and the strategy network, and the rewarding and punishing result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained after the action of the unmanned aerial vehicle at the time t is executed; the time t and the time t+1 are the time in the unmanned aerial vehicle obstacle avoidance flight training process.
S233, selecting the action of the unmanned aerial vehicle at the moment t according to the state of the unmanned aerial vehicle at the moment t and the network parameters of the actor, and executing the action of the unmanned aerial vehicle at the moment t to obtain the rewarding and punishing result of the unmanned aerial vehicle at the moment t and the state of the unmanned aerial vehicle at the moment t+1 until the unmanned aerial vehicle reaches the target position of the highest-grade flight area or reaches the single flight training time of the unmanned aerial vehicle, so as to complete the single unmanned aerial vehicle obstacle avoidance flight training;
s234, calculating and obtaining a rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle according to the rewarding function;
The calculation expression of the rewarding result is as follows:
wherein R e represents a reward result corresponding to low-altitude obstacle avoidance training of the unmanned aerial vehicle, R represents an obstacle avoidance passing reward and punishment function, a represents a grade corresponding to a flight area to which a starting position of the unmanned aerial vehicle belongs, i represents a corresponding grade number i of the flight area, R i represents a flight area reward function, R represents a forward reward value of the unmanned aerial vehicle successfully reaching a target position, s represents a training state of the unmanned aerial vehicle, ar represents the target position reached by the unmanned aerial vehicle, nca represents that the unmanned aerial vehicle is not collided and does not reach the target position, los represents that the advancing speed of the unmanned aerial vehicle is smaller than a preset speed threshold, co represents that the unmanned aerial vehicle collides, sign (·) represents a symbol function, x represents a corresponding grade number of the flight area in which the unmanned aerial vehicle is positioned in training, and L represents a total grade number of the flight area. In this embodiment, the number of levels of the initial level flight area is 0, the number of levels of the other level flight areas is a corresponding positive integer, for example, the number of levels of the first level flight area is 1, the number of levels of the second level flight area is 2, … …, the number of levels of the nth level flight area is N, and the total number of levels of the flight areas is the maximum number of levels of the flight area.
In the embodiment, the obstacle avoidance is set as { position information, collision information, speed information, reaching information }, by inputting a reward and punishment function, wherein the higher the number of grades corresponding to a flight area, the higher the reward value obtained when calculating a reward result is; the collision information is obtained from a three-dimensional obstacle avoidance scene, if an obstacle is encountered, obstacle avoidance failure is carried out, punishment and division are carried out, otherwise, the unmanned aerial vehicle is in a normal state, and when the unmanned aerial vehicle does not reach a target position, the collision information and the arrival information need to be combined and analyzed so as to calculate the result of obstacle avoidance passing through rewards and punishments; for the situation that the flying speed of the unmanned aerial vehicle in the speed information is smaller than a preset speed threshold value, punishment and division may be carried out, so that the phenomenon that an action strategy is not output by a actor commentary network model under the situation that collision of the unmanned aerial vehicle is avoided, and the training speed and the training effect are influenced is avoided; and when the target position successfully reached by the unmanned aerial vehicle in the information is reached, corresponding rewards are obtained, and rewards and punishments executed by the unmanned aerial vehicle in the obstacle avoidance through rewarding and punishment conditions are carried out by taking the rewards obtained by the unmanned aerial vehicle reaching the target position as a reference. The obstacle avoidance gives forward rewards R to unmanned aerial vehicles which successfully arrive at the target point through a rewards and punishment function, and gives punishment to unmanned aerial vehicles which do not successfully arrive at the target point, have too small flight speed and collide in the middle of flight within a single training time threshold.
S235, acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training, and updating a multi-experience pool depth determination strategy gradient model;
The step S235 includes the steps of:
S2351, calculating a training target Q value of unmanned aerial vehicle obstacle avoidance flight training according to unmanned aerial vehicle state data;
The calculation expression of the training target Q value is as follows:
Wherein, Represent training target Q value,/>Represents the excitation function, gamma represents the discount rate,/>Representing critic target network,/>Representing actor target network parameters as/>Probability distribution of state of unmanned plane at time t+1,/>Representing critics target network parameters;
S2352, updating the actor network parameters and the commentator network parameters according to the training target Q value of the unmanned aerial vehicle obstacle avoidance flight training to obtain updated actor network parameters and commentator network parameters;
the critic network interacts with the three-dimensional obstacle avoidance scene, and the critic network parameters are updated by minimizing the difference between the predicted value and the target value of the current critic network parameters; and the actor network updates the actor network parameters through gradient rise to obtain updated actor network parameters by guiding the improvement of the action strategy of the unmanned aerial vehicle at each moment through the update value with the weight according to the comment network parameter predicted value.
S2353, updating the actor target network parameters based on the updated actor network parameters to obtain updated actor target network parameters;
the updated actor target network parameters are calculated as follows:
Wherein, Representing actor target network parameters, tau representing the softness parameter of parameter synchronization,/>Representing actor network parameters, wherein τ=1 is a hard update, τ=0 is no update, and 0< τ <1 is a soft update; the update of the actor target network parameters in the embodiment is soft update;
S2354, updating the criticism target network based on the updated actor target network parameters and the criticism network parameters to obtain updated criticism target network parameters, and finishing updating of the multi-experience pool depth determination strategy gradient model.
The updated critic target network parameters have the following calculation expression:
Wherein, Representing critic target network parameters,/>Representing commentator network parameters;
in this embodiment, adam optimizers are employed to update actor network parameters, and the learning rate of the actor network is set to The learning rate of the commentator network is set to/>The training batch is set to 64, the exploration rate is initially set to 1, and the unmanned aerial vehicle is enabled to explore continuously; with the progressive exploration rate of training gradually decreasing, in order to ensure that strategy stability and certain exploration capacity are simultaneously provided in the later stage of training, the minimum value is set to be 0.01. In the reward function, the forward reward value R for the successful arrival of the unmanned aerial vehicle at the target position is set to 2, the discount rate γ is set to 0.9, and the storage amount for flight experience data in the arrival flight experience pool and the unrealistic flight experience pool is set to 50000.
S236, judging whether the unmanned aerial vehicle reaches the target position of the highest-level flight area in the obstacle avoidance flight training of the single unmanned aerial vehicle, if so, entering S237, otherwise, entering S238;
The step S236 may be replaced by that it is judged whether the rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle is greater than a preset rewarding result threshold value and no collision occurs, if yes, the step S237 is entered, otherwise the step S238 is entered; in this embodiment, the preset reward result threshold is 20, that is, when the reward result of the single unmanned aerial vehicle obstacle avoidance flight training is greater than 20 and no collision occurs, the flight experience data of the training has value and is added to the flight experience pool.
S237, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in the obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until the unmanned aerial vehicle state data reaches a flight experience pool, and entering S239;
s238, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until a flight experience pool is not reached, and entering S239;
S239, judging whether the total number of the unmanned aerial vehicle obstacle avoidance flight training reaches a threshold value of the number of the unmanned aerial vehicle obstacle avoidance flight training, if yes, entering S24, otherwise returning to S231.
And S24, optimizing the action of the unmanned aerial vehicle during training by using the rewarding result, so that the Q value of the training target is continuously increased until the training target is stably close to the expected Q value of the training target, and completing the obstacle avoidance flight training of the low-altitude unmanned aerial vehicle.
The arrival flight experience pool and the non-arrival flight experience pool in the flight training experience buffer zone can provide flight experience data so as to perform offline unmanned aerial vehicle obstacle avoidance flight training, and therefore actor target network parameters and criticism target network parameters are optimized.
In the invention, based on whether the unmanned aerial vehicle arrives at the target position, two different experience pools are generated by dividing, namely an arriving flight experience pool and an unreachable flight experience pool. Unmanned aerial vehicle single flight training data the unmanned aerial vehicle flight state data that this time keeps away barrier flight training and produce, if successfully reach the target position in the limiting time of experiment, deposit it and arrive flight experience pond. And for the situation that the exploration or collision still occurs within a limited time, storing the data in the situation that the flight experience pool is not reached. The arrival experience pool records that the target position is successfully reached or the target point is not reached within a limited time, but unmanned aerial vehicle flight state data with higher rewards are obtained. The unrealistic pool records unmanned aerial vehicle flight status data that has collided within a defined time, or is still exploring. During training, samples are taken from two different experience pools in a proportion, and a batch of data is taken from the two different experience pools for training each time the parameters of the algorithm need to be updated. According to the invention, the correlation between samples in the training process is reduced through the arrangement of the two flight experience pools, and the stability of the unmanned aerial vehicle in obstacle avoidance flight is also improved.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims (7)

1. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle is characterized by comprising the following steps of:
S1, constructing a three-dimensional obstacle avoidance scene;
S2, arranging the unmanned aerial vehicle in a three-dimensional obstacle avoidance scene, and performing low-altitude unmanned aerial vehicle flight training by utilizing a multi-experience pool depth determination strategy gradient model to finish the training of the low-altitude unmanned aerial vehicle obstacle avoidance flight;
The step S2 comprises the following steps:
S21, setting a flight training experience buffer zone to store flight experience data in a classification mode to a arrived flight experience pool and an unreached flight experience pool;
s22, constructing a multi-experience pool depth determination strategy gradient model of the unmanned aerial vehicle obstacle avoidance flight, and initializing an actor network parameter, a reviewer network parameter, an actor target network parameter and a reviewer target network parameter of the multi-experience pool depth determination strategy gradient model;
S23, repeatedly enabling the unmanned aerial vehicle to autonomously explore the flight from the initial position within the duration of single flight training to obtain flight experience data, training target Q value and rewarding results corresponding to each unmanned aerial vehicle low-altitude obstacle avoidance training;
the step S23 includes the steps of:
S231, setting an obstacle avoidance flight training time threshold value and a single flight training time length of the unmanned aerial vehicle;
S232, setting the unmanned aerial vehicle at any position in an initial-level flight area, taking the position as an initial position, and initializing unmanned aerial vehicle state data;
S233, selecting the action of the unmanned aerial vehicle at the moment t according to the state of the unmanned aerial vehicle at the moment t and the network parameters of the actor, and executing the action of the unmanned aerial vehicle at the moment t to obtain the rewarding and punishing result of the unmanned aerial vehicle at the moment t and the state of the unmanned aerial vehicle at the moment t+1 until the unmanned aerial vehicle reaches the target position of the highest-grade flight area or reaches the single flight training time of the unmanned aerial vehicle, so as to complete the single unmanned aerial vehicle obstacle avoidance flight training;
s234, calculating and obtaining a rewarding result of the obstacle avoidance flight training of the single unmanned aerial vehicle according to the rewarding function;
S235, acquiring a training target Q value of unmanned aerial vehicle obstacle avoidance flight training, and updating a multi-experience pool depth determination strategy gradient model;
s236, judging whether the unmanned aerial vehicle reaches the target position of the highest-level flight area in the obstacle avoidance flight training of the single unmanned aerial vehicle, if so, entering S237, otherwise, entering S238;
S237, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in the obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until the unmanned aerial vehicle state data reaches a flight experience pool, and entering S239;
s238, randomly acquiring unmanned aerial vehicle state data at a plurality of moments in obstacle avoidance flight training of the single unmanned aerial vehicle, storing the unmanned aerial vehicle state data as flight experience data until a flight experience pool is not reached, and entering S239;
S239, judging whether the total number of the unmanned aerial vehicle obstacle avoidance flight training reaches a threshold value of the number of the unmanned aerial vehicle obstacle avoidance flight training, if yes, entering S24, otherwise returning to S231;
And S24, optimizing the action of the unmanned aerial vehicle during training by using the rewarding result, so that the Q value of the training target is continuously increased until the training target is stably close to the expected Q value of the training target, and completing the obstacle avoidance flight training of the low-altitude unmanned aerial vehicle.
2. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the three-dimensional obstacle avoidance scene comprises a plurality of levels of flight areas, and each level of flight area is respectively and correspondingly provided with a corresponding level of obstacle except for an initial level of flight area, and the flight area of the level is entered through the corresponding level of obstacle.
3. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 2, wherein the three-dimensional obstacle avoidance scene is a cuboid space with the overall space width of 20m, the height of 20m and the length of 70m, 5 levels of flight areas are arranged in total, corresponding obstacles in each level of flight areas are uniformly distributed at intervals along the length direction of the three-dimensional obstacle avoidance scene, wherein the first level of obstacles are two cross beams arranged along the width direction, the second level of obstacles are two walls with two groups of four openings arranged along the width direction, the third level of obstacles are two cylinders arranged along the width direction, and the fourth level of obstacles are one inclined cylinder.
4. The training method for the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the unmanned aerial vehicle state data is (s t,at,rt,st+1), s t represents a state of the unmanned aerial vehicle at a time t, a t represents an action of the unmanned aerial vehicle at the time t, r t represents a reward and punish result of the unmanned aerial vehicle at the time t, s t+1 represents a state of the unmanned aerial vehicle at a time t+1, the unmanned aerial vehicle selects the action of the unmanned aerial vehicle at the time t according to the state at the time t and a policy network, and a reward and punish result of the unmanned aerial vehicle at the time t and the state of the unmanned aerial vehicle at the time t+1 are obtained after the action of the unmanned aerial vehicle at the time t is executed.
5. The training method for obstacle avoidance flight of a low altitude unmanned aerial vehicle of claim 4, wherein the calculation expression of the reward result is as follows:
Wherein R e represents a reward result corresponding to low-altitude obstacle avoidance training of the unmanned aerial vehicle, R represents an obstacle avoidance passing reward and punishment function, a represents a grade corresponding to a flight area to which a starting position of the unmanned aerial vehicle belongs, i represents a corresponding grade number i of the flight area, R i represents a flight area reward function, R represents a forward reward value of the unmanned aerial vehicle successfully reaching a target position, s represents a training state of the unmanned aerial vehicle, ar represents the target position reached by the unmanned aerial vehicle, nca represents that the unmanned aerial vehicle is not collided and does not reach the target position, los represents that the advancing speed of the unmanned aerial vehicle is smaller than a preset speed threshold, co represents that the unmanned aerial vehicle collides, sign (·) represents a symbol function, x represents a corresponding grade number of the flight area in which the unmanned aerial vehicle is positioned in training, and L represents a total grade number of the flight area.
6. The method for training the obstacle avoidance flight of the low altitude unmanned aerial vehicle of claim 4, wherein S235 comprises the steps of:
S2351, calculating a training target Q value of unmanned aerial vehicle obstacle avoidance flight training according to unmanned aerial vehicle state data;
The calculation expression of the training target Q value is as follows:
Wherein, Represent training target Q value,/>Represents the excitation function, gamma represents the discount rate,/>Representing critic target network,/>Representing actor target network parameters as/>Probability distribution of state of unmanned plane at time t+1,/>Representing critics target network parameters;
S2352, updating the actor network parameters and the commentator network parameters according to the training target Q value of the unmanned aerial vehicle obstacle avoidance flight training to obtain updated actor network parameters and commentator network parameters;
S2353, updating the actor target network parameters based on the updated actor network parameters to obtain updated actor target network parameters;
S2354, updating the criticism target network parameters based on the updated actor target network parameters and the criticism network parameters to obtain updated criticism target network parameters, and finishing updating of the multi-experience pool depth determination strategy gradient model.
7. The method for training the obstacle avoidance flight of the low-altitude unmanned aerial vehicle according to claim 1, wherein the arrival flight experience pool and the non-arrival flight experience pool in the flight training experience buffer can provide flight experience data for offline unmanned aerial vehicle obstacle avoidance flight training so as to optimize actor target network parameters and critique target network parameters.
CN202410452069.1A 2024-04-16 2024-04-16 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle Active CN118051063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410452069.1A CN118051063B (en) 2024-04-16 2024-04-16 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410452069.1A CN118051063B (en) 2024-04-16 2024-04-16 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Publications (2)

Publication Number Publication Date
CN118051063A CN118051063A (en) 2024-05-17
CN118051063B true CN118051063B (en) 2024-06-14

Family

ID=91054085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410452069.1A Active CN118051063B (en) 2024-04-16 2024-04-16 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Country Status (1)

Country Link
CN (1) CN118051063B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN114089776A (en) * 2021-11-09 2022-02-25 南京航空航天大学 Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6721785B2 (en) * 2016-09-15 2020-07-15 グーグル エルエルシー Deep reinforcement learning for robot operation
CN110471444B (en) * 2019-08-19 2022-07-12 西安微电子技术研究所 Unmanned aerial vehicle intelligent obstacle avoidance method based on autonomous learning
CN114510078B (en) * 2022-02-16 2022-12-09 南通大学 Unmanned aerial vehicle maneuver avoidance decision-making method based on deep reinforcement learning
US20230306084A1 (en) * 2022-03-28 2023-09-28 Hitachi, Ltd. K-nearest multi-agent reinforcement learning for collaborative tasks with variable number of agents
CN115033022A (en) * 2022-06-28 2022-09-09 华南理工大学 DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
CN115857530A (en) * 2022-12-05 2023-03-28 北京航天自动控制研究所 Decoupling-free attitude control method of aircraft based on TD3 multi-experience pool reinforcement learning
CN116227767A (en) * 2023-01-07 2023-06-06 南京工业大学 Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning
CN116974299A (en) * 2023-08-10 2023-10-31 北京理工大学 Reinforced learning unmanned aerial vehicle track planning method based on delayed experience priority playback mechanism
CN117715219A (en) * 2023-11-23 2024-03-15 重庆邮电大学 Space-time domain resource allocation method based on deep reinforcement learning
CN117875375A (en) * 2023-12-27 2024-04-12 南京航空航天大学 Multi-agent deep reinforcement learning method based on safety exploration

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN114089776A (en) * 2021-11-09 2022-02-25 南京航空航天大学 Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN118051063A (en) 2024-05-17

Similar Documents

Publication Publication Date Title
CN111061277B (en) Unmanned vehicle global path planning method and device
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN112099496B (en) Automatic driving training method, device, equipment and medium
CN112327923A (en) Multi-unmanned aerial vehicle collaborative path planning method
CN113561986A (en) Decision-making method and device for automatically driving automobile
CN115357031B (en) Ship path planning method and system based on improved ant colony algorithm
CN112550314A (en) Embedded optimization type control method suitable for unmanned driving, driving control module and automatic driving control system thereof
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN113391633A (en) Urban environment-oriented mobile robot fusion path planning method
CN109190787B (en) Dual particle swarm multi-monitoring point access path planning method for underwater vehicle
CN116476863A (en) Automatic driving transverse and longitudinal integrated decision-making method based on deep reinforcement learning
CN118051063B (en) Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle
US20230162539A1 (en) Driving decision-making method and apparatus and chip
CN115826591B (en) Multi-target point path planning method based on neural network estimation path cost
CN117420824A (en) Path planning method based on intelligent ant colony algorithm with learning capability
CN116551703A (en) Motion planning method based on machine learning in complex environment
CN115344046A (en) Mobile robot path planning based on improved deep Q network algorithm
Lesch et al. Towards situation-aware meta-optimization of adaptation planning strategies
CN115981302A (en) Vehicle following lane change behavior decision-making method and device and electronic equipment
CN114527759A (en) End-to-end driving method based on layered reinforcement learning
CN115031753A (en) Driving condition local path planning method based on safety potential field and DQN algorithm
CN116882607B (en) Key node identification method based on path planning task
CN113589810B (en) Dynamic autonomous obstacle avoidance movement method and device for intelligent body, server and storage medium
CN116341662A (en) Automatic driving longitudinal decision-making method based on Monte Carlo tree search
CN117151577B (en) Combined path optimization method and system considering group heterogeneity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant