CN116679719A

CN116679719A - Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy

Info

Publication number: CN116679719A
Application number: CN202310792088.4A
Authority: CN
Inventors: 张卫波; 王单坤; 黄赐坤; 林景胜; 丘英浩; 陈虎
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-01

Abstract

The invention relates to an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy. Firstly, constructing an intelligent body-environment interaction model facing an unmanned vehicle, constructing a near-end strategy optimization learning (PPO) model based on an actor-commentator framework, defining a reward function according to a Dynamic Window Algorithm (DWA) principle and main evaluation factors, determining model parameters in an input layer, an output layer, a hidden layer number, a neuron number and the like, and constructing a DWA-PPO deep reinforcement learning model; and then, using the established DWA-PPO deep reinforcement learning model, continuously iterating and training, and finally converging a network model capable of representing the potential relation between surrounding environment information and the evaluation function weight parameter to complete the construction of the self-adaptive PPO-ADWA algorithm. And finally, verifying feasibility and effectiveness of the PPO-ADWA-based unmanned vehicle self-adaptive path planning strategy through a simulation comparison experiment.

Description

Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy

Technical Field

The invention relates to the technical field of unmanned path planning and autonomous navigation, in particular to an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy.

Background

In recent years, with rapid development of science and technology, a new technological revolution represented by the internet, artificial intelligence, big data, etc. is redefining various industries of society, and the conventional automobile industry is facing a profound industrial revolution. The traditional automobiles are developing towards intellectualization and unmanned, and intelligent internet-connected automobiles and automatic driving automobiles become the strategic direction of the global automobile industry development. The intelligent driving technology mainly comprises environment sensing, navigation positioning, path planning, control decision and the like. Path planning is an important ring in intelligent driving and has great significance to the development of intelligent driving technology.

The path planning technology can be summarized into path planning, namely a safe and feasible collision-free path is drawn by calculation rules under a known environment, an optimal obstacle avoidance path from a starting point to a terminal point is selected, and the optimal solution under a plurality of constraint conditions is essentially selected, so that the path planning is a key part of the unmanned navigation technology of the intelligent vehicle. The path planning algorithm can be further divided into a global plan based on the understanding of the complete area information and a local plan based on the understanding of the local area information. The dynamic window method (Dynamic Window Approach, DWA) is widely applied to intelligent vehicle path navigation as a local path planning method considering the motion performance of the intelligent vehicle. The evaluation function of the DWA algorithm is three parts of an orientation angle function, an obstacle function, a speed function and the like, the evaluation function is the weighted summation of the three subfunctions, the weights corresponding to the three functions in the classical DWA algorithm are fixed values, however, the obstacle environment around the intelligent vehicle is complex and changeable in the process of exploring the end point, different obstacle distribution needs different weights, and the classical DWA algorithm fixed weight value method easily enables the intelligent vehicle to sink into local optimum or the target cannot be reached. Therefore, the classical DWA algorithm is improved by means of a near-end strategy optimization algorithm in deep reinforcement learning.

Disclosure of Invention

The invention aims to solve the problem that an intelligent agent cannot dynamically adjust weight coefficients in an evaluation function in the face of different obstacle environments, and cannot always find a final point or calculate an optimal path, and provides an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy.

In order to achieve the above purpose, the technical scheme of the invention is as follows: a unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy comprises the following steps:

firstly, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;

establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and the main elements and evaluation functions of the DWA algorithm;

thirdly, establishing a near-end strategy optimization learning PPO model based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;

step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; determining parameters of a DWA-PPO deep reinforcement learning model including parameters of an input layer, an output layer, the number of hidden layers and the number of neurons, and completing instantiation of the DWA-PPO deep reinforcement learning model;

step five, constructing a self-adaptive PPO-ADWA algorithm, namely simulating navigation planning of the unmanned vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training the DWA-PPO deep reinforcement learning model, and outputting a model of corresponding weight parameters according to the distribution change of surrounding obstacles by repeated iteration convergence to finish the construction of the self-adaptive PPO-ADWA algorithm;

step six, the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm is demonstrated through a simulation comparison experiment.

Compared with the prior art, the invention has the following beneficial effects: aiming at the problem that the weight coefficient in the evaluation function of the traditional DWA algorithm can not be dynamically adjusted along with the environment where the intelligent vehicle is located and the motion state of the intelligent vehicle, a near-end strategy optimization algorithm in deep reinforcement learning is used for constructing a DWA-PPO deep reinforcement learning model, and a network model is obtained through continuous iterative training, so that model parameters of corresponding weight parameters are output, and the construction of a self-adaptive PPO-ADWA algorithm is completed; the method solves the problem that the weight coefficient of the intelligent body in the face of different obstacle environments cannot be dynamically adjusted due to the evaluation function, and the intelligent body cannot always find a terminal point or calculate an optimal path.

Drawings

FIG. 1 is a schematic illustration of an agent-environment interaction model.

Fig. 2 is a schematic diagram of the principle of DWA algorithm.

Fig. 3 is a velocity angular velocity window.

FIG. 4 is a diagram ofAnd delta schematic.

Fig. 5 is a schematic diagram of an actor commentator framework.

Fig. 6 is state s.

Fig. 7 is a policy network architecture.

Fig. 8 is a value network architecture.

FIG. 9 is a DWA-PPO model.

Fig. 10 is a plot of fractional versus arrival rate.

Fig. 11 is a simulation environment.

Fig. 12 is a classical DWA.

FIG. 13 is PPO-ADWA.

Fig. 14 is a weight parameter variation curve.

FIG. 15 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is specifically described below with reference to fig. 1 to 15.

As shown in fig. 15, the invention provides an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy, which comprises the following steps:

The method comprises the following steps:

step one, as shown in fig. 1, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;

the intelligent agent plays roles of decision making and learning in the deep reinforcement learning system, is mainly responsible for outputting action information and receiving rewards and states, and the environment is an interactive object of the intelligent agent, and the interactive process comprises the following three steps:

(1) The intelligent agent is controlled by the environment stateObserve information +.> The state space is a value set of environmental states; />The system is a value set for observing the space and the observed quantity of the agent.

(2) The agent being derived from known O _t Making a corresponding decision to decide the action to be applied to the environment Is a set of action values.

(3) Environmental subject A _t Influence, self state S _t Transition to S _t+1 And give rewards to the agent Is a value set of rewards. The discretized agent-environment interaction model can thus be represented by the following sequence:

S ₀ ,O ₀ ,A ₀ ,R ₀ ,S ₁ ,O ₁ ,A ₁ ,R ₁ ,S ₂ ,O ₂ ,A ₂ ,R ₂ ,…,S _T ＝S _{termination of}

When the state of the environment can be completely observed by the intelligent agent, S is present _t ＝O _t To be simplified as:

S ₀ ,A ₀ ,R ₀ ,S ₁ ,A ₁ ,R ₁ ,S ₂ ,A ₂ ,R ₂ ,…,S _T ＝S _{termination of}

Establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and main element evaluation functions of the DWA algorithm;

the DWA algorithm is a local path planning method for intuitively understanding the map environment where the unmanned vehicle is located from the perspective of speed space, and the working flow is as follows: taking into consideration the constraint of each condition at the moment t on the speed and angular velocity, obtaining a speed and angular velocity window V which can be reached by the unmanned vehicle at the moment t _win The method comprises the steps of carrying out a first treatment on the surface of the Discretizing the speed and angular velocity, and combining the discretized speed and angular velocity; the drone traverses all combinations and simulates the forward m delta according to a given motion model _t Obtaining a simulation track set tau, namely a series of point sets, according to the duration; the evaluation function gives the scores of all the simulated tracks in the simulated track set tau, and the track tau with the highest score is selected _b Corresponding combinations; driving the unmanned aerial vehicle with the combination for a forward time delta _t Reaching the time t+1; cycling through this to the endpoint. m is the sampling step number, delta _t Is the sampling interval, as shown in fig. 2.

At time t, V of unmanned vehicle _win Constrained by its hardware conditions and the surrounding environment, consider the following three-point constraint:

(1) Limit speed angular speed constraint:

V _lim ＝{(v,w)|v∈[v _min ,v _max ]^w∈[w _min ,w _max ]}

(2) Acceleration limited speed angular velocity constraint:

(3) Speed angular velocity constraint for braking distance limitation:

above, v _min 、v _max Is the limit linear velocity, w _min 、w _max Is the limit angular velocity. v _cu 、w _cu For the current linear velocity and the angular velocity,for extreme linear acceleration, +.>Is the limit angular acceleration. dist (v, w) is the closest distance of the simulated trajectory corresponding to the velocity-angular velocity combination (v, w) from the obstacle. Final time t V _win Expressed as:

V _win ＝V _lim ∩V _acc ∩V _dis

as shown in fig. 3, the evaluation function includes three sub-functions, which are comprehensive consideration of three factors of the unmanned vehicle running speed, the obstacle collision risk and the unmanned vehicle heading, and specifically include the following steps:

G(v,w)＝σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))

wherein the method comprises the steps of The course angle of the unmanned vehicle is represented, delta is the included angle between the connecting line of the unmanned vehicle and the target point and the positive direction of the x-axis, as shown in fig. 4.

dist (v, w) is the Euclidean distance from the simulated track to the nearest obstacle, vel (v, w) represents the linear speed of the unmanned vehicle, and alpha, eta and gamma are three weight coefficients. From the above, the evaluation function is composed of sub-functions with different dimensions, the normalized function σ () in the formula is equivalent to dimensionless learning, and data with different dimensions can be unified under the same reference system for combination or comparison, so as to avoid evaluation deviation caused by different dimensions of the data, and the method specifically comprises the following steps:

dist(v _i ,w _j ) And vel (v) _i ,w _j ) The same normalization operation was performed.

The unmanned vehicle obtains a simulation track according to the uniform motion model, under the assumption condition of the motion model, the linear speed and the angular speed of the unmanned vehicle are kept unchanged, the linear speed direction change amount and time are in linear relation, the speed direction is kept unchanged in a tiny time interval for simplifying the model to accelerate the operation, and therefore the uniform motion model can be discretized, and x is calculated by _t 、y _t Represents the abscissa of the intelligent vehicle at the time t,indicating the course angle at time t, v _t 、w _t The speed and angular velocity at time t are shown in the following formula.

Thirdly, establishing a near-end strategy optimization learning (PPO) model (shown in fig. 5) based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;

the approach of the near-end policy optimization algorithm (Proximal Policy Optimization, PPO) is then to perform the function at the objectiveIncrease D _KL The (p||q) penalty term is specifically as follows:

in the middle ofTo the parameter theta ₁ The integration of (2) to obtain the target function of strategy learning based on importance sampling, theta is the strategy pi parameter, when the strategy is better, the target function +.>The larger, γ is the parameter introduced for the Monte Carlo approximation, U _t Is the parameter in the policy gradient, pi (a _t |s _t ；θ ₁ ) For target policy, pi (a _t |s _t ；θ ₂ ) For behavioural policy->For mathematical expectation of policy network, beta is super parameter, and D is larger as the difference between the distribution q and p is larger _KL The larger the (p||q) term is, ++>The greater the penalty received, the conversely D _KL The smaller the (p||q) term is, ++>The smaller the penalty is, the goal of reinforcement learning is to maximize +.>Thus have penalty term->The behavior can be controlled to be within a certain similarity range with the target strategy.

The unmanned aerial vehicle searches an optimal path capable of connecting the starting point and the ending point in the obstacle environment, so that the learning environment taking the actual application scene of the unmanned aerial vehicle as a model is the obstacle map.

The state s in the model is environmental information perceived by the unmanned vehicle by using a sensor, and can also comprise self position and motion state information. The state s is the only information source of the unmanned vehicle action decision, and is also an important basis for maximizing the return, so that the advantages and disadvantages of the state s directly influence whether the algorithm can converge, the convergence speed and the final performance. The state s can be understood as a high-dimensional vector of surrounding environment information, the final target of the unmanned aerial vehicle reaches the end point in an optimal path, and obviously, the position and the state of the unmanned aerial vehicle, surrounding barrier distribution and the position information of the target point are core basis of the action decision of the unmanned aerial vehicle. In order to be more suitable for practical application, the laser radar can be scanned at a scanning interval of 2 degrees, the information reflected by one circle is taken as the main part of the state s, and the state s also comprises the speed v of the unmanned vehicle _t Angular velocity w _t Course angleCurrent target point position information (x ^g _t ,y ^g _t ) As shown in fig. 6. The specific method is that a strategy network is utilized to output a fixed weight for replacing an evaluation function, a self-adaptive evaluation function is constructed, and obviously, an action a corresponds to the weight (alpha, eta, gamma) in the evaluation function, so that the action a is defined as follows:

a＝[μ ₁ ,σ ₁ ,μ ₂ ,σ ₂ ,μ ₃ ,σ ₃ ]

wherein [ mu ] ₁ ,σ ₁ ]For mean and variance, the probability density function used to describe the weight α:

and so on [ mu ] ₂ ,σ ₂ ]For mean and variance, probability density function for describing weight η, [ mu ] ₃ ,σ ₃ ]The mean and variance are probability density functions describing the weights γ. Then randomly sampling and determining (alpha, eta, gamma) according to the respective probability density function, and mapping the actions to [ -1,1 through the Tanh function]Within the interval.

After the state s and the action a are determined, the numbers of neurons of the input layer and the output layer of the policy network pi (a|s; theta) and the value network q (s, a; w) are also determined. The schematic diagrams of the policy network and the value network structure are shown in fig. 7 and 8:

step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; model parameters including parameters such as the size of an input layer, the size of an output layer, the number of hidden layers, the number of neurons and the like are determined, and instantiation of the DWA-PPO deep reinforcement learning model is completed;

the rewards obtained by the core content unmanned vehicle in the learning model can be divided into main line rewards and sub-target rewards according to whether the rewards obtained by triggering the main line event are:

main line rewards: the main line rewards are understood as settlement rewards for the arrival of the agent at the end state, i.e. rewards R obtained by navigation of the vehicle to the end point _mian ^goal Punishment rewards R when the maximum number of iterative steps is exceeded _mian ^out And punishment rewards R when unmanned vehicle collides with obstacle _mian ^coll 。

Sub-target rewards: the rewards outside the main rewards are called auxiliary rewards, and the main form is sub-target rewards. And analyzing the influence of factors such as local key points, environment states, movement states of the unmanned aerial vehicle, relative relation between the unmanned aerial vehicle and target points and the like on a main line task of the unmanned aerial vehicle for finding an optimal path by combining with the actual application scene of navigation planning of the unmanned aerial vehicle in an obstacle environment, and giving the following sub-target rewards:

(1) Energy penalty prize R _sub ^step ：R _sub ^step Is stored in (a)On one hand, the energy consumption of the unmanned vehicle can be limited, and meanwhile, the unmanned vehicle can be promoted to find an optimal path; e (E) _t At speed v for the t step unmanned vehicle _t Running delta _t The energy consumed by the process, normalized, defines R _sub ^step The method comprises the following steps:

(2) Distance change reward R _sub ^dis : in the process, the unmanned vehicle can be away from the end point locally due to avoidance obstacles, but is always close to the end point globally. Thus, a prize relating to the distance between the position of the vehicle and the target point, R, can be defined _sub ^dis Should be a positive prize, and R is the greater the distance moved in the direction of the end point _sub ^dis The larger.

(3) Obstacle distance rewards R _sub ^obs ：r _t ^obs Defining that when no obstacle exists in the safety distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle does not collide in the planning process, which is the primary premise of ensuring the driving safety, and defining R after normalization _sub ^obs The method comprises the following steps:

(4) Azimuth prize R _sub ^head : the target of the unmanned vehicle reaches the destination, so that the more the unmanned vehicle is considered to face the destination in navigation, the better the heading angle of the unmanned vehicle is; r is (r) ^head The method is defined as that forward rewards are obtained when the heading of the unmanned vehicle is very close to the optimal azimuth angle, and R is defined after normalization _sub ^head The method comprises the following steps:

to sum up, it can be known that unmanned vehicleRewards R at step t _t Is a compound of the formula (I),the adjustment factor is awarded for the sub-target.

The AC framework constructs a value network for approximating action value in a policy gradient, so the network architecture includes at least the value network and the policy network architecture. According to the value network loss function:

the learning targets of the value network are:

it can be seen that its learning objective includes a part of its own predictionsGiven that there is an overestimation of the action value Q (s, a) on the value network itself, this way of using self-learning of the value network itself can lead to the overestimation problem being amplified continuously and this overestimation is non-uniform, severely affecting the network training, a phenomenon known as Bootstrapping (Bootstrapping). To prevent bootstrap phenomenon of value network, w is used ^- Constructing a target value network asThe parameter structure of the network is consistent with the value network but different in specific value, and is used for calculating TD error:

the initial parameters of the target value network are consistent with the value network, mu is a parameter, the sum of coefficients is ensured to be 1, and the following updating reference is carried out:

taken together, the network architecture under the DWA-PPO reinforcement learning model includes three major parts: policy network pi (a|s; θ), value network q (s, a; w) and target value network q ^T (s, a; w). The DWA-PPO reinforcement learning model is shown in FIG. 9.

In summary, the build model includes an agent, an environment, a critique module, and an actor module. The commentator module comprises a value network error function L (w), a value network q (s, a; w) and a target value network q ^T (s,a；w ^- ). Actor modules include a target network pi (a|s; θ ₁ ) A behavioral network pi (a|s; θ ₂ ) Policy network objective functionThe training beginning stage is training set collection, as shown in black line segments: the 0 th round of initial time unmanned vehicle observes state s from the environment using the sensing and positioning system ₀ Behavior network pi (a|s; θ ₂ ) Receiving s ₀ Post-outputting a message about action A ₀ Is of Gaussian distribution pi (A) ₀ |s ₀ ；θ ₁ ) Then randomly extracting the determination action a from the probability distribution ₀ Transmitting to an intelligent vehicle to obtain an evaluation function G of the DWA algorithm at the initial moment ₀ And (v, w) completing evaluation of a simulated track set of the DWA algorithm at the initial moment, and transmitting a speed angular speed instruction of the optimal track to the unmanned vehicle motion control module to drive the unmanned vehicle to move. So far, the information such as the position, the orientation angle, the surrounding obstacle distribution and the like of the unmanned vehicle are changed, and the environment is converted into a state s ₁ The rewarding function also feeds back the rewarding r to the commentator module according to the changed information ₀ . When s is ₁ Not in the termination state s _n The round goes to the next moment in time,otherwise, resetting the map and the unmanned vehicle state, and collecting the track of the next round until the full i rounds are collected, so as to finally obtain a training set:

χ＝[χ ₀ ,χ ₁ ,…,χ _i ]

χ ₀ ＝[s ₀ ⁰ ,a ₀ ⁰ ,r ₀ ⁰ ,…,s _n-1 ⁰ ,a _n-1 ⁰ ,r _n-1 ⁰ ,s _n ⁰ ]

step five, constructing a PPO-ADWA algorithm, namely simulating navigation planning of the unmanned aerial vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training a network model, and outputting model parameters of corresponding weight parameters by repeatedly and iteratively converging the model parameters according to the distribution change of surrounding obstacles to finish the construction of the self-adaptive PPO-ADWA algorithm;

after obtaining the training set, the value network q (s, a; w) is updated by back propagation of an error function L (w); error function using PPO algorithmBack propagation update pi (a|s; θ) ₁ ). Setting networks q (s, a; w), q ^T (s,a；w ^- )、π(a|s；θ ₁ ) The current network parameters are w respectively _now 、/>θ _now The following steps are repeated for Z times to finish the generation of updating:

(1) Randomly extracting M from the disordered training set χ _I (minimum batch size) states s _N ^I 。

(2) With q ^T (s,a；w ^- ) Calculating the state s _N ^I K-step TD error MTD as starting point _N ^I ：

(3) Calculating the state s using the value network q (s, a; w) _N ^I Action value estimation at the time:

q _N ^I ＝Q(s _N ^I ,a _N ^I ；w _now )

(4) Calculation L (w):

(5) Calculation of

(6) Update value network, policy network, target value network:

assume that the pre-update parameter is θ _now Obtaining parameter theta after importance sampling update _new Let the pre-update parameter be w _now Obtaining a parameter w after strategy learning and updating _new Let us assume the parameters w introduced before the update to prevent the bootstrap phenomenon of the value network ^- _now After mu is taken as a parameter, the w is obtained after the update of the coefficient sum of 1 ^- _new . After Z times of updating, the target network pi (a|s; theta ₁ ) Parameters are assigned to a target networkπ(a|s；θ ₁ ) And recording as one generation of update, then emptying the training set, and re-entering the next generation of update until the model converges.

Fig. 10 is a graph of the average score and arrival rate of each generation of the unmanned vehicle in the deep reinforcement learning environment in the network training process, and the network model gradually learns to a parameter network capable of guiding the unmanned vehicle path planning correctly along with the iterative convergence of the model. And completing the construction of the self-adaptive PPO-ADWA algorithm.

Step six, the self-adaptive adjustment capability of the unmanned vehicle path planning based on PPO-ADWA is demonstrated through a simulation comparison experiment;

to verify the self-regulating capability of unmanned vehicle path planning based on PPO-ADWA algorithm, this section will verify its robustness in a randomly generated complex static obstacle environment. As shown in FIG. 11, the simulation environment is characterized in that the map size is 60m multiplied by 60m, green dots are used as starting points, blue pentagram is used as an end point, black geometric figures represent barriers, the shapes of the barriers comprise regular polygons and circles, and the sizes and the number of the barriers are randomly generated within a certain range. The simulation results of the performance under the map with different 100 obstacle positions are shown in table 1.

Table 1 comparison of simulation results

The arrival rate of the unmanned vehicle path planning result under PPO-ADWA is 84%, and is improved by 6 percentage points compared with the arrival rate under classical DWA; the average path length is 93.04m, and the path efficiency is improved by 5.00%; the average step number is 251.95, and the average step number cost is reduced by 4.85%. Classical DWA programming results are shown in FIG. 12, and PPO-ADWA programming results are shown in FIG. 13. In the unmanned vehicle path planning process of the PPO-ADWA fusion strategy, a change curve of the weight parameters is shown in fig. 14. It can be seen that the weight parameters generally maintain a numerical relationship of η > γ > α.

The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy is characterized by comprising the following steps:

2. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 1, wherein the following steps are specifically implemented:

the intelligent agent is responsible for outputting action information, receiving rewards and states, the environment is an interactive object of the intelligent agent, and the interactive process comprises the following three steps:

(1) The intelligent agent is controlled by the environment stateObserve information +.> The state space is a value set of environmental states; />The system is an observation space and a value set of the observed quantity of the agent;

(2) The agent being derived from known O _t Making a corresponding decision to decide the action to be applied to the environment Is an action value set;

(3) Environmental subject A _t Influence, self state S _t Transition to S _t+1 And awarding rewards to the agent Is a value set of rewards; the discretized agent-environment interaction model is thus represented by the following sequence:

S ₀ ,A ₀ ,R ₀ ,S ₁ ,A ₁ ,R ₁ ,S ₂ ,A ₂ ,R ₂ ,…,S _T ＝S _{termination of} 。

3. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 2, wherein the step two is specifically implemented as follows:

the DWA algorithm is a local path planning method for intuitively understanding the map environment where the unmanned vehicle is located from the perspective of speed space, and the working flow is as follows: taking into consideration the constraint of each condition at the moment t on the speed and angular velocity, obtaining a speed and angular velocity window V which can be reached by the unmanned vehicle at the moment t _win The method comprises the steps of carrying out a first treatment on the surface of the Discretizing the speed and angular velocity, and combining the discretized speed and angular velocity; the drone traverses all combinations and simulates the forward m delta according to a given motion model _t Obtaining a simulation track set tau, namely a series of point sets, according to the duration; the evaluation function gives the scores of all the simulated tracks in the simulated track set tau, and the track tau with the highest score is selected _b Corresponding combinations; driving the unmanned aerial vehicle with the combination for a forward time delta _t Reaching the time t+1; taking the cycle to the end point, m is the sampling step number, delta _t Is the sampling interval;

at time t, speed angular velocity window V of unmanned vehicle _win Constrained by its hardware conditions and the surrounding environment, consider the following three-point constraint:

(1) Limit speed angular speed constraint:

V _lim ＝{(v,w)|v∈[v _min ,v _max ]∧w∈[w _min ,w _max ]}

(2) Acceleration limited speed angular velocity constraint:

(3) Speed angular velocity constraint for braking distance limitation:

above, v _min 、v _max Is the limit linear velocity, w _min 、w _max For the limit angular velocity v _cu 、w _cu For the current linear velocity and the angular velocity,for extreme linear acceleration, +.>For the limit angular acceleration, dist (v, w) is the nearest distance from the obstacle of the simulated track corresponding to the velocity angular velocity combination (v, w); finally, speed angular velocity window V of unmanned vehicle at t moment _win Expressed as:

V _win ＝V _lim ∩V _acc ∩V _dis

the evaluation function comprises three subfunctions, which are comprehensive consideration of three factors of the unmanned vehicle running speed, the obstacle collision risk and the unmanned vehicle heading, and specifically comprises the following steps:

G(v,w)＝σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))

wherein the method comprises the steps of The course angle of the unmanned vehicle is represented, delta is the included angle between the connecting line of the unmanned vehicle and the target point and the positive direction of the x-axis; dist (v, w) is the Euclidean distance from the simulated track to the nearest obstacle, vel (v, w) represents the linear speed of the unmanned vehicle, and alpha, eta and gamma are three weight coefficients; from the above, the evaluation function is notThe homonymy subfunction is formed, the normalization function sigma () in the formula is equivalent to dimensionless learning, and can unify the data of different dimensions under the same reference system for combination or comparison, so as to avoid evaluation deviation caused by different dimensions of the data, and the method is concretely as follows:

dist(v _i ,w _j ) And vel (v) _i ,w _j ) Performing the same normalization operation;

the unmanned vehicle obtains a simulation track according to a uniform motion model, under the assumption condition of the uniform motion model, the linear speed and the angular speed of the unmanned vehicle are kept unchanged, the linear speed direction change amount and time are in linear relation, the speed direction is kept unchanged in a tiny time interval for simplifying the model to accelerate the operation, and therefore the uniform motion model is discretized, and x is calculated _t 、y _t Represents the abscissa of the intelligent vehicle at the time t,indicating the course angle at time t, v _t 、w _t The speed and angular velocity at time t are represented by the following formulas:

4. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 3, wherein the third implementation is as follows:

the approach of the near-end policy optimization algorithm is that the approach is based on the objective functionIncrease D _KL The (p||q) penalty term is specifically as follows:

in the middle ofTo the parameter theta ₁ The integration of (2) to obtain the target function of strategy learning based on importance sampling, theta is the strategy pi parameter, when the strategy is better, the target function +.>The larger, γ is the parameter introduced for the Monte Carlo approximation, U _t Pi (a _t |s _t ；θ ₁ ) For the target policy, pi (a _t |s _t ；θ ₂ ) For behavioural policy->For mathematical expectation of policy network, beta is super parameter, and D is larger as the difference between the distribution q and p is larger _KL The larger the (p||q) term is, ++>The greater the penalty received, the conversely D _KL The smaller the (p||q) term is, ++>The smaller the penalty is, the goal of reinforcement learning is to maximize +.>Thus have penalty term->The behavior and the target strategy can be controlled within a preset similarity range;

the unmanned vehicle searches an optimal path capable of connecting the starting point and the end point in the obstacle environment, so that the learning environment taking the actual application scene of the unmanned vehicle as a model is an obstacle map;

the state s in the model is environmental information perceived by the unmanned vehicle by using a sensor and comprises self position and motion state information; the laser radar scans the information reflected by a circle at a scanning interval of 2 degrees to form a main part of a state s, and the state s also comprises the speed v of the unmanned vehicle _t Angular velocity w _t Course angleCurrent target point position information (x ^g _t ,y ^g _t ) The specific method is that a strategy network is utilized to output fixed weights for replacing an evaluation function, a self-adaptive evaluation function is constructed, and obviously, an action a corresponds to the weights (alpha, eta, gamma) in the evaluation function, so that the action a is defined as follows:

a＝[μ ₁ ,σ ₁ ,μ ₂ ,σ ₂ ,μ ₃ ,σ ₃ ]

wherein [ mu ] ₁ ,σ ₁ ]For the mean and variance, the probability density function used to describe the weight α, the same applies:

[μ ₂ ,σ ₂ ]for mean and variance, probability density function for describing weight η, [ mu ] ₃ ,σ ₃ ]The mean value and the variance are probability density functions for describing the weight gamma; then randomly sampling and determining (alpha, eta, gamma) according to the respective probability density function, and mapping the actions to [ -1,1 through the Tanh function]The interval is within;

after the state s and the action a are determined, the number of neurons of the output layer of the input and output layers of the strategy network and the value network are also determined.

5. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 4, wherein the fourth implementation is as follows:

the rewarding function is to learn core content in a DWA-PPO deep reinforcement learning model, and rewards obtained by unmanned vehicles are divided into main line rewards and sub-target rewards according to whether the rewards obtained by triggering main line events are:

main line rewards: the main line rewards are settlement rewards for the agent reaching the ending state, namely rewards R obtained by the navigation of the unmanned vehicle to the ending point _mian ^goal Punishment rewards R when the maximum number of iterative steps is exceeded _mian ^out And punishment rewards R when unmanned vehicle collides with obstacle _mian ^coll ；

Sub-target rewards: rewards other than the main rewards are called auxiliary rewards, and the main form of the rewards is sub-target rewards; and analyzing the influence of factors such as local key points, environment states, movement states of the unmanned aerial vehicle, relative relation between the unmanned aerial vehicle and target points and the like on a main line task of the unmanned aerial vehicle for finding an optimal path by combining with the actual application scene of navigation planning of the unmanned aerial vehicle in an obstacle environment, and giving the following sub-target rewards:

(1) Energy penalty prize R _sub ^step ：R _sub ^step On one hand, the energy consumption of the unmanned vehicle can be limited, and meanwhile, the unmanned vehicle can be promoted to find an optimal path; e (E) _t At speed v for the t step unmanned vehicle _t Running delta _t The energy consumed by the process, normalized, defines R _sub ^step The method comprises the following steps:

(2) Distance change reward R _sub ^dis : defining a prize related to the position-target point distance of the unmanned vehicle, R _sub ^dis Should be a positive prize, and R is the greater the distance moved in the direction of the end point _sub ^dis The larger;

rewards R at t-th step of unmanned vehicle _t Is a compound of the formula (I),awarding the adjustment factor for the sub-target;

the network architecture under the DWA-PPO reinforcement learning model at least comprises a value network and a strategy network architecture; according to the value network loss function:

the learning targets of the value network are:

learning objectives for value networks include a portion of predictions themselvesTo prevent bootstrap phenomenon of value network, w is used ^- Constructing a target value network q ^T (s,a；w ^- ) The parameter structure of the target value network is consistent with the value network but different in specific value, and is used for calculating TD error:

w ^- _new ←μw _new +(1-μ)w ^- _now

the network architecture under the DWA-PPO reinforcement learning model includes three major parts: policy network pi (a|s; θ), value network q (s, a; w) and target value network q ^T (s,a；w)；

The DWA-PPO reinforcement learning model is built in a comprehensive way and comprises an agent, an environment, a commentator module and an actor module; the commentator module comprises a value network error function L (w), a value network q (s, a; w) and a target value network q ^T (s,a；w ^- ) The method comprises the steps of carrying out a first treatment on the surface of the Actor modules include a target network pi (a|s; θ ₁ ) A behavioral network pi (a|s; θ ₂ ) Policy network objective functionThe training beginning stage is collection of training sets: the 0 th round of initial time unmanned vehicle observes state s from the environment using the sensing and positioning system ₀ Behavior network pi (a|s; θ ₂ ) Receiving s ₀ Post-outputting a message about action A ₀ Is of Gaussian distribution pi (A) ₀ |s ₀ ；θ ₁ ) Then randomly extracting the determination action a from the probability distribution ₀ Transmitting to an intelligent vehicle to obtain an evaluation function G of the DWA algorithm at the initial moment ₀ (v, w) completing evaluation of a simulated track set of the DWA algorithm at the initial moment, and transmitting a speed angular speed instruction of the optimal track to an unmanned vehicle motion control module to drive the unmanned vehicle to move; to this end, information including the position, the angle of orientation, the surrounding obstacle distribution of the drone is changed, the environment is switched to state s ₁ The rewarding function also feeds back the rewarding r to the commentator module according to the changed information ₀ The method comprises the steps of carrying out a first treatment on the surface of the When s is ₁ Not in the termination state s _n And (3) entering the next time in the round, otherwise resetting the map and the unmanned vehicle state, and collecting the track of the next round until the round i is collected, so as to finally obtain a training set:

χ＝[χ ₀ ,χ ₁ ,…,χ _i ]

χ ₀ ＝[s ₀ ⁰ ,a ₀ ⁰ ,r ₀ ⁰ ,…,s _n-1 ⁰ ,a _n-1 ⁰ ,r _n-1 ⁰ ,s _n ⁰ ]。

6. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 5, wherein the fifth implementation is as follows:

after obtaining the training set, the value network q (s, a; w) is updated by back propagation of an error function L (w); error function using near-end strategy optimization algorithmBack propagation update pi (a|s; θ) ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Let q (s, a; w), q ^T (s,a；w ^- )、π(a|s；θ ₁ ) The current network parameters are w respectively _now 、w ^- _now 、θ _now The following steps are repeated for Z times to finish the generation of updating:

(1) Randomly extracting the minimum lot size M from the shuffled training set χ _I Individual states s _N ^I ；

q _N ^I ＝Q(s _N ^I ,a _N ^I ；w _now )

(4) Calculation L (w):

(5) Calculation of

(6) Update value network, policy network, target value network:

w ^- _new ←(1-μ)w _new +μw ^- _now

assume that the pre-update parameter is θ _now Obtaining parameter theta after importance sampling update _new Let the pre-update parameter be w _now Obtaining a parameter w after strategy learning and updating _new Let us assume the parameters w introduced before the update to prevent the bootstrap phenomenon of the value network ^- _now After mu is taken as a parameter, the w is obtained after the update of the coefficient sum of 1 ^- _new The method comprises the steps of carrying out a first treatment on the surface of the After Z times of updating, the target network pi (a|s; theta ₁ ) Parameters are assigned to the target network pi (a|s; θ ₁ ) Recording as one generation of update, then emptying the training set, and re-entering the next generation of update until the model converges;

the average score and arrival rate change curve of each generation of the unmanned vehicle in the deep reinforcement learning environment gradually learns a parameter network capable of guiding the unmanned vehicle path planning correctly along with the iterative convergence of the model; and completing the construction of the self-adaptive PPO-ADWA algorithm.