CN116679719A - Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy - Google Patents

Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy Download PDF

Info

Publication number
CN116679719A
CN116679719A CN202310792088.4A CN202310792088A CN116679719A CN 116679719 A CN116679719 A CN 116679719A CN 202310792088 A CN202310792088 A CN 202310792088A CN 116679719 A CN116679719 A CN 116679719A
Authority
CN
China
Prior art keywords
unmanned vehicle
rewards
network
model
ppo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310792088.4A
Other languages
Chinese (zh)
Inventor
张卫波
王单坤
黄赐坤
林景胜
丘英浩
陈虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202310792088.4A priority Critical patent/CN116679719A/en
Publication of CN116679719A publication Critical patent/CN116679719A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention relates to an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy. Firstly, constructing an intelligent body-environment interaction model facing an unmanned vehicle, constructing a near-end strategy optimization learning (PPO) model based on an actor-commentator framework, defining a reward function according to a Dynamic Window Algorithm (DWA) principle and main evaluation factors, determining model parameters in an input layer, an output layer, a hidden layer number, a neuron number and the like, and constructing a DWA-PPO deep reinforcement learning model; and then, using the established DWA-PPO deep reinforcement learning model, continuously iterating and training, and finally converging a network model capable of representing the potential relation between surrounding environment information and the evaluation function weight parameter to complete the construction of the self-adaptive PPO-ADWA algorithm. And finally, verifying feasibility and effectiveness of the PPO-ADWA-based unmanned vehicle self-adaptive path planning strategy through a simulation comparison experiment.

Description

Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
Technical Field
The invention relates to the technical field of unmanned path planning and autonomous navigation, in particular to an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy.
Background
In recent years, with rapid development of science and technology, a new technological revolution represented by the internet, artificial intelligence, big data, etc. is redefining various industries of society, and the conventional automobile industry is facing a profound industrial revolution. The traditional automobiles are developing towards intellectualization and unmanned, and intelligent internet-connected automobiles and automatic driving automobiles become the strategic direction of the global automobile industry development. The intelligent driving technology mainly comprises environment sensing, navigation positioning, path planning, control decision and the like. Path planning is an important ring in intelligent driving and has great significance to the development of intelligent driving technology.
The path planning technology can be summarized into path planning, namely a safe and feasible collision-free path is drawn by calculation rules under a known environment, an optimal obstacle avoidance path from a starting point to a terminal point is selected, and the optimal solution under a plurality of constraint conditions is essentially selected, so that the path planning is a key part of the unmanned navigation technology of the intelligent vehicle. The path planning algorithm can be further divided into a global plan based on the understanding of the complete area information and a local plan based on the understanding of the local area information. The dynamic window method (Dynamic Window Approach, DWA) is widely applied to intelligent vehicle path navigation as a local path planning method considering the motion performance of the intelligent vehicle. The evaluation function of the DWA algorithm is three parts of an orientation angle function, an obstacle function, a speed function and the like, the evaluation function is the weighted summation of the three subfunctions, the weights corresponding to the three functions in the classical DWA algorithm are fixed values, however, the obstacle environment around the intelligent vehicle is complex and changeable in the process of exploring the end point, different obstacle distribution needs different weights, and the classical DWA algorithm fixed weight value method easily enables the intelligent vehicle to sink into local optimum or the target cannot be reached. Therefore, the classical DWA algorithm is improved by means of a near-end strategy optimization algorithm in deep reinforcement learning.
Disclosure of Invention
The invention aims to solve the problem that an intelligent agent cannot dynamically adjust weight coefficients in an evaluation function in the face of different obstacle environments, and cannot always find a final point or calculate an optimal path, and provides an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy.
In order to achieve the above purpose, the technical scheme of the invention is as follows: a unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy comprises the following steps:
firstly, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;
establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and the main elements and evaluation functions of the DWA algorithm;
thirdly, establishing a near-end strategy optimization learning PPO model based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;
step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; determining parameters of a DWA-PPO deep reinforcement learning model including parameters of an input layer, an output layer, the number of hidden layers and the number of neurons, and completing instantiation of the DWA-PPO deep reinforcement learning model;
step five, constructing a self-adaptive PPO-ADWA algorithm, namely simulating navigation planning of the unmanned vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training the DWA-PPO deep reinforcement learning model, and outputting a model of corresponding weight parameters according to the distribution change of surrounding obstacles by repeated iteration convergence to finish the construction of the self-adaptive PPO-ADWA algorithm;
step six, the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm is demonstrated through a simulation comparison experiment.
Compared with the prior art, the invention has the following beneficial effects: aiming at the problem that the weight coefficient in the evaluation function of the traditional DWA algorithm can not be dynamically adjusted along with the environment where the intelligent vehicle is located and the motion state of the intelligent vehicle, a near-end strategy optimization algorithm in deep reinforcement learning is used for constructing a DWA-PPO deep reinforcement learning model, and a network model is obtained through continuous iterative training, so that model parameters of corresponding weight parameters are output, and the construction of a self-adaptive PPO-ADWA algorithm is completed; the method solves the problem that the weight coefficient of the intelligent body in the face of different obstacle environments cannot be dynamically adjusted due to the evaluation function, and the intelligent body cannot always find a terminal point or calculate an optimal path.
Drawings
FIG. 1 is a schematic illustration of an agent-environment interaction model.
Fig. 2 is a schematic diagram of the principle of DWA algorithm.
Fig. 3 is a velocity angular velocity window.
FIG. 4 is a diagram ofAnd delta schematic.
Fig. 5 is a schematic diagram of an actor commentator framework.
Fig. 6 is state s.
Fig. 7 is a policy network architecture.
Fig. 8 is a value network architecture.
FIG. 9 is a DWA-PPO model.
Fig. 10 is a plot of fractional versus arrival rate.
Fig. 11 is a simulation environment.
Fig. 12 is a classical DWA.
FIG. 13 is PPO-ADWA.
Fig. 14 is a weight parameter variation curve.
FIG. 15 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is specifically described below with reference to fig. 1 to 15.
As shown in fig. 15, the invention provides an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy, which comprises the following steps:
firstly, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;
establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and the main elements and evaluation functions of the DWA algorithm;
thirdly, establishing a near-end strategy optimization learning PPO model based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;
step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; determining parameters of a DWA-PPO deep reinforcement learning model including parameters of an input layer, an output layer, the number of hidden layers and the number of neurons, and completing instantiation of the DWA-PPO deep reinforcement learning model;
step five, constructing a self-adaptive PPO-ADWA algorithm, namely simulating navigation planning of the unmanned vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training the DWA-PPO deep reinforcement learning model, and outputting a model of corresponding weight parameters according to the distribution change of surrounding obstacles by repeated iteration convergence to finish the construction of the self-adaptive PPO-ADWA algorithm;
step six, the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm is demonstrated through a simulation comparison experiment.
The method comprises the following steps:
step one, as shown in fig. 1, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;
the intelligent agent plays roles of decision making and learning in the deep reinforcement learning system, is mainly responsible for outputting action information and receiving rewards and states, and the environment is an interactive object of the intelligent agent, and the interactive process comprises the following three steps:
(1) The intelligent agent is controlled by the environment stateObserve information +.> The state space is a value set of environmental states; />The system is a value set for observing the space and the observed quantity of the agent.
(2) The agent being derived from known O t Making a corresponding decision to decide the action to be applied to the environment Is a set of action values.
(3) Environmental subject A t Influence, self state S t Transition to S t+1 And give rewards to the agent Is a value set of rewards. The discretized agent-environment interaction model can thus be represented by the following sequence:
S 0 ,O 0 ,A 0 ,R 0 ,S 1 ,O 1 ,A 1 ,R 1 ,S 2 ,O 2 ,A 2 ,R 2 ,…,S T =S termination of
When the state of the environment can be completely observed by the intelligent agent, S is present t =O t To be simplified as:
S 0 ,A 0 ,R 0 ,S 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…,S T =S termination of
Establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and main element evaluation functions of the DWA algorithm;
the DWA algorithm is a local path planning method for intuitively understanding the map environment where the unmanned vehicle is located from the perspective of speed space, and the working flow is as follows: taking into consideration the constraint of each condition at the moment t on the speed and angular velocity, obtaining a speed and angular velocity window V which can be reached by the unmanned vehicle at the moment t win The method comprises the steps of carrying out a first treatment on the surface of the Discretizing the speed and angular velocity, and combining the discretized speed and angular velocity; the drone traverses all combinations and simulates the forward m delta according to a given motion model t Obtaining a simulation track set tau, namely a series of point sets, according to the duration; the evaluation function gives the scores of all the simulated tracks in the simulated track set tau, and the track tau with the highest score is selected b Corresponding combinations; driving the unmanned aerial vehicle with the combination for a forward time delta t Reaching the time t+1; cycling through this to the endpoint. m is the sampling step number, delta t Is the sampling interval, as shown in fig. 2.
At time t, V of unmanned vehicle win Constrained by its hardware conditions and the surrounding environment, consider the following three-point constraint:
(1) Limit speed angular speed constraint:
V lim ={(v,w)|v∈[v min ,v max ]^w∈[w min ,w max ]}
(2) Acceleration limited speed angular velocity constraint:
(3) Speed angular velocity constraint for braking distance limitation:
above, v min 、v max Is the limit linear velocity, w min 、w max Is the limit angular velocity. v cu 、w cu For the current linear velocity and the angular velocity,for extreme linear acceleration, +.>Is the limit angular acceleration. dist (v, w) is the closest distance of the simulated trajectory corresponding to the velocity-angular velocity combination (v, w) from the obstacle. Final time t V win Expressed as:
V win =V lim ∩V acc ∩V dis
as shown in fig. 3, the evaluation function includes three sub-functions, which are comprehensive consideration of three factors of the unmanned vehicle running speed, the obstacle collision risk and the unmanned vehicle heading, and specifically include the following steps:
G(v,w)=σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))
wherein the method comprises the steps of The course angle of the unmanned vehicle is represented, delta is the included angle between the connecting line of the unmanned vehicle and the target point and the positive direction of the x-axis, as shown in fig. 4.
dist (v, w) is the Euclidean distance from the simulated track to the nearest obstacle, vel (v, w) represents the linear speed of the unmanned vehicle, and alpha, eta and gamma are three weight coefficients. From the above, the evaluation function is composed of sub-functions with different dimensions, the normalized function σ () in the formula is equivalent to dimensionless learning, and data with different dimensions can be unified under the same reference system for combination or comparison, so as to avoid evaluation deviation caused by different dimensions of the data, and the method specifically comprises the following steps:
dist(v i ,w j ) And vel (v) i ,w j ) The same normalization operation was performed.
The unmanned vehicle obtains a simulation track according to the uniform motion model, under the assumption condition of the motion model, the linear speed and the angular speed of the unmanned vehicle are kept unchanged, the linear speed direction change amount and time are in linear relation, the speed direction is kept unchanged in a tiny time interval for simplifying the model to accelerate the operation, and therefore the uniform motion model can be discretized, and x is calculated by t 、y t Represents the abscissa of the intelligent vehicle at the time t,indicating the course angle at time t, v t 、w t The speed and angular velocity at time t are shown in the following formula.
Thirdly, establishing a near-end strategy optimization learning (PPO) model (shown in fig. 5) based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;
the approach of the near-end policy optimization algorithm (Proximal Policy Optimization, PPO) is then to perform the function at the objectiveIncrease D KL The (p||q) penalty term is specifically as follows:
in the middle ofTo the parameter theta 1 The integration of (2) to obtain the target function of strategy learning based on importance sampling, theta is the strategy pi parameter, when the strategy is better, the target function +.>The larger, γ is the parameter introduced for the Monte Carlo approximation, U t Is the parameter in the policy gradient, pi (a t |s t ;θ 1 ) For target policy, pi (a t |s t ;θ 2 ) For behavioural policy->For mathematical expectation of policy network, beta is super parameter, and D is larger as the difference between the distribution q and p is larger KL The larger the (p||q) term is, ++>The greater the penalty received, the conversely D KL The smaller the (p||q) term is, ++>The smaller the penalty is, the goal of reinforcement learning is to maximize +.>Thus have penalty term->The behavior can be controlled to be within a certain similarity range with the target strategy.
The unmanned aerial vehicle searches an optimal path capable of connecting the starting point and the ending point in the obstacle environment, so that the learning environment taking the actual application scene of the unmanned aerial vehicle as a model is the obstacle map.
The state s in the model is environmental information perceived by the unmanned vehicle by using a sensor, and can also comprise self position and motion state information. The state s is the only information source of the unmanned vehicle action decision, and is also an important basis for maximizing the return, so that the advantages and disadvantages of the state s directly influence whether the algorithm can converge, the convergence speed and the final performance. The state s can be understood as a high-dimensional vector of surrounding environment information, the final target of the unmanned aerial vehicle reaches the end point in an optimal path, and obviously, the position and the state of the unmanned aerial vehicle, surrounding barrier distribution and the position information of the target point are core basis of the action decision of the unmanned aerial vehicle. In order to be more suitable for practical application, the laser radar can be scanned at a scanning interval of 2 degrees, the information reflected by one circle is taken as the main part of the state s, and the state s also comprises the speed v of the unmanned vehicle t Angular velocity w t Course angleCurrent target point position information (x g t ,y g t ) As shown in fig. 6. The specific method is that a strategy network is utilized to output a fixed weight for replacing an evaluation function, a self-adaptive evaluation function is constructed, and obviously, an action a corresponds to the weight (alpha, eta, gamma) in the evaluation function, so that the action a is defined as follows:
a=[μ 112233 ]
wherein [ mu ] 11 ]For mean and variance, the probability density function used to describe the weight α:
and so on [ mu ] 22 ]For mean and variance, probability density function for describing weight η, [ mu ] 33 ]The mean and variance are probability density functions describing the weights γ. Then randomly sampling and determining (alpha, eta, gamma) according to the respective probability density function, and mapping the actions to [ -1,1 through the Tanh function]Within the interval.
After the state s and the action a are determined, the numbers of neurons of the input layer and the output layer of the policy network pi (a|s; theta) and the value network q (s, a; w) are also determined. The schematic diagrams of the policy network and the value network structure are shown in fig. 7 and 8:
step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; model parameters including parameters such as the size of an input layer, the size of an output layer, the number of hidden layers, the number of neurons and the like are determined, and instantiation of the DWA-PPO deep reinforcement learning model is completed;
the rewards obtained by the core content unmanned vehicle in the learning model can be divided into main line rewards and sub-target rewards according to whether the rewards obtained by triggering the main line event are:
main line rewards: the main line rewards are understood as settlement rewards for the arrival of the agent at the end state, i.e. rewards R obtained by navigation of the vehicle to the end point mian goal Punishment rewards R when the maximum number of iterative steps is exceeded mian out And punishment rewards R when unmanned vehicle collides with obstacle mian coll
Sub-target rewards: the rewards outside the main rewards are called auxiliary rewards, and the main form is sub-target rewards. And analyzing the influence of factors such as local key points, environment states, movement states of the unmanned aerial vehicle, relative relation between the unmanned aerial vehicle and target points and the like on a main line task of the unmanned aerial vehicle for finding an optimal path by combining with the actual application scene of navigation planning of the unmanned aerial vehicle in an obstacle environment, and giving the following sub-target rewards:
(1) Energy penalty prize R sub step :R sub step Is stored in (a)On one hand, the energy consumption of the unmanned vehicle can be limited, and meanwhile, the unmanned vehicle can be promoted to find an optimal path; e (E) t At speed v for the t step unmanned vehicle t Running delta t The energy consumed by the process, normalized, defines R sub step The method comprises the following steps:
(2) Distance change reward R sub dis : in the process, the unmanned vehicle can be away from the end point locally due to avoidance obstacles, but is always close to the end point globally. Thus, a prize relating to the distance between the position of the vehicle and the target point, R, can be defined sub dis Should be a positive prize, and R is the greater the distance moved in the direction of the end point sub dis The larger.
(3) Obstacle distance rewards R sub obs :r t obs Defining that when no obstacle exists in the safety distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle does not collide in the planning process, which is the primary premise of ensuring the driving safety, and defining R after normalization sub obs The method comprises the following steps:
(4) Azimuth prize R sub head : the target of the unmanned vehicle reaches the destination, so that the more the unmanned vehicle is considered to face the destination in navigation, the better the heading angle of the unmanned vehicle is; r is (r) head The method is defined as that forward rewards are obtained when the heading of the unmanned vehicle is very close to the optimal azimuth angle, and R is defined after normalization sub head The method comprises the following steps:
to sum up, it can be known that unmanned vehicleRewards R at step t t Is a compound of the formula (I),the adjustment factor is awarded for the sub-target.
The AC framework constructs a value network for approximating action value in a policy gradient, so the network architecture includes at least the value network and the policy network architecture. According to the value network loss function:
the learning targets of the value network are:
it can be seen that its learning objective includes a part of its own predictionsGiven that there is an overestimation of the action value Q (s, a) on the value network itself, this way of using self-learning of the value network itself can lead to the overestimation problem being amplified continuously and this overestimation is non-uniform, severely affecting the network training, a phenomenon known as Bootstrapping (Bootstrapping). To prevent bootstrap phenomenon of value network, w is used - Constructing a target value network asThe parameter structure of the network is consistent with the value network but different in specific value, and is used for calculating TD error:
the initial parameters of the target value network are consistent with the value network, mu is a parameter, the sum of coefficients is ensured to be 1, and the following updating reference is carried out:
taken together, the network architecture under the DWA-PPO reinforcement learning model includes three major parts: policy network pi (a|s; θ), value network q (s, a; w) and target value network q T (s, a; w). The DWA-PPO reinforcement learning model is shown in FIG. 9.
In summary, the build model includes an agent, an environment, a critique module, and an actor module. The commentator module comprises a value network error function L (w), a value network q (s, a; w) and a target value network q T (s,a;w - ). Actor modules include a target network pi (a|s; θ 1 ) A behavioral network pi (a|s; θ 2 ) Policy network objective functionThe training beginning stage is training set collection, as shown in black line segments: the 0 th round of initial time unmanned vehicle observes state s from the environment using the sensing and positioning system 0 Behavior network pi (a|s; θ 2 ) Receiving s 0 Post-outputting a message about action A 0 Is of Gaussian distribution pi (A) 0 |s 0 ;θ 1 ) Then randomly extracting the determination action a from the probability distribution 0 Transmitting to an intelligent vehicle to obtain an evaluation function G of the DWA algorithm at the initial moment 0 And (v, w) completing evaluation of a simulated track set of the DWA algorithm at the initial moment, and transmitting a speed angular speed instruction of the optimal track to the unmanned vehicle motion control module to drive the unmanned vehicle to move. So far, the information such as the position, the orientation angle, the surrounding obstacle distribution and the like of the unmanned vehicle are changed, and the environment is converted into a state s 1 The rewarding function also feeds back the rewarding r to the commentator module according to the changed information 0 . When s is 1 Not in the termination state s n The round goes to the next moment in time,otherwise, resetting the map and the unmanned vehicle state, and collecting the track of the next round until the full i rounds are collected, so as to finally obtain a training set:
χ=[χ 01 ,…,χ i ]
χ 0 =[s 0 0 ,a 0 0 ,r 0 0 ,…,s n-1 0 ,a n-1 0 ,r n-1 0 ,s n 0 ]
step five, constructing a PPO-ADWA algorithm, namely simulating navigation planning of the unmanned aerial vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training a network model, and outputting model parameters of corresponding weight parameters by repeatedly and iteratively converging the model parameters according to the distribution change of surrounding obstacles to finish the construction of the self-adaptive PPO-ADWA algorithm;
after obtaining the training set, the value network q (s, a; w) is updated by back propagation of an error function L (w); error function using PPO algorithmBack propagation update pi (a|s; θ) 1 ). Setting networks q (s, a; w), q T (s,a;w - )、π(a|s;θ 1 ) The current network parameters are w respectively now 、/>θ now The following steps are repeated for Z times to finish the generation of updating:
(1) Randomly extracting M from the disordered training set χ I (minimum batch size) states s N I
(2) With q T (s,a;w - ) Calculating the state s N I K-step TD error MTD as starting point N I
(3) Calculating the state s using the value network q (s, a; w) N I Action value estimation at the time:
q N I =Q(s N I ,a N I ;w now )
(4) Calculation L (w):
(5) Calculation of
(6) Update value network, policy network, target value network:
assume that the pre-update parameter is θ now Obtaining parameter theta after importance sampling update new Let the pre-update parameter be w now Obtaining a parameter w after strategy learning and updating new Let us assume the parameters w introduced before the update to prevent the bootstrap phenomenon of the value network - now After mu is taken as a parameter, the w is obtained after the update of the coefficient sum of 1 - new . After Z times of updating, the target network pi (a|s; theta 1 ) Parameters are assigned to a target networkπ(a|s;θ 1 ) And recording as one generation of update, then emptying the training set, and re-entering the next generation of update until the model converges.
Fig. 10 is a graph of the average score and arrival rate of each generation of the unmanned vehicle in the deep reinforcement learning environment in the network training process, and the network model gradually learns to a parameter network capable of guiding the unmanned vehicle path planning correctly along with the iterative convergence of the model. And completing the construction of the self-adaptive PPO-ADWA algorithm.
Step six, the self-adaptive adjustment capability of the unmanned vehicle path planning based on PPO-ADWA is demonstrated through a simulation comparison experiment;
to verify the self-regulating capability of unmanned vehicle path planning based on PPO-ADWA algorithm, this section will verify its robustness in a randomly generated complex static obstacle environment. As shown in FIG. 11, the simulation environment is characterized in that the map size is 60m multiplied by 60m, green dots are used as starting points, blue pentagram is used as an end point, black geometric figures represent barriers, the shapes of the barriers comprise regular polygons and circles, and the sizes and the number of the barriers are randomly generated within a certain range. The simulation results of the performance under the map with different 100 obstacle positions are shown in table 1.
Table 1 comparison of simulation results
The arrival rate of the unmanned vehicle path planning result under PPO-ADWA is 84%, and is improved by 6 percentage points compared with the arrival rate under classical DWA; the average path length is 93.04m, and the path efficiency is improved by 5.00%; the average step number is 251.95, and the average step number cost is reduced by 4.85%. Classical DWA programming results are shown in FIG. 12, and PPO-ADWA programming results are shown in FIG. 13. In the unmanned vehicle path planning process of the PPO-ADWA fusion strategy, a change curve of the weight parameters is shown in fig. 14. It can be seen that the weight parameters generally maintain a numerical relationship of η > γ > α.
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims (6)

1. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy is characterized by comprising the following steps:
firstly, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;
establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and the main elements and evaluation functions of the DWA algorithm;
thirdly, establishing a near-end strategy optimization learning PPO model based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;
step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; determining parameters of a DWA-PPO deep reinforcement learning model including parameters of an input layer, an output layer, the number of hidden layers and the number of neurons, and completing instantiation of the DWA-PPO deep reinforcement learning model;
step five, constructing a self-adaptive PPO-ADWA algorithm, namely simulating navigation planning of the unmanned vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training the DWA-PPO deep reinforcement learning model, and outputting a model of corresponding weight parameters according to the distribution change of surrounding obstacles by repeated iteration convergence to finish the construction of the self-adaptive PPO-ADWA algorithm;
step six, the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm is demonstrated through a simulation comparison experiment.
2. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 1, wherein the following steps are specifically implemented:
the intelligent agent is responsible for outputting action information, receiving rewards and states, the environment is an interactive object of the intelligent agent, and the interactive process comprises the following three steps:
(1) The intelligent agent is controlled by the environment stateObserve information +.> The state space is a value set of environmental states; />The system is an observation space and a value set of the observed quantity of the agent;
(2) The agent being derived from known O t Making a corresponding decision to decide the action to be applied to the environment Is an action value set;
(3) Environmental subject A t Influence, self state S t Transition to S t+1 And awarding rewards to the agent Is a value set of rewards; the discretized agent-environment interaction model is thus represented by the following sequence:
S 0 ,O 0 ,A 0 ,R 0 ,S 1 ,O 1 ,A 1 ,R 1 ,S 2 ,O 2 ,A 2 ,R 2 ,…,S T =S termination of
When the state of the environment can be completely observed by the intelligent agent, S is present t =O t To be simplified as:
S 0 ,A 0 ,R 0 ,S 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…,S T =S termination of
3. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 2, wherein the step two is specifically implemented as follows:
the DWA algorithm is a local path planning method for intuitively understanding the map environment where the unmanned vehicle is located from the perspective of speed space, and the working flow is as follows: taking into consideration the constraint of each condition at the moment t on the speed and angular velocity, obtaining a speed and angular velocity window V which can be reached by the unmanned vehicle at the moment t win The method comprises the steps of carrying out a first treatment on the surface of the Discretizing the speed and angular velocity, and combining the discretized speed and angular velocity; the drone traverses all combinations and simulates the forward m delta according to a given motion model t Obtaining a simulation track set tau, namely a series of point sets, according to the duration; the evaluation function gives the scores of all the simulated tracks in the simulated track set tau, and the track tau with the highest score is selected b Corresponding combinations; driving the unmanned aerial vehicle with the combination for a forward time delta t Reaching the time t+1; taking the cycle to the end point, m is the sampling step number, delta t Is the sampling interval;
at time t, speed angular velocity window V of unmanned vehicle win Constrained by its hardware conditions and the surrounding environment, consider the following three-point constraint:
(1) Limit speed angular speed constraint:
V lim ={(v,w)|v∈[v min ,v max ]∧w∈[w min ,w max ]}
(2) Acceleration limited speed angular velocity constraint:
(3) Speed angular velocity constraint for braking distance limitation:
above, v min 、v max Is the limit linear velocity, w min 、w max For the limit angular velocity v cu 、w cu For the current linear velocity and the angular velocity,for extreme linear acceleration, +.>For the limit angular acceleration, dist (v, w) is the nearest distance from the obstacle of the simulated track corresponding to the velocity angular velocity combination (v, w); finally, speed angular velocity window V of unmanned vehicle at t moment win Expressed as:
V win =V lim ∩V acc ∩V dis
the evaluation function comprises three subfunctions, which are comprehensive consideration of three factors of the unmanned vehicle running speed, the obstacle collision risk and the unmanned vehicle heading, and specifically comprises the following steps:
G(v,w)=σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))
wherein the method comprises the steps of The course angle of the unmanned vehicle is represented, delta is the included angle between the connecting line of the unmanned vehicle and the target point and the positive direction of the x-axis; dist (v, w) is the Euclidean distance from the simulated track to the nearest obstacle, vel (v, w) represents the linear speed of the unmanned vehicle, and alpha, eta and gamma are three weight coefficients; from the above, the evaluation function is notThe homonymy subfunction is formed, the normalization function sigma () in the formula is equivalent to dimensionless learning, and can unify the data of different dimensions under the same reference system for combination or comparison, so as to avoid evaluation deviation caused by different dimensions of the data, and the method is concretely as follows:
dist(v i ,w j ) And vel (v) i ,w j ) Performing the same normalization operation;
the unmanned vehicle obtains a simulation track according to a uniform motion model, under the assumption condition of the uniform motion model, the linear speed and the angular speed of the unmanned vehicle are kept unchanged, the linear speed direction change amount and time are in linear relation, the speed direction is kept unchanged in a tiny time interval for simplifying the model to accelerate the operation, and therefore the uniform motion model is discretized, and x is calculated t 、y t Represents the abscissa of the intelligent vehicle at the time t,indicating the course angle at time t, v t 、w t The speed and angular velocity at time t are represented by the following formulas:
4. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 3, wherein the third implementation is as follows:
the approach of the near-end policy optimization algorithm is that the approach is based on the objective functionIncrease D KL The (p||q) penalty term is specifically as follows:
in the middle ofTo the parameter theta 1 The integration of (2) to obtain the target function of strategy learning based on importance sampling, theta is the strategy pi parameter, when the strategy is better, the target function +.>The larger, γ is the parameter introduced for the Monte Carlo approximation, U t Pi (a t |s t ;θ 1 ) For the target policy, pi (a t |s t ;θ 2 ) For behavioural policy->For mathematical expectation of policy network, beta is super parameter, and D is larger as the difference between the distribution q and p is larger KL The larger the (p||q) term is, ++>The greater the penalty received, the conversely D KL The smaller the (p||q) term is, ++>The smaller the penalty is, the goal of reinforcement learning is to maximize +.>Thus have penalty term->The behavior and the target strategy can be controlled within a preset similarity range;
the unmanned vehicle searches an optimal path capable of connecting the starting point and the end point in the obstacle environment, so that the learning environment taking the actual application scene of the unmanned vehicle as a model is an obstacle map;
the state s in the model is environmental information perceived by the unmanned vehicle by using a sensor and comprises self position and motion state information; the laser radar scans the information reflected by a circle at a scanning interval of 2 degrees to form a main part of a state s, and the state s also comprises the speed v of the unmanned vehicle t Angular velocity w t Course angleCurrent target point position information (x g t ,y g t ) The specific method is that a strategy network is utilized to output fixed weights for replacing an evaluation function, a self-adaptive evaluation function is constructed, and obviously, an action a corresponds to the weights (alpha, eta, gamma) in the evaluation function, so that the action a is defined as follows:
a=[μ 112233 ]
wherein [ mu ] 11 ]For the mean and variance, the probability density function used to describe the weight α, the same applies:
22 ]for mean and variance, probability density function for describing weight η, [ mu ] 33 ]The mean value and the variance are probability density functions for describing the weight gamma; then randomly sampling and determining (alpha, eta, gamma) according to the respective probability density function, and mapping the actions to [ -1,1 through the Tanh function]The interval is within;
after the state s and the action a are determined, the number of neurons of the output layer of the input and output layers of the strategy network and the value network are also determined.
5. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 4, wherein the fourth implementation is as follows:
the rewarding function is to learn core content in a DWA-PPO deep reinforcement learning model, and rewards obtained by unmanned vehicles are divided into main line rewards and sub-target rewards according to whether the rewards obtained by triggering main line events are:
main line rewards: the main line rewards are settlement rewards for the agent reaching the ending state, namely rewards R obtained by the navigation of the unmanned vehicle to the ending point mian goal Punishment rewards R when the maximum number of iterative steps is exceeded mian out And punishment rewards R when unmanned vehicle collides with obstacle mian coll
Sub-target rewards: rewards other than the main rewards are called auxiliary rewards, and the main form of the rewards is sub-target rewards; and analyzing the influence of factors such as local key points, environment states, movement states of the unmanned aerial vehicle, relative relation between the unmanned aerial vehicle and target points and the like on a main line task of the unmanned aerial vehicle for finding an optimal path by combining with the actual application scene of navigation planning of the unmanned aerial vehicle in an obstacle environment, and giving the following sub-target rewards:
(1) Energy penalty prize R sub step :R sub step On one hand, the energy consumption of the unmanned vehicle can be limited, and meanwhile, the unmanned vehicle can be promoted to find an optimal path; e (E) t At speed v for the t step unmanned vehicle t Running delta t The energy consumed by the process, normalized, defines R sub step The method comprises the following steps:
(2) Distance change reward R sub dis : defining a prize related to the position-target point distance of the unmanned vehicle, R sub dis Should be a positive prize, and R is the greater the distance moved in the direction of the end point sub dis The larger;
(3) Obstacle distance rewards R sub obs :r t obs Defining that when no obstacle exists in the safety distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle does not collide in the planning process, which is the primary premise of ensuring the driving safety, and defining R after normalization sub obs The method comprises the following steps:
(4) Azimuth prize R sub head : the target of the unmanned vehicle reaches the destination, so that the more the unmanned vehicle is considered to face the destination in navigation, the better the heading angle of the unmanned vehicle is; r is (r) head The method is defined as that forward rewards are obtained when the heading of the unmanned vehicle is very close to the optimal azimuth angle, and R is defined after normalization sub head The method comprises the following steps:
rewards R at t-th step of unmanned vehicle t Is a compound of the formula (I),awarding the adjustment factor for the sub-target;
the network architecture under the DWA-PPO reinforcement learning model at least comprises a value network and a strategy network architecture; according to the value network loss function:
the learning targets of the value network are:
learning objectives for value networks include a portion of predictions themselvesTo prevent bootstrap phenomenon of value network, w is used - Constructing a target value network q T (s,a;w - ) The parameter structure of the target value network is consistent with the value network but different in specific value, and is used for calculating TD error:
the initial parameters of the target value network are consistent with the value network, mu is a parameter, the sum of coefficients is ensured to be 1, and the following updating reference is carried out:
w - new ←μw new +(1-μ)w - now
the network architecture under the DWA-PPO reinforcement learning model includes three major parts: policy network pi (a|s; θ), value network q (s, a; w) and target value network q T (s,a;w);
The DWA-PPO reinforcement learning model is built in a comprehensive way and comprises an agent, an environment, a commentator module and an actor module; the commentator module comprises a value network error function L (w), a value network q (s, a; w) and a target value network q T (s,a;w - ) The method comprises the steps of carrying out a first treatment on the surface of the Actor modules include a target network pi (a|s; θ 1 ) A behavioral network pi (a|s; θ 2 ) Policy network objective functionThe training beginning stage is collection of training sets: the 0 th round of initial time unmanned vehicle observes state s from the environment using the sensing and positioning system 0 Behavior network pi (a|s; θ 2 ) Receiving s 0 Post-outputting a message about action A 0 Is of Gaussian distribution pi (A) 0 |s 0 ;θ 1 ) Then randomly extracting the determination action a from the probability distribution 0 Transmitting to an intelligent vehicle to obtain an evaluation function G of the DWA algorithm at the initial moment 0 (v, w) completing evaluation of a simulated track set of the DWA algorithm at the initial moment, and transmitting a speed angular speed instruction of the optimal track to an unmanned vehicle motion control module to drive the unmanned vehicle to move; to this end, information including the position, the angle of orientation, the surrounding obstacle distribution of the drone is changed, the environment is switched to state s 1 The rewarding function also feeds back the rewarding r to the commentator module according to the changed information 0 The method comprises the steps of carrying out a first treatment on the surface of the When s is 1 Not in the termination state s n And (3) entering the next time in the round, otherwise resetting the map and the unmanned vehicle state, and collecting the track of the next round until the round i is collected, so as to finally obtain a training set:
χ=[χ 01 ,…,χ i ]
χ 0 =[s 0 0 ,a 0 0 ,r 0 0 ,…,s n-1 0 ,a n-1 0 ,r n-1 0 ,s n 0 ]。
6. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 5, wherein the fifth implementation is as follows:
after obtaining the training set, the value network q (s, a; w) is updated by back propagation of an error function L (w); error function using near-end strategy optimization algorithmBack propagation update pi (a|s; θ) 1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let q (s, a; w), q T (s,a;w - )、π(a|s;θ 1 ) The current network parameters are w respectively now 、w - now 、θ now The following steps are repeated for Z times to finish the generation of updating:
(1) Randomly extracting the minimum lot size M from the shuffled training set χ I Individual states s N I
(2) With q T (s,a;w - ) Calculating the state s N I K-step TD error MTD as starting point N I
(3) Calculating the state s using the value network q (s, a; w) N I Action value estimation at the time:
q N I =Q(s N I ,a N I ;w now )
(4) Calculation L (w):
(5) Calculation of
(6) Update value network, policy network, target value network:
w - new ←(1-μ)w new +μw - now
assume that the pre-update parameter is θ now Obtaining parameter theta after importance sampling update new Let the pre-update parameter be w now Obtaining a parameter w after strategy learning and updating new Let us assume the parameters w introduced before the update to prevent the bootstrap phenomenon of the value network - now After mu is taken as a parameter, the w is obtained after the update of the coefficient sum of 1 - new The method comprises the steps of carrying out a first treatment on the surface of the After Z times of updating, the target network pi (a|s; theta 1 ) Parameters are assigned to the target network pi (a|s; θ 1 ) Recording as one generation of update, then emptying the training set, and re-entering the next generation of update until the model converges;
the average score and arrival rate change curve of each generation of the unmanned vehicle in the deep reinforcement learning environment gradually learns a parameter network capable of guiding the unmanned vehicle path planning correctly along with the iterative convergence of the model; and completing the construction of the self-adaptive PPO-ADWA algorithm.
CN202310792088.4A 2023-06-30 2023-06-30 Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy Pending CN116679719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310792088.4A CN116679719A (en) 2023-06-30 2023-06-30 Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310792088.4A CN116679719A (en) 2023-06-30 2023-06-30 Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy

Publications (1)

Publication Number Publication Date
CN116679719A true CN116679719A (en) 2023-09-01

Family

ID=87782071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310792088.4A Pending CN116679719A (en) 2023-06-30 2023-06-30 Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy

Country Status (1)

Country Link
CN (1) CN116679719A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130263A (en) * 2023-10-26 2023-11-28 博创联动科技股份有限公司 Intelligent control method and system for whole vehicle based on big data of Internet of vehicles
CN117553800A (en) * 2024-01-04 2024-02-13 深圳市乐骑智能科技有限公司 AGV positioning and path planning method and device
CN117682429A (en) * 2024-02-01 2024-03-12 华芯(嘉兴)智能装备有限公司 Crown block carrying instruction scheduling method and device of material control system
CN117724478A (en) * 2023-11-27 2024-03-19 上海海事大学 Automatic container terminal AGV path planning method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117130263A (en) * 2023-10-26 2023-11-28 博创联动科技股份有限公司 Intelligent control method and system for whole vehicle based on big data of Internet of vehicles
CN117130263B (en) * 2023-10-26 2024-01-16 博创联动科技股份有限公司 Intelligent control method and system for whole vehicle based on big data of Internet of vehicles
CN117724478A (en) * 2023-11-27 2024-03-19 上海海事大学 Automatic container terminal AGV path planning method
CN117553800A (en) * 2024-01-04 2024-02-13 深圳市乐骑智能科技有限公司 AGV positioning and path planning method and device
CN117553800B (en) * 2024-01-04 2024-03-19 深圳市乐骑智能科技有限公司 AGV positioning and path planning method and device
CN117682429A (en) * 2024-02-01 2024-03-12 华芯(嘉兴)智能装备有限公司 Crown block carrying instruction scheduling method and device of material control system
CN117682429B (en) * 2024-02-01 2024-04-05 华芯(嘉兴)智能装备有限公司 Crown block carrying instruction scheduling method and device of material control system

Similar Documents

Publication Publication Date Title
CN110136481B (en) Parking strategy based on deep reinforcement learning
Zhu et al. Deep reinforcement learning based mobile robot navigation: A review
CN113485380B (en) AGV path planning method and system based on reinforcement learning
CN116679719A (en) Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
CN112356830B (en) Intelligent parking method based on model reinforcement learning
CN107063280A (en) A kind of intelligent vehicle path planning system and method based on control sampling
CN112433525A (en) Mobile robot navigation method based on simulation learning and deep reinforcement learning
Botteghi et al. On reward shaping for mobile robot navigation: A reinforcement learning and SLAM based approach
CN111679660B (en) Unmanned deep reinforcement learning method integrating human-like driving behaviors
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN116540731B (en) Path planning method and system integrating LSTM and SAC algorithms
CN113391633A (en) Urban environment-oriented mobile robot fusion path planning method
CN116804879A (en) Robot path planning framework method for improving dung beetle algorithm and fusing DWA algorithm
CN115373415A (en) Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning
Al-Sharman et al. Self-learned autonomous driving at unsignalized intersections: A hierarchical reinforced learning approach for feasible decision-making
Liu et al. A hierarchical reinforcement learning algorithm based on attention mechanism for UAV autonomous navigation
Hamad et al. Path Planning of Mobile Robot Based on Modification of Vector Field Histogram using Neuro-Fuzzy Algorithm.
Sun et al. Event-triggered reconfigurable reinforcement learning motion-planning approach for mobile robot in unknown dynamic environments
CN116551703B (en) Motion planning method based on machine learning in complex environment
Ejaz et al. Autonomous visual navigation using deep reinforcement learning: An overview
Li et al. Research on the agricultural machinery path tracking method based on deep reinforcement learning
CN114396949A (en) Mobile robot no-priori map navigation decision-making method based on DDPG
Wang et al. An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle
Garrote et al. Improving Local Motion Planning with a Reinforcement Learning Approach
CN113959446B (en) Autonomous logistics transportation navigation method for robot based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination