CN115047878A

CN115047878A - DM-DQN-based mobile robot path planning method

Info

Publication number: CN115047878A
Application number: CN202210673628.2A
Authority: CN
Inventors: 顾玉宛; 朱智涛; 吕继东; 石林; 徐守坤; 刘铭雨
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-13

Abstract

The invention relates to the technical field of DQN algorithm, in particular to a mobile robot path planning method based on DM-DQN, which comprises the steps of establishing a mobile robot path planning model based on DM-DQN; designing a state space, an action space, a DM-DQN network model and a reward function of the DM-DQN algorithm; and training the DM-DQN algorithm to obtain an experience reward value, and completing the collision-free path planning of the robot. The invention introduces a competitive network structure, decomposes the network structure into a value function and an advantage function, and decouples action selection and action evaluation, so that the state is not completely judged depending on the value of the action any more, and independent value prediction can be carried out, thereby solving the problem of low convergence speed; and by designing the reward function based on the artificial potential field, the problem that the robot is too close to the edge of the barrier is solved.

Description

DM-DQN-based mobile robot path planning method

Technical Field

The invention relates to the technical field of DQN (differential Quadrature reference network) algorithms, in particular to a mobile robot path planning method based on DM-DQN.

Background

With the development trend of artificial intelligence, the robot industry also develops towards the intelligent direction of autonomous learning and autonomous exploration, and the path planning of a mobile robot is a core problem in the motion of the robot, and aims to find an optimal or suboptimal path without collision from a starting point to a terminal point; with the continuous development of science and technology, the environment faced by the robot is more and more complex, and in an unknown environment, the information of the whole environment cannot be obtained, so that the traditional path planning algorithm cannot meet the requirements of people, for example: artificial potential field algorithm, ant colony algorithm, genetic algorithm, particle swarm algorithm and the like. Aiming at the situation, deep reinforcement learning is provided, and the deep learning is combined with the reinforcement learning, wherein the deep learning mainly extracts features of an input unknown environment state through a neural network, and the fitting of the environment state to an action value function is realized; and the reinforcement learning completes the decision according to the output of the deep neural network and the exploration strategy, thereby realizing the mapping from the state to the action. The combination of deep learning and reinforcement learning solves the problem of dimension disaster caused by mapping from states to actions, and can better meet the motion requirements of the robot in a complex environment.

Disclosure of Invention

Aiming at the defects of the existing algorithm, the invention introduces a competition network structure, decomposes the network structure into a value function and an advantage function, and thereby decouples the action selection and the action evaluation, so that the state is not completely dependent on the value of the action for judgment any more, the independent value prediction can be carried out, and the problem of low convergence speed is solved; and by designing the reward function based on the artificial potential field, the problem that the robot is too close to the edge of the barrier is solved.

The technical scheme adopted by the invention is as follows: a DM-DQN-based mobile robot path planning method comprises the following steps:

step one, establishing a mobile robot path planning model based on DM-DQN;

designing a state space, an action space, a DM-DQN network model and a reward function of the DM-DQN algorithm;

further, the structure of the DM-DQN network model is divided into a cost function V (s, ω, α) and a merit function a (s, a, ω, β), and the output of the DM-DQN network model is represented as:

Q(s,a,ω,α,β)＝V(s,ω,α)+A(s,a,ω,β) (4)

where s represents the state, a represents the motion, ω is a parameter common to V and a, α and β are parameters of V and a, respectively, the value of V can be regarded as the average of the Q values in the state of s, the value of a is a limit with the average being 0, and the sum of the value of V and the value of a is the original Q value.

Further, the merit function is centralized, and the output of the DM-DQN network model is represented as:

where s denotes the state, a denotes the action, a' denotes the next action, a is an alternative action, ω is a common parameter for V and A, and α and β are parameters for V and A, respectively.

Further, the reward function is divided into a position reward function and a direction reward function, and a total reward function is calculated according to the position reward function and the direction reward function.

Further, in the position reward function, firstly, the gravity potential field function is used for constructing a target guide reward function:

where ζ represents the gravity reward function constant, d _goal Representing the distance between the current position and the target point;

secondly, constructing an obstacle avoidance reward function by using a repulsive force potential field function, wherein the reward is a negative reward and is reduced along with the reduction of the distance between the robot and the obstacle:

wherein η represents a repulsive reward function constant, d _obs Indicating the distance between the current position and the obstacle, d _max Representing the maximum impact distance of the obstacle.

Further, the direction reward function is expressed according to the angle difference between the expected direction and the actual direction of the robot, and the formula is as follows:

wherein, F _q Denotes the desired direction, F _a Which is indicative of the actual direction of the light,

representing the angle between the expected direction and the actual direction;

the directional reward function may be expressed as:

further, the overall reward function of the mobile robot is expressed as:

wherein r is _goal Representing the radius of the target area, r, centered on the target point _obs Representing the radius of the impact zone centered on the obstacle;

and step three, training the DM-DQN algorithm, obtaining an experience reward value, and completing the collision-free path planning of the robot.

The invention has the beneficial effects that:

1. by introducing a competitive network structure, the network structure is decomposed into a value function and an advantage function, so that action selection and action evaluation are decoupled, the state is judged without completely depending on the value of the action, independent value prediction can be performed, the problem of low convergence speed is solved, and the network structure has better generalization performance.

2. By designing the reward function based on the artificial potential field, the problem that the robot is too close to the edge of the barrier is solved; the learning efficiency in the dynamic unknown environment is higher, the convergence speed is higher, and a collision-free path far away from an obstacle can be planned.

Drawings

Fig. 1 is a diagram of a DM-DQN network architecture of the present invention;

FIGS. 2(a) and (b) are a static environment diagram and a dynamic and static environment diagram, respectively, according to the present invention;

FIGS. 3(a), (b) are plots of reward values for the static and dynamic environments of the DM-DQN algorithm of the present invention;

fig. 4(a) and (b) are a static environment generation path diagram and a dynamic and static environment generation path diagram according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and which illustrate only the basic structure of the invention and, therefore, only show the structures associated with the invention.

Aiming at the problem of low convergence speed of M-DQN, the method is improved, a competition network structure is introduced, and the network structure is decomposed into a cost function and an advantage function; and aiming at the problem that the motion trail of the robot is too close to the edge of the obstacle, a reward function of the artificial potential field method is designed, so that the motion trail of the robot is far away from the periphery of the obstacle.

As shown in fig. 1, a DM-DQN-based mobile robot path planning method includes the following steps:

step one, establishing a DM-DQN-based mobile robot path planning model, and describing a mobile robot path planning problem as a Markov decision process;

first, the Q value is estimated by an online reducing Q-network with a weight of theta, and the weight of theta is copied to the weight of theta every C steps

In the target network of (2);

secondly, by interacting with the environment using an epsilon-greedy strategy, the robot gets a reward and the next state according to the designed artificial potential field based reward function, and finally transitions(s) _t ,a _t ,r _t ,s _t+1 ) Is stored in a fixed size FIFO playback buffer, and every F steps the DM-DQN randomly extracts a batch D from the playback buffer D _t And according to the following formula:

returning to the target, and reducing the loss to the minimum;

where s represents the status, a represents the action, r represents the prize value, and γ represents the discount factor.

Satisfy the requirement of

τ is a hyperparameter for controlling the weight of the entropy, a' represents the action at time t +1, α is the hyperparameter set to 1,

indicating the policy selected in that state,

is an alternative action.

the state space includes: the method comprises the following steps of (1) laser radar data, a current control instruction of the mobile robot, a control instruction of the mobile robot at the last moment, and the direction and distance of a target point;

the motion space includes: angular and linear velocities of the mobile robot;

the motion space of the robot is dispersed into 5 motions, the fixed linear velocity v is 0.15m/s, an angular velocity value is given, the angular velocity is selected instead of directly giving a corner for the output of the control quantity, the kinematic characteristics of the mobile robot are better met, and the angular velocity is given according to the following formula:

wherein action _ size represents that the action space is dispersed into 5 actions, action [5 ]]The values representing the actions are: 0 to 4, max _ angel _vel The maximum angular velocity value representing the robot steering is 1.5rad/s, and 5 actions are calculated according to equation (2), as shown in equation (3), where linear velocity v is in m/s and angular velocity ω is in rad/s.

Further, the DM-DQN network model divides the network structure into two parts, as shown in fig. 1, the first part is only related to the state S, called cost function, and is represented as V (S, ω, α); another part is related to state S and action a, called the dominance function, denoted as a (S, a, ω, β), and therefore the output of the network can be expressed as:

Q(s,a,ω,α,β)＝V(s,ω,α)+A(s,a,ω,β) (4)

wherein s represents the state, a represents the motion, ω is a common parameter of V and A, α and β are parameters of V and A respectively, the value of V can be regarded as the average of Q values in the state of s, the value of A is a limit with the average being 0, and the sum of the value of V and the value of A is the original Q value;

since there is a limit that the sum of the values a must be 0, the network will preferentially update the value V, which is the average of the values Q, and the adjustment of the average is equivalent to updating all the values Q in the state at one time, so that the network not only updates the value Q of a certain action, but adjusts the value Q of all the actions in the state at one time.

Further, in the robot path planning, the merit function is a condition that the learning robot does not detect the obstacle, and the merit function is a condition that the learning robot knows that the robot detects the obstacle, and in order to solve the identifiability problem, the merit function is centralized:

Further, the reward function is designed according to an artificial potential field method, and is decomposed into two parts: the first part is a position reward function which comprises a target guide reward function and an obstacle avoidance reward function, wherein the target reward function is used for guiding the robot to quickly reach a target point, and the obstacle avoidance reward function is used for keeping the robot and an obstacle at a certain distance;

the second part is a direction reward function, the current orientation of the robot plays a key role in rational navigation, and the direction reward function is designed to guide the robot to move towards a target point in view of the fact that the direction of the resultant force applied to the robot in the artificial potential field can well fit the moving direction of the robot.

Further, in the position reward function, the gravity potential field function is firstly used for constructing a target guide reward function:

where eta represents a repulsive reward function constant, d _obs Indicating current position to obstacleDistance between objects, d _max Representing the maximum impact distance of the obstacle.

Further, in the direction reward function, the angular difference between the expected direction and the actual direction of the robot is expressed as:

wherein, F _q Denotes the desired direction, F _a Which represents the actual direction of the light beam,

representing the angle between the expected direction and the actual direction;

thus, the directional reward function may be expressed as:

further, the overall reward function may be expressed as:

the overall reward function for a mobile robot is expressed as:

wherein r is _goal Representing the radius of the target area, r, centered on the target point _obs Indicating the radius of the impact zone centered on the obstacle.

Designing a simulation environment, interacting the mobile robot with the environment, acquiring training data, sampling the training data to carry out simulation training on the mobile robot, and completing collision-free path planning;

and step three, training the DM-DQN algorithm to obtain an experience reward value, and completing the collision-free path planning of the robot.

The specific experimental steps are as follows:

a virtual simulation environment is created through a Gazebo simulator, and a robot model is established through the Gazebo to realize a path planning task, wherein the simulation environment comprises a static environment and a dynamic and static environment as shown in FIG. 2, FIG. 2(a) is the static environment, and FIG. 2(b) is the dynamic environment;

and implementing a path planning algorithm by adopting a python language and calling a built-in Gazebo simulator to control the motion of the robot and acquire the perception information of the robot.

The DM-DQN algorithm obtains an experience reward value through 320 times of simulation training, as shown in fig. 3, fig. 3(a) and (b) respectively show that the DM-DQN algorithm records the accumulated reward of each round and the average reward of the agent in static environment and dynamic and static environment, wherein each point represents one round, and a black curve represents the average reward, which indicates that the DM-DQN adopts a competitive network structure, and action selection and action evaluation are decoupled so that it has a faster learning rate, so that the experience of environmental exploration in the early stage can be more fully utilized, thereby obtaining a greater reward.

7 points are appointed for the robot to navigate, the robot autonomously reaches the No. 2-No. 7 positions from the No. 1 position in sequence without collision in an unknown environment and then returns to the No. 1 position, and collision-free path planning is achieved, as shown in FIG. 4.

As shown in table 1, comparing the DM-DQN algorithm of the present invention with the existing algorithm under the same training condition, the average moving times to a target point and the number of times to successfully reach the target point in 300 rounds are respectively compared, and it can be found from the table that the average moving times of DM-DQN is the least, and the success times is increased by 50% compared with the DQN algorithm; increased by 23.6% compared with dulling DQN; compared with M-DQN, the increase is 19.3%.

TABLE 1

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A DM-DQN-based mobile robot path planning method is characterized by comprising the following steps:

step one, establishing a mobile robot path planning model based on DM-DQN;

2. The DM-DQN based mobile robot path planning method according to claim 1, wherein the structure of the DM-DQN network model is divided into cost function V (s, ω, α) and dominance function a (s, a, ω, β), and the output of DM-DQN network model is expressed as:

Q(s,a,ω,α,β)＝V(s,ω,α)+A(s,a,ω,β) (4)

where s represents the state, a represents the motion, ω is a parameter common to V and a, α and β are parameters of V and a, respectively, and the value of V is the average of the values of Q in the state of s.

3. The DM-DQN based mobile robot path planning method of claim 2, in which the dominance function is centralized, the output of DM-DQN network model is expressed as:

where s represents the state, a represents the action, a' represents the next action, a is the alternative action, ω is a parameter common to V and a, and α and β are the parameters of V and a, respectively.

4. The DM-DQN based mobile robot path planning method of claim 1, wherein: the reward function is divided into a position reward function and a direction reward function, and a total reward function is calculated according to the position reward function and the direction reward function.

5. The DM-DQN based mobile robot path planning method of claim 4, wherein in the location reward function, an objective guided reward function is first constructed using a gravitational potential field function:

secondly, constructing an obstacle avoidance reward function by using a repulsive force potential field function:

where eta represents a repulsive reward function constant, d _obs Indicating the distance between the current position and the obstacle, d _max Representing the maximum impact distance of the obstacle.

6. The DM-DQN based mobile robot path planning method of claim 4, wherein the direction reward function is expressed in terms of the angle difference between the robot's expected and actual directions, the formula of the angle difference is:

representing the angle between the expected direction and the actual direction;

the directional reward function is expressed as:

7. a DM-DQN based mobile robot path planning method according to claim 4, characterized in that the total reward function represents: