CN113159432A

CN113159432A - Multi-agent path planning method based on deep reinforcement learning

Info

Publication number: CN113159432A
Application number: CN202110468095.XA
Authority: CN
Inventors: 范钰捷; 林志赟; 王博; 程自帅; 韩志敏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-23

Abstract

The invention discloses a multi-agent path planning method based on deep reinforcement learning. The method is a distributed path planning method, and is characterized in that local observation information of a single intelligent agent is input into a neural network, information among the intelligent agents is transmitted by using the neural network, a neural network approximate strategy function is trained, and therefore a moving strategy is output. The neural network parameters are trained by using a method combining deep reinforcement learning and simulation learning, so that the convergence of the return function is faster. After training, the higher success rate of group path planning in a four-neighborhood 2D grid map under the scale of thousands of agents can be realized, namely, a collision-free route from a starting point to a terminal point is successfully planned for each agent within the time limit. And has strong adaptability to the change of the map size and the obstacle density.

Description

Multi-agent path planning method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to multi-agent path planning based on deep reinforcement learning.

Background

Multi-agent path planning is a class of problems that finds a set of paths for multiple agents from a starting location to a target location without conflict, while achieving optimal constraints such as minimizing the sum of paths or sum of action costs for all agents, maximizing throughput, etc. The research aiming at the problem has a large number of application scenes in the fields of logistics, unmanned vehicles, military, security, games and the like.

The traditional algorithm for path planning of a single agent has many methods at home and abroad, such as an a-star algorithm, a particle swarm algorithm, a genetic algorithm, an ant colony algorithm, a simulated annealing algorithm and the like. With the improvement of the requirements of industry and living standard, the work of a single intelligent agent cannot meet the requirements of practical application. The multi-agent path planning technology capable of realizing group coordination is produced by the following steps: the traditional algorithms include M, CBS, WHCA and the variants thereof, and the path planning of less than 300 agents can be realized. The deep reinforcement learning method has structures such as DQN, Q-learning and MADDPG, and certain results are obtained.

However, multi-agent path planning based on deep reinforcement learning also faces several specific problems: the self-adaptability is poor under the conditions of various map sizes and high-density obstacles. Lack of communication among the agents leads to planning information blocking and congestion; with the increase of the number of agents, the state-behavior space of most path planning methods can generate dimension explosion, a great amount of calculation is needed, and the planning success rate (successfully planning a collision-free route from a starting point to a terminal point for each agent within the time limit) is limited; the training efficiency is low, and the training time is long.

Disclosure of Invention

In view of the above-mentioned shortcomings, the technical problem to be solved by the present invention is the lack of communication between agents in the prior art; the adaptability under the condition of a changeable map is poor; more intelligent agent information is easy to generate the limitation of dimension explosion; the return convergence and the training process caused by the frame design of the reinforcement learning algorithm are slow. Therefore, the application provides a multi-agent path planning method based on deep reinforcement learning. The method is a distributed path planning method, and is characterized in that local observation information of a single agent is input into a convolutional neural network for processing, information among agents is transmitted by using a graph neural network, a neural network approximate strategy function is trained, and therefore a mobile strategy is output. The neural network parameters are trained by using a method combining deep reinforcement learning and simulation learning, so that the convergence of the return function is faster. After training, the higher success rate of group path planning in a four-neighborhood 2D grid map under the scale of thousands of agents can be realized, namely, a collision-free route from a starting point to a terminal point is successfully planned for each agent within the time limit. And has strong adaptability to the change of the map size and the obstacle density.

In order to achieve the purpose, the technical scheme of the invention is as follows: a multi-agent path planning method based on deep reinforcement learning comprises the following steps:

s1: generating a complex data set, wherein the starting point and the target point of each agent, the different 2D grid square map sizes, the obstacle density and the number of agents are randomly generated and combined in the data set.

S2: inputting map local information tensor into a convolutional neural network for preprocessing, wherein the map local information takes a single agent body as a center and the side length is r_localMap information within the square of each grid.

S3: the processed local information between agents is transferred S2 using the graphical neural network.

S4: and training the network parameters of the algorithm by a method combining simulation learning and reinforcement learning. Each agent copies a copy of the algorithm network, outputs a strategy, and selects an action strategy of one of the agent's up, down, left, right and no movement in time sequence.

Further, in the step S1:

s1: a global grid map, obstacles, a number of agent start points, and a goal point binary map are generated using python or manually designed. The grid map is a square with the side length of 10, 50 or 100; the obstacle density is the percentage of the number of obstacle grids in the whole map to the number of the map grids, and can be selected to be 10%, 30% or 50%; the number of agents can be selected to be 4, 8, 32, 512 or 1024, and the agents must reach the target point, i.e. communicate. And traversing the combination in the generated map.

Further, in the step S2:

the map local information tensor includes:

(1) an obstacle, and the boundary is considered as an obstacle;

(2) other agent location coordinates;

(3) if the coordinate is outside the local range, connecting the intelligent agent with the target point of the intelligent agent, and projecting the point on the boundary as a target coordinate point;

(4) target point coordinates of other agents.

Further, in the step S3:

s31: in a time step t, constructing a graph, specifically: each agent is abstracted into a point, and the local information observed by the agent is the characteristic X of the point_tAt r_localThe agent in the system is a neighbor, and an edge is arranged between the agent and the neighbor.

S32: constructing a adjacency matrix S_tAnd recording the neighbor information of all the agents. Adjacency matrix S_tThe first action in (1) is the index of the current point, and the other actions are the neighbors of the current point.

S33: calculating a graph convolution

Wherein

Indicating the fusion of information with the k-th neighbor,

^ka trained convolution filter is required for this fusion. The graph convolution shows the information fusion of the point and the K-hop neighbor, wherein 1 hop refers to the point, 2 hops refers to the neighbor, 3 hops refers to the neighbor of the neighbor, and the like. Performing relu activation operation on the graph volumeAnd (4) forming a neural network.

Further, in the step S4:

the specific process of training is as follows: when a training epicode starts, the local map information processed by the graph neural network is input into one of the simulation learning module or the reinforcement learning module at random in a probabilistic manner, and the simulation learning gives a trial-and-error exploration process for accelerating the reinforcement learning of an expert strategy to help to converge to an optimal strategy. Both optimize the same policy network parameters.

The reinforcement learning part utilizes an asynchronous dominant actor-critic algorithm to conduct exploration training, an actor network calculates a moving action strategy pi, a critic network calculates the value V of the moving action, and a gradient descent optimization strategy network is conducted through a loss function of V.

The simulation learning is to simulate an observation-action pair track generated by an expert algorithm, the expert algorithm is a multi-agent path planning algorithm Greedy-Conflict Based Search (GCBS), the cross entropy of the current strategy pi and the expert strategy is calculated, gradient descent is carried out, and gradient updating of a strategy network is realized, so that the strategy is closer to the expert algorithm.

The invention has the beneficial effects that: the distributed path planning algorithm is realized by utilizing the exchange of local information and the design of a single intelligent agent neural network, namely, the robot can only sense the local environment in reality to carry out the autonomous online planning of each intelligent agent, compared with the central planning, the calculation cost caused by dimension explosion is effectively reduced, and the planned path can be quickly calculated; local map information among a plurality of agents is transmitted by using a graph neural network, so that the action purposes of other agents are known, and the planning success rate is effectively improved; the training method combining reinforcement learning and simulation learning improves the efficiency of a trial-and-error exploration mode of reinforcement learning, improves the training and convergence speed, and simulates an expert algorithm to reduce collision, thereby embodying the coordination of groups.

Drawings

FIG. 1 is a flow chart of a method for multi-agent path planning based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific neural network structure of a deep reinforcement learning-based multi-agent path planning method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the map information in step S1 of the multi-agent path planning method based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the 4-layer local observation tensors in step S2 of the deep reinforcement learning-based multi-agent path planning method according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating the conversion of the neighborhood map, the neighborhood map and the adjacency matrix in step S3 in the multi-agent path planning method based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the combination of reinforcement learning and simulation learning method in step S4 of the deep reinforcement learning-based multi-agent path planning method according to an embodiment of the present invention;

Detailed Description

As shown in fig. 1 and fig. 2, for the method flow and the specific algorithm network structure of the present invention, the method for planning a multi-agent path based on deep reinforcement learning proposed by the present invention includes the following steps:

A global grid map, obstacles, a number of agent start points, and a goal point binary map are generated using python or manually designed. The grid map is a square with side lengths of 10, 50 and 100; the obstacle density is the percentage of the number of obstacle grids in the whole map to the number of the map grids, and can be selected from 10%, 30% and 50%; the number of agents is 4, 8, 32, 512, 1024, and the agents must reach the target point, i.e. communicate. In the generated map, the traversal combination is carried out, and each item is represented by a binary matrix. As shown in fig. 3, a map with a side length of 10, an obstacle density of 10%, and a number of agents of 4 is generated.

S2: inputting the local tensor of the map into the convolutional neural network for preprocessing, wherein the local information o_tTo use a single intelligent body as a center, the side length is r_localMap information within the square of the grid. The neural network architecture: 3 convolutional layers, 1 max pooling layer, 2 full-link layers. As shown in FIG. 4 as r_localThe time map local tensor comprises, when being equal to 7:

(1) a': an obstacle, which regards the boundary as an obstacle if the local view is outside the global map;

(2) b': if the coordinate is outside the local range, connecting the intelligent agent with the target point of the intelligent agent, and projecting the point on the boundary as a target coordinate point;

(3) c': other agent (agent 2, 3) position coordinates;

(4) d': target point coordinates of other agents.

S3: as shown in fig. 5, the local information between agents preprocessed in S2 is transferred using a graph neural network.

S31: during one time step t, as in a' in fig. 5: at r_localFor 7 agents in the grid as neighbors, b': constructing a graph, abstracting each agent into points, and obtaining local information observed by the agent after the preprocessing of S2, namely the point characteristic X_tAt r_localThe agent in the system is a neighbor, and an edge is arranged between the agent and the neighbor.

S32: as in c' of fig. 5: and constructing an adjacency matrix St and recording the neighbor information of all agents. The first action is the index of the current point and the other actions are the neighbors of the current point.

S33: calculating a graph convolution

Wherein

Indicating the fusion of information with the k-th neighbor,

A^ka trained convolution filter is required for this fusion. The neighbors also perform graph convolution operation, so the graph convolution represents the information fusion between the point and the K-hop neighbors, wherein 1 hop refers to the neighbor, 2 hop refers to the neighbor, 3 hop refers to the neighbor of the neighbor, and so on. Information fusion of 2-hop neighbors is performed as in fig. 5. And carrying out relu activation operation on the graph convolution to form a graph neural network.

S4: and training the network parameters of the algorithm by a method combining simulation learning and reinforcement learning. Each agent copies a copy of the algorithm network, outputs a policy value matrix, and selects an action corresponding to the maximum value in the policy vector matrix at each time step: one of up, down, left, right, no movement.

As shown in fig. 6, the specific method of combining the mimic learning and the reinforcement learning is to input the local map information processed by the graph neural network in S3 into one of the mimic learning module and the reinforcement learning module probabilistically at the beginning of one training epadio, and the mimic learning will provide expert strategies to accelerate the trial-and-error exploration process of the reinforcement learning and help converge to the optimal strategies. Both optimize the same policy network parameters.

The reinforcement learning part utilizes an asynchronous dominant actor-critic algorithm to conduct exploration training, and an actor network passes through a dominant function

The gradient descent optimization strategy network calculates a moving action strategy pi with an advantage function of

Wherein T is the number of steps in a given time or the number of steps when a target is reached, theta is a value network parameter, gamma is a discount factor, r_tFor the reward function, k is the number of steps, P (a)_t| pi, o; theta) is the selection action a_tThe probability of (c). The critic network calculates the value V of the movement action and passes the loss function of V

Performing a gradient descent optimization strategy network, wherein theta' is a value network parameter, R_tIs the cumulative reward calculated by the reward function.

The simulation learning part simulates observation-action pair tracks generated by an expert algorithm, and the expert algorithm is as follows: the traditional multi-agent path planning algorithm Greedy-Conflict Based Search (GCBS) calculates the cross entropy of the current strategy pi and the expert strategy action

And gradient descent is carried out, and gradient updating of the strategy network is realized, so that the strategy is closer to an expert algorithm.

Claims

1. A multi-agent path planning method based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for multi-agent path planning based on deep reinforcement learning as claimed in claim 1, wherein in step S1:

3. The method for multi-agent path planning based on deep reinforcement learning as claimed in claim 1, wherein in step S2:

the map local information tensor includes:

(1) an obstacle, and the boundary is considered as an obstacle;

(2) other agent location coordinates;

(4) target point coordinates of other agents.

4. The method for multi-agent path planning based on deep reinforcement learning as claimed in claim 1, wherein in step S3:

S33: calculating a graph convolution

Wherein

Indicating the fusion of information with the k-th neighbor,

A^ka trained convolution filter is required for this fusion. The graph convolution shows the information fusion of the point and the K-hop neighbor, wherein 1 hop refers to the point, 2 hops refers to the neighbor, 3 hops refers to the neighbor of the neighbor, and the like. And carrying out relu activation operation on the graph convolution to form a graph neural network.

5. The method for multi-agent path planning based on deep reinforcement learning as claimed in claim 1, wherein in step S4: