CN114815840A - Multi-agent path planning method based on deep reinforcement learning - Google Patents

Multi-agent path planning method based on deep reinforcement learning Download PDF

Info

Publication number
CN114815840A
CN114815840A CN202210490010.2A CN202210490010A CN114815840A CN 114815840 A CN114815840 A CN 114815840A CN 202210490010 A CN202210490010 A CN 202210490010A CN 114815840 A CN114815840 A CN 114815840A
Authority
CN
China
Prior art keywords
agent
joint
network
space
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210490010.2A
Other languages
Chinese (zh)
Other versions
CN114815840B (en
Inventor
郑煜明
陈松
鲁华祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210490010.2A priority Critical patent/CN114815840B/en
Priority claimed from CN202210490010.2A external-priority patent/CN114815840B/en
Publication of CN114815840A publication Critical patent/CN114815840A/en
Application granted granted Critical
Publication of CN114815840B publication Critical patent/CN114815840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure provides a multi-agent path planning method based on deep reinforcement learning, which includes: setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward; initializing the structure and parameters of a network model in a path planning method, and combining a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool; updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, otherwise, entering a reinforcement learning training stage, and updating parameters of the network model by a multi-step return method; and realizing multi-agent path planning by using the network model obtained by training.

Description

Multi-agent path planning method based on deep reinforcement learning
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly, to a deep reinforcement learning-based multi-agent path planning method, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product.
Background
With the continuous development and improvement of multi-agent technology, path planning becomes the key point of research as an effective means for improving the survival capability and application value of multi-agents. The purpose of path planning is to plan an optimal path from a current position to a target position in an environment with obstacles under the constraint of hardware conditions. The path planning method mainly comprises a traditional planning method and an intelligent planning method. Traditional planning algorithms include dynamic search-based algorithms and sampling-based algorithms, such as a-x algorithms, artificial potential field methods, Dijkstra algorithms, Particle Swarm Optimization (PSO), etc.; the intelligent planning algorithm includes reinforcement learning algorithms such as Q-learning (Q-learning) algorithm, and state action rewarding (Sarsa) algorithm. The reinforcement learning algorithm is one of the important learning methods in the machine learning at present, namely reinforcement learning, the basic idea obtains corresponding rewards through the continuous interaction of an intelligent agent and the environment, the information of the basic idea is fed back to the environment again, the circulation is continuous, and the process of path planning is completed through self-learning after a large amount of experiences are accumulated.
With the increase of the environment scale, the algorithm has the problems of large calculation amount, high cost, long consumed time, low applicability to different environments and easy trapping of local optimization in complex environments.
Disclosure of Invention
In view of the above, the present disclosure provides a deep reinforcement learning based multi-agent path planning method, an electronic device, a non-transitory computer readable storage medium storing computer instructions, and a computer program product.
According to a first aspect of the present disclosure, there is provided a multi-agent path planning method based on deep reinforcement learning, comprising:
setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward;
initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the target strategy network and the strategy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters;
inputting the state space of each agent into the policy network, and outputting the action to be executed by each agent at the current moment after the state space is processed by a deterministic policy function;
responding to the corresponding action executed by the intelligent agent, and respectively obtaining the state space and the corresponding reward value of each intelligent agent at the next moment; combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool;
updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to acquire sample information through the policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method; randomly extracting K samples with track length of a multi-step return value L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating the parameters of the evaluation network by using a minimum loss function; updating the parameters of the strategy network by using a reverse gradient descent method; updating the parameters of the target strategy network and the target evaluation network by using a soft updating method;
and planning the path of the intelligent agent by using the trained network model in the path planning.
According to an embodiment of the present disclosure, the setting of the joint state space, the joint action space, and the joint reward of the agent includes:
constructing a Markov game model for multi-agent path planning, wherein the Markov model is described by a quintuple < N, S, A, P, R >, wherein N ═ 1.. multidrug, N } represents a set of a plurality of agents; s represents a joint state space; a represents a joint action space; r represents a joint reward; p represents the probability value that all of the agents take a joint action in the current state to the next state.
According to an embodiment of the present disclosure, before constructing the markov game model for multi-agent path planning, the method further comprises:
acquiring initial environment information of a multi-agent system, wherein the initial environment information comprises the number of agents in the multi-agent system, the initial position coordinate of each agent, the position coordinate of a corresponding target point and the position coordinate of an obstacle;
converting an agent in the multi-agent system into a particle model, wherein the particle model comprises a plurality of particles corresponding to the plurality of agents, current coordinates of the particles correspond to current position coordinates of the agent, and endpoint coordinates of the particles correspond to position coordinates of a target point corresponding to the agent.
According to an embodiment of the present disclosure, the policy network is a fully connected neural network, and is configured to output an action that the agent needs to perform at a current time based on a state space corresponding to a current state input to the agent.
According to an embodiment of the present disclosure, the evaluation network is a fully connected neural network, and is configured to output a joint state action value of the agent at the current time based on the input joint state space vectors of all the agents and a joint action space obtained according to the policy networks of all the agents.
According to an embodiment of the present disclosure, the inputting the joint state space, the joint action space and the reward at the future time, and the outputting the corresponding target joint state action value includes:
and updating the target joint state action value of the intelligent agent at the current moment by utilizing a multi-step learning method according to the joint state space, the joint action space and the reward at the future moment.
According to the embodiment of the disclosure, updating the target joint state action value of the agent at the current moment by using a multi-step learning method according to the joint state space, the joint action space and the reward at the future moment comprises:
and inputting the reward value of the agent at each time from the time j to the time j + L, the joint state space at the time j + L, the joint action space and the discount coefficient, and outputting a target joint state action value at the current time, wherein the target joint state action value represents the sum of the accumulated reward in the L step length and the target joint state action value at the time j + L multiplied by the discount coefficient.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor.
Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the deep reinforcement learning based multi-agent path planning methods.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform any one of the deep reinforcement learning based multi-agent path planning methods described above.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above deep reinforcement learning based multi-agent path planning method.
The invention provides a multi-agent path planning method based on deep reinforcement learning, which combines a multi-agent depth certainty strategy and a multi-step return idea, considers the influence of future data on the current value by using the multi-step return, enables the output value of a target evaluation network to be infinitely close to a real value function, reduces the deviation of a loss function, improves the accuracy, improves the convergence speed, reduces the learning time, quickly and efficiently plans an optimal path for a multi-agent system, and solves the path planning problem.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates a block diagram of a deep reinforcement learning based multi-agent system model MS-MADDPG according to an embodiment of the present invention;
FIG. 2 schematically illustrates a flow chart of a deep reinforcement learning based multi-agent path planning method of an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart for setting joint state space, joint action space and joint rewards for a multi-agent system according to an embodiment of the disclosure;
fig. 4 schematically shows a block diagram of an electronic device adapted to implement a deep reinforcement learning based multi-agent path planning method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
With the continuous development and improvement of multi-agent technology, in a multi-agent system, information which can be observed by a single agent is limited, the traditional single-agent reinforcement learning algorithm does not have universality, the number of agents is increased on the basis of single-agent reinforcement learning by the agent reinforcement learning algorithm, a joint state and a joint action are introduced, and a centralized training distributed execution strategy is introduced, so that each agent can independently complete a target, and a DDPG algorithm is used as a strategy type reinforcement learning algorithm to guide the agent to search an optimal path in an unknown environment.
In implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: with the increase of the number of the agents and the expansion of the environment scale, the problems of low convergence speed, long training time and low path planning efficiency of a multi-agent system can occur.
To at least partially solve the technical problems in the related art, the present disclosure provides a deep reinforcement learning-based multi-agent path planning method, an electronic device, and a non-transitory computer-readable storage medium storing computer instructions, which can be applied to the technical field of artificial intelligence.
According to an embodiment of the present disclosure, the multi-agent path planning method based on deep reinforcement learning includes:
setting a joint state space and a joint action space of the multi-agent system; and designing a reward function aiming at the path planning problem to generate combined rewards.
Initializing a MS-MADDPG algorithm neural network, and initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the target strategy network and the strategy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters.
And inputting the state space of each agent into the policy network, and outputting the action to be executed by each agent at the current moment after the state space of each agent is processed by a deterministic policy function.
Responding to the corresponding action executed by the intelligent agent, and respectively obtaining the state space and the corresponding reward value of each intelligent agent at the next moment; and combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool.
Updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to acquire sample information through the policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the target network; randomly extracting K samples with track length of a multi-step return value L from the experience buffer pool, training the target evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the policy network by using a reverse gradient descent method; and updating the parameters of the target strategy network and the target evaluation network by using a soft updating method.
And planning the path of the intelligent agent by using the trained network model in the path planning.
FIG. 1 schematically shows a block diagram of a deep reinforcement learning based multi-agent system model MS-MADDPG according to an embodiment of the present invention.
As shown in fig. 1, the deep reinforcement learning-based multi-agent system model of this embodiment includes: an environment system, agents 1 to n, and an experience buffer pool.
Acquiring initial state information of the multi-agent system, and storing the initial state information into the environment system.
The agent 1 to the agent n may be respectively provided with a policy network, an evaluation network, a target policy network, and a target evaluation network, where the target policy network and the corresponding policy network have the same structure but different parameters, and the target evaluation network and the corresponding evaluation network have the same structure but different parameters.
Agents 1 through n obtain corresponding state spaces s from the environmental system 1 To s n The agents 1 to n input corresponding state spaces into corresponding policy networks, and after being processed by deterministic policy functions, the actions a to be executed at the current time of the agents 1 to n are output 1 To a n
Obtaining state spaces s 'of agent 1 to agent n at next time instants in response to corresponding actions performed by agent 1 to agent n' 1 To s' n And corresponding prize value r 1 To r n (ii) a And combining the current joint action space a, the current joint state space x, the current joint reward r and the next time joint state space x 'obtained through information exchange between the intelligent agents 1 and n into a tuple (x, a, r, x') serving as a sample and storing the tuple in an experience buffer pool.
The agents 1 to n interact with the environment system and act a to be executed according to the corresponding current time 1 To a n The state space of agent 1 through agent n at the next time is updated.
And comparing the number of samples in the experience buffer pool with the preset capacity, if the number of samples is less than the preset capacity, continuously acquiring sample information through the strategy networks corresponding to the agents from the agent 1 to the agent n, and otherwise, entering a reinforcement learning training stage.
Updating parameters of evaluation networks corresponding to the agents 1 to n when reinforcement learning training is carried out; randomly extracting K samples (x) with the track length of a multi-step return value L from an experience buffer pool j:j+L ,a j:j+L ,r j:j+L ,x j:j+L ) Training the evaluation network, inputting the joint state space, the joint action space and the reward at the future moment, and outputting a corresponding target joint state action value; updating the parameters of the evaluation network by using a minimum loss function; updating the parameters of the strategy network by using a reverse gradient descent method; and updating the parameters of the target strategy network and the target evaluation network by using a soft updating method.
And planning the path of the intelligent agent by using the network model in the path planning obtained by training. Fig. 2 schematically illustrates a flowchart of a deep reinforcement learning-based multi-agent path planning method according to an embodiment of the present disclosure.
As shown in FIG. 2, the deep reinforcement learning-based multi-agent path planning method includes operations S210-S260.
In operation S210, a joint state space, a joint action space, and a joint bonus of the multi-agent system are set.
According to an embodiment of the present disclosure, setting a joint state space, a joint action space, and a joint reward of an agent includes:
constructing a Markov game model for multi-agent path planning, wherein the Markov model is described by a quintuple < N, S, A, P, R >, wherein N ═ 1.. multidrug, N } represents a set of a plurality of agents; s represents a joint state space; a represents a joint action space; r represents a joint reward; p represents the probability value that all of the agents take a joint action in the current state to the next state.
Setting the joint state space, joint action space and joint rewards of a multi-agent system according to embodiments of the present disclosure also includes a pre-operation of constructing a markov game model for multi-agent path planning.
FIG. 3 schematically illustrates a flow chart for setting joint state space, joint action space and joint rewards for a multi-agent system according to an embodiment of the disclosure.
As shown in FIG. 3, the above-described joint state space, joint action space and joint rewards of the baseset multi-agent system further includes operations S212-S213.
In operation S211, initial environment information of the multi-agent system is acquired.
According to an embodiment of the present disclosure, the initial environment information includes the number of agents in the multi-agent system, the start position coordinates of each agent and the position coordinates of the corresponding target point, the position coordinates of the obstacle;
more specifically, environment-related information is initialized; a physical model is set for each agent and obstacle. Setting each agent i as a square area, and representing the coordinate of the square area of each agent as P i (x, y); the target corresponding to each agent is set to be a square area, and the coordinate of the target position is P ig (x, y); the obstacle is set to correspond to a square area with a position P o (x, y), all obstacles are stationary, and the overall environmental scale is 10 x 10 square areas.
In operation S212, the agents in the multi-agent system are converted into particle models.
More specifically, the particle model includes a plurality of particles corresponding to a plurality of the intelligent bodies, current coordinates of the particles correspond to current position coordinates of each intelligent body, end point coordinates of the particles correspond to position coordinates of a target point corresponding to the intelligent body, and the position coordinates of the target point are preset.
Initializing an experience buffer pool capacity M o (e.g., 50000), number of bulk samples K (e.g., 256), and multi-step return value L (e.g., 5). Setting a maximum training iteration number E (for example 20000) and a maximum step length T (for example 100) of each round of training, and initializing a current training time T (0) and a current training round E (0).
In operation S213, a markov game model for multi-agent path planning is constructed.
According to the embodiment of the disclosure, the state space of each agent is vector information, and includes the position of the current agent, the relative positions of other agents to the current agent, and the relative positions of obstacles to the agent.
According to the embodiment of the disclosure, the action space of each agent represents the action that the agent can take with respect to the state space of the agent, and comprises 4 actions.
According to an embodiment of the present disclosure, the reward function of each agent represents a reward punishment value obtained after the corresponding action space is selected in the current state space, and because all agents cooperatively reach the target position while avoiding obstacles, the reward function of each agent is the same. The reward function of an agent may be represented by the following equation (1):
Figure BDA0003624532980000101
more specifically, the state space of each agent is input into the policy network, and after being processed by a deterministic policy function, the action to be executed by each agent at the current moment is output; responding to the corresponding action executed by the agents, and respectively obtaining the state space and the corresponding reward value of each agent at the next moment; and obtaining a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment through information exchange among different agents.
In operation S220, the structure of the network model and the parameter network model in the path planning method are initialized to include a policy network, an evaluation network, a target policy network, and a target evaluation network.
According to the embodiment of the disclosure, the target policy network and the policy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters.
More specifically, a policy network for agent i is initialized
Figure BDA0003624532980000102
And network parameters
Figure BDA0003624532980000103
The strategy network is a fully-connected neural network comprising two hidden layers, the activation functions of the two hidden layers are relu functions, the first hidden layer is provided with 64 nodes, the second hidden layer is provided with 32 nodes, the output layer comprises 1 node, and actions required to be executed by the agent i at the current moment are output by adopting a gumbel-softmax activation function.
Initializing an evaluation network of agent i
Figure BDA0003624532980000104
And network parameters
Figure BDA0003624532980000105
The evaluation network is a fully-connected neural network comprising two hidden layers, each layer is provided with 64 nodes, the activation function of the first hidden layer is a sigmod function, the activation function of the second hidden layer is a relu function, the output layer comprises 1 node, and the state action value at the current moment is output by adopting a linear activation function.
Copying parameters of the policy network into a corresponding target policy network, wherein the copying process can be expressed by the following formula (2):
θ μ →θ μ′ ; (2)
wherein, theta μ′ Parameters of the network are targeted.
Copying parameters of the policy network into a corresponding target policy network, wherein the copying process can be expressed by the following formula (3):
θ Q →θ Q′ ; (3)
wherein, theta Q The parameters of the network are evaluated for the target.
In operation S230, the state space of each agent is input into the policy network, and after being processed by the deterministic policy function, the action that needs to be executed by each agent at the current time is output.
According to the embodiment of the present disclosure, outputting the action that each agent needs to perform at the current moment may be represented by the following formula (4):
Figure BDA0003624532980000111
wherein a is i For the action that agent i needs to perform at the current moment, μ i (□) is a deterministic policy function, s i Is the state space of the agent i,
Figure BDA0003624532980000112
and f, the parameters of the strategy network corresponding to the agent i are represented, and epsilon is the noise of the strategy network.
In operation S240, in response to the agents performing the corresponding actions, respectively obtaining a state space and a corresponding reward value of each agent at a next time; and combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool.
According to the embodiment of the disclosure, through information exchange among different agents, the action a required to be executed according to the current moment of the agent i i Interacting with the environment to obtain the state space information s of the agent i at the next moment i And a feedback prize value r i (ii) a According to the current time state space s of all agents i Obtaining the joint state space x ═(s) at the current moment 1 ,…,s n ) And state space s 'according to the next moment of all the agents' i Obtaining a next-time joint state space x ═ s' 1 ,…,s′ n ) And according to the actions a which need to be executed by all the agents at the current moment i The current time joint action set a ═ (a) 1 ,…,a n ) And the joint reward value set r ═ r (r) at the current moment 1 ,…,r n ) And combining the two into a tuple (x, a, r, x') as a sample and storing the tuple into an experience buffer pool D.
In operation S250, updating a state space of each agent at the current time; and comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing sampling, and acquiring sample information through a policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method.
According to an embodiment of the present disclosure, updating the state space of each agent at the current time may be represented by the following formula (5):
x′→x; (5)
and x is the current time joint state space of the intelligent agents, x' is the next time joint state space of the intelligent agents, and the intelligent agents are interacted with the environment system through each intelligent agent, so that the next time state space corresponding to the intelligent agents is endowed to the intelligent agents and serves as the updated current time state space corresponding to the intelligent agents.
According to the embodiment of the disclosure, the current capacity M and the preset capacity M in the experience buffer pool are compared o If M < M o Continuing to execute operations S230-S250; if M ≧ M o And entering a reinforcement learning training stage to update the parameters of the target network.
More specifically, according to an embodiment of the present disclosure, the preset capacity of the experience buffer pool may be set to 50000.
In accordance with an embodiment of the present disclosure, K batches of samples (x) are randomly drawn from experience buffer pool D j:j+L ,a j:j+L ,r j:j+L ,x j:j+L ) And wherein, each batch of length is a return value L, and the evaluation network is trained.
According to an embodiment of the present disclosure, inputting a joint state space, a joint action space, and rewards in all sample times at a future time, outputting a corresponding target joint state action value comprises:
and updating the target joint state action value of the intelligent agent at the current moment by using a multi-step learning method according to the joint state space and the joint action space at the future moment and the rewards in all sample moments.
More specifically, the reward value of the agent at each time from time j to time j + L, the state space, the action space and the discount coefficient at time j + L are input, and the target joint state action value at the current time is output, where the target joint state action value represents the sum of the cumulative reward in the L step and the target joint state action value at time j + L multiplied by the discount coefficient, and the reward value L step may be set to 5, for example.
More specifically, K batches of samples with a track length of a multi-step return value L are randomly extracted from the experience buffer pool D, the number of batch sampling samples K can be set to 256, the evaluation network is trained, the joint state space and joint action space at the future time and rewards in all sample times are input, and the corresponding target joint state action value is output.
The output corresponding target joint-state action value may be represented by the following equation (6):
Figure BDA0003624532980000131
y is the reward value of each time from the j time to the j + L time and the target joint state action value of the j + L time which are output to the target evaluation network at the current time, and is the sum of the cumulative reward in the L step length and the target joint state action value of the j + L time multiplied by the discount coefficient; γ is the discount coefficient, set to 0.99; r is j+k An agent award value at time j + k;
Figure BDA0003624532980000132
a target policy network set;
Figure BDA0003624532980000133
inputting a state space at the moment of j + L and a state action value output after the joint action obtained according to the strategy set mu' at the moment of j + L into a target evaluation network; t is the current time; a' j Is any one of a set of policies.
According to an embodiment of the present disclosure, updating a parameter of an evaluation network using a minimum loss function may be represented by the following formula (7):
Figure BDA0003624532980000134
wherein L is a loss function; sigma is the sum of all the processed sampling samples;
Figure BDA0003624532980000135
in order to evaluate the joint state action value output by the network, K is the number of batch sampling samples, and y is the target joint state action value at the current moment.
According to the embodiment of the disclosure, the evaluation network is a fully-connected neural network and is used for outputting the joint state action value of the agent at the current moment based on the input joint state space vectors of all agents and the joint action space obtained according to the respective strategy networks of all agents.
More specifically, the evaluation network is composed of an input layer, a hidden layer and an output layer, and the joint state space x(s) of all the agents is input 1 ,…,s n ) And obtaining the joint action space a (a) which needs to be executed by all the agents at the current moment according to the corresponding policy network 1 ,…,a n ) Outputting combined state action values of all agents
Figure BDA0003624532980000141
According to an embodiment of the present disclosure, updating parameters of a policy network using an inverse gradient descent method may be represented by the following equation (8):
Figure BDA0003624532980000142
wherein
Figure BDA0003624532980000143
Is a loss function gradient;
Figure BDA0003624532980000144
and
Figure BDA0003624532980000145
respectively, a policy function gradient and a state action value function gradient, and x is the current time combined state space of all agents.
Figure BDA0003624532980000146
To input the actions and join state information of all agents to the evaluation network, the join state action value of i of the agent is output.
According to an embodiment of the present disclosure, updating parameters of the target policy network and the target evaluation network using the soft update method may be represented by the following formula (9):
Figure BDA0003624532980000147
where τ is a soft update coefficient, which may be set to 0.9, for example; target policy network parameter theta on right of equal sign in formula i μ′ Showing the target strategy network parameter before soft update, the target strategy network parameter theta on the left of the equal sign i μ′ Representing the target strategy network parameters after soft update; target evaluation network parameter theta on right side of equal sign in formula i Q′ Representing the target evaluation network parameters before soft update, and the target strategy evaluation parameter theta on the left of the equal sign i Q′ And expressing the target evaluation network parameters after the soft update.
Judging whether the training is finished or not, adding 1 to the current training time step length, recording the current training time T, and if T is less than T, continuing to execute operation S230 and the subsequent training operation to realize the data collection of the multi-agent; if T is greater than or equal to T, adding 1 to the current training round number, if E is less than E, continuously initializing the environmental information of the multi-agent system, and continuously executing operation S230 and the subsequent training operation, otherwise, finishing the training and storing the trained network parameters.
In operation S260, a path of the multi-agent system is planned using the network model in the trained path plan.
According to the embodiment of the disclosure, the multi-agent path planning method based on deep reinforcement learning combines a multi-agent deep certainty strategy and a multi-step return idea, and utilizes a future multi-step state action function to update a state action function at the current moment, so that an evaluation network output value and a target evaluation network output value are infinitely close, the deviation of a loss function is reduced, the influence of future data on the current state action value is considered, the accuracy of a trained neural network model is improved, the training speed is improved, the training time is reduced, and the path planning problem in a multi-agent system is quickly realized.
According to the embodiment of the disclosure, the intelligent agent is subjected to path planning by the trained network model, and the intelligent agent selects a proper action according to the trained network model to complete a path planning task.
According to the embodiment of the disclosure, each intelligent agent in the multi-intelligent-agent system can autonomously avoid the obstacle and smoothly reach the respective target position in the complex environment, and the multi-intelligent-agent path planning method based on deep reinforcement learning is utilized, so that the time for training path planning is shortened, and the flexibility and the accuracy of path planning are ensured.
According to the embodiment of the disclosure, the path planning problem in the multi-agent system can be quickly realized, and a foundation is laid for the large-scale multi-agent system to execute tasks. Compared with the original MADDPG algorithm, the method has a better network parameter updating mode, updates the state action function at the current moment by using the state action function at the future multi-step moment, solves the problem that the difference value between the evaluation network output and the target evaluation network output is large, namely the loss function error is large, and can further improve the accuracy of the trained neural network model, thereby further improving the practical application value. The reinforcement learning method based on multi-step learning shortens the training time and has better training effect.
Fig. 4 schematically shows a block diagram of an electronic device adapted to implement a text processing method according to an embodiment of the present disclosure. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 4, the computer electronic device 400 according to the embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 404 into a Random Access Memory (RAM) 403. Processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 401 may also include onboard memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.
In the RAM403, various programs and data necessary for the operation of the electronic apparatus 400 are stored. The processor 401, ROM402 and RAM403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM402 and/or the RAM 403. Note that the programs may also be stored in one or more memories other than the ROM402 and RAM 403. The processor 401 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 405, input/output (I/O) interface 405 also being connected to bus 404. Electronic device 400 may also include one or more of the following components connected to I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 404 including a network interface card such as a LAN card, a modem, or the like. The communication section 404 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 404, and/or installed from the removable medium 411. The computer program, when executed by the processor 401, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM402 and/or RAM403 and/or one or more memories other than ROM402 and RAM403 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being adapted to cause the electronic device to carry out the text processing method provided by the embodiments of the present disclosure.
The computer program, when executed by the processor 401, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 404, and/or installed from the removable medium 411. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (10)

1. A multi-agent path planning method based on deep reinforcement learning comprises the following steps:
setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward;
initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the target strategy network and the strategy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters;
inputting the state space of each agent into the policy network, and outputting the action to be executed by each agent at the current moment after the state space is processed by a deterministic policy function;
responding to the corresponding action executed by the intelligent agent, and respectively obtaining the state space and the corresponding reward value of each intelligent agent at the next moment; combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool;
updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, and acquiring sample information through the strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step return method; randomly extracting K batches of samples with the track length of a multi-step return value L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the policy network by using a reverse gradient descent method; updating parameters of the target strategy network and the target evaluation network by using a soft updating method;
and planning the path of the intelligent agent by using the trained network model in the path planning.
2. The method of claim 1, wherein setting a joint state space, a joint action space, and a joint reward for a smart agent comprises:
constructing a Markov game model for multi-agent path planning, wherein the Markov model is described by quintuple < N, S, A, P, R, wherein N ═ 1.. multidrug, N } represents a set of a plurality of agents; s represents a joint state space; a represents a joint action space; r represents a joint reward; p represents the probability value that all of the agents take a joint action in the current state to the next state.
3. The method of claim 1, wherein prior to constructing a markov game model for multi-agent path planning, the method further comprises:
acquiring initial environment information of a multi-agent system, wherein the initial environment information comprises the number of agents in the multi-agent system, the initial position coordinate of each agent, the position coordinate of a corresponding target point and the position coordinate of an obstacle;
converting an agent in the multi-agent system into a particle model, wherein the particle model comprises a plurality of particles corresponding to a plurality of the agents, current coordinates of the particles correspond to current location coordinates of the agent, and endpoint coordinates of the particles correspond to location coordinates of a target point corresponding to the agent.
4. The method of claim 1, wherein the policy network is a fully-connected neural network for outputting actions that the agent needs to perform at a current time based on an input state space of the agent corresponding to the current state.
5. The method according to claim 1, wherein the evaluation network is a fully-connected neural network, and is configured to output the joint state action value of the agent at the current time based on the input joint state space vectors of all agents and the joint action space obtained according to their own policy networks.
6. The method of claim 1, wherein the inputting a joint state space, a joint action space, and a reward at a future time, the outputting a corresponding target joint state action value comprises:
and updating the target joint state action value of the intelligent agent at the current moment by utilizing a multi-step learning method according to the joint state space, the joint action space and the reward at the future moment.
7. The method of claim 6, wherein updating the target joint state action value of the agent at the current time using a multi-step learning method based on the joint state space, joint action space, and rewards at future times comprises:
and inputting the reward value of the agent at each time from the time j to the time j + L, the joint state space at the time j + L, the joint action space and the discount coefficient, and outputting a target joint state action value at the current time, wherein the target joint state action value represents the sum of the accumulated reward in the L step length and the target joint state action value at the time j + L multiplied by the discount coefficient.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202210490010.2A 2022-04-29 Multi-agent path planning method based on deep reinforcement learning Active CN114815840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210490010.2A CN114815840B (en) 2022-04-29 Multi-agent path planning method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210490010.2A CN114815840B (en) 2022-04-29 Multi-agent path planning method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114815840A true CN114815840A (en) 2022-07-29
CN114815840B CN114815840B (en) 2024-06-28

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496208B (en) * 2022-11-15 2023-04-18 清华大学 Cooperative mode diversified and guided unsupervised multi-agent reinforcement learning method
CN117114937A (en) * 2023-09-07 2023-11-24 深圳市真实智元科技有限公司 Method and device for generating exercise song based on artificial intelligence
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089910A1 (en) * 2019-09-25 2021-03-25 Deepmind Technologies Limited Reinforcement learning using meta-learned intrinsic rewards
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN113485380A (en) * 2021-08-20 2021-10-08 广东工业大学 AGV path planning method and system based on reinforcement learning
WO2022007179A1 (en) * 2020-07-10 2022-01-13 歌尔股份有限公司 Multi-agv motion planning method, apparatus, and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089910A1 (en) * 2019-09-25 2021-03-25 Deepmind Technologies Limited Reinforcement learning using meta-learned intrinsic rewards
WO2022007179A1 (en) * 2020-07-10 2022-01-13 歌尔股份有限公司 Multi-agv motion planning method, apparatus, and system
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN113485380A (en) * 2021-08-20 2021-10-08 广东工业大学 AGV path planning method and system based on reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许诺;杨振伟;: "稀疏奖励下基于MADDPG算法的多智能体协同", 现代计算机, no. 15, 25 May 2020 (2020-05-25), pages 48 - 52 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496208B (en) * 2022-11-15 2023-04-18 清华大学 Cooperative mode diversified and guided unsupervised multi-agent reinforcement learning method
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning
CN117406706B (en) * 2023-08-11 2024-04-09 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning
CN117114937A (en) * 2023-09-07 2023-11-24 深圳市真实智元科技有限公司 Method and device for generating exercise song based on artificial intelligence

Similar Documents

Publication Publication Date Title
Ashraf et al. Optimizing hyperparameters of deep reinforcement learning for autonomous driving based on whale optimization algorithm
KR102242516B1 (en) Train machine learning models on multiple machine learning tasks
JP6728496B2 (en) Environmental navigation using reinforcement learning
CN110114784B (en) Recursive environment predictor and method therefor
US11521056B2 (en) System and methods for intrinsic reward reinforcement learning
US10860927B2 (en) Stacked convolutional long short-term memory for model-free reinforcement learning
JP2021185492A (en) Enhanced learning including auxiliary task
KR20200110400A (en) Learning data augmentation policy
CN110692066A (en) Selecting actions using multimodal input
CN110646009A (en) DQN-based vehicle automatic driving path planning method and device
JP7284277B2 (en) Action selection using the interaction history graph
US20210158162A1 (en) Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
JP2020508524A (en) Action Selection for Reinforcement Learning Using Neural Networks
US11454978B2 (en) Systems and methods for improving generalization in visual navigation
JP7139524B2 (en) Control agents over long timescales using time value transfer
US20220036186A1 (en) Accelerated deep reinforcement learning of agent control policies
US12005580B2 (en) Method and device for controlling a robot
JP7354460B2 (en) Learning environment representation for agent control using bootstrapped latency predictions
CN114519433A (en) Multi-agent reinforcement learning and strategy execution method and computer equipment
CN116300977B (en) Articulated vehicle track tracking control method and device based on reinforcement learning
CN114815840B (en) Multi-agent path planning method based on deep reinforcement learning
CN114815840A (en) Multi-agent path planning method based on deep reinforcement learning
JP2022189799A (en) Demonstration-conditioned reinforcement learning for few-shot imitation
CN113158539A (en) Method for long-term trajectory prediction of traffic participants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant