CN114815840A

CN114815840A - Multi-agent path planning method based on deep reinforcement learning

Info

Publication number: CN114815840A
Application number: CN202210490010.2A
Authority: CN
Inventors: 郑煜明; 陈松; 鲁华祥
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29
Anticipated expiration: 2042-04-29

Abstract

The present disclosure provides a multi-agent path planning method based on deep reinforcement learning, which includes: setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward; initializing the structure and parameters of a network model in a path planning method, and combining a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool; updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, otherwise, entering a reinforcement learning training stage, and updating parameters of the network model by a multi-step return method; and realizing multi-agent path planning by using the network model obtained by training.

Description

Multi-agent path planning method based on deep reinforcement learning

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to a deep reinforcement learning-based multi-agent path planning method, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product.

Background

With the continuous development and improvement of multi-agent technology, path planning becomes the key point of research as an effective means for improving the survival capability and application value of multi-agents. The purpose of path planning is to plan an optimal path from a current position to a target position in an environment with obstacles under the constraint of hardware conditions. The path planning method mainly comprises a traditional planning method and an intelligent planning method. Traditional planning algorithms include dynamic search-based algorithms and sampling-based algorithms, such as a-x algorithms, artificial potential field methods, Dijkstra algorithms, Particle Swarm Optimization (PSO), etc.; the intelligent planning algorithm includes reinforcement learning algorithms such as Q-learning (Q-learning) algorithm, and state action rewarding (Sarsa) algorithm. The reinforcement learning algorithm is one of the important learning methods in the machine learning at present, namely reinforcement learning, the basic idea obtains corresponding rewards through the continuous interaction of an intelligent agent and the environment, the information of the basic idea is fed back to the environment again, the circulation is continuous, and the process of path planning is completed through self-learning after a large amount of experiences are accumulated.

With the increase of the environment scale, the algorithm has the problems of large calculation amount, high cost, long consumed time, low applicability to different environments and easy trapping of local optimization in complex environments.

Disclosure of Invention

In view of the above, the present disclosure provides a deep reinforcement learning based multi-agent path planning method, an electronic device, a non-transitory computer readable storage medium storing computer instructions, and a computer program product.

According to a first aspect of the present disclosure, there is provided a multi-agent path planning method based on deep reinforcement learning, comprising:

setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward;

initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the target strategy network and the strategy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters;

inputting the state space of each agent into the policy network, and outputting the action to be executed by each agent at the current moment after the state space is processed by a deterministic policy function;

responding to the corresponding action executed by the intelligent agent, and respectively obtaining the state space and the corresponding reward value of each intelligent agent at the next moment; combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool;

updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to acquire sample information through the policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method; randomly extracting K samples with track length of a multi-step return value L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating the parameters of the evaluation network by using a minimum loss function; updating the parameters of the strategy network by using a reverse gradient descent method; updating the parameters of the target strategy network and the target evaluation network by using a soft updating method;

and planning the path of the intelligent agent by using the trained network model in the path planning.

According to an embodiment of the present disclosure, the setting of the joint state space, the joint action space, and the joint reward of the agent includes:

constructing a Markov game model for multi-agent path planning, wherein the Markov model is described by a quintuple < N, S, A, P, R >, wherein N ═ 1.. multidrug, N } represents a set of a plurality of agents; s represents a joint state space; a represents a joint action space; r represents a joint reward; p represents the probability value that all of the agents take a joint action in the current state to the next state.

According to an embodiment of the present disclosure, before constructing the markov game model for multi-agent path planning, the method further comprises:

acquiring initial environment information of a multi-agent system, wherein the initial environment information comprises the number of agents in the multi-agent system, the initial position coordinate of each agent, the position coordinate of a corresponding target point and the position coordinate of an obstacle;

converting an agent in the multi-agent system into a particle model, wherein the particle model comprises a plurality of particles corresponding to the plurality of agents, current coordinates of the particles correspond to current position coordinates of the agent, and endpoint coordinates of the particles correspond to position coordinates of a target point corresponding to the agent.

According to an embodiment of the present disclosure, the policy network is a fully connected neural network, and is configured to output an action that the agent needs to perform at a current time based on a state space corresponding to a current state input to the agent.

According to an embodiment of the present disclosure, the evaluation network is a fully connected neural network, and is configured to output a joint state action value of the agent at the current time based on the input joint state space vectors of all the agents and a joint action space obtained according to the policy networks of all the agents.

According to an embodiment of the present disclosure, the inputting the joint state space, the joint action space and the reward at the future time, and the outputting the corresponding target joint state action value includes:

and updating the target joint state action value of the intelligent agent at the current moment by utilizing a multi-step learning method according to the joint state space, the joint action space and the reward at the future moment.

According to the embodiment of the disclosure, updating the target joint state action value of the agent at the current moment by using a multi-step learning method according to the joint state space, the joint action space and the reward at the future moment comprises:

and inputting the reward value of the agent at each time from the time j to the time j + L, the joint state space at the time j + L, the joint action space and the discount coefficient, and outputting a target joint state action value at the current time, wherein the target joint state action value represents the sum of the accumulated reward in the L step length and the target joint state action value at the time j + L multiplied by the discount coefficient.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor.

Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the deep reinforcement learning based multi-agent path planning methods.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform any one of the deep reinforcement learning based multi-agent path planning methods described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above deep reinforcement learning based multi-agent path planning method.

The invention provides a multi-agent path planning method based on deep reinforcement learning, which combines a multi-agent depth certainty strategy and a multi-step return idea, considers the influence of future data on the current value by using the multi-step return, enables the output value of a target evaluation network to be infinitely close to a real value function, reduces the deviation of a loss function, improves the accuracy, improves the convergence speed, reduces the learning time, quickly and efficiently plans an optimal path for a multi-agent system, and solves the path planning problem.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates a block diagram of a deep reinforcement learning based multi-agent system model MS-MADDPG according to an embodiment of the present invention;

FIG. 2 schematically illustrates a flow chart of a deep reinforcement learning based multi-agent path planning method of an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart for setting joint state space, joint action space and joint rewards for a multi-agent system according to an embodiment of the disclosure;

fig. 4 schematically shows a block diagram of an electronic device adapted to implement a deep reinforcement learning based multi-agent path planning method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

With the continuous development and improvement of multi-agent technology, in a multi-agent system, information which can be observed by a single agent is limited, the traditional single-agent reinforcement learning algorithm does not have universality, the number of agents is increased on the basis of single-agent reinforcement learning by the agent reinforcement learning algorithm, a joint state and a joint action are introduced, and a centralized training distributed execution strategy is introduced, so that each agent can independently complete a target, and a DDPG algorithm is used as a strategy type reinforcement learning algorithm to guide the agent to search an optimal path in an unknown environment.

In implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: with the increase of the number of the agents and the expansion of the environment scale, the problems of low convergence speed, long training time and low path planning efficiency of a multi-agent system can occur.

To at least partially solve the technical problems in the related art, the present disclosure provides a deep reinforcement learning-based multi-agent path planning method, an electronic device, and a non-transitory computer-readable storage medium storing computer instructions, which can be applied to the technical field of artificial intelligence.

According to an embodiment of the present disclosure, the multi-agent path planning method based on deep reinforcement learning includes:

setting a joint state space and a joint action space of the multi-agent system; and designing a reward function aiming at the path planning problem to generate combined rewards.

Initializing a MS-MADDPG algorithm neural network, and initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the target strategy network and the strategy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters.

And inputting the state space of each agent into the policy network, and outputting the action to be executed by each agent at the current moment after the state space of each agent is processed by a deterministic policy function.

Responding to the corresponding action executed by the intelligent agent, and respectively obtaining the state space and the corresponding reward value of each intelligent agent at the next moment; and combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool.

Updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to acquire sample information through the policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the target network; randomly extracting K samples with track length of a multi-step return value L from the experience buffer pool, training the target evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the policy network by using a reverse gradient descent method; and updating the parameters of the target strategy network and the target evaluation network by using a soft updating method.

FIG. 1 schematically shows a block diagram of a deep reinforcement learning based multi-agent system model MS-MADDPG according to an embodiment of the present invention.

As shown in fig. 1, the deep reinforcement learning-based multi-agent system model of this embodiment includes: an environment system, agents 1 to n, and an experience buffer pool.

Acquiring initial state information of the multi-agent system, and storing the initial state information into the environment system.

The agent 1 to the agent n may be respectively provided with a policy network, an evaluation network, a target policy network, and a target evaluation network, where the target policy network and the corresponding policy network have the same structure but different parameters, and the target evaluation network and the corresponding evaluation network have the same structure but different parameters.

Agents 1 through n obtain corresponding state spaces s from the environmental system ₁ To s _n The agents 1 to n input corresponding state spaces into corresponding policy networks, and after being processed by deterministic policy functions, the actions a to be executed at the current time of the agents 1 to n are output ₁ To a _n 。

Obtaining state spaces s 'of agent 1 to agent n at next time instants in response to corresponding actions performed by agent 1 to agent n' ₁ To s' _n And corresponding prize value r ₁ To r _n (ii) a And combining the current joint action space a, the current joint state space x, the current joint reward r and the next time joint state space x 'obtained through information exchange between the intelligent agents 1 and n into a tuple (x, a, r, x') serving as a sample and storing the tuple in an experience buffer pool.

The agents 1 to n interact with the environment system and act a to be executed according to the corresponding current time ₁ To a _n The state space of agent 1 through agent n at the next time is updated.

And comparing the number of samples in the experience buffer pool with the preset capacity, if the number of samples is less than the preset capacity, continuously acquiring sample information through the strategy networks corresponding to the agents from the agent 1 to the agent n, and otherwise, entering a reinforcement learning training stage.

Updating parameters of evaluation networks corresponding to the agents 1 to n when reinforcement learning training is carried out; randomly extracting K samples (x) with the track length of a multi-step return value L from an experience buffer pool _j：j+L ，a _j：j+L ，r _j：j+L ，x _j：j+L ) Training the evaluation network, inputting the joint state space, the joint action space and the reward at the future moment, and outputting a corresponding target joint state action value; updating the parameters of the evaluation network by using a minimum loss function; updating the parameters of the strategy network by using a reverse gradient descent method; and updating the parameters of the target strategy network and the target evaluation network by using a soft updating method.

And planning the path of the intelligent agent by using the network model in the path planning obtained by training. Fig. 2 schematically illustrates a flowchart of a deep reinforcement learning-based multi-agent path planning method according to an embodiment of the present disclosure.

As shown in FIG. 2, the deep reinforcement learning-based multi-agent path planning method includes operations S210-S260.

In operation S210, a joint state space, a joint action space, and a joint bonus of the multi-agent system are set.

According to an embodiment of the present disclosure, setting a joint state space, a joint action space, and a joint reward of an agent includes:

Setting the joint state space, joint action space and joint rewards of a multi-agent system according to embodiments of the present disclosure also includes a pre-operation of constructing a markov game model for multi-agent path planning.

FIG. 3 schematically illustrates a flow chart for setting joint state space, joint action space and joint rewards for a multi-agent system according to an embodiment of the disclosure.

As shown in FIG. 3, the above-described joint state space, joint action space and joint rewards of the baseset multi-agent system further includes operations S212-S213.

In operation S211, initial environment information of the multi-agent system is acquired.

According to an embodiment of the present disclosure, the initial environment information includes the number of agents in the multi-agent system, the start position coordinates of each agent and the position coordinates of the corresponding target point, the position coordinates of the obstacle;

more specifically, environment-related information is initialized; a physical model is set for each agent and obstacle. Setting each agent i as a square area, and representing the coordinate of the square area of each agent as P _i (x, y); the target corresponding to each agent is set to be a square area, and the coordinate of the target position is P _ig (x, y); the obstacle is set to correspond to a square area with a position P _o (x, y), all obstacles are stationary, and the overall environmental scale is 10 x 10 square areas.

In operation S212, the agents in the multi-agent system are converted into particle models.

More specifically, the particle model includes a plurality of particles corresponding to a plurality of the intelligent bodies, current coordinates of the particles correspond to current position coordinates of each intelligent body, end point coordinates of the particles correspond to position coordinates of a target point corresponding to the intelligent body, and the position coordinates of the target point are preset.

Initializing an experience buffer pool capacity M _o (e.g., 50000), number of bulk samples K (e.g., 256), and multi-step return value L (e.g., 5). Setting a maximum training iteration number E (for example 20000) and a maximum step length T (for example 100) of each round of training, and initializing a current training time T (0) and a current training round E (0).

In operation S213, a markov game model for multi-agent path planning is constructed.

According to the embodiment of the disclosure, the state space of each agent is vector information, and includes the position of the current agent, the relative positions of other agents to the current agent, and the relative positions of obstacles to the agent.

According to the embodiment of the disclosure, the action space of each agent represents the action that the agent can take with respect to the state space of the agent, and comprises 4 actions.

According to an embodiment of the present disclosure, the reward function of each agent represents a reward punishment value obtained after the corresponding action space is selected in the current state space, and because all agents cooperatively reach the target position while avoiding obstacles, the reward function of each agent is the same. The reward function of an agent may be represented by the following equation (1):

more specifically, the state space of each agent is input into the policy network, and after being processed by a deterministic policy function, the action to be executed by each agent at the current moment is output; responding to the corresponding action executed by the agents, and respectively obtaining the state space and the corresponding reward value of each agent at the next moment; and obtaining a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment through information exchange among different agents.

In operation S220, the structure of the network model and the parameter network model in the path planning method are initialized to include a policy network, an evaluation network, a target policy network, and a target evaluation network.

According to the embodiment of the disclosure, the target policy network and the policy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters.

More specifically, a policy network for agent i is initialized

And network parameters

The strategy network is a fully-connected neural network comprising two hidden layers, the activation functions of the two hidden layers are relu functions, the first hidden layer is provided with 64 nodes, the second hidden layer is provided with 32 nodes, the output layer comprises 1 node, and actions required to be executed by the agent i at the current moment are output by adopting a gumbel-softmax activation function.

Initializing an evaluation network of agent i

And network parameters

The evaluation network is a fully-connected neural network comprising two hidden layers, each layer is provided with 64 nodes, the activation function of the first hidden layer is a sigmod function, the activation function of the second hidden layer is a relu function, the output layer comprises 1 node, and the state action value at the current moment is output by adopting a linear activation function.

Copying parameters of the policy network into a corresponding target policy network, wherein the copying process can be expressed by the following formula (2):

θ ^μ →θ ^μ′ ； (2)

wherein, theta ^μ′ Parameters of the network are targeted.

Copying parameters of the policy network into a corresponding target policy network, wherein the copying process can be expressed by the following formula (3):

θ ^Q →θ ^Q′ ； (3)

wherein, theta ^Q The parameters of the network are evaluated for the target.

In operation S230, the state space of each agent is input into the policy network, and after being processed by the deterministic policy function, the action that needs to be executed by each agent at the current time is output.

According to the embodiment of the present disclosure, outputting the action that each agent needs to perform at the current moment may be represented by the following formula (4):

wherein a is _i For the action that agent i needs to perform at the current moment, μ _i (□) is a deterministic policy function, s _i Is the state space of the agent i,

and f, the parameters of the strategy network corresponding to the agent i are represented, and epsilon is the noise of the strategy network.

In operation S240, in response to the agents performing the corresponding actions, respectively obtaining a state space and a corresponding reward value of each agent at a next time; and combining the current joint action space, the current joint state space, the current joint reward and the joint state space at the next moment obtained through information exchange among different agents into a tuple as a sample to be stored in an experience buffer pool.

According to the embodiment of the disclosure, through information exchange among different agents, the action a required to be executed according to the current moment of the agent i _i Interacting with the environment to obtain the state space information s of the agent i at the next moment _i And a feedback prize value r _i (ii) a According to the current time state space s of all agents _i Obtaining the joint state space x ═(s) at the current moment ₁ ，…，s _n ) And state space s 'according to the next moment of all the agents' _i Obtaining a next-time joint state space x ═ s' ₁ ，…，s′ _n ) And according to the actions a which need to be executed by all the agents at the current moment _i The current time joint action set a ═ (a) ₁ ，…，a _n ) And the joint reward value set r ═ r (r) at the current moment ₁ ，…，r _n ) And combining the two into a tuple (x, a, r, x') as a sample and storing the tuple into an experience buffer pool D.

In operation S250, updating a state space of each agent at the current time; and comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing sampling, and acquiring sample information through a policy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method.

According to an embodiment of the present disclosure, updating the state space of each agent at the current time may be represented by the following formula (5):

x′→x； (5)

and x is the current time joint state space of the intelligent agents, x' is the next time joint state space of the intelligent agents, and the intelligent agents are interacted with the environment system through each intelligent agent, so that the next time state space corresponding to the intelligent agents is endowed to the intelligent agents and serves as the updated current time state space corresponding to the intelligent agents.

According to the embodiment of the disclosure, the current capacity M and the preset capacity M in the experience buffer pool are compared _o If M < M _o Continuing to execute operations S230-S250; if M ≧ M _o And entering a reinforcement learning training stage to update the parameters of the target network.

More specifically, according to an embodiment of the present disclosure, the preset capacity of the experience buffer pool may be set to 50000.

In accordance with an embodiment of the present disclosure, K batches of samples (x) are randomly drawn from experience buffer pool D _j：j+L ，a _j：j+L ，r _j：j+L ，x _j：j+L ) And wherein, each batch of length is a return value L, and the evaluation network is trained.

According to an embodiment of the present disclosure, inputting a joint state space, a joint action space, and rewards in all sample times at a future time, outputting a corresponding target joint state action value comprises:

and updating the target joint state action value of the intelligent agent at the current moment by using a multi-step learning method according to the joint state space and the joint action space at the future moment and the rewards in all sample moments.

More specifically, the reward value of the agent at each time from time j to time j + L, the state space, the action space and the discount coefficient at time j + L are input, and the target joint state action value at the current time is output, where the target joint state action value represents the sum of the cumulative reward in the L step and the target joint state action value at time j + L multiplied by the discount coefficient, and the reward value L step may be set to 5, for example.

More specifically, K batches of samples with a track length of a multi-step return value L are randomly extracted from the experience buffer pool D, the number of batch sampling samples K can be set to 256, the evaluation network is trained, the joint state space and joint action space at the future time and rewards in all sample times are input, and the corresponding target joint state action value is output.

The output corresponding target joint-state action value may be represented by the following equation (6):

y is the reward value of each time from the j time to the j + L time and the target joint state action value of the j + L time which are output to the target evaluation network at the current time, and is the sum of the cumulative reward in the L step length and the target joint state action value of the j + L time multiplied by the discount coefficient; γ is the discount coefficient, set to 0.99; r is _j+k An agent award value at time j + k;

a target policy network set;

inputting a state space at the moment of j + L and a state action value output after the joint action obtained according to the strategy set mu' at the moment of j + L into a target evaluation network; t is the current time; a' _j Is any one of a set of policies.

According to an embodiment of the present disclosure, updating a parameter of an evaluation network using a minimum loss function may be represented by the following formula (7):

wherein L is a loss function; sigma is the sum of all the processed sampling samples;

in order to evaluate the joint state action value output by the network, K is the number of batch sampling samples, and y is the target joint state action value at the current moment.

According to the embodiment of the disclosure, the evaluation network is a fully-connected neural network and is used for outputting the joint state action value of the agent at the current moment based on the input joint state space vectors of all agents and the joint action space obtained according to the respective strategy networks of all agents.

More specifically, the evaluation network is composed of an input layer, a hidden layer and an output layer, and the joint state space x(s) of all the agents is input ₁ ，…，s _n ) And obtaining the joint action space a (a) which needs to be executed by all the agents at the current moment according to the corresponding policy network ₁ ，…，a _n ) Outputting combined state action values of all agents

According to an embodiment of the present disclosure, updating parameters of a policy network using an inverse gradient descent method may be represented by the following equation (8):

wherein

Is a loss function gradient;

and

respectively, a policy function gradient and a state action value function gradient, and x is the current time combined state space of all agents.

To input the actions and join state information of all agents to the evaluation network, the join state action value of i of the agent is output.

According to an embodiment of the present disclosure, updating parameters of the target policy network and the target evaluation network using the soft update method may be represented by the following formula (9):

where τ is a soft update coefficient, which may be set to 0.9, for example; target policy network parameter theta on right of equal sign in formula _i ^μ′ Showing the target strategy network parameter before soft update, the target strategy network parameter theta on the left of the equal sign _i ^μ′ Representing the target strategy network parameters after soft update; target evaluation network parameter theta on right side of equal sign in formula _i ^Q′ Representing the target evaluation network parameters before soft update, and the target strategy evaluation parameter theta on the left of the equal sign _i ^Q′ And expressing the target evaluation network parameters after the soft update.

Judging whether the training is finished or not, adding 1 to the current training time step length, recording the current training time T, and if T is less than T, continuing to execute operation S230 and the subsequent training operation to realize the data collection of the multi-agent; if T is greater than or equal to T, adding 1 to the current training round number, if E is less than E, continuously initializing the environmental information of the multi-agent system, and continuously executing operation S230 and the subsequent training operation, otherwise, finishing the training and storing the trained network parameters.

In operation S260, a path of the multi-agent system is planned using the network model in the trained path plan.

According to the embodiment of the disclosure, the multi-agent path planning method based on deep reinforcement learning combines a multi-agent deep certainty strategy and a multi-step return idea, and utilizes a future multi-step state action function to update a state action function at the current moment, so that an evaluation network output value and a target evaluation network output value are infinitely close, the deviation of a loss function is reduced, the influence of future data on the current state action value is considered, the accuracy of a trained neural network model is improved, the training speed is improved, the training time is reduced, and the path planning problem in a multi-agent system is quickly realized.

According to the embodiment of the disclosure, the intelligent agent is subjected to path planning by the trained network model, and the intelligent agent selects a proper action according to the trained network model to complete a path planning task.

According to the embodiment of the disclosure, each intelligent agent in the multi-intelligent-agent system can autonomously avoid the obstacle and smoothly reach the respective target position in the complex environment, and the multi-intelligent-agent path planning method based on deep reinforcement learning is utilized, so that the time for training path planning is shortened, and the flexibility and the accuracy of path planning are ensured.

According to the embodiment of the disclosure, the path planning problem in the multi-agent system can be quickly realized, and a foundation is laid for the large-scale multi-agent system to execute tasks. Compared with the original MADDPG algorithm, the method has a better network parameter updating mode, updates the state action function at the current moment by using the state action function at the future multi-step moment, solves the problem that the difference value between the evaluation network output and the target evaluation network output is large, namely the loss function error is large, and can further improve the accuracy of the trained neural network model, thereby further improving the practical application value. The reinforcement learning method based on multi-step learning shortens the training time and has better training effect.

Fig. 4 schematically shows a block diagram of an electronic device adapted to implement a text processing method according to an embodiment of the present disclosure. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, the computer electronic device 400 according to the embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 404 into a Random Access Memory (RAM) 403. Processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 401 may also include onboard memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.

In the RAM403, various programs and data necessary for the operation of the electronic apparatus 400 are stored. The processor 401, ROM402 and RAM403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM402 and/or the RAM 403. Note that the programs may also be stored in one or more memories other than the ROM402 and RAM 403. The processor 401 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 405, input/output (I/O) interface 405 also being connected to bus 404. Electronic device 400 may also include one or more of the following components connected to I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 404 including a network interface card such as a LAN card, a modem, or the like. The communication section 404 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 404, and/or installed from the removable medium 411. The computer program, when executed by the processor 401, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM402 and/or RAM403 and/or one or more memories other than ROM402 and RAM403 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being adapted to cause the electronic device to carry out the text processing method provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 401, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 404, and/or installed from the removable medium 411. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A multi-agent path planning method based on deep reinforcement learning comprises the following steps:

updating the state space of each agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, and acquiring sample information through the strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step return method; randomly extracting K batches of samples with the track length of a multi-step return value L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at a future moment, and outputting a corresponding target joint state action value; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the policy network by using a reverse gradient descent method; updating parameters of the target strategy network and the target evaluation network by using a soft updating method;

2. The method of claim 1, wherein setting a joint state space, a joint action space, and a joint reward for a smart agent comprises:

constructing a Markov game model for multi-agent path planning, wherein the Markov model is described by quintuple < N, S, A, P, R, wherein N ═ 1.. multidrug, N } represents a set of a plurality of agents; s represents a joint state space; a represents a joint action space; r represents a joint reward; p represents the probability value that all of the agents take a joint action in the current state to the next state.

3. The method of claim 1, wherein prior to constructing a markov game model for multi-agent path planning, the method further comprises:

converting an agent in the multi-agent system into a particle model, wherein the particle model comprises a plurality of particles corresponding to a plurality of the agents, current coordinates of the particles correspond to current location coordinates of the agent, and endpoint coordinates of the particles correspond to location coordinates of a target point corresponding to the agent.

4. The method of claim 1, wherein the policy network is a fully-connected neural network for outputting actions that the agent needs to perform at a current time based on an input state space of the agent corresponding to the current state.

5. The method according to claim 1, wherein the evaluation network is a fully-connected neural network, and is configured to output the joint state action value of the agent at the current time based on the input joint state space vectors of all agents and the joint action space obtained according to their own policy networks.

6. The method of claim 1, wherein the inputting a joint state space, a joint action space, and a reward at a future time, the outputting a corresponding target joint state action value comprises:

7. The method of claim 6, wherein updating the target joint state action value of the agent at the current time using a multi-step learning method based on the joint state space, joint action space, and rewards at future times comprises:

8. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.