CN114815840B

CN114815840B - Multi-agent path planning method based on deep reinforcement learning

Info

Publication number: CN114815840B
Application number: CN202210490010.2A
Authority: CN
Inventors: 郑煜明; 陈松; 鲁华祥
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Filing date: 2022-04-29
Publication date: 2024-06-28
Anticipated expiration: 2042-04-29

Abstract

The present disclosure provides a multi-agent path planning method based on deep reinforcement learning, comprising: setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward; initializing the structure and parameters of a network model in a path planning method, combining a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment, which are obtained through information exchange among different intelligent agents, into a tuple as a sample, and storing the tuple in an experience buffer pool; updating the state space of each intelligent agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, otherwise, entering a reinforcement learning training stage, and updating parameters of the network model by a multi-step return method; and realizing multi-agent path planning by utilizing the trained network model.

Description

Multi-agent path planning method based on deep reinforcement learning

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to a multi-agent path planning method based on deep reinforcement learning, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product.

Background

With the continuous development and perfection of multi-agent technology, path planning has become an important point of research as an effective means for improving the survivability and application value of multi-agents. The purpose of path planning is to plan an optimal path from the current location to the target location in an environment where an obstacle exists under the constraint of hardware conditions. The path planning method mainly comprises a traditional planning method and an intelligent planning method. Conventional planning algorithms include dynamic search-based algorithms, sampling-based algorithms, and the like, such as a-algorithm, artificial potential field method, dijkstra algorithm (Dijkstra), particle swarm algorithm (PSO), and the like; the intelligent planning algorithm includes a reinforcement learning algorithm, such as a Q learning (Q-learning) algorithm and a state action reward (Sarsa) algorithm. The reinforcement learning algorithm is one of important learning methods in the current machine learning, namely reinforcement learning, the basic idea is to continuously interact with the environment through an agent to obtain corresponding rewards, to feed back own information to the environment again, to continuously circulate, and to complete the path planning process by self-learning after accumulating a large amount of experience.

With the increase of the environment scale, the algorithm has the problems of large calculation amount, high cost, long consumption time, low applicability to different environments and easy sinking of local optimum in complex environments.

Disclosure of Invention

In view of this, the present disclosure provides a multi-agent path planning method based on deep reinforcement learning, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product.

According to a first aspect of the present disclosure, there is provided a multi-agent path planning method based on deep reinforcement learning, including:

Setting a joint state space and a joint action space of the multi-agent system; designing a reward function aiming at the path planning problem to generate a combined reward;

initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, the structure of the target strategy network is the same as that of the strategy network but the parameters of the target strategy network are different, and the structure of the target evaluation network and the structure of the evaluation network are the same but the parameters of the target strategy network and the parameters of the evaluation network are different;

Inputting the state space of each intelligent agent into the strategy network, and outputting the action required to be executed at the current moment of each intelligent agent after the state space is processed by a deterministic strategy function;

Responding to the intelligent agents to execute corresponding actions, and respectively obtaining a state space and a corresponding rewarding value of each intelligent agent at the next moment; combining a current joint action space, a current joint state space, a current joint rewards and a joint state space at the next moment, which are obtained through information exchange among different intelligent agents, into a tuple as a sample and storing the tuple into an experience buffer pool;

Updating the state space of each intelligent agent at the current moment; comparing the sample number existing in the experience buffer pool with a preset capacity, if the sample number is smaller than the preset capacity, continuing to acquire sample information through the strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method; randomly extracting K batches of samples with track length of multi-step return values L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at future time, and outputting corresponding target joint state action values; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the strategy network by using a reverse gradient descent method; updating parameters of the target strategy network and the target evaluation network by using a soft updating method;

And planning the path of the intelligent agent by utilizing the network model in the path planning obtained through training.

According to an embodiment of the present disclosure, the setting the joint state space, the joint action space, and the joint rewards of the agent includes:

Constructing a markov game model for multi-agent path planning, wherein the markov model is described by five tuples < N, S, a, P, R >, wherein n= {1,..; s represents a joint state space; a represents a joint action space; r represents a joint prize; p represents the probability value that all the agents take joint action in the current state to reach the next state.

In accordance with an embodiment of the present disclosure, the above method further comprises, prior to constructing the markov game model for multi-agent path planning:

Acquiring initial environment information of a multi-agent system, wherein the initial environment information comprises the number of agents in the multi-agent system, the initial position coordinate of each agent, the position coordinate of a corresponding target point and the position coordinate of an obstacle;

and converting the intelligent agent in the multi-intelligent agent system into a particle model, wherein the particle model comprises a plurality of particles corresponding to a plurality of intelligent agents, the current coordinates of the particles correspond to the current position coordinates of the intelligent agents, and the end coordinates of the particles correspond to the position coordinates of target points corresponding to the intelligent agents.

According to an embodiment of the disclosure, the policy network is a fully-connected neural network, and is configured to output an action that the agent needs to execute at a current time based on a state space corresponding to a current state of the agent.

According to an embodiment of the disclosure, the evaluation network is a fully-connected neural network, and is configured to output a joint state action value of the agent at a current moment based on the input joint state space vectors of all the agents and a joint action space obtained according to the policy network of all the agents.

According to an embodiment of the present disclosure, the inputting the joint state space, the joint action space and the rewards at the future time, and outputting the corresponding target joint state action value includes:

And updating the target joint state action value of the intelligent agent at the current moment by utilizing a multi-step learning method according to the joint state space, the joint action space and the rewards at the future moment.

According to an embodiment of the present disclosure, updating a target joint state action value of the agent at a current time using a multi-step learning method according to a joint state space, a joint action space, and a reward at a future time includes:

Inputting the rewarding value of the intelligent agent at each moment from the moment j to the moment j+L, the joint state space, the joint action space and the discount coefficient at the moment j+L, and outputting a target joint state action value at the current moment, wherein the target joint state action value represents the sum of the accumulated rewarding in the L step length and the target joint state action value at the moment j+L multiplied by the discount coefficient.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor.

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the deep reinforcement learning-based multi-agent path planning methods.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform any one of the above-described deep reinforcement learning-based multi-agent path planning methods.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the multi-agent path planning method based on deep reinforcement learning described above.

The multi-agent path planning method based on deep reinforcement learning combines the multi-agent depth deterministic strategy and the multi-step return idea, considers the influence of future data on the current value by utilizing the multi-step return, enables the output value of a target evaluation network to be infinitely close to a real cost function, reduces the deviation of a loss function, improves the convergence rate while improving the accuracy, reduces the learning time, rapidly and efficiently plans an optimal path for a multi-agent system, and solves the path planning problem.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates a block diagram of a multi-agent system model MS-MADDPG based on deep reinforcement learning in accordance with an embodiment of the present invention;

FIG. 2 schematically illustrates a flow chart of a multi-agent path planning method based on deep reinforcement learning of an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart for setting a joint state space, a joint action space, and a joint rewards for a multi-agent system in accordance with an embodiment of the present disclosure;

fig. 4 schematically illustrates a block diagram of an electronic device suitable for implementing a multi-agent path planning method based on deep reinforcement learning, in accordance with an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

With the continuous development and perfection of multi-agent technology, in a multi-agent system, information which can be observed by single agents is limited, a traditional single-agent reinforcement learning algorithm does not have universality, the number of the agents is increased on the basis of single-agent reinforcement learning by the intelligent agent reinforcement learning algorithm, a combined state and combined action are introduced, a centralized training distributed execution strategy is introduced, each agent can independently complete a target, and DDPG algorithm is used as a strategy reinforcement learning algorithm to guide the agents to explore an optimal path in an unknown environment.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: with the increase of the number of the intelligent agents and the expansion of the environmental scale, the problems of slow convergence speed, long training time and low path planning efficiency of the multi-intelligent system can occur.

In order to at least partially solve the technical problems in the related art, the present disclosure provides a multi-agent path planning method based on deep reinforcement learning, an electronic device, and a non-transitory computer readable storage medium storing computer instructions, which can be applied to the technical field of artificial intelligence.

According to an embodiment of the present disclosure, the multi-agent path planning method based on deep reinforcement learning includes:

Setting a joint state space and a joint action space of the multi-agent system; and designing a reward function aiming at the path planning problem to generate the combined reward.

Initializing an MS-MADDPG algorithm neural network, and initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, wherein the structure of the target strategy network and the structure of the strategy network are the same but the parameters are different, and the structure of the target evaluation network and the structure of the evaluation network are the same but the parameters are different.

And inputting the state space of each intelligent agent into the strategy network, and outputting the action required to be executed by each intelligent agent at the current moment after the state space is processed by a deterministic strategy function.

Responding to the intelligent agents to execute corresponding actions, and respectively obtaining a state space and a corresponding rewarding value of each intelligent agent at the next moment; and combining the current joint action space, the current joint state space, the current joint rewards and the joint state space at the next moment which are obtained through information exchange among different intelligent agents into a tuple as a sample and storing the tuple into an experience buffer pool.

Updating the state space of each intelligent agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to acquire sample information through the strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the target network; randomly extracting K batches of samples with track length of multi-step return values L from the experience buffer pool, training the target evaluation network, inputting a joint state space, a joint action space and rewards at future time, and outputting corresponding target joint state action values; updating parameters of the evaluation network with a minimum loss function; updating parameters of the strategy network by using a reverse gradient descent method; and updating parameters of the target strategy network and the target evaluation network by using a soft updating method.

And carrying out path planning on the intelligent body by utilizing the network model in the path planning obtained through training.

FIG. 1 schematically illustrates a block diagram of a multi-agent system model MS-MADDPG based on deep reinforcement learning in accordance with an embodiment of the present invention.

As shown in fig. 1, the multi-intelligent system model based on deep reinforcement learning of this embodiment includes: environmental system, agent 1 through agent n, experience buffer pool.

Initial state information of the multi-agent system is acquired, and the initial state information is stored in the environment system.

Policy networks, evaluation networks, target policy networks and target evaluation networks can be respectively arranged in the intelligent agents 1 to n, wherein the structures of the target policy networks and the corresponding policy networks are the same but the parameters are different, and the structures of the target evaluation networks and the corresponding evaluation networks are the same but the parameters are different.

The intelligent agents 1 to n acquire corresponding state spaces s ₁ to s _n from an environment system, the intelligent agents 1 to n input the corresponding state spaces into corresponding strategy networks, and after being processed by a deterministic strategy function, the actions a ₁ to a _n required to be executed at the current moment of the intelligent agents 1 to n are output.

Responsive to corresponding actions performed by agent 1 through agent n, respectively obtaining state spaces s '₁ through s' _n and corresponding prize values r ₁ through r _n for the next time of agent 1 through agent n; the current joint action space a, the current joint state space x, the current joint rewards r and the joint state space x 'at the next time are obtained through information exchange between the intelligent agents 1 and n, and are combined into a tuple (x, a, r, x') as a sample and stored in an experience buffer pool.

Agent 1 through agent n interact with the environmental system to update the state space of agent 1 through agent n at the next moment according to the corresponding actions a ₁ through a _n that need to be performed at the current moment.

Comparing the sample number existing in the experience buffer pool with the preset capacity, if the sample number is smaller than the preset capacity, continuing to acquire sample information through the strategy network corresponding to the intelligent agent 1 to the intelligent agent n, otherwise, entering the reinforcement learning training stage.

Updating parameters of evaluation networks corresponding to the agents 1 to n during reinforcement learning training; randomly extracting K batches of samples (x _j：j+L,a_j：j+L,r_j：j+L,x_j：j+L) with the track length of a multi-step return value L from an experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at future time, and outputting a corresponding target joint state action value; updating parameters of the evaluation network by using a minimum loss function; updating parameters of the strategy network by using a reverse gradient descent method; and updating parameters of the target strategy network and the target evaluation network by using a soft updating method.

And planning the path of the intelligent agent by utilizing the network model in the path planning obtained by training. Fig. 2 schematically illustrates a flow chart of a multi-agent path planning method based on deep reinforcement learning in accordance with an embodiment of the present disclosure.

As shown in fig. 2, the multi-agent path planning method based on deep reinforcement learning includes operations S210 to S260.

In operation S210, a joint state space, a joint action space, and a joint rewards of the multi-agent system are set.

According to an embodiment of the present disclosure, setting a joint status space, a joint action space, and a joint reward for an agent includes:

In accordance with an embodiment of the present disclosure, setting up the joint state space, joint action space, and joint rewards of the multi-agent system further includes a pre-operation of building a Markov gaming model for multi-agent path planning.

FIG. 3 schematically illustrates a flow chart for setting a joint state space, a joint action space, and a joint rewards for a multi-agent system in accordance with an embodiment of the disclosure.

As shown in FIG. 3, the above-described joint state space, joint action space, and joint rewards of the base set multi-agent system further include operations S212-S213.

In operation S211, initial environmental information of the multi-agent system is acquired.

According to an embodiment of the present disclosure, the initial environmental information includes the number of agents in the multi-agent system, a start position coordinate of each agent and a position coordinate of a corresponding target point, a position coordinate of an obstacle;

More specifically, initializing environment-related information; a physical model of each agent and obstacle is set. Setting each intelligent body i as a square area, wherein the square area coordinate of each intelligent body is P _i (x, y); the target corresponding to each intelligent agent is set as a square area, and the target position coordinate is P _ig (x, y); the obstacles are set to correspond to square areas, the positions of the obstacles are P _o (x, y), all the obstacles are stationary, and the whole environment scale is 10 x 10 square areas.

In operation S212, agents in the multi-agent system are converted into particle models.

More specifically, the particle model includes a plurality of particles corresponding to a plurality of the agents, current coordinates of the plurality of particles correspond to current position coordinates of each agent, end coordinates of the plurality of particles correspond to position coordinates of target points corresponding to the agents, and the position coordinates of the target points are preset.

The empirical buffer pool capacity M _o (e.g., 50000) is initialized, the number of samples K (e.g., 256) is sampled, and the multi-step return value L (e.g., 5). The maximum training iteration number E (e.g. 20000) and the maximum training step length T (e.g. 100) per round are set, and the current training time T (0) and the current training round E (0) are initialized.

In operation S213, a markov game model for multi-agent path planning is constructed.

According to an embodiment of the present disclosure, the state space of each of the above-mentioned agents is a vector information including a location of a current agent, a relative location of other agents to the current agent, and a relative location of an obstacle to the agent.

According to the embodiment of the disclosure, the action space of each agent represents actions that the agent can take for its own state space, including 4 actions up, down, left, and right.

According to the embodiment of the disclosure, the reward function of each agent represents a reward and punishment value obtained after selecting a corresponding action space in a current state space, and the reward function of each agent is the same because all agents cooperate to reach a target position while avoiding barriers. The rewarding function of the agent may be represented by the following formula (1):

more specifically, the state space of each intelligent agent is input into the strategy network, and after the strategy function processing is performed, the action required to be executed by each intelligent agent at the current moment is output; responding to the corresponding actions executed by the intelligent agents, and respectively obtaining a state space and a corresponding rewarding value of each intelligent agent at the next moment; and acquiring a current joint action space, a current joint state space, a current joint reward and a joint state space at the next moment through information exchange among different intelligent agents.

In operation S220, a structure of a network model and a parametric network model including a policy network, an evaluation network, a target policy network, and a target evaluation network in a path planning method are initialized.

According to embodiments of the present disclosure, the target policy network and the policy network have the same structure but different parameters, and the target evaluation network and the evaluation network have the same structure but different parameters.

More specifically, initializing policy network for agent iAnd network parametersThe strategy network is a fully-connected neural network comprising two hidden layers, the activation functions of the two hidden layers are relu functions, the first hidden layer is provided with 64 nodes, the second hidden layer is provided with 32 nodes, the output layer comprises 1 node, and the gumbel-softmax activation function is adopted to output actions required to be executed by the intelligent agent i at the current moment.

Initializing an evaluation network of agent iAnd network parametersThe evaluation network is a fully-connected neural network comprising two hidden layers, each layer has 64 nodes, the activation function of the first hidden layer is sigmod functions, the activation function of the second hidden layer is relu functions, the output layer comprises 1 node, and the state action value at the current moment is output by adopting a linear activation function.

Copying parameters of the policy network into the corresponding target policy network, wherein the copying process can be represented by the following formula (2):

θ^μ→θ^μ′； (2)

wherein θ ^μ′ is a parameter of the target policy network.

Copying parameters of the policy network into the corresponding target policy network, the copying process may be represented by the following formula (3):

θ^Q→θ^Q′； (3)

Wherein θ ^Q is a parameter of the target evaluation network.

In operation S230, the state space of each agent is input into the policy network, and after the deterministic policy function processing, the action that needs to be executed by each agent at the current moment is output.

According to an embodiment of the present disclosure, outputting an action that each agent needs to perform at the current time may be represented by the following formula (4):

Where a _i is the action that agent i needs to perform at the current time, μ _i (≡) is a deterministic policy function, s _i is the state space of agent i, And E is the strategy network parameter corresponding to the intelligent agent i, and epsilon is strategy network noise.

In operation S240, a state space and a corresponding prize value at a next time of each agent are obtained respectively in response to the agent performing the corresponding action; and combining the current joint action space, the current joint state space, the current joint rewards and the joint state space at the next moment which are obtained through information exchange among different agents into a tuple as a sample and storing the tuple into an experience buffer pool.

According to the embodiment of the disclosure, through information exchange among different agents, interaction is performed according to actions a _i required to be executed at the current moment of the agent i and the environment, so as to obtain state space information s _i of the agent i at the next moment and a feedback rewarding value r _i; the current time joint state space x= (s ₁,…,s_n) is obtained according to the current time state space s _i of all the agents, the next time joint state space x '= (s' ₁,…,s′_n) is obtained according to the next time state space s '_i of all the agents, the current time joint action set a= (a ₁,…,a_n) of the action a _i required to be executed according to the current time of all the agents, and the current time joint rewarding value set r= (r ₁,…,r_n) are combined to form a tuple (x, a, r, x') which is stored into the experience buffer pool D as a sample.

In operation S250, updating the state space of each agent at the current time; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, acquiring sample information through a strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step learning method.

According to an embodiment of the present disclosure, updating the state space of each agent at the current time may be represented by the following formula (5):

x′→x； (5)

Wherein x is the current time joint state space of the intelligent agent, x' is the next time joint state space of the intelligent agent, each intelligent agent interacts with the environment system, and the next time state space corresponding to the intelligent agent is endowed to the intelligent agent as the updated current time state space corresponding to the intelligent agent.

According to the embodiment of the disclosure, comparing the magnitude relation between the current capacity M and the preset capacity M _o in the experience buffer pool, if M is less than M _o, continuing to execute operations S230-S250; if M is more than or equal to M _o, entering a reinforcement learning training stage, and updating parameters of the target network.

According to an embodiment of the present disclosure, more specifically, the experience buffer pool preset capacity may be set to 50000.

According to an embodiment of the present disclosure, K batches of samples (x _j：j+L,a_j：j+L,r_j：j+L,x_j：j+L) are randomly drawn from the empirical buffer pool D, where each batch is of length of return value L, and the evaluation network is trained.

According to an embodiment of the present disclosure, inputting the joint state space, the joint action space, and rewards in all sample moments at a future time, outputting the corresponding target joint state action values includes:

And updating the target joint state action value of the intelligent agent at the current moment by using a multi-step learning method according to the joint state space, the joint action space and rewards in all sample moments at the future moment.

More specifically, the prize value of the agent at each of the j time to the j+l time, the state space, the action space and the discount coefficient at the j+l time are input, and a target joint state action value at the current time is output, where the target joint state action value characterizes the sum of the jackpot prize in the L step length and the target joint state action value at the j+l time multiplied by the discount coefficient, and for example, the return value L step length may be set to 5.

More specifically, samples with K batches of track lengths being multi-step return values L are randomly extracted from the experience buffer pool D, the number K of the samples for batch sampling can be set to be 256, the evaluation network is trained, the joint state space, the joint action space and rewards in all sample moments at future moments are input, and corresponding target joint state action values are output.

Outputting the corresponding target joint state action value may be represented by the following equation (6):

wherein y is a target joint state action value of the current moment output after the reward value of each moment from moment j to moment j+L and the target joint state action value of moment j+L are input into the target evaluation network, and is the sum of the accumulated reward in the L step length and the target joint state action value of moment j+L multiplied by a discount coefficient; gamma is a discount coefficient, set to 0.99; r _j+k is the agent rewarding value at the moment j+k; A set of target policy networks; The method comprises the steps of inputting a state space at the moment j+L and a state action value which is output after the joint action and is obtained according to a strategy set mu' at the moment j+L into a target evaluation network; t is the current moment; a' _j is any one policy in the policy set.

According to an embodiment of the present disclosure, updating parameters of an evaluation network with a minimum loss function may be represented by the following formula (7):

Wherein L is a loss function; sigma is the processed sum of all the sampled samples; In order to evaluate the joint state action value output by the network, K is the number of batch sampling samples, and y is the target joint state action value at the current moment.

According to the embodiment of the disclosure, the evaluation network is a fully-connected neural network, and is used for outputting the joint state action value of the intelligent agent at the current moment based on the input joint state space vector of all the intelligent agents and the joint action space obtained according to the respective strategy network of all the intelligent agents.

More specifically, the evaluation network is composed of an input layer, a hidden layer and an output layer, and inputs the joint state space x= (s ₁,…,s_n) of all the agents and obtains the joint action space a= (a ₁,…,a_n) needed to be executed by all the agents at the current moment according to the corresponding strategy network, and outputs the joint state action values of all the agents

According to an embodiment of the present disclosure, updating parameters of a policy network using a reverse gradient descent method may be represented by the following formula (8):

Wherein the method comprises the steps of Gradient as a loss function; And The method is characterized in that the method comprises the steps of policy function gradient and state action value function gradient, wherein x is the current moment joint state space of all the agents.To input the actions and joint state information of all agents to the evaluation network, the joint state action value of i of the agents is output.

According to an embodiment of the present disclosure, updating parameters of a target policy network and a target evaluation network using a soft update method may be represented by the following formula (9):

Where τ is a soft update coefficient, which may be set to 0.9, for example; the target policy network parameter theta _i ^μ′ on the right side of the equal sign in the formula represents the target policy network parameter before soft update, and the target policy network parameter theta _i ^μ′ on the left side of the equal sign represents the target policy network parameter after soft update; the target evaluation network parameter θ _i ^Q′ on the right of the equal sign in the formula represents the target evaluation network parameter before soft update, and the target policy evaluation parameter θ _i ^Q′ on the left of the equal sign represents the target evaluation network parameter after soft update.

Judging whether training is finished, adding 1 to the current training time step, recording the current training time T, and if T is less than T, continuing to execute operation S230 and subsequent training operations to realize data collection of multiple intelligent agents; if T is greater than or equal to T, adding 1 to the current training round number, if E is less than E, continuously initializing the environment information of the multi-agent system, continuously executing operation S230 and subsequent training operation, otherwise, finishing training, and storing the network parameters after training.

In operation S260, a path planning is performed for the multi-agent system using the network model in the trained path planning.

According to the multi-agent path planning method based on deep reinforcement learning, the multi-agent depth deterministic strategy and the multi-step return idea are combined, the state action function at the current moment is updated by utilizing the future multi-step state action function, so that the evaluation network output value and the target evaluation network output value are infinitely close, the deviation of the loss function is reduced, the influence of future data on the current state action value is considered, the accuracy of a trained neural network model is improved, the training speed is improved, the training time is shortened, and the path planning problem in a multi-agent system is rapidly realized.

According to the embodiment of the disclosure, the trained network model performs path planning on the intelligent agent, and the intelligent agent selects proper actions according to the trained network model to complete the path planning task.

According to the embodiment of the disclosure, each intelligent body in the multi-intelligent body system can automatically avoid obstacles in a complex environment and smoothly reach respective target positions, and the flexibility and accuracy of path planning are ensured while the time for training path planning is shortened by using the multi-intelligent body path planning method based on deep reinforcement learning.

According to the embodiment of the disclosure, the path planning problem in the multi-agent system can be rapidly realized, and a foundation is laid for executing tasks of the large-scale multi-agent system. Compared with the original MADDPG algorithm, the method disclosed by the embodiment of the disclosure updates the state action function at the current moment by using the state action function at the future multi-step moment, solves the problem that the difference between the output of the evaluation network and the output of the target evaluation network is large, namely the error of the loss function is large, and can further improve the accuracy of the trained neural network model, thereby further improving the practical application value. The reinforcement learning method based on multi-step learning shortens the training time and has better training effect.

Fig. 4 schematically illustrates a block diagram of an electronic device adapted to implement a text processing method according to an embodiment of the disclosure. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 4, a computer electronic device 400 according to an embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 404 into a Random Access Memory (RAM) 403. The processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 401 may also include on-board memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing different actions of the method flows in accordance with embodiments of the disclosure.

In the RAM403, various programs and data necessary for the operation of the electronic device 400 are stored. The processor 401, the ROM402, and the RAM403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flow according to the embodiment of the present disclosure by executing programs in the ROM402 and/or the RAM 403. Note that the program may be stored in one or more memories other than the ROM402 and the RAM 403. The processor 401 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 405, with input/output (I/O) interface 405 also connected to bus 404. Electronic device 400 may also include one or more of the following components connected to I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output portion 407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 408 including a hard disk or the like; and a communication section 404 including a network interface card such as a LAN card, a modem, and the like. The communication section 404 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read therefrom is installed into the storage section 408 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 404, and/or installed from the removable medium 411. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 401. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 402 and/or RAM 403 and/or one or more memories other than ROM 402 and RAM 403 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, the program code for causing an electronic device to implement the text processing methods provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 401. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed over a network medium in the form of signals, downloaded and installed via the communication part 404, and/or installed from the removable medium 411. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A multi-agent path planning method based on deep reinforcement learning comprises the following steps:

Initializing the structure and parameters of a network model in a path planning method, wherein the network model comprises a strategy network, an evaluation network, a target strategy network and a target evaluation network, wherein the structure of the target strategy network is the same as that of the strategy network but the parameters of the target strategy network are different, and the structure of the target evaluation network and the structure of the evaluation network are the same but the parameters of the target strategy network are different;

Updating the state space of each intelligent agent at the current moment; comparing the number of samples in the experience buffer pool with a preset capacity, if the number of samples is smaller than the preset capacity, continuing to sample, acquiring sample information through the strategy network, otherwise, entering a reinforcement learning training stage, and updating parameters of the evaluation network through a multi-step return method; randomly extracting K batches of samples with track length of multi-step return values L from the experience buffer pool, training the evaluation network, inputting a joint state space, a joint action space and rewards at future time, and outputting corresponding target joint state action values; updating parameters of the evaluation network with a minimum loss function; updating parameters of the strategy network by using a reverse gradient descent method; updating parameters of the target strategy network and the target evaluation network by using a soft updating method;

2. The method of claim 1, wherein the setting of the joint state space, the joint action space, and the joint rewards for multiple agents comprises:

Constructing a markov game model for multi-agent path planning, wherein the markov game model is described by five tuples < N, S, a, P, R > wherein n= {1,..; s represents a joint state space; a represents a joint action space; r represents a joint prize; p represents the probability value that all of the agents take joint action in the current state to reach the next state.

3. The method of claim 1, wherein prior to constructing a markov gaming model for multi-agent path planning, the method further comprises:

4. The method of claim 1, wherein the policy network is a fully-connected neural network for outputting actions that the agent needs to perform at a current time based on an input state space of the agent corresponding to a current state.

5. The method according to claim 1, wherein the evaluation network is a fully-connected neural network, and is configured to output a joint state action value of the agent at a current moment based on the input joint state space vectors of all the agents and the joint action space obtained according to the respective policy networks of all the agents.

6. The method of claim 1, wherein the inputting the joint state space, joint action space, and rewards for future time instants, outputting the respective target joint state action values comprises:

7. The method of claim 6, wherein updating the target joint state action value of the agent at the current time using a multi-step learning method based on the joint state space, the joint action space, and the rewards at the future time comprises:

8. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

9. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.