CN113759929B

CN113759929B - Multi-agent path planning method based on reinforcement learning and model predictive control

Info

Publication number: CN113759929B
Application number: CN202111107563.7A
Authority: CN
Inventors: 杜飞平; 王春民; 李嘉豪; 于登秀; 王震
Original assignee: Northwestern Polytechnical University; Xian Aerospace Propulsion Institute
Current assignee: Northwestern Polytechnical University; Xian Aerospace Propulsion Institute
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-08-23
Anticipated expiration: 2041-09-22
Also published as: CN113759929A

Abstract

The invention discloses a multi-agent path planning method based on reinforcement learning and model predictive control, which utilizes a path planning and tracking method combining ESB-MADPPG and MPC algorithms for the path planning problem of multi-agents, and comprises the following basic steps: firstly, a multi-agent system model is simplified into a particle model, then an ESB-MADDPG algorithm is used for path planning, and finally all paths are followed through model prediction control, so that the path planning of the multi-agent system can be quickly realized, and a foundation is laid for the large-scale multi-agent system to execute tasks.

Description

Multi-agent path planning method based on reinforcement learning and model predictive control

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-agent path planning method based on reinforcement learning and model predictive control.

Background

With the development and maturity of artificial intelligence theory and related research technologies, multi-agent systems are being researched and applied more and more widely. The multi-agent system is an autonomous intelligent system which utilizes information interaction and feedback, excitation and response and other interaction behaviors to realize behavior coordination, adapts to a dynamic environment and finally completes specific tasks together.

The multi-agent system is an important application research field of group intelligence and is one of the important directions for the future development of the intelligent system. Path planning is the focus of research on multi-agent systems, and focuses on considering the global optimal path of the whole system, such as the shortest total path length or the smallest total energy consumption of the system path. The efficiency and success rate of the multi-agent system for executing tasks can be improved only by planning the most effective path of the whole system.

At present, for a path planning task of a multi-agent system, only a single target is optimized, and for multi-agent path planning with multi-target optimization requirements, the path optimization between a plurality of agents and a plurality of targets is difficult to realize by the existing path planning method.

Disclosure of Invention

Aiming at the problem that the existing path planning method cannot meet the path optimization between a plurality of intelligent agents and a plurality of targets, the invention provides a multi-intelligent-agent path planning method based on reinforcement learning and model predictive control.

The basic design idea of the invention is as follows:

on the basis of multi-agent depth certainty strategy gradient (MADDPG), the idea of an Expert System (ESB) and Model Predictive Control (MPC) are added. Firstly, simplifying the intelligent agents in a multi-agent system into a particle model, and then introducing an expert system to smooth the path drawn by the MADDPG algorithm and accelerate the convergence time; and finally, summarizing all paths obtained by the ESB-MADDPG algorithm, and following all paths through model prediction control, so that the multi-agent system can realize path planning meeting the multi-objective optimization requirement.

The specific technical scheme of the invention is as follows:

a multi-agent path planning method based on reinforcement learning and model predictive control comprises the following steps:

step 1: establishing a multi-agent system model, and acquiring initial state information of the multi-agent system model: the initial state information includes a number of agents in the multi-agent system model ofnThe number of target points isn、Arbitrary agents under global coordinatesiCurrent position coordinate is p _i Each target pointjPosition coordinate p _j The position coordinates of the target point are artificially given according to the multi-agent path planning task requirements; (i,j)∈n；

And 2, step: converting the multi-agent system model into a particle model;

particle models include models corresponding tonOf a personal agentnThe mass points of the material are distributed on the surface of the material,nthe start position coordinates of each mass point are the current position coordinates of the intelligent agent corresponding to the start position coordinates,nthe termination position coordinates of each mass point are the position coordinates of a target point corresponding to the termination position coordinates;

assigning each particle a start coordinate of an observation range

Giving each particle a termination coordinate of a observable range

；

And step 3: carrying out path planning by utilizing an ESB-MADDPG algorithm;

step 3.1: solving the reward value at each moment according to a formular：

Representing an agentiAnd the target pointjThe distance between them;

step 3.2: obtaining any particle from step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o _i ；

Current time state o _i From particleiCurrent time of day coordinates, and particleiThe relative position of the current time coordinate of the other particles and the current time coordinate of other particles;

step 3.3: obtaining a state o at a current time based on a motion estimation network _i Lower mass pointiCurrent time of day action

I.e. by

；

Wherein the content of the first and second substances,

from particleiIs/are as followsx、ySpeed on the shaft;

is to estimate network parameters in motion

An action selection strategy;

representing mass pointsiSelecting the interference suffered in the action at the current moment;

step 3.4: in the mass pointiAction of selecting current moment

And after execution, particlesiWill reach a new state

；

Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt each of the time points, the time point,mless than or equal to 50 to obtain particlesiState results of path planning at all times

All the time state results obtained by the training are compared

Middle mass pointiAre connected to obtain particlesiSet of paths of (1)

；

Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles

；

Step 3.7: for the path set obtained by training

Judging; the judgment standard is the observation range in the final time state of all particles

All have observable ranges therein

Exist, i.e. are

And

all contact, if yes, the initial path planning is considered to be finished at the moment, and the step 3.9 is executed;

if not, repeating the steps 3.1-3.6 for a totalMSecondly, make

Filling the experience pool D, and executing the step 3.8;M≥100；

wherein o is

A set of (a);

is composed of

A set of (a);

is composed of

A set of (a);

step 3.8: from experience poolsDRandomly sampling a small portion of the sample

Calculated by a state estimation networkQA value;Qthe value is used for evaluating the action quality output by the action estimation network;

at the same time, the sample is mixed

Inputting the data into a motion estimation network, and estimating network parameters for the motion through a strategy gradient formula

Updating, the updated action estimating network parameters

Inputting the data into a motion estimation network, and returning to the step 3.1;

step 3.9: carrying out smoothing processing on the initial path meeting the requirements and outputting the initial path;

and 4, step 4: tracking the path by using a model predictive control algorithm;

step 4.1: establishing an intelligent agent tracking model;

setting mass pointiFor virtual leaders

Its initial time position is the mass pointiThe reference track is a smoothed particleiA path of (a); set and particleiThe corresponding intelligent agent is a follower

The initial time position of the agent is the agent in step 1iThe position of (a); set following person

And a virtual leader

The ideal control relationship between the two is

；

l ₁ Representing a distance between the virtual leader and the follower;

representing a relative orientation between the virtual leader and the follower;

represents a deviation in orientation between the virtual leader and the follower, an

All the initial values of (1) are 0;

step 4.2: according to virtual leaders

And following person

Obtaining the virtual leadership according to the respective speed, angular speed and distance between the two under the global coordinate systemA

And following person

The expression of the control relation is combined with the kinematic formula of the intelligent agent to establish a tracking control model;

step 4.3: according to following person

Position and initial velocity at initial time, and tracking control model for predicting follower at t time

And a virtual leader

Control relationship between

；

Step 4.4: predicted by step 4.3

With setting in step 4.1

Comparing and calculating the error e of the two ^t And correcting;

step 4.5: error e is corrected by particle swarm algorithm ^t Carrying out optimization correction, and calculating the follower at the t +1 moment through speed input at the t +1 moment

And a virtual leader

Control relationship of

；

Step 4.6: judging whether the control termination time is reached, if so, outputting a tracking path, and otherwise, returning to the step 4.4; the control termination time is the duration of the reference trajectory in step 4.1;

step 4.7: and tracking the initial paths of the rest particles according to the steps 4.1-4.6, and finally completing path planning of all the intelligent agents.

Further, the expression of the tracking control model in the step 4.2 is as follows:

the expression of the relationship between the virtual leader and follower is:

the kinematic formula of the intelligent agent is as follows:

wherein the content of the first and second substances,

is the speed of the follower or the speed of the follower,

is the angular velocity of the follower and,

is the speed of the virtual leader or the virtual leader,

is the angular velocity of the virtual leader,

representing the follower's speed input.

Further, step 3.2 above _i The expression of (a) is:

；

relative position p _ij The solving formula is as follows:

and is and

。

further, in the above step 3.8QThe expression of (a) is:

wherein:

in order to be able to use the attenuation factor,

a value is awarded for a new time instant obtained at a new time instant.

Further, in step 3.8, the network parameters are estimated for the action by the policy gradient formula

The strategy gradient formula for updating is as follows:

wherein: s represents the number of samples to be sampled,

representing the strategic gradient method applied to its updated parameters,

。

the invention has the beneficial effects that:

1. aiming at the problem of path planning of the multi-agent, the invention utilizes a path planning and tracking algorithm combined with ESB-MADPPG and MPC algorithms to quickly realize the path planning of the multi-agent system and lay a foundation for the large-scale multi-agent system to execute tasks.

2. According to the invention, by designing the reward value in the ESB-MADDPG algorithm and the neural network in the algorithm, the mutual interference between the paths of each particle system is avoided, and the distance of the path reaching the target point position is shortest; the kinematics model of the intelligent agent point is introduced through the MPC algorithm, the speed in the intelligent agent tracking path can be optimized, and the optimized multi-intelligent agent path is obtained.

Drawings

FIG. 1 is a flow chart of a basic implementation of the present invention;

FIG. 2 is a flow chart of path planning using ESB-MADDPG;

FIG. 3 is a flow chart of a path tracking based on model prediction for PSO;

FIG. 4 is a schematic diagram of a particle model;

FIG. 5 is a smoothed multi-agent trajectory graph;

FIG. 6 is a schematic diagram of an agent tracking model;

FIG. 7 is a graph of agent path tracking error, where (a) - (f) represent the path tracking error for agent as the follower for particle A, B, C, D, E, F, respectively.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The embodiment provides a multi-agent path planning method based on reinforcement learning and model predictive control, in the embodiment, the agents are robots, and the implementation flow is as shown in fig. 1, and specifically includes the following steps:

step 1: establishing a multi-agent system model, and acquiring initial state information of the multi-agent system model: the initial state information includes a number of agents in the multi-agent system model ofnThe number of target points isn、Arbitrary agents under global coordinatesiCurrent position coordinate is p _i Each target pointjPosition coordinate p _j The position coordinates of the target point are artificially given according to the multi-agent path planning task requirement; (i,j)∈n；

Step 2: converting the multi-agent system model into a particle model; particle models include models corresponding tonOf a personal agentnThe number of particles is one,nthe start position coordinates of each mass point are the current position coordinates of the intelligent agent corresponding to the start position coordinates,nthe termination position coordinates of the mass points are the position coordinates of the target points corresponding to the termination position coordinates;

assigning each particle a start coordinate of an observation range

Giving each particle a termination coordinate of a observable range

As shown in FIG. 4, in this embodiment, there are six agents, i.e. there are six particles (i.e. A, B, C, D, E, F), and the white area is the observation range of the start coordinates of the six particles

The black area is the observed range of the termination coordinates of six particles

；

And 3, step 3: path planning is carried out by utilizing an ESB-MADDPG algorithm, and the basic flow is shown in figure 2;

step 3.1: solving the reward value at each moment according to a formular：

Representing an agentiAnd target pointjThe distance therebetween;

step 3.2: obtaining particles according to step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o _i ；

Current time state o _i From particleiCurrent time of day coordinates, and particlesiThe relative position of the current time coordinate of the other particles and the current time coordinate of other particles;

namely, it is

Relative position p _ij The solving formula is as follows:

and is and

；

I.e. by

；

Wherein the content of the first and second substances,

from particleiIs/are as followsx、ySpeed on the shaft;

is to estimate network parameters in motion

An action selection strategy;

representing particlesiSelecting the interference suffered in the action at the current moment;

step 3.4: in the mass pointiAction of selecting current moment

And after execution, mass pointiWill reach a new state

；

Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt each time, 30 times in this embodiment, the particles are obtainediPlanning the state results of all the time of a path, and obtaining the state of all the time of the training

The positions of the medium particles i are connected to obtain particlesiOf (2) a

；

Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles

；

Step 3.7: for the set obtained by training

Judging; the judgment standard is that all particle observation ranges in the final time state

All have observable ranges therein

Exist, i.e. are

And

if not, repeating the steps 3.1-3.6 for a totalMNext, the number of times in this embodiment is 100, so that

Filling the experience pool D, and executing the step 3.8;

wherein o is

A set of (a);

is composed of

A set of (a);

is composed of

A set of (a);

step 3.8: from experience poolsDSampling a small portion of the sample at random

wherein:

in order to be able to use the attenuation factor,

a value is awarded for a new time instant obtained at a new time instant.

At the same time, according to the above-mentioned sample

Inputting the parameters into the action estimation network, and carrying out the network parameter pair through a strategy gradient formula

And (3) updating:

updated parameters

Inputting the motion estimation network, and re-executing the step 3.1;

wherein: s represents the number of samples to be sampled,

representing the strategic gradient method applied to its updated parameters,

。

step 3.9: and smoothing the initial path meeting the requirement and outputting the initial path. Due to the initial path obtained

The trajectory is a non-smooth trajectory, in order to enable the intelligent agent to track the trajectory in real time, the trajectories formed by connecting a series of straight line segments need to be smoothed, a B-spline curve is adopted as a smoothing mode, and fig. 5 shows a smoothed multi-agent path;

and 4, step 4: tracking the path by using a model predictive control algorithm, wherein a flow chart of the algorithm is shown in FIG. 3;

step 4.1: establishing an agent tracking model As shown in FIG. 6, particles are setiFor virtual leaders

The initial time position is the mass pointiThe reference track is a smoothed particleiA path of (a); set and particleiThe corresponding intelligent agent is a follower

The initial time position of the intelligent agent in the step 1iThe position of (a); set follower

And a virtual leader

The ideal control relationship between the two is

；

l ₁ Representing a distance between the virtual leader and the follower;

All initial values of (1) are 0;

step 4.2: according to virtual leaders

And following person

Acquiring the virtual leader according to the respective speed, angular speed and distance between the two under the global coordinate system

Following person

wherein the expression of the relationship between the virtual leader and the follower is:

the agent kinematics formula is:

the expression of the tracking control model is:

wherein the content of the first and second substances,

is the speed of the follower or the speed of the follower,

is the angular velocity of the follower and,

is the speed of the leader or the speed of the leader,

is the angular velocity of the leader or the leader,

representing the follower's speed input.

Step 4.3: according to following person

Position and initial velocity at an initial time, and a tracking control model to predict follower output at a next time

；；

Step 4.4: predicted by step 4.3

With setting in step 4.1

Comparing and calculating the error e of the two ^t And correcting; as shown in fig. 7 (a), the path tracking error of the agent corresponding to the particle a as the follower (b) to(f) Path tracking errors for agents corresponding to the remaining five particles (i.e., B, C, D, E, F) as followers;

step 4.5: error e is corrected by Particle Swarm Optimization (PSO) ^t Performing optimization correction, calculating the speed input at the t +1 moment, and calculating the follower at the t +1 moment

And a virtual leader

Control relationship of

；

Step 4.6: judging whether the control termination time is reached, wherein the control termination time is the reference track duration in the step 4.1; if yes, outputting a tracking path, otherwise, executing the step 4.4 again;

step 4.7: and tracking the initial paths of the other 5 particles according to steps 4.1-4.6, and finally completing path planning of all 6 intelligent agents.

Claims

1. A multi-agent path planning method based on reinforcement learning and model predictive control is characterized by comprising the following steps:

Step 2: converting the multi-agent system model into a particle model;

particle models include models corresponding tonOf a personal agentnThe number of particles is one,nthe start bit coordinates of each particle are corresponding toThe current location coordinates of the agent in question,nthe termination position coordinates of each mass point are the position coordinates of a target point corresponding to the termination position coordinates;

assigning each particle a start coordinate of an observation range

Giving each particle a termination coordinate of a observable range

；

And 3, step 3: carrying out path planning by utilizing an ESB-MADDPG algorithm;

step 3.1: solving the reward value at each moment according to a formular：

Representing an agentiAnd the target pointjThe distance between them;

step 3.2: obtaining any particle according to step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o _i ；

step 3.3: estimating a network from motionTo obtain the state o at the current time _i Lower mass pointiCurrent time of day action

I.e. by

；

Wherein the content of the first and second substances,

composed of particlesiIs/are as followsx、ySpeed build on the shaft;

is to estimate network parameters in motion

An action selection strategy;

step 3.4: in the particleiAction of selecting current moment

And after execution, mass pointiWill reach a new state

；

Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt the time of each of the time points,mless than or equal to 50 to obtain particlesiState results of path planning at all times

All the time shapes obtained by the training are recordedState result

Middle mass pointiAre connected to obtain particlesiSet of paths of

；

Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles

；

Step 3.7: for the path set obtained by training

All have observable ranges therein

Exist, i.e. are

And

all the contact points are contacted, if yes, the initial path planning is considered to be finished at the moment, and the step 3.9 is executed;

if not, repeating the steps 3.1-3.6 for a totalMSecondly, make

Filling the experience pool D, and executing the step 3.8;M≥100；

wherein o is

A set of (a);

is composed of

A set of (a);

is composed of

A set of (a);

step 3.8: from experience poolsDRandomly sampling a small portion of the sample

at the same time, the sample is mixed

Inputting the data into the action estimation network, and estimating network parameters for the action by a strategy gradient formula

Updating, the updated action estimation network parameter

and 4, step 4: tracking the path by using a model predictive control algorithm;

step 4.1: establishing an intelligent agent tracking model;

setting mass pointiFor virtual leaders

And a virtual leader

The ideal control relationship between the two is

；

l ₁ Representing a distance between the virtual leader and the follower;

All initial values of (1) are 0;

step 4.2: according to virtual leader

And following person

And following person

step 4.3: according to following person

Position and initial velocity of initial time, and tracking control model for predicting follower at t time

And a virtual leader

Control relationship therebetween

；

Step 4.4: predicted by step 4.3

With setting in step 4.1

Comparing and calculating the error e of the two ^t And correcting;

step 4.5: error e is corrected by adopting particle swarm algorithm ^t Carry out the bestCorrecting, calculating the follower at t +1 time by speed input at t +1 time

And a virtual leader

Control relationship of

；

2. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: the expression of the tracking control model in the step 4.2 is as follows:

the expression of the relationship between the virtual leader and follower is:

the kinematic formula of the intelligent agent is as follows:

wherein the content of the first and second substances,

is the speed of the follower in the direction of the speed of the follower,

is the angular velocity of the follower and,

is the speed of the virtual leader and is,

is the angular velocity of the virtual leader and,

representing the follower's speed input.

3. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein:

in said step 3.2 o _i The expression of (a) is:

；

relative position p _ij The solving formula is as follows:

and is made of

。

4. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: in said step 3.8QThe expression of (a) is:

wherein:

in order to be able to use the attenuation factor,

a value is awarded for a new time instant obtained at a new time instant.

5. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: in the step 3.8, the network parameters are estimated for the action through a strategy gradient formula

The strategy gradient formula for updating is as follows:

wherein: s represents the number of samples to be sampled,

representing the strategic gradient method applied to its updated parameters,

。