CN113759929B - Multi-agent path planning method based on reinforcement learning and model predictive control - Google Patents

Multi-agent path planning method based on reinforcement learning and model predictive control Download PDF

Info

Publication number
CN113759929B
CN113759929B CN202111107563.7A CN202111107563A CN113759929B CN 113759929 B CN113759929 B CN 113759929B CN 202111107563 A CN202111107563 A CN 202111107563A CN 113759929 B CN113759929 B CN 113759929B
Authority
CN
China
Prior art keywords
agent
time
path planning
follower
particle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111107563.7A
Other languages
Chinese (zh)
Other versions
CN113759929A (en
Inventor
杜飞平
王春民
李嘉豪
于登秀
王震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Xian Aerospace Propulsion Institute
Original Assignee
Northwestern Polytechnical University
Xian Aerospace Propulsion Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University, Xian Aerospace Propulsion Institute filed Critical Northwestern Polytechnical University
Priority to CN202111107563.7A priority Critical patent/CN113759929B/en
Publication of CN113759929A publication Critical patent/CN113759929A/en
Application granted granted Critical
Publication of CN113759929B publication Critical patent/CN113759929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-agent path planning method based on reinforcement learning and model predictive control, which utilizes a path planning and tracking method combining ESB-MADPPG and MPC algorithms for the path planning problem of multi-agents, and comprises the following basic steps: firstly, a multi-agent system model is simplified into a particle model, then an ESB-MADDPG algorithm is used for path planning, and finally all paths are followed through model prediction control, so that the path planning of the multi-agent system can be quickly realized, and a foundation is laid for the large-scale multi-agent system to execute tasks.

Description

Multi-agent path planning method based on reinforcement learning and model predictive control
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-agent path planning method based on reinforcement learning and model predictive control.
Background
With the development and maturity of artificial intelligence theory and related research technologies, multi-agent systems are being researched and applied more and more widely. The multi-agent system is an autonomous intelligent system which utilizes information interaction and feedback, excitation and response and other interaction behaviors to realize behavior coordination, adapts to a dynamic environment and finally completes specific tasks together.
The multi-agent system is an important application research field of group intelligence and is one of the important directions for the future development of the intelligent system. Path planning is the focus of research on multi-agent systems, and focuses on considering the global optimal path of the whole system, such as the shortest total path length or the smallest total energy consumption of the system path. The efficiency and success rate of the multi-agent system for executing tasks can be improved only by planning the most effective path of the whole system.
At present, for a path planning task of a multi-agent system, only a single target is optimized, and for multi-agent path planning with multi-target optimization requirements, the path optimization between a plurality of agents and a plurality of targets is difficult to realize by the existing path planning method.
Disclosure of Invention
Aiming at the problem that the existing path planning method cannot meet the path optimization between a plurality of intelligent agents and a plurality of targets, the invention provides a multi-intelligent-agent path planning method based on reinforcement learning and model predictive control.
The basic design idea of the invention is as follows:
on the basis of multi-agent depth certainty strategy gradient (MADDPG), the idea of an Expert System (ESB) and Model Predictive Control (MPC) are added. Firstly, simplifying the intelligent agents in a multi-agent system into a particle model, and then introducing an expert system to smooth the path drawn by the MADDPG algorithm and accelerate the convergence time; and finally, summarizing all paths obtained by the ESB-MADDPG algorithm, and following all paths through model prediction control, so that the multi-agent system can realize path planning meeting the multi-objective optimization requirement.
The specific technical scheme of the invention is as follows:
a multi-agent path planning method based on reinforcement learning and model predictive control comprises the following steps:
step 1: establishing a multi-agent system model, and acquiring initial state information of the multi-agent system model: the initial state information includes a number of agents in the multi-agent system model ofnThe number of target points isn、Arbitrary agents under global coordinatesiCurrent position coordinate is p i Each target pointjPosition coordinate p j The position coordinates of the target point are artificially given according to the multi-agent path planning task requirements; (i,j)∈n
And 2, step: converting the multi-agent system model into a particle model;
particle models include models corresponding tonOf a personal agentnThe mass points of the material are distributed on the surface of the material,nthe start position coordinates of each mass point are the current position coordinates of the intelligent agent corresponding to the start position coordinates,nthe termination position coordinates of each mass point are the position coordinates of a target point corresponding to the termination position coordinates;
assigning each particle a start coordinate of an observation range
Figure DEST_PATH_IMAGE001
Giving each particle a termination coordinate of a observable range
Figure 100002_DEST_PATH_IMAGE002
And step 3: carrying out path planning by utilizing an ESB-MADDPG algorithm;
step 3.1: solving the reward value at each moment according to a formular
Figure DEST_PATH_IMAGE003
Figure 100002_DEST_PATH_IMAGE004
Figure 177766DEST_PATH_IMAGE005
Figure DEST_PATH_IMAGE006
Representing an agentiAnd the target pointjThe distance between them;
step 3.2: obtaining any particle from step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o i
Current time state o i From particleiCurrent time of day coordinates, and particleiThe relative position of the current time coordinate of the other particles and the current time coordinate of other particles;
step 3.3: obtaining a state o at a current time based on a motion estimation network i Lower mass pointiCurrent time of day action
Figure 102121DEST_PATH_IMAGE007
I.e. by
Figure 958082DEST_PATH_IMAGE008
Wherein the content of the first and second substances,
Figure 921359DEST_PATH_IMAGE007
from particleiIs/are as followsxySpeed on the shaft;
Figure 280796DEST_PATH_IMAGE009
is to estimate network parameters in motion
Figure 671326DEST_PATH_IMAGE010
An action selection strategy;
Figure 100002_DEST_PATH_IMAGE011
representing mass pointsiSelecting the interference suffered in the action at the current moment;
step 3.4: in the mass pointiAction of selecting current moment
Figure 14583DEST_PATH_IMAGE007
And after execution, particlesiWill reach a new state
Figure 781550DEST_PATH_IMAGE012
Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt each of the time points, the time point,mless than or equal to 50 to obtain particlesiState results of path planning at all times
Figure 100002_DEST_PATH_IMAGE013
All the time state results obtained by the training are compared
Figure DEST_PATH_IMAGE014
Middle mass pointiAre connected to obtain particlesiSet of paths of (1)
Figure 100002_DEST_PATH_IMAGE015
Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles
Figure 143962DEST_PATH_IMAGE016
Step 3.7: for the path set obtained by training
Figure 580760DEST_PATH_IMAGE016
Judging; the judgment standard is the observation range in the final time state of all particles
Figure 942471DEST_PATH_IMAGE001
All have observable ranges therein
Figure 247550DEST_PATH_IMAGE002
Exist, i.e. are
Figure 581580DEST_PATH_IMAGE001
And
Figure 313912DEST_PATH_IMAGE002
all contact, if yes, the initial path planning is considered to be finished at the moment, and the step 3.9 is executed;
if not, repeating the steps 3.1-3.6 for a totalMSecondly, make
Figure 100002_DEST_PATH_IMAGE017
Filling the experience pool D, and executing the step 3.8;M≥100;
wherein o is
Figure 195543DEST_PATH_IMAGE014
A set of (a);
Figure DEST_PATH_IMAGE018
is composed of
Figure 507576DEST_PATH_IMAGE007
A set of (a);
Figure 100002_DEST_PATH_IMAGE019
is composed of
Figure 227270DEST_PATH_IMAGE012
A set of (a);
step 3.8: from experience poolsDRandomly sampling a small portion of the sample
Figure DEST_PATH_IMAGE020
Calculated by a state estimation networkQA value;Qthe value is used for evaluating the action quality output by the action estimation network;
at the same time, the sample is mixed
Figure 599345DEST_PATH_IMAGE020
Inputting the data into a motion estimation network, and estimating network parameters for the motion through a strategy gradient formula
Figure 997966DEST_PATH_IMAGE010
Updating, the updated action estimating network parameters
Figure 520214DEST_PATH_IMAGE010
Inputting the data into a motion estimation network, and returning to the step 3.1;
step 3.9: carrying out smoothing processing on the initial path meeting the requirements and outputting the initial path;
and 4, step 4: tracking the path by using a model predictive control algorithm;
step 4.1: establishing an intelligent agent tracking model;
setting mass pointiFor virtual leaders
Figure 452004DEST_PATH_IMAGE021
Its initial time position is the mass pointiThe reference track is a smoothed particleiA path of (a); set and particleiThe corresponding intelligent agent is a follower
Figure 401506DEST_PATH_IMAGE022
The initial time position of the agent is the agent in step 1iThe position of (a); set following person
Figure 959526DEST_PATH_IMAGE022
And a virtual leader
Figure 410099DEST_PATH_IMAGE021
The ideal control relationship between the two is
Figure 573227DEST_PATH_IMAGE023
l 1 Representing a distance between the virtual leader and the follower;
Figure DEST_PATH_IMAGE024
representing a relative orientation between the virtual leader and the follower;
Figure 21526DEST_PATH_IMAGE025
represents a deviation in orientation between the virtual leader and the follower, an
Figure DEST_PATH_IMAGE026
All the initial values of (1) are 0;
step 4.2: according to virtual leaders
Figure 129159DEST_PATH_IMAGE021
And following person
Figure 789948DEST_PATH_IMAGE022
Obtaining the virtual leadership according to the respective speed, angular speed and distance between the two under the global coordinate systemA
Figure 433681DEST_PATH_IMAGE021
And following person
Figure 990564DEST_PATH_IMAGE022
The expression of the control relation is combined with the kinematic formula of the intelligent agent to establish a tracking control model;
step 4.3: according to following person
Figure 585494DEST_PATH_IMAGE022
Position and initial velocity at initial time, and tracking control model for predicting follower at t time
Figure 253235DEST_PATH_IMAGE022
And a virtual leader
Figure 250010DEST_PATH_IMAGE021
Control relationship between
Figure 774532DEST_PATH_IMAGE027
Step 4.4: predicted by step 4.3
Figure 997703DEST_PATH_IMAGE027
With setting in step 4.1
Figure 328191DEST_PATH_IMAGE023
Comparing and calculating the error e of the two t And correcting;
step 4.5: error e is corrected by particle swarm algorithm t Carrying out optimization correction, and calculating the follower at the t +1 moment through speed input at the t +1 moment
Figure 320417DEST_PATH_IMAGE022
And a virtual leader
Figure 848132DEST_PATH_IMAGE021
Control relationship of
Figure DEST_PATH_IMAGE028
Step 4.6: judging whether the control termination time is reached, if so, outputting a tracking path, and otherwise, returning to the step 4.4; the control termination time is the duration of the reference trajectory in step 4.1;
step 4.7: and tracking the initial paths of the rest particles according to the steps 4.1-4.6, and finally completing path planning of all the intelligent agents.
Further, the expression of the tracking control model in the step 4.2 is as follows:
Figure 620915DEST_PATH_IMAGE029
the expression of the relationship between the virtual leader and follower is:
Figure DEST_PATH_IMAGE030
the kinematic formula of the intelligent agent is as follows:
Figure 100002_DEST_PATH_IMAGE031
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE032
is the speed of the follower or the speed of the follower,
Figure 958356DEST_PATH_IMAGE033
is the angular velocity of the follower and,
Figure 195302DEST_PATH_IMAGE034
is the speed of the virtual leader or the virtual leader,
Figure 733731DEST_PATH_IMAGE035
is the angular velocity of the virtual leader,
Figure 557593DEST_PATH_IMAGE036
representing the follower's speed input.
Further, step 3.2 above i The expression of (a) is:
Figure 105249DEST_PATH_IMAGE037
relative position p ij The solving formula is as follows:
Figure 868805DEST_PATH_IMAGE038
and is and
Figure 968348DEST_PATH_IMAGE039
further, in the above step 3.8QThe expression of (a) is:
Figure 653408DEST_PATH_IMAGE040
wherein:
Figure 129388DEST_PATH_IMAGE041
in order to be able to use the attenuation factor,
Figure 685135DEST_PATH_IMAGE042
a value is awarded for a new time instant obtained at a new time instant.
Further, in step 3.8, the network parameters are estimated for the action by the policy gradient formula
Figure 221158DEST_PATH_IMAGE010
The strategy gradient formula for updating is as follows:
Figure 100002_DEST_PATH_IMAGE043
wherein: s represents the number of samples to be sampled,
Figure 393514DEST_PATH_IMAGE044
representing the strategic gradient method applied to its updated parameters,
Figure 906141DEST_PATH_IMAGE045
the invention has the beneficial effects that:
1. aiming at the problem of path planning of the multi-agent, the invention utilizes a path planning and tracking algorithm combined with ESB-MADPPG and MPC algorithms to quickly realize the path planning of the multi-agent system and lay a foundation for the large-scale multi-agent system to execute tasks.
2. According to the invention, by designing the reward value in the ESB-MADDPG algorithm and the neural network in the algorithm, the mutual interference between the paths of each particle system is avoided, and the distance of the path reaching the target point position is shortest; the kinematics model of the intelligent agent point is introduced through the MPC algorithm, the speed in the intelligent agent tracking path can be optimized, and the optimized multi-intelligent agent path is obtained.
Drawings
FIG. 1 is a flow chart of a basic implementation of the present invention;
FIG. 2 is a flow chart of path planning using ESB-MADDPG;
FIG. 3 is a flow chart of a path tracking based on model prediction for PSO;
FIG. 4 is a schematic diagram of a particle model;
FIG. 5 is a smoothed multi-agent trajectory graph;
FIG. 6 is a schematic diagram of an agent tracking model;
FIG. 7 is a graph of agent path tracking error, where (a) - (f) represent the path tracking error for agent as the follower for particle A, B, C, D, E, F, respectively.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
The embodiment provides a multi-agent path planning method based on reinforcement learning and model predictive control, in the embodiment, the agents are robots, and the implementation flow is as shown in fig. 1, and specifically includes the following steps:
step 1: establishing a multi-agent system model, and acquiring initial state information of the multi-agent system model: the initial state information includes a number of agents in the multi-agent system model ofnThe number of target points isn、Arbitrary agents under global coordinatesiCurrent position coordinate is p i Each target pointjPosition coordinate p j The position coordinates of the target point are artificially given according to the multi-agent path planning task requirement; (i,j)∈n
Step 2: converting the multi-agent system model into a particle model; particle models include models corresponding tonOf a personal agentnThe number of particles is one,nthe start position coordinates of each mass point are the current position coordinates of the intelligent agent corresponding to the start position coordinates,nthe termination position coordinates of the mass points are the position coordinates of the target points corresponding to the termination position coordinates;
assigning each particle a start coordinate of an observation range
Figure 847552DEST_PATH_IMAGE001
Giving each particle a termination coordinate of a observable range
Figure 288898DEST_PATH_IMAGE002
As shown in FIG. 4, in this embodiment, there are six agents, i.e. there are six particles (i.e. A, B, C, D, E, F), and the white area is the observation range of the start coordinates of the six particles
Figure 417391DEST_PATH_IMAGE001
The black area is the observed range of the termination coordinates of six particles
Figure 235174DEST_PATH_IMAGE002
And 3, step 3: path planning is carried out by utilizing an ESB-MADDPG algorithm, and the basic flow is shown in figure 2;
step 3.1: solving the reward value at each moment according to a formular
Figure 827830DEST_PATH_IMAGE003
Figure 581022DEST_PATH_IMAGE004
Figure 587024DEST_PATH_IMAGE005
Figure 818285DEST_PATH_IMAGE006
Representing an agentiAnd target pointjThe distance therebetween;
step 3.2: obtaining particles according to step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o i
Current time state o i From particleiCurrent time of day coordinates, and particlesiThe relative position of the current time coordinate of the other particles and the current time coordinate of other particles;
namely, it is
Figure 94808DEST_PATH_IMAGE037
Relative position p ij The solving formula is as follows:
Figure 753322DEST_PATH_IMAGE038
and is and
Figure 246621DEST_PATH_IMAGE039
step 3.3: obtaining a state o at a current time based on a motion estimation network i Lower mass pointiCurrent time of day action
Figure 281573DEST_PATH_IMAGE007
I.e. by
Figure 911137DEST_PATH_IMAGE008
Wherein the content of the first and second substances,
Figure 802870DEST_PATH_IMAGE007
from particleiIs/are as followsxySpeed on the shaft;
Figure 658830DEST_PATH_IMAGE009
is to estimate network parameters in motion
Figure 356528DEST_PATH_IMAGE010
An action selection strategy;
Figure 715965DEST_PATH_IMAGE011
representing particlesiSelecting the interference suffered in the action at the current moment;
step 3.4: in the mass pointiAction of selecting current moment
Figure 622609DEST_PATH_IMAGE007
And after execution, mass pointiWill reach a new state
Figure 700286DEST_PATH_IMAGE012
Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt each time, 30 times in this embodiment, the particles are obtainediPlanning the state results of all the time of a path, and obtaining the state of all the time of the training
Figure 467254DEST_PATH_IMAGE014
The positions of the medium particles i are connected to obtain particlesiOf (2) a
Figure 743514DEST_PATH_IMAGE015
Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles
Figure 180312DEST_PATH_IMAGE016
Step 3.7: for the set obtained by training
Figure 869919DEST_PATH_IMAGE016
Judging; the judgment standard is that all particle observation ranges in the final time state
Figure 519206DEST_PATH_IMAGE001
All have observable ranges therein
Figure 243449DEST_PATH_IMAGE002
Exist, i.e. are
Figure 851148DEST_PATH_IMAGE001
And
Figure 529516DEST_PATH_IMAGE002
all contact, if yes, the initial path planning is considered to be finished at the moment, and the step 3.9 is executed;
if not, repeating the steps 3.1-3.6 for a totalMNext, the number of times in this embodiment is 100, so that
Figure 44811DEST_PATH_IMAGE017
Filling the experience pool D, and executing the step 3.8;
wherein o is
Figure 498926DEST_PATH_IMAGE014
A set of (a);
Figure 402160DEST_PATH_IMAGE018
is composed of
Figure 941726DEST_PATH_IMAGE007
A set of (a);
Figure 323028DEST_PATH_IMAGE019
is composed of
Figure 631650DEST_PATH_IMAGE012
A set of (a);
step 3.8: from experience poolsDSampling a small portion of the sample at random
Figure 705785DEST_PATH_IMAGE020
Calculated by a state estimation networkQA value;Qthe value is used for evaluating the action quality output by the action estimation network;
Figure 263805DEST_PATH_IMAGE040
wherein:
Figure 589745DEST_PATH_IMAGE041
in order to be able to use the attenuation factor,
Figure 376042DEST_PATH_IMAGE042
a value is awarded for a new time instant obtained at a new time instant.
At the same time, according to the above-mentioned sample
Figure 496444DEST_PATH_IMAGE020
Inputting the parameters into the action estimation network, and carrying out the network parameter pair through a strategy gradient formula
Figure 869657DEST_PATH_IMAGE010
And (3) updating:
Figure 733708DEST_PATH_IMAGE043
updated parameters
Figure 875976DEST_PATH_IMAGE010
Inputting the motion estimation network, and re-executing the step 3.1;
wherein: s represents the number of samples to be sampled,
Figure 229597DEST_PATH_IMAGE044
representing the strategic gradient method applied to its updated parameters,
Figure 965472DEST_PATH_IMAGE045
step 3.9: and smoothing the initial path meeting the requirement and outputting the initial path. Due to the initial path obtained
Figure 492268DEST_PATH_IMAGE016
The trajectory is a non-smooth trajectory, in order to enable the intelligent agent to track the trajectory in real time, the trajectories formed by connecting a series of straight line segments need to be smoothed, a B-spline curve is adopted as a smoothing mode, and fig. 5 shows a smoothed multi-agent path;
and 4, step 4: tracking the path by using a model predictive control algorithm, wherein a flow chart of the algorithm is shown in FIG. 3;
step 4.1: establishing an agent tracking model As shown in FIG. 6, particles are setiFor virtual leaders
Figure 629988DEST_PATH_IMAGE021
The initial time position is the mass pointiThe reference track is a smoothed particleiA path of (a); set and particleiThe corresponding intelligent agent is a follower
Figure 718292DEST_PATH_IMAGE022
The initial time position of the intelligent agent in the step 1iThe position of (a); set follower
Figure 675884DEST_PATH_IMAGE022
And a virtual leader
Figure 271950DEST_PATH_IMAGE021
The ideal control relationship between the two is
Figure 60915DEST_PATH_IMAGE023
l 1 Representing a distance between the virtual leader and the follower;
Figure 694022DEST_PATH_IMAGE024
representing a relative orientation between the virtual leader and the follower;
Figure 529122DEST_PATH_IMAGE025
represents a deviation in orientation between the virtual leader and the follower, an
Figure 273088DEST_PATH_IMAGE026
All initial values of (1) are 0;
step 4.2: according to virtual leaders
Figure 244455DEST_PATH_IMAGE021
And following person
Figure 314042DEST_PATH_IMAGE022
Acquiring the virtual leader according to the respective speed, angular speed and distance between the two under the global coordinate system
Figure 875254DEST_PATH_IMAGE021
Following person
Figure 688489DEST_PATH_IMAGE022
The expression of the control relation is combined with the kinematic formula of the intelligent agent to establish a tracking control model;
wherein the expression of the relationship between the virtual leader and the follower is:
Figure 452046DEST_PATH_IMAGE030
the agent kinematics formula is:
Figure 551589DEST_PATH_IMAGE031
the expression of the tracking control model is:
Figure 236648DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 447050DEST_PATH_IMAGE032
is the speed of the follower or the speed of the follower,
Figure 268375DEST_PATH_IMAGE033
is the angular velocity of the follower and,
Figure 538819DEST_PATH_IMAGE034
is the speed of the leader or the speed of the leader,
Figure 445596DEST_PATH_IMAGE035
is the angular velocity of the leader or the leader,
Figure 397371DEST_PATH_IMAGE036
representing the follower's speed input.
Step 4.3: according to following person
Figure 699302DEST_PATH_IMAGE022
Position and initial velocity at an initial time, and a tracking control model to predict follower output at a next time
Figure 281593DEST_PATH_IMAGE027
;;
Step 4.4: predicted by step 4.3
Figure 800299DEST_PATH_IMAGE027
With setting in step 4.1
Figure 227869DEST_PATH_IMAGE023
Comparing and calculating the error e of the two t And correcting; as shown in fig. 7 (a), the path tracking error of the agent corresponding to the particle a as the follower (b) to(f) Path tracking errors for agents corresponding to the remaining five particles (i.e., B, C, D, E, F) as followers;
step 4.5: error e is corrected by Particle Swarm Optimization (PSO) t Performing optimization correction, calculating the speed input at the t +1 moment, and calculating the follower at the t +1 moment
Figure 148421DEST_PATH_IMAGE022
And a virtual leader
Figure 901613DEST_PATH_IMAGE021
Control relationship of
Figure 314140DEST_PATH_IMAGE028
Step 4.6: judging whether the control termination time is reached, wherein the control termination time is the reference track duration in the step 4.1; if yes, outputting a tracking path, otherwise, executing the step 4.4 again;
step 4.7: and tracking the initial paths of the other 5 particles according to steps 4.1-4.6, and finally completing path planning of all 6 intelligent agents.

Claims (5)

1. A multi-agent path planning method based on reinforcement learning and model predictive control is characterized by comprising the following steps:
step 1: establishing a multi-agent system model, and acquiring initial state information of the multi-agent system model: the initial state information includes a number of agents in the multi-agent system model ofnThe number of target points isn、Arbitrary agents under global coordinatesiCurrent position coordinate is p i Each target pointjPosition coordinate p j The position coordinates of the target point are artificially given according to the multi-agent path planning task requirement; (i,j)∈n
Step 2: converting the multi-agent system model into a particle model;
particle models include models corresponding tonOf a personal agentnThe number of particles is one,nthe start bit coordinates of each particle are corresponding toThe current location coordinates of the agent in question,nthe termination position coordinates of each mass point are the position coordinates of a target point corresponding to the termination position coordinates;
assigning each particle a start coordinate of an observation range
Figure 281079DEST_PATH_IMAGE001
Giving each particle a termination coordinate of a observable range
Figure DEST_PATH_IMAGE002
And 3, step 3: carrying out path planning by utilizing an ESB-MADDPG algorithm;
step 3.1: solving the reward value at each moment according to a formular
Figure 24650DEST_PATH_IMAGE003
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Figure 69967DEST_PATH_IMAGE006
Representing an agentiAnd the target pointjThe distance between them;
step 3.2: obtaining any particle according to step 2iThe start bit coordinate and the end bit coordinate of the point are obtained by the ESB-MADDPG algorithmiCurrent time state o i
Current time state o i From particleiCurrent time of day coordinates, and particlesiThe relative position of the current time coordinate of the other particles and the current time coordinate of other particles;
step 3.3: estimating a network from motionTo obtain the state o at the current time i Lower mass pointiCurrent time of day action
Figure DEST_PATH_IMAGE007
I.e. by
Figure DEST_PATH_IMAGE008
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
composed of particlesiIs/are as followsxySpeed build on the shaft;
Figure DEST_PATH_IMAGE010
is to estimate network parameters in motion
Figure DEST_PATH_IMAGE011
An action selection strategy;
Figure 153591DEST_PATH_IMAGE012
representing mass pointsiSelecting the interference suffered in the action at the current moment;
step 3.4: in the particleiAction of selecting current moment
Figure 561439DEST_PATH_IMAGE007
And after execution, mass pointiWill reach a new state
Figure DEST_PATH_IMAGE013
Step 3.5: repeatedly executing steps 3.3-3.4 for a totalmAt the time of each of the time points,mless than or equal to 50 to obtain particlesiState results of path planning at all times
Figure 446218DEST_PATH_IMAGE014
All the time shapes obtained by the training are recordedState result
Figure DEST_PATH_IMAGE015
Middle mass pointiAre connected to obtain particlesiSet of paths of
Figure 447672DEST_PATH_IMAGE016
Step 3.6: repeating the steps 3.1-3.5 to obtainnPath set of particles
Figure DEST_PATH_IMAGE017
Step 3.7: for the path set obtained by training
Figure 443310DEST_PATH_IMAGE017
Judging; the judgment standard is the observation range in the final time state of all particles
Figure 221778DEST_PATH_IMAGE001
All have observable ranges therein
Figure 683983DEST_PATH_IMAGE002
Exist, i.e. are
Figure 703892DEST_PATH_IMAGE001
And
Figure 34379DEST_PATH_IMAGE002
all the contact points are contacted, if yes, the initial path planning is considered to be finished at the moment, and the step 3.9 is executed;
if not, repeating the steps 3.1-3.6 for a totalMSecondly, make
Figure 26606DEST_PATH_IMAGE018
Filling the experience pool D, and executing the step 3.8;M≥100;
wherein o is
Figure 784346DEST_PATH_IMAGE015
A set of (a);
Figure DEST_PATH_IMAGE019
is composed of
Figure 88288DEST_PATH_IMAGE009
A set of (a);
Figure 832253DEST_PATH_IMAGE020
is composed of
Figure DEST_PATH_IMAGE021
A set of (a);
step 3.8: from experience poolsDRandomly sampling a small portion of the sample
Figure 836244DEST_PATH_IMAGE022
Calculated by a state estimation networkQA value;Qthe value is used for evaluating the action quality output by the action estimation network;
at the same time, the sample is mixed
Figure DEST_PATH_IMAGE023
Inputting the data into the action estimation network, and estimating network parameters for the action by a strategy gradient formula
Figure 968148DEST_PATH_IMAGE011
Updating, the updated action estimation network parameter
Figure 165911DEST_PATH_IMAGE011
Inputting the data into a motion estimation network, and returning to the step 3.1;
step 3.9: carrying out smoothing processing on the initial path meeting the requirements and outputting the initial path;
and 4, step 4: tracking the path by using a model predictive control algorithm;
step 4.1: establishing an intelligent agent tracking model;
setting mass pointiFor virtual leaders
Figure 838201DEST_PATH_IMAGE024
The initial time position is the mass pointiThe reference track is a smoothed particleiA path of (a); set and particleiThe corresponding intelligent agent is a follower
Figure DEST_PATH_IMAGE025
The initial time position of the agent is the agent in step 1iThe position of (a); set following person
Figure 70599DEST_PATH_IMAGE025
And a virtual leader
Figure 435721DEST_PATH_IMAGE024
The ideal control relationship between the two is
Figure 855201DEST_PATH_IMAGE026
l 1 Representing a distance between the virtual leader and the follower;
Figure DEST_PATH_IMAGE027
representing a relative orientation between the virtual leader and the follower;
Figure 32979DEST_PATH_IMAGE028
represents a deviation in orientation between the virtual leader and the follower, an
Figure DEST_PATH_IMAGE029
All initial values of (1) are 0;
step 4.2: according to virtual leader
Figure 447780DEST_PATH_IMAGE024
And following person
Figure 655908DEST_PATH_IMAGE025
Acquiring the virtual leader according to the respective speed, angular speed and distance between the two under the global coordinate system
Figure 687318DEST_PATH_IMAGE024
And following person
Figure 576776DEST_PATH_IMAGE025
The expression of the control relation is combined with the kinematic formula of the intelligent agent to establish a tracking control model;
step 4.3: according to following person
Figure 377242DEST_PATH_IMAGE025
Position and initial velocity of initial time, and tracking control model for predicting follower at t time
Figure 959533DEST_PATH_IMAGE025
And a virtual leader
Figure 979704DEST_PATH_IMAGE024
Control relationship therebetween
Figure 672853DEST_PATH_IMAGE030
Step 4.4: predicted by step 4.3
Figure 999930DEST_PATH_IMAGE030
With setting in step 4.1
Figure 877756DEST_PATH_IMAGE026
Comparing and calculating the error e of the two t And correcting;
step 4.5: error e is corrected by adopting particle swarm algorithm t Carry out the bestCorrecting, calculating the follower at t +1 time by speed input at t +1 time
Figure 759124DEST_PATH_IMAGE025
And a virtual leader
Figure 115019DEST_PATH_IMAGE024
Control relationship of
Figure DEST_PATH_IMAGE031
Step 4.6: judging whether the control termination time is reached, if so, outputting a tracking path, and otherwise, returning to the step 4.4; the control termination time is the duration of the reference trajectory in step 4.1;
step 4.7: and tracking the initial paths of the rest particles according to the steps 4.1-4.6, and finally completing path planning of all the intelligent agents.
2. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: the expression of the tracking control model in the step 4.2 is as follows:
Figure DEST_PATH_IMAGE032
the expression of the relationship between the virtual leader and follower is:
Figure 827760DEST_PATH_IMAGE033
the kinematic formula of the intelligent agent is as follows:
Figure DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 380882DEST_PATH_IMAGE035
is the speed of the follower in the direction of the speed of the follower,
Figure DEST_PATH_IMAGE036
is the angular velocity of the follower and,
Figure DEST_PATH_IMAGE037
is the speed of the virtual leader and is,
Figure DEST_PATH_IMAGE038
is the angular velocity of the virtual leader and,
Figure DEST_PATH_IMAGE039
representing the follower's speed input.
3. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein:
in said step 3.2 o i The expression of (a) is:
Figure DEST_PATH_IMAGE040
relative position p ij The solving formula is as follows:
Figure DEST_PATH_IMAGE041
and is made of
Figure DEST_PATH_IMAGE042
4. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: in said step 3.8QThe expression of (a) is:
Figure DEST_PATH_IMAGE043
wherein:
Figure DEST_PATH_IMAGE044
in order to be able to use the attenuation factor,
Figure DEST_PATH_IMAGE045
a value is awarded for a new time instant obtained at a new time instant.
5. The reinforcement learning and model predictive control-based multi-agent path planning method of claim 1, wherein: in the step 3.8, the network parameters are estimated for the action through a strategy gradient formula
Figure 500279DEST_PATH_IMAGE011
The strategy gradient formula for updating is as follows:
Figure 659865DEST_PATH_IMAGE046
wherein: s represents the number of samples to be sampled,
Figure DEST_PATH_IMAGE047
representing the strategic gradient method applied to its updated parameters,
Figure DEST_PATH_IMAGE048
CN202111107563.7A 2021-09-22 2021-09-22 Multi-agent path planning method based on reinforcement learning and model predictive control Active CN113759929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111107563.7A CN113759929B (en) 2021-09-22 2021-09-22 Multi-agent path planning method based on reinforcement learning and model predictive control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111107563.7A CN113759929B (en) 2021-09-22 2021-09-22 Multi-agent path planning method based on reinforcement learning and model predictive control

Publications (2)

Publication Number Publication Date
CN113759929A CN113759929A (en) 2021-12-07
CN113759929B true CN113759929B (en) 2022-08-23

Family

ID=78796675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111107563.7A Active CN113759929B (en) 2021-09-22 2021-09-22 Multi-agent path planning method based on reinforcement learning and model predictive control

Country Status (1)

Country Link
CN (1) CN113759929B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114857991B (en) * 2022-05-26 2023-06-13 西安航天动力研究所 Control method and system for automatically tracking shooting direction of target plane in tactical training

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992000A (en) * 2019-04-04 2019-07-09 北京航空航天大学 A kind of multiple no-manned plane path collaborative planning method and device based on Hierarchical reinforcement learning
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN112488310A (en) * 2020-11-11 2021-03-12 厦门渊亭信息科技有限公司 Multi-agent group cooperation strategy automatic generation method
CN112488359A (en) * 2020-11-02 2021-03-12 杭州电子科技大学 Multi-agent static multi-target enclosure method based on RRT and OSPA distances
CN113110509A (en) * 2021-05-17 2021-07-13 哈尔滨工业大学(深圳) Warehousing system multi-robot path planning method based on deep reinforcement learning
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3060900A1 (en) * 2018-11-05 2020-05-05 Royal Bank Of Canada System and method for deep reinforcement learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992000A (en) * 2019-04-04 2019-07-09 北京航空航天大学 A kind of multiple no-manned plane path collaborative planning method and device based on Hierarchical reinforcement learning
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN112488359A (en) * 2020-11-02 2021-03-12 杭州电子科技大学 Multi-agent static multi-target enclosure method based on RRT and OSPA distances
CN112488310A (en) * 2020-11-11 2021-03-12 厦门渊亭信息科技有限公司 Multi-agent group cooperation strategy automatic generation method
CN113110509A (en) * 2021-05-17 2021-07-13 哈尔滨工业大学(深圳) Warehousing system multi-robot path planning method based on deep reinforcement learning
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN113392935A (en) * 2021-07-09 2021-09-14 浙江工业大学 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于神经网络的多卫星姿态协同控制预测模型;宁宇等;《空间控制技术与应用》;20200415(第02期);全文 *

Also Published As

Publication number Publication date
CN113759929A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
CN112685165A (en) Multi-target cloud workflow scheduling method based on joint reinforcement learning strategy
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
Wang et al. Model-based reinforcement learning for decentralized multiagent rendezvous
CN110716575A (en) UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
Balakrishna et al. On-policy robot imitation learning from a converging supervisor
CN113759929B (en) Multi-agent path planning method based on reinforcement learning and model predictive control
Huang et al. To imitate or not to imitate: Boosting reinforcement learning-based construction robotic control for long-horizon tasks using virtual demonstrations
Wu et al. Torch: Strategy evolution in swarm robots using heterogeneous–homogeneous coevolution method
CN110716574A (en) UUV real-time collision avoidance planning method based on deep Q network
CN118201742A (en) Multi-robot coordination using a graph neural network
Hafez et al. Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination
Sumiea et al. Enhanced deep deterministic policy gradient algorithm using grey wolf optimizer for continuous control tasks
Chen et al. Survey of multi-agent strategy based on reinforcement learning
CN115366099B (en) Mechanical arm depth deterministic strategy gradient training method based on forward kinematics
CN116578080A (en) Local path planning method based on deep reinforcement learning
Zhang et al. Auto-conditioned recurrent mixture density networks for learning generalizable robot skills
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
CN116149179A (en) Non-uniform track length differential evolution iterative learning control method for robot fish
CN115273502A (en) Traffic signal cooperative control method
Riccio et al. LoOP: Iterative learning for optimistic planning on robots
CN114118371A (en) Intelligent agent deep reinforcement learning method and computer readable medium
Xu et al. Reinforcement learning with construction robots: A preliminary review of research areas, challenges and opportunities
Yu et al. A novel automated guided vehicle (AGV) remote path planning based on RLACA algorithm in 5G environment
Tang et al. Hierarchical reinforcement learning based on multi-agent cooperation game theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant