CN115797394B

CN115797394B - Multi-agent coverage method based on reinforcement learning

Info

Publication number: CN115797394B
Application number: CN202211432494.1A
Authority: CN
Inventors: 孙新苗; 任明里; 丁大伟; 任莹莹; 王恒
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-09-05
Anticipated expiration: 2042-11-15
Also published as: CN115797394A

Abstract

The invention discloses a multi-agent coverage method based on reinforcement learning, which comprises the following steps: the method comprises the steps of determining the positions of a plurality of static intelligent agents in an area with the aim of maximizing coverage performance, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static intelligent agents; calculating the maximum coverage performance which can be obtained by the mobile intelligent agent; setting observation and actions of the mobile intelligent agent, and setting rewards of the mobile intelligent agent based on the maximum coverage performance which can be obtained by the mobile intelligent agent; each mobile intelligent agent aims at maximizing a respective rewarding function, and based on a reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time to perform distributed training, so that a motion plan of each mobile intelligent agent is obtained, and coverage of an ineffective coverage area is realized. The technical scheme of the invention can realize the effective coverage of the region completed by the cooperation of multiple agents and improve the coverage performance of the region.

Description

Multi-agent coverage method based on reinforcement learning

Technical Field

The invention relates to the technical field of multi-agent system coverage optimization, in particular to a multi-agent coverage method based on reinforcement learning.

Background

With the rapid development of computer and mems, robotics and communication technologies, multi-intelligent systems are receiving more and more attention and are being applied to various fields such as coverage. The multi-agent area coverage means that a plurality of agents form a team, and the whole area is effectively covered by a cooperation strategy. The plurality of agents cooperatively perform the area coverage task, so that the target task can be more efficiently completed, the limit of the number and the angle of single agent sensors can be overcome, and the intelligent vehicle sensor has the redundancy characteristic. At present, although a scheme can solve the problem of full coverage of multiple intelligent agents on an area, the coverage performance is improved while effective coverage cannot be realized.

Disclosure of Invention

The invention provides a multi-agent coverage method based on reinforcement learning, which is used for rapidly realizing effective coverage of an area and improving the coverage performance of the area.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a reinforcement learning-based multi-agent coverage method, the multi-agent including a plurality of stationary agents and a plurality of mobile agents, the multi-agent coverage method comprising:

the method comprises the steps of determining the positions of a plurality of static intelligent agents in an area with the aim of maximizing coverage performance, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static intelligent agents;

calculating the maximum coverage performance which can be obtained by the mobile intelligent agent;

setting up the observation and action of each mobile agent to the environment, and setting up rewards of the mobile agents based on the maximum coverage performance that the mobile agents can obtain; each mobile intelligent agent aims at maximizing a respective rewarding function, and based on a reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time to perform distributed training, so that the motion planning of each mobile intelligent agent is obtained, and coverage of an ineffective coverage area is realized.

Further, the determining the location of the plurality of stationary agents in the area with the goal of maximizing coverage performance includes:

the positions of the plurality of stationary agents in the area are adjusted so that the coverage can be as large as possible.

Further, the calculation function H (S) of the coverage performance is as follows:

H(S)＝∫R(x)P(x,S)dx

wherein P (x, S) is the mid-point of the multi-agent in the regionThe joint detection probability at x is set to, p _i (x,s _i ) The detection probability of the ith agent is N, the number of agents is N, and R (x) is an event density function.

Further, when the area is divided into an effective coverage area and an ineffective coverage area, judging whether a point x in the area is effectively covered or not according to whether the joint detection probability P (x, S) of the multi-agent at the x is larger than a preset threshold, and when the P (x, S) is larger than the preset threshold, indicating that the x is effectively covered, otherwise, the x is not effectively covered.

Further, the mobile intelligent body observes the environment as three binary images; wherein,,

the first binary image shows the area which is not effectively covered at present;

the second binary image shows the position of the current mobile intelligent agent;

the third binary image shows the location of other mobile agents than the current mobile agent.

Further, the action set of the mobile agent is {0,1,2,3,4}, which respectively indicates that the mobile agent is stationary, the mobile agent moves up, the mobile agent moves down, the mobile agent moves left, and the mobile agent moves right.

Further, the rewards of the environment to the mobile agent, reward, are:

Reward＝(H _{currently, the method is that} -H _max )/10+incres*30

Wherein H is _{Currently, the method is that} Coverage performance of the mobile intelligent body at the current position is achieved; h _max Maximum coverage performance obtainable for a mobile agent; incres is the area of the effective coverage area newly added at the previous time; the first part of the reward represents the difference between the coverage performance of the mobile agent at the current position and the maximum value, and the second part of the reward is the newly increased effectiveness at the last momentCoverage area.

Further, based on the reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time, and when distributed training is performed, an actor network and a critic network of the mobile intelligent agents are set to be two-layer convolution layers plus three-layer full-connection layers; the first layer of convolution layer in the network is 16 convolution kernels of 20 x 20, the second layer of convolution layer is 8 convolution kernels of 10 x 10, and the channel numbers of the three layers of full-connection layers are 256, 128 and 64 respectively.

In yet another aspect, the present invention also provides an electronic device including a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.

The technical scheme provided by the invention has the beneficial effects that at least:

1. the invention can realize the effective coverage of the multi-agent cooperation completion area.

2. The invention utilizes the decision optimization capability of reinforcement learning, and can improve the coverage performance of the area while realizing effective coverage. The method has the advantages of high efficiency and strong robustness.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a reinforcement learning-based multi-agent coverage method provided by an embodiment of the present invention;

FIG. 2 is a schematic illustration of stationary agent location deployment provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of mobile agent and environment interactions provided by an embodiment of the present invention;

FIG. 4 is a graph showing the effective coverage as a function of step size provided by an embodiment of the present invention;

fig. 5 is a schematic diagram showing the coverage performance according to the step length according to the embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

First embodiment

The embodiment provides a multi-agent coverage method based on reinforcement learning. Firstly, it should be noted that the multi-agent in this embodiment includes two types of agents, namely a stationary agent and a mobile agent, and by controlling the movement of the mobile agent, the effective coverage of the area is achieved and the coverage performance of the area is improved.

Based on the above, the execution flow of the method of this embodiment is shown in fig. 1, and includes the following steps:

s1, determining the positions of a plurality of static intelligent agents in an area with the aim of maximizing coverage performance, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static intelligent agents;

it should be noted that, when determining the position of the stationary agent, the objective should be to maximize coverage performance, i.e., to adjust the position s= (S) of the multi-agent ₁ ,…,s _N ) The coverage performance function H (S) is made as large as possible, wherein the coverage performance of the multi-agent in the area is the integral of the product of the event density and the detection probability in the area, namely: h (S) = ≡r (x) P (x, S) dx, where P (x, S) is the joint detection probability of the multi-agent system S at point x,p _i (x,s _i ) The detection probability for the ith agent is typically x and s _i And the monotone decreasing function of the distance between the two agents is that N is the number of the agents, and R (x) is an event density function.

Specifically, in the present embodiment, the positional deployment of the stationary agent is as shown in fig. 2, where the gray area is an area where effective coverage has been achieved. The basis for judging whether a point x in the area is effectively covered is whether the multi-agent joint detection probability P (x, S) at the point x is larger than a threshold value rho, when P (x, S) is larger than rho, the point x is effectively covered, otherwise, the point x is not effectively covered. After the ineffective coverage areas are obtained, the goal of the mobile agent is to cover the ineffective coverage areas, namely P (x, S) is equal to or larger than ρ at a certain moment, and the coverage performance H (S) is improved as much as possible in the moving process.

S2, calculating the maximum coverage performance which can be obtained by the mobile intelligent agent;

in the present embodiment, the maximum coverage performance H that can be obtained by the mobile agent is calculated _max I.e., maximizing the coverage performance H (S) = c R (x) P (x, S) dx. The goal is to use the calculation of the mobile agent reward function in a later step, where the situation where the number of mobile agents is small can be usually calculated with greedy algorithms, i.e. adding one mobile agent at a time on the basis of stationary agents, results in the most increase in coverage performance.

S3, setting the observation and actions of each mobile intelligent agent on the environment, and setting rewards of the mobile intelligent agents based on the maximum coverage performance which can be obtained by the mobile intelligent agents; each mobile intelligent agent aims at maximizing a respective rewarding function, and based on a reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time to perform distributed training, so that the motion planning of each mobile intelligent agent is obtained, and coverage of an ineffective coverage area is realized.

It should be noted that, the above steps are preparation steps for training the mobile agent by reinforcement learning, and fig. 3 is an exemplary interaction between three mobile agents and the environment, where an action set of the mobile agent, an observation of the environment by the agent, and a reward function of the environment to the agent need to be set before training the mobile agent by reinforcement learning. The environment is a grid environment in which a static intelligent agent exists, the mobile intelligent agent can select 5 actions of static, upward movement, downward movement, leftward movement and rightward movement in the environment, and the set of actions of the mobile intelligent agent is {0,1,2,3,4}, which respectively represent the static, upward movement, downward movement, leftward movement and rightward movement, and the movement distance is one grid. In order to cooperatively realize effective coverage of the areas, the observation of the environment by the intelligent agent is set to three binary images, wherein the first binary image shows the areas which are not effectively covered at present, and the grid mark 1 which is effectively covered and the grid mark 0 which is not effectively covered are arranged in the grid mark 1 which is effectively covered; the second binary image shows the current position of the mobile intelligent agent, and the position mark 1 of the intelligent agent is positioned; the third binary image shows the current location of other mobile agents, labeled 1 at the grid where the other mobile agents are located. Setting the environment to rewards of the intelligent agent into two parts, and respectively realizing the targets of effectively covering and improving the covering performance. The rewards of the environment to the mobile agent are as follows:

Reward＝(H _{currently, the method is that} -H _max )/10+incres*30

Wherein H is _{Currently, the method is that} Coverage performance of the mobile intelligent body at the current position is achieved; incres is the area of the effective coverage area that was newly increased from the previous time; the first part of the reward represents the gap between the coverage performance of the mobile agent at the current location and the maximum value, and the second part of the reward is the newly increased effective coverage area compared to the last time. The function is used as a reward to improve the coverage performance of the area while realizing effective coverage quickly.

Further, when a plurality of mobile intelligent agents interact with the environment at the same time based on a reinforcement learning algorithm and perform distributed training, an actor network and a critic network of the mobile intelligent agents are set to be two-layer convolution layers plus three-layer full-connection layers; the first layer of convolution layer in the network is 16 convolution kernels of 20 x 20, the second layer of convolution layer is 8 convolution kernels of 10 x 10, and the channel numbers of the three layers of full-connection layers are 256, 128 and 64 respectively.

Further, when training a plurality of mobile agents simultaneously, the embodiment uses a near-end policy optimization algorithm (PPO) for training, wherein the near-end policy optimization algorithm is a model-free and online policy gradient reinforcement learning method, and the specific process is as follows:

a. the random parameter θ is used to initialize the actor pi (A|S; θ) and the random parameter φ is used to initialize the critic V (S; φ).

b. Following the current strategy, N segments of experience are generated, the sequence of experience is:

S _ts ,A _ts ,R _ts+1 ,…,S _ts+N-1 ,A _ts+N-1 ,R _ts+N ,S _ts+N

wherein A is _t Is in state S _t Action taken, S _t+1 Is the next state, R _t+1 Is state S _t Transition to S _t+1 Awards obtained, the agent at S _t Where pi (A|S; θ) is used to calculate the probability of taking each action and randomly select action A based on the probability distribution _t 。

c. For each step t=ts+1, ts+2, …, ts+N, a return value G is calculated _t And an advantage function D _t ，δ _k ＝R _t +bγV(S _t ；φ),G _t ＝D _t +V(S _t The method comprises the steps of carrying out a first treatment on the surface of the Phi), where, when S _ts+N B is 0 when in the end state, otherwise 1, lambda is the smoothing coefficient, gamma is the discount coefficient.

d. Randomly acquiring small batch data with the size of M from the current experience set, learning the small batch data, and minimizing a loss functionTo update the critic parameters by minimizing the action loss function +.> To update actor, where r _i (θ)＝π(A _i |S _i ；θ)/π(A _i |S _i ；θ _old ),c _i (θ)＝max(min(r _i (θ), 1+ε), 1- ε), the entropy loss function is increased to facilitate exploration of an agent>

e. Repeating b through d until the training termination condition is reached.

By executing the above steps, the change of the effective coverage area ratio along with the step length after the training of the embodiment is completed is shown in fig. 4, so that the coverage rate of 97% can be achieved in the embodiment, the change of the coverage performance along with the step length is shown in fig. 5, and the coverage performance can be improved both in the process of realizing the effective coverage and after the effective coverage is realized.

In summary, the present embodiment provides a multi-agent coverage method based on reinforcement learning, which can achieve effective coverage of a multi-agent cooperation completion area. The method utilizes decision optimization capability of reinforcement learning, and can improve coverage performance of an area while realizing effective coverage. The method has the advantages of high efficiency and strong robustness.

Second embodiment

The embodiment provides an electronic device, which comprises a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.

The electronic device may vary considerably in configuration or performance and may include one or more processors (central processing units, CPU) and one or more memories having at least one instruction stored therein that is loaded by the processors and performs the methods described above.

Third embodiment

The present embodiment provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of the first embodiment described above. The computer readable storage medium may be, among other things, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The instructions stored therein may be loaded by a processor in the terminal and perform the methods described above.

Furthermore, it should be noted that the present invention can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

It is finally pointed out that the above description of the preferred embodiments of the invention, it being understood that although preferred embodiments of the invention have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the invention are known, several modifications and adaptations can be made without departing from the principles of the invention, and these modifications and adaptations are intended to be within the scope of the invention. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A reinforcement learning-based multi-agent coverage method, wherein the multi-agent comprises a plurality of stationary agents and a plurality of mobile agents, the multi-agent coverage method comprising:

setting up the observation and action of each mobile agent to the environment, and setting up rewards of the mobile agents based on the maximum coverage performance that the mobile agents can obtain; each mobile intelligent agent aims at maximizing a respective rewarding function, and based on a reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time, perform distributed training, obtain a motion plan of each mobile intelligent agent, and realize coverage of an ineffective coverage area;

the determining the location of the plurality of stationary agents in the area with the goal of maximizing coverage performance includes:

adjusting the positions of the plurality of stationary agents in the area so that the coverage performance is as large as possible;

the calculation function H (S) of the coverage performance is as follows:

H(S)＝∫R(x)P(x，S)dx

wherein P (x, S) is the joint detection probability of multiple agents at the midpoint x of the region, p _i (x，s _i ) The detection probability of the ith agent is represented by N, the number of the agents is represented by R (x), and the event density function is represented by R (x);

when the area is divided into an effective coverage area and an ineffective coverage area, judging whether one point x in the area is effectively covered or not according to whether the joint detection probability P (x, S) of multiple intelligent agents at the x is larger than a preset threshold, and when the P (x, S) is larger than the preset threshold, indicating that the x is effectively covered, otherwise, the x is not effectively covered;

the mobile intelligent body observes the environment in three binary images; wherein,,

the third binary image shows the positions of other mobile intelligent agents except the current mobile intelligent agent;

the action set of the mobile agent is {0,1,2,3,4}, which respectively indicates that the mobile agent is stationary, the mobile agent moves upwards, the mobile agent moves downwards, the mobile agent moves leftwards and the mobile agent moves rightwards;

the rewards of the environment to the mobile agent are as follows:

Reward＝(H _{currently, the method is that} -H _max )/10+incres*30

Wherein H is _{Currently, the method is that} Coverage performance of the mobile intelligent body at the current position is achieved; h _max Maximum coverage performance obtainable for a mobile agent; incres is the area of the effective coverage area that was newly increased from the previous time; the first part of the rewards represents the difference between the coverage performance of the mobile agent at the current position and the maximum value, and the second part of the rewards is the effective coverage area which is newly increased compared with the last moment;

the method comprises the steps that based on a reinforcement learning algorithm, a plurality of mobile intelligent agents interact with the environment at the same time, and when distributed training is carried out, an actor network and a critic network of the mobile intelligent agents are set to be two-layer convolution layers plus three-layer full-connection layers; the first layer of convolution layer in the network is 16 convolution kernels of 20 x 20, the second layer of convolution layer is 8 convolution kernels of 10 x 10, and the channel numbers of the three layers of full-connection layers are 256, 128 and 64 respectively.