WO2023109699A1 - 一种多智能体的通信学习方法 - Google Patents

一种多智能体的通信学习方法 Download PDF

Info

Publication number
WO2023109699A1
WO2023109699A1 PCT/CN2022/138140 CN2022138140W WO2023109699A1 WO 2023109699 A1 WO2023109699 A1 WO 2023109699A1 CN 2022138140 W CN2022138140 W CN 2022138140W WO 2023109699 A1 WO2023109699 A1 WO 2023109699A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
agents
communication
message
actornet
Prior art date
Application number
PCT/CN2022/138140
Other languages
English (en)
French (fr)
Inventor
代浩
吴嘉澍
王洋
叶可江
张锦霞
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2023109699A1 publication Critical patent/WO2023109699A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to a communication learning method, in particular to a multi-agent communication learning method.
  • DRL Deep Reinforcement Learning
  • the current mainstream multi-agent algorithms adopt the architecture of centralized training and distributed execution (CTDE), which has global information during training and only the observation of the agent itself during execution.
  • CDE centralized training and distributed execution
  • This architecture has a critic network during training, which updates the critic and actor networks according to the state-action combination of all agents.
  • each agent has only an independent actor network, and makes decisions based on local observations.
  • Typical architectures of this type such as IQL, QMIX, etc., use global information during training, and each agent can only make decisions based on local information during execution.
  • the current mainstream method is CommNet, which uses a mean unit to receive the local observations of all agents between the strategy networks of multiple agents, and broadcasts to all agents after generating messages (star communication framework); and TarMAC is A fully connected network architecture in which messages are broadcast among all agents.
  • the star and fully connected network architectures are designed to ensure that the messages generated by all agents are not missed, and that local observation information can be disseminated to all agents, so that they can have global information for decision-making.
  • the present invention analyzes the impact of the messages of other agents on the current agent. Influence, an index used to describe the importance of messages is proposed, and agents are grouped according to this, the network traffic is reduced through the idea of layered transmission, and a communication learning method for deep reinforcement learning of edge networks is realized.
  • An advantage of the present invention is to provide a multi-agent communication learning method, wherein the multi-agent communication learning method introduces message passing between multi-agents to transmit local observations, so that the agents can make decisions fully take into account the overall situation.
  • An advantage of the present invention is to provide a multi-agent communication learning method, wherein an importance ranking index and an efficient grouping algorithm are designed in the multi-agent communication learning method to reduce the amount of messages transmitted and realize efficient A communication learning method is used to effectively reduce communication bandwidth consumption caused by unnecessary messages.
  • An advantage of the present invention is to provide a multi-agent communication learning method, wherein the multi-agent communication learning method can be used for all multi-agent reinforcement learning in the edge network for various applications, such as multi-agent intelligence Driving, robot navigation, logistics scheduling, etc.
  • An advantage of the present invention is to provide a multi-agent communication learning method, wherein the multi-agent communication learning method is suitable for scenes that require multi-scene fusion perception, such as multi-camera fusion and other scenes.
  • the invention provides a multi-agent communication learning method, comprising:
  • CriticNet wherein the CriticNet is used to calculate the importance of communication during the training phase, and is used to train the corresponding three networks on the end device, that is, the ActorNet, the PriorNet and the EncoderNet;
  • ActorNet wherein the ActorNet is used to select the corresponding action on the agent side, acts on the agent side, and works in both the training phase and the execution phase.
  • the ActorNet needs to learn the strategy ⁇ of the agent in the training phase, and then according to the local Observe and receive messages and generate corresponding actions that is in is the message received by agent i at time t, that is, the importance of agent j’s message to agent i;
  • PriorNet wherein the PriorNet is used for the agent to select the object of communication, and PriorNet will evaluate the agent observed in the local observation and output an importance value and
  • EncoderNet wherein the EncoderNet is used for the agent to encode its own information to reduce the size of the message body.
  • the CriticNet runs on the cloud and only works during the training phase.
  • the CriticNet will calculate the network loss and pass the gradient back to the rest of the network, and update the rest of the network parameters.
  • the importance value exceeds a certain threshold, it means that the current agent i needs to obtain information from the agent j to make a decision.
  • the agent encodes its own previous actions and observations together for reference by other agents to improve the stability of cooperation.
  • the calculation method of the importance is further included, and the steps are as follows:
  • Step A By removing the message of agent j, observe whether it will lead to the action of the output of the ActorNet network;
  • Step B Since the action output by ActorNet is the distribution of an action set, KL divergence is used to calculate the difference between the action distributions output by the agent's ActorNet.
  • the specific formula is as follows:
  • Step C where o ⁇ i ⁇ represents the message set of all other agents observed by agent i, o ⁇ i ⁇ j ⁇ represents the message set of other observed agents except agent j, the formula The calculated difference indicates whether the decision distribution of messages lacking agent j is consistent with the decision distribution of messages having agent j;
  • Step D If the difference is large, it means that the message of agent j is very important to i, so its communication confidence is relatively high;
  • Step E After calculating the confidence of all agents, a confidence matrix M between agents is obtained, and the agents are grouped through this confidence matrix.
  • the PriorNet network outputs two values, namely query and signature:
  • the signature vector is the information fingerprint of the agent itself, including the code of the position and label of the agent itself;
  • the query vector is Query information, which represents the encoding of the set of agents that the agent needs to communicate with.
  • a communication mechanism is further included, and the communication mechanism includes a handshake phase, an election phase, a communication phase and a decision-making phase, wherein in the handshake phase, all agents will broadcast the query and signature to the agents in the observation, and all agents are in After receiving the query and signature, the confidence matrix of the communication is restored by multiplying the vectors.
  • the election phase after calculating the confidence matrix, all agents calculate the adjacency graph and select the intelligence with the highest degree.
  • the agent is the preset agent, that is, most of the agents want to obtain the message of the preset agent to make decisions, and the preset agent acts as the leader node.
  • the leader encodes the received message through the encoder network, and then communicates between the leaders. After the leader passes the message to other leaders, in the decision-making phase, the leader makes decisions based on the received messages from other leaders. And send its own decisions and messages to other non-leader agents in the same group, and other agents will make the next step based on this.
  • the present invention proposes to use KL divergence to measure the importance of messages, which ensures that only effective information is transmitted, avoids redundant message transmission, and improves the convergence rate.
  • the present invention uses grouping and electing a leader for communication, which greatly reduces communication links and reduces communication bandwidth consumption.
  • Fig. 1 is a network schematic diagram of the multi-agent communication learning method provided by the present invention.
  • Fig. 2 is a schematic diagram of spectral clustering of the multi-agent communication learning method provided by the present invention.
  • Fig. 3 is the agent group communication of the multi-agent communication learning method provided by the present invention.
  • Fig. 4 shows that the global reward of the cooperative agent of the multi-agent communication learning method provided by the present invention has been improved.
  • Fig. 5 shows the communication traffic among the multi-agents in the multi-agent communication learning method provided by the present invention.
  • a typical distributed edge computing architecture is composed of multiple edge devices (indicated by "Device"). Assuming that there are N edge devices, each device i can be regarded as an agent, and the agents can be connected through networks such as WIFI and 5G. They are interconnected and have limited computing power and bandwidth resources.
  • the goal of a cooperative multi-agent system is to maximize the cumulative value of the global reward r, so all agents need to master the global information they care about through message passing to achieve collaborative decision-making.
  • the present invention follows the framework of CTDE, maintains comprehensive information intercommunication in the training stage, and performs information encoding and communication object selection according to the trained communication network in the execution stage.
  • the multi-agent communication learning method of the present invention includes CriticNet, ActorNet, PriorNet and EncoderNet, wherein the CriticNet is used to calculate the importance of communication in the training phase, and is used to train the corresponding Three networks, that is, the ActorNet, the PriorNet, and the EncoderNet. Further, the CriticNet runs on the cloud and only works during the training phase.
  • the CriticNet will calculate the network loss And pass the gradient back to the rest of the network, and update the rest of the network parameters, wherein the ActorNet is used to select the corresponding action at the agent end, acts on the agent end, works in the training phase and the execution phase, and the ActorNet is in the training phase It is necessary to learn the strategy ⁇ of the agent, and then generate corresponding actions based on local observations and received messages that is in is the message received by agent i at time t, where the PriorNet is used for the agent to select the communication object, and PriorNet will evaluate the agent observed in the local observation and output an importance value That is, the importance of the message of agent j to agent i, when the importance value exceeds a certain threshold, it means that the current agent i needs to obtain the message of agent j to make a decision, and the EncoderNet is used for the agent to encode itself information, since the agent’s observation of the environment is low-dimensional and sparse, it needs to be converted into a high-
  • the multi-agent communication learning method of the present invention uses the cross-entropy loss function as an error, and uses the gradient descent method as a means of parameter update.
  • the multi-agent communication learning method of the present invention lies in how to select the communication object of the agent and the interactive mode of communication.
  • the calculation method for the importance degree of the present invention is how to weight the other agents observed by the agent i, and assign the execution degree of the communication.
  • the steps are as follows:
  • Step A By removing the message of agent j, observe whether it will lead to the action of the output of the ActorNet network;
  • Step B Since the action output by ActorNet is the distribution of an action set, KL divergence is used to calculate the difference between the action distributions output by the agent's ActorNet.
  • the specific formula is as follows:
  • Step C where o ⁇ i ⁇ represents the message set of all other agents observed by agent i, o ⁇ i ⁇ j ⁇ represents the message set of other observed agents except agent j, the formula The calculated difference indicates whether the decision distribution of messages lacking agent j is consistent with the decision distribution of messages having agent j;
  • Step D If the difference is large, it means that the message of agent j is very important to i, so its communication confidence is relatively high;
  • the network needs to calculate the output of ActorNet multiple times, so it can only be calculated during the training phase. At the same time, the calculation result will be used as the supervisory signal of PriorNet to train the network, so that it does not need to be repeatedly calculated during the execution phase. Calculate its communication confidence directly through PriorNet.
  • Step E After calculating the confidence of all agents, a confidence matrix M between agents is obtained, and the agents are grouped through this confidence matrix.
  • Spectral clustering is an algorithm evolved from graph theory, which was later widely used in clustering. Its main idea is to regard all data as points in space, and these points can be connected by edges. The edge weight value between two points that are farther away is lower, while the edge weight value between two points that are closer is higher. By cutting the graph composed of all data points, the graph is different after cutting. The sum of the edge weights between subgraphs should be as low as possible, and the sum of edge weights in the subgraphs should be as high as possible, so as to achieve the purpose of clustering.
  • the communication within each group can be denser, while the communication between groups is relatively sparse.
  • the present invention proposes a distributed grouping method.
  • the invention allows the PriorNet network to output two values, which are query and signature: the signature vector is the information fingerprint of the agent itself, including the code of the location and label of the agent itself; the query vector is the query information, which represents the agent An encoding of a collection of agents that need to communicate.
  • the communication mechanism of the present invention includes a handshake stage, an election stage, a communication stage and a decision-making stage, wherein in the handshake stage, all agents will broadcast the query and signature to the agents in the observation, and all agents will receive the After the query and signature, the confidence matrix of the communication can be restored by multiplying the vectors.
  • all agents calculate the adjacency graph and select the agent with the highest degree , that is, most of the agents want to obtain the messages of the agent to make decisions, so it can be used as a leader node, where in the communication phase, all non-leader nodes send their own messages to the leader node, and the leader will receive The message is encoded through the encoder network, and then the communication between the leaders is carried out. After the leader passes the message to other leaders, in the decision-making phase, the leader makes decisions based on the messages received from other leaders, and sends its own decisions and messages. To other non-leader agents in the same group, other agents make the next decision based on this.
  • the present invention effectively reduces the cost of communication, reduces communication links, and realizes efficient communication learning of multi-agent reinforcement learning, thereby realizing the importance of calculating and measuring messages through KL divergence in multi-agents. degree; by using the spectral clustering algorithm on the confidence matrix to realize the grouping of agents, thereby reducing the communication link; through the degree of the graph to carry out the election within the group, and select the leader node from it to realize the communication between the groups, reducing traffic.
  • the present invention has been proved to be feasible by sufficient experiments, and has been verified in the open-source multi-agent reinforcement learning environment of openai. It can be found that the present invention can help improve the cooperation among multi-agents and maximize the global reward .
  • the communication volume of the present invention will gradually decrease with the gradual stabilization of the training.
  • the agents have learned to improve cooperation through communication, so the communication volume increases rapidly, and as the training continues, the grouping The method starts working, gradually reducing the amount of traffic.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Geometry (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种多智能体的通信学习方法,所述多智能体的通信学习方法包括:CriticNet、ActorNet、PriorNet 和EncoderNet,其中所述CriticNet 用于在训练阶段进行通信重要度的计算,并用于训练对应的在端设备上的三个网络,也就是所述ActorNet、所述PriorNet 和所述EncoderNet,其中所述ActorNet用于在智能体端选择对应的动作,作用于智能体端,在训练阶段和执行阶段均工作,所述ActorNet 在训练阶段需要学习出智能体的策略π,然后根据局部观测和收到的消息,生成对应的动作at i,即是at i=π(ot i,ct i),其中ct i是智能体i 在t 时刻收到的消息,即是智能体j 的消息对于智能体i 的重要度,其中所述PriorNet用于智能体选择通信的对象,PriorNet 会对局部观测中观测到的智能体进行评价,并输出一个重要度值θj i,其中所述EncoderNet 用于智能体编码自己的信息,以减少消息体的大小。

Description

一种多智能体的通信学习方法 技术领域
本发明涉及通信学习方法,尤其涉及一种多智能体的通信学习方法。
背景技术
在协作的多智体***中,所有协作的智能体只有一个全局的奖励函数,然而每个智能体的观测范围是有限的,因此在进行协同时缺乏全局的信息来进行感知或者决策,导致智能体之间出现互斥的决策,难以达到全局的最优。
作为一种先进的人工智能技术,深度强化学习(Deep Reinforcement Learning,DRL)在许多具有挑战性的现实问题中取得了巨大的成功。它被广泛地部署在不同的设备上,如智能车、智能手机、可穿戴设备、智能相机以及边缘网络中的其他智能对象。其中,协作多智能体强化学习则是DRL中更具有难度和更具有实际应用价值的一种范式,由于每个智能体只有局部的观测,缺少全局的信息,导致动作空间非常巨大,计算复杂;同时,由于只有一个全局的reward,导致难以分配对应的奖励到独立的智能体,以至于难以训练和保障收敛性。
为了解决这个难点,目前主流的多智体算法都采用了中心式训练和分布式执行(CTDE)的架构,训练时有全局信息,执行时只有agent本身的观测。这种架构在训练时存在一个critic网络,该网络根据所有agent的状态-动作组合更新critic和actor网络,执行时每个agent只有独立的actor网络,根据局部观测进行决策。典型的这类架构如IQL、QMIX等,都采用了训练时拥有全局信息,执行时则每个智能体都只能根据局部的信息进行决策。这些方法都将其他智能体看作环境的一部分进行建模,而自身只解决单智能体的问题,因此无法保证收敛性了,并且智能体会很容易陷入无止境的探索中。
因此,许多研究开始着眼于基于通信的多智能体强化学***稳的问题,促进智能体之间的协作。目前主流的方法有CommNet,在多个智能体的策略网络之间采用一 个均值单元接收所有智能体的局部观察,生成消息后对所有智能体进行广播(星型的通信框架);而TarMAC则是一种全联接的网络架构,所有的智能体之间均会进行消息的广播。星型和全联接的网络架构都是为了保证所有智能体产生的消息都不被遗漏,保证局部观测信息能够传播到所有的智能体上,让它们能拥有全局信息进行决策。
现有的通信学习方法虽然保证了所有的智能体都能够获得所有其他智能体的消息,但同时也带来了巨大的冗余信息。由于智能体之间的相关性不同,不相关的智能体之间的信息传递不仅是无用的,甚至可能会对智能体的决策作出负面的影响。
同时,冗余的信息传递也对边缘网络产生了巨大的考验,由于边缘网络结构复杂,且通信带宽资源有限,传统的通信学习方法往往难以应用到边缘环境中。而多智能体的强化学习的主要应用场景就是在边缘网络环境下,因此为了解决网络带宽和通信学习需要的资源之间的不匹配问题,本发明分析了其他智能体的消息对于当前智能体的影响,提出了一种用来刻画消息重要性的指标,并据此对智能体进行分组,通过分层传输的思路减少了网络通信量,实现了面向边缘网络深度强化学习的通信学习方法。
发明内容
本发明的一个优势在于提供一种多智能体的通信学习方法,其中所述多智能体的通信学习方法在多智能体之间引入了消息传递来传输局部的观测,以供让智能体在决策时充分考虑到全局的情况。
本发明的一个优势在于提供一种多智能体的通信学习方法,其中所述多智能体的通信学习方法设计了一个重要度排序指标和一个高效的分组算法来减少传递的消息量,实现高效的通信学习方法,以供有效降低不必要的消息带来的通信带宽消耗。
本发明的一个优势在于提供一种多智能体的通信学习方法,其中所述多智能体的通信学习方法可用于所有在边缘网络中进行多智体强化学习进行各种应用,如多智体智能驾驶、机器人导航、物流调度等。
本发明的一个优势在于提供一种多智能体的通信学习方法,其中所述多智能体的通信学习方法适用于需要多场景融合感知的场景,如多摄像头融合等场 景。
本发明就上述技术问题而提出的技术方案如下:
本发明提供了一种多智能体的通信学习方法,包括:
CriticNet,其中所述CriticNet用于在训练阶段进行通信重要度的计算,并用于训练对应的在端设备上的三个网络,也就是所述ActorNet、所述PriorNet和所述EncoderNet;
ActorNet,其中所述ActorNet用于在智能体端选择对应的动作,作用于智能体端,在训练阶段和执行阶段均工作,所述ActorNet在训练阶段需要学习出智能体的策略π,然后根据局部观测和收到的消息,生成对应的动作
Figure PCTCN2022138140-appb-000001
即是
Figure PCTCN2022138140-appb-000002
其中
Figure PCTCN2022138140-appb-000003
是智能体i在t时刻收到的消息,即是智能体j的消息对于智能体i的重要度;
PriorNet,其中所述PriorNet用于智能体选择通信的对象,PriorNet会对局部观测中观测到的智能体进行评价,并输出一个重要度值
Figure PCTCN2022138140-appb-000004
EncoderNet,其中所述EncoderNet用于智能体编码自己的信息,以减少消息体的大小。
优选地,所述CriticNet运行在云端,仅在训练阶段工作,通过计算全局reward和通信优先级,所述CriticNet将计算网络损失并将梯度传递回其余网络,并更新其余网络参数。
优选地,当重要度值超过一定的阈值,表示当前智能体i需要获得智能体j的消息来进行决策。
优选地,所述智能体将自身之前的动作与观测一起进行编码,供其他智能体参考,提升合作的稳定性。
优选地,进一步包括重要度的计算方法,步骤如下:
步骤A:通过去除智能体j的消息,观察是否会导致ActorNet网络的输出的动作;
步骤B:由于ActorNet输出的动作是一个动作集合的分布,采用KL散度来计算智能体的ActorNet输出的动作分布之间的差异,具体的公式如下:
Figure PCTCN2022138140-appb-000005
步骤C:其中o {i}表示智能体i观测到的所有其余智能体的消息集合,o {{i}\j}表示除了智能体j之外其余被观测到智能体的消息集合,该公式计算出 的差异表示缺少智能体j的消息的决策分布和拥有智能体j的消息的决策分布是否一致;
步骤D:如果差异较大,则说明智能体j的消息对于i来说很重要,因此其通信置信度比较高;
步骤E:在计算出所有智能体的置信度之后,得到一个智能体之间的置信度矩阵M,通过这个置信度矩阵对智能体之间进行分组。
优选地,进一步包括分布式分组方法,PriorNet网络输出了两个值,分别是query和signature:signature向量是智能体本身的信息指纹,包含了智能体本身的位置和标号的编码;query向量则是查询信息,表示了智能体需要通信的智能体集合的编码。
优选地,进一步包括通信机制,通信机制包括握手阶段、选举阶段、通信阶段和决策阶段,其中在握手阶段时,所有的智能体会将query和signature向观测内的智能体广播出去,所有智能体在接收到query和signature后,通过向量之间相乘还原出通信的置信度矩阵,其中在选举阶段时,所有的智能体在计算出置信度矩阵后,计算出邻接图并选择出度最大的智能体为预设智能体,即是大部分的智能体都希望获得预设智能体的消息来决策,预设智能体作为leader节点,其中在通信阶段时,所有的非leader节点将自己的消息发送给leader节点,leader将收到的消息通过encoder网络进行编码,然后进行leader间的通信,leader将消息互相传递给其他leader之后,其中在决策阶段时,leader根据收到的其他leader消息进行决策,并将自己的决策和消息发送到同一组内的其他非leader智能体上,其他智能体则据此进行下一步决策。
相较于目前主流的方法,如星形和全联接形,本发明的有益效果是:
1.全联接和星形通信网络都无视了消息本身对于智能体的决策的影响,收到不恰当的消息可能会影响智能体的收敛,进而影响全局奖励的最大化。本发明提出了使用KL散度来衡量消息的重要性,保证了只传递有效的信息,避免了冗余的消息传递,提升了收敛速率。
2.全联接和星形通信网络都需要大量的端对端连接,本发明则采用了分组并选举leader的方式来进行通信,大大减少了通信的链路,降低了通信带宽的消耗。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明提供的多智能体通信学习方法的网络示意图。
图2为本发明提供的多智能体通信学习方法的谱聚类示意图。
图3为本发明提供的多智能体通信学习方法的智能体分组通信。
图4为本发明提供的多智能体通信学习方法的合作智能体的全局奖励得到了提升。
图5为本发明提供的多智能体通信学习方法的多智体之间的通信量。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
典型的分布式边缘计算架构为多个边缘设备(“Device”表示)组成,假设存在有N个边缘设备,每个设备i可以看作一个智能体,智能体之间可以通过WIFI、5G等网络互相连通,并且有着有限的算力和带宽资源。每个智能体都有一个动作集A,在每个时间t智能体i有自己局部的观测oit,智能体根据自己的观测oit和动作策略来选择下一步的动作并执行,即是
Figure PCTCN2022138140-appb-000006
同时,当所有的智能体都做出了对应的动作,所有的智能体都可以获得一个全局的奖励值r=env(a 0,a 1,...,a n)。
协作的多智能体***的目标是最大化该全局奖励r的累积值,因此所有的agent需要通过消息传递来掌握所关注的全局信息,实现协同的决策。
本发明遵循CTDE的架构,在训练阶段保持着全面的信息互通,而在执行阶段则根据训练的通信网络进行信息编码和通信对象选择。
如图1所示,本发明的多智能体通信学习方法包括CriticNet、ActorNet、PriorNet和EncoderNet,其中所述CriticNet用于在训练阶段进行通信重要度的计算,并用于训练对应的在端设备上的三个网络,也就是所述ActorNet、所述PriorNet和所述EncoderNet,进一步的,所述CriticNet运行在云端,仅在训练 阶段工作,通过计算全局reward和通信优先级,所述CriticNet将计算网络损失并将梯度传递回其余网络,并更新其余网络参数,其中所述ActorNet用于在智能体端选择对应的动作,作用于智能体端,在训练阶段和执行阶段均工作,所述ActorNet在训练阶段需要学习出智能体的策略π,然后根据局部观测和收到的消息,生成对应的动作
Figure PCTCN2022138140-appb-000007
即是
Figure PCTCN2022138140-appb-000008
其中
Figure PCTCN2022138140-appb-000009
是智能体i在t时刻收到的消息,其中所述PriorNet用于智能体选择通信的对象,PriorNet会对局部观测中观测到的智能体进行评价,并输出一个重要度值
Figure PCTCN2022138140-appb-000010
即是智能体j的消息对于智能体i的重要度,当重要度值超过一定的阈值,表示当前智能体i需要获得智能体j的消息来进行决策,其中所述EncoderNet用于智能体编码自己的信息,由于智能体对环境的观测是低维且稀疏的,需要通过一个编码网络对其转换为高维表示,以减少消息体的大小,此外,除了观测信息以外,智能体还要将自身之前的动作与观测一起进行编码,供其他智能体参考,提升合作的稳定性。
对于策略网络所述ActorNet,以及对应的奖励loss,本发明的多智能体通信学习方法使用交叉熵损失函数作为误差,通过梯度下降法作为参数更新的手段。
进一步的,本发明的多智能体通信学习方法在于如何选择智能体的通信对象以及通信的交互方式。
本发明对于重要度的计算方法,即是如何给智能体i观测到其余智能体进行加权,分配通信的执行度,步骤如下:
步骤A:通过去除智能体j的消息,观察是否会导致ActorNet网络的输出的动作;
步骤B:由于ActorNet输出的动作是一个动作集合的分布,采用KL散度来计算智能体的ActorNet输出的动作分布之间的差异,具体的公式如下:
Figure PCTCN2022138140-appb-000011
步骤C:其中o {i}表示智能体i观测到的所有其余智能体的消息集合,o {{i}\j}表示除了智能体j之外其余被观测到智能体的消息集合,该公式计算出的差异表示缺少智能体j的消息的决策分布和拥有智能体j的消息的决策分布是否一致;
步骤D:如果差异较大,则说明智能体j的消息对于i来说很重要,因此其通信置信度比较高;
值得注意的是,该网络需要多次计算ActorNet的输出,所以只能在训练阶段进行计算,同时该计算结果将作为PriorNet的监督信号来训练该网络,这样在执行阶段并不需要重复计算即可直接通过PriorNet计算出其通信置信度。
步骤E:在计算出所有智能体的置信度之后,得到一个智能体之间的置信度矩阵M,通过这个置信度矩阵对智能体之间进行分组。
如下图所示,为示例矩阵M,所述矩阵M是比较稀疏的,这说明大部分智能体之间是不需要通信的,可以通过谱聚类算法来将智能体之间进行分组。谱聚类是从图论中演化出来的算法,后来在聚类中得到了广泛的应用。它的主要思想是把所有的数据看做空间中的点,这些点之间可以用边连接起来。距离较远的两个点之间的边权重值较低,而距离较近的两个点之间的边权重值较高,通过对所有数据点组成的图进行切图,让切图后不同的子图间边权重和尽可能的低,而子图内的边权重和尽可能的高,从而达到聚类的目的。
Figure PCTCN2022138140-appb-000012
如图3所示,通过该聚类算法,可以使每个组内之间的通信比较稠密,而组间的通信则比较稀疏。
由于在执行阶段,智能体之间以分布式的方式进行通信,并没有中心节点帮助进行分组,因此本发明提出了一种分布式的分组方法。本发明让PriorNet网络输出了两个值,分别是query和signature:signature向量是智能体本身的信息指纹,包含了智能体本身的位置和标号的编码;query向量则是查询信息,表示了智能体需要通信的智能体集合的编码。
进一步的,本发明的通信机制包括握手阶段、选举阶段、通信阶段和决策阶段,其中在握手阶段时,所有的智能体会将query和signature向观测内的智能体广播出去,所有智能体在接收到query和signature后,通过向量之间相乘可以还原出通信的置信度矩阵,其中在选举阶段时,所有的智能体在计算出置信度矩阵后,计算出邻接图并选择出度最大的智能体,即是大部分的智能体都希望获得该智能体的消息来决策,因此它可以作为leader节点,其中在通信阶段时,所有的非leader节点将自己的消息发送给leader节点,leader将收到的消 息通过encoder网络进行编码,然后进行leader间的通信,leader将消息互相传递给其他leader之后,其中在决策阶段时,leader根据收到的其他leader消息进行决策,并将自己的决策和消息发送到同一组内的其他非leader智能体上,其他智能体则据此进行下一步决策。
通过上述分组的通信模式,本发明有效降低通信的成本,减少通信链路,实现高效的多智体强化学习的通信学习,从而实现在多智能体中通过KL散度来计算和衡量消息的重要度;通过对置信度矩阵使用谱聚类算法来实现智能体的分组,从而减少通信链路;通过图的出度来进行组内的选举,从中选出leader节点来实现组间的通信,减少通信量。
如图4所示,本发明经过了充分的实验证明可行,在openai开源的多智体强化学习环境中进行了验证,可以发现本发明能够帮助提升多智体之间的合作,最大化全局奖励。
如图5所示,本发明的通信量将会随着训练的逐步稳定降低下来,最开始智能体之间学会了通过通信来提升合作,因此通信量快速上升,而随着训练的继续,分组方法开始工作,将通信量逐步降低。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (7)

  1. 一种多智能体的通信学习方法,其特征在于,包括:
    CriticNet,其中所述CriticNet用于在训练阶段进行通信重要度的计算,并用于训练对应的在端设备上的三个网络,也就是所述ActorNet、所述PriorNet和所述EncoderNet;
    ActorNet,其中所述ActorNet用于在智能体端选择对应的动作,作用于智能体端,在训练阶段和执行阶段均工作,所述ActorNet在训练阶段需要学习出智能体的策略π,然后根据局部观测和收到的消息,生成对应的动作
    Figure PCTCN2022138140-appb-100001
    即是
    Figure PCTCN2022138140-appb-100002
    其中
    Figure PCTCN2022138140-appb-100003
    是智能体i在t时刻收到的消息,即是智能体j的消息对于智能体i的重要度;
    PriorNet,其中所述PriorNet用于智能体选择通信的对象,PriorNet会对局部观测中观测到的智能体进行评价,并输出一个重要度值
    Figure PCTCN2022138140-appb-100004
    EncoderNet,其中所述EncoderNet用于智能体编码自己的信息,以减少消息体的大小。
  2. 如权利要求1所述的,其特征在于,所述CriticNet运行在云端,仅在训练阶段工作,通过计算全局reward和通信优先级,所述CriticNet将计算网络损失并将梯度传递回其余网络,并更新其余网络参数。
  3. 如权利要求1所述的,其特征在于,当重要度值超过一定的阈值,表示当前智能体i需要获得智能体j的消息来进行决策。
  4. 如权利要求1所述的,其特征在于,所述智能体将自身之前的动作与观测一起进行编码,供其他智能体参考,提升合作的稳定性。
  5. 如权利要求1所述的,其特征在于,进一步包括重要度的计算方法,步骤如下:
    步骤A:通过去除智能体j的消息,观察是否会导致ActorNet网络的输出的动作;
    步骤B:由于ActorNet输出的动作是一个动作集合的分布,采用KL散度来计算智能体的ActorNet输出的动作分布之间的差异,具体的公式如下:
    Figure PCTCN2022138140-appb-100005
    步骤C:其中o {i}表示智能体i观测到的所有其余智能体的消息集合,
    o {{i}\j}表示除了智能体j之外其余被观测到智能体的消息集合,该公式计算出的差异表示缺少智能体j的消息的决策分布和拥有智能体j的消息的决策分布是否一致;
    步骤D:如果差异较大,则说明智能体j的消息对于i来说很重要,因此其通信置信度比较高;
    步骤E:在计算出所有智能体的置信度之后,得到一个智能体之间的置信度矩阵M,通过这个置信度矩阵对智能体之间进行分组。
  6. 如权利要求1所述的,其特征在于,进一步包括分布式分组方法,PriorNet网络输出了两个值,分别是query和signature:signature向量是智能体本身的信息指纹,包含了智能体本身的位置和标号的编码;query向量则是查询信息,表示了智能体需要通信的智能体集合的编码。
  7. 如权利要求6所述的,其特征在于,进一步包括通信机制,通信机制包括握手阶段、选举阶段、通信阶段和决策阶段,其中在握手阶段时,所有的智能体会将query和signature向观测内的智能体广播出去,所有智能体在接收到query和signature后,通过向量之间相乘还原出通信的置信度矩阵,其中在选举阶段时,所有的智能体在计算出置信度矩阵后,计算出邻接图并选择出度最大的智能体为预设智能体,即是大部分的智能体都希望获得预设智能体的消息来决策,预设智能体作为leader节点,其中在通信阶段时,所有的非leader节点将自己的消息发送给leader节点,leader将收到的消息通过encoder网络进行编码,然后进行leader间的通信,leader将消息互相传递给其他leader之后,其中在决策阶段时,leader根据收到的其他leader消息进行决策,并将自己的决策和消息发送到同一组内的其他非leader智能体上,其他智能体则据此进行下一步决策。
PCT/CN2022/138140 2021-12-17 2022-12-09 一种多智能体的通信学习方法 WO2023109699A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111549398.0 2021-12-17
CN202111549398.0A CN114298178A (zh) 2021-12-17 2021-12-17 一种多智能体的通信学习方法

Publications (1)

Publication Number Publication Date
WO2023109699A1 true WO2023109699A1 (zh) 2023-06-22

Family

ID=80967633

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138140 WO2023109699A1 (zh) 2021-12-17 2022-12-09 一种多智能体的通信学习方法

Country Status (2)

Country Link
CN (1) CN114298178A (zh)
WO (1) WO2023109699A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114123178A (zh) * 2021-11-17 2022-03-01 哈尔滨工程大学 一种基于多智能体强化学习的智能电网分区网络重构方法
CN117031399A (zh) * 2023-10-10 2023-11-10 浙江华创视讯科技有限公司 多智能体协同的声源定位方法、设备及存储介质
CN117575220A (zh) * 2023-11-15 2024-02-20 杭州智元研究院有限公司 一种面向异构多智能体的多任务策略博弈方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114298178A (zh) * 2021-12-17 2022-04-08 深圳先进技术研究院 一种多智能体的通信学习方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113286275A (zh) * 2021-04-23 2021-08-20 南京大学 一种基于多智能体强化学习的无人机集群高效通信方法
CN113592079A (zh) * 2021-08-13 2021-11-02 大连大学 一种面向大规模任务空间的协同多智能体通信方法
CN113642233A (zh) * 2021-07-29 2021-11-12 太原理工大学 一种通信机制优化的群体智能协同方法
CN114298178A (zh) * 2021-12-17 2022-04-08 深圳先进技术研究院 一种多智能体的通信学习方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113286275A (zh) * 2021-04-23 2021-08-20 南京大学 一种基于多智能体强化学习的无人机集群高效通信方法
CN113642233A (zh) * 2021-07-29 2021-11-12 太原理工大学 一种通信机制优化的群体智能协同方法
CN113592079A (zh) * 2021-08-13 2021-11-02 大连大学 一种面向大规模任务空间的协同多智能体通信方法
CN114298178A (zh) * 2021-12-17 2022-04-08 深圳先进技术研究院 一种多智能体的通信学习方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HANGYU MAO; ZHIBO GONG; ZHENGCHAO ZHANG; ZHEN XIAO; YAN NI: "Learning Multi-agent Communication under Limited-bandwidth Restriction for Internet Packet Routing", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 February 2019 (2019-02-26), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081154030 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114123178A (zh) * 2021-11-17 2022-03-01 哈尔滨工程大学 一种基于多智能体强化学习的智能电网分区网络重构方法
CN114123178B (zh) * 2021-11-17 2023-12-19 哈尔滨工程大学 一种基于多智能体强化学习的智能电网分区网络重构方法
CN117031399A (zh) * 2023-10-10 2023-11-10 浙江华创视讯科技有限公司 多智能体协同的声源定位方法、设备及存储介质
CN117031399B (zh) * 2023-10-10 2024-02-20 浙江华创视讯科技有限公司 多智能体协同的声源定位方法、设备及存储介质
CN117575220A (zh) * 2023-11-15 2024-02-20 杭州智元研究院有限公司 一种面向异构多智能体的多任务策略博弈方法

Also Published As

Publication number Publication date
CN114298178A (zh) 2022-04-08

Similar Documents

Publication Publication Date Title
WO2023109699A1 (zh) 一种多智能体的通信学习方法
US20220114475A1 (en) Methods and systems for decentralized federated learning
CN109039942B (zh) 一种基于深度强化学习的网络负载均衡***及均衡方法
CN113010305B (zh) 部署在边缘计算网络中的联邦学习***及其学习方法
Wang et al. A novel reputation-aware client selection scheme for federated learning within mobile environments
CN111245903B (zh) 一种基于边缘计算的联合学习方法及***
CN111310932A (zh) 横向联邦学习***优化方法、装置、设备及可读存储介质
Shi et al. Machine learning for large-scale optimization in 6g wireless networks
CN111629380A (zh) 面向高并发多业务工业5g网络的动态资源分配方法
WO2024032121A1 (zh) 一种基于云边端协同的深度学习模型推理加速方法
CN114417417A (zh) 一种基于联邦学习的工业物联网隐私保护***及方法
Zou et al. Wireless multi-agent generative ai: From connected intelligence to collective intelligence
WO2022111398A1 (zh) 数据模型训练方法及装置
CN114357676A (zh) 一种针对层次化模型训练框架的聚合频率控制方法
Lv et al. Edge computing task offloading for environmental perception of autonomous vehicles in 6G networks
Chen et al. Profit-aware cooperative offloading in uav-enabled mec systems using lightweight deep reinforcement learning
Sun et al. Zero-shot multi-level feature transmission policy powered by semantic knowledge base
Le et al. Applications of distributed machine learning for the Internet-of-Things: A comprehensive survey
Hu et al. Clustered data sharing for Non-IID federated learning over wireless networks
Zhou et al. Digital Twin-Based 3D Map Management for Edge-Assisted Device Pose Tracking in Mobile AR
Wu et al. Agglomerative federated learning: Empowering larger model training via end-edge-cloud collaboration
Si et al. UAV-assisted Semantic Communication with Hybrid Action Reinforcement Learning
CN116133082A (zh) 一种提高航空自组网拓扑持续时间的多跳分簇方法
Zhu et al. Deep reinforced energy efficient traffic grooming in fog-cloud elastic optical networks
CN115499365A (zh) 路由优化方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22906446

Country of ref document: EP

Kind code of ref document: A1