CN112232478B

CN112232478B - Multi-agent reinforcement learning method and system based on layered attention mechanism

Info

Publication number: CN112232478B
Application number: CN202010913132.9A
Authority: CN
Inventors: 史殿习; 王雅洁; 张拥军; 薛超; 郝锋; 姜浩; 王功举
Original assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Current assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2023-11-17
Anticipated expiration: 2040-09-03
Also published as: CN112232478A

Abstract

The invention provides a multi-agent reinforcement learning method and system based on a layered attention mechanism, comprising the following steps: constructing a learning environment, wherein the learning environment comprises a plurality of intelligent agents, a Critic network calculates estimated values based on observed values and action values of other intelligent agents in the plurality of intelligent agents, which are acquired by a layered attention mechanism of each intelligent agent, optimizes through a minimum joint loss function until the minimum joint loss function converges, calculates an action-value function based on the observed values, the action values and the trained estimated values in combination with an estimated Actor network, optimizes through a maximized advantage function until an optimal action-value function is acquired, and executes deterministic actions based on the optimal action-value function; the invention combines the layered attention mechanism with the Actor-Critic network framework to realize the extendability reinforcement learning of the intelligent agent under the belonged environment.

Description

Multi-agent reinforcement learning method and system based on layered attention mechanism

Technical Field

The invention relates to a multi-agent, in particular to a multi-agent reinforcement learning method and system based on a layered attention mechanism.

Background

Deep reinforcement learning has made significant progress in many areas, such as the Atari game, weiqi, and complex continuous control tasks related to sports. We will generally refer to robots that learn and make decisions as agents, and everything outside of agents that interacts with them as environments. The agent selects actions, the environment gives corresponding feedback to those actions, and a new state is presented to the agent. The environment also generates a benefit (i.e., a reward) that the agent wants to maximize during the selection action. The series of decision processes can be modeled as a Markov decision process (Markov Decision Process, MDP), which is a mathematically idealized form of the reinforcement learning problem, and a theoretical framework for achieving the goal through interactive learning.

In the currently existing deep reinforcement learning paradigm, most of the attention is directed to a single agent in a static environment or to a special dynamic environment (e.g., alphaGo) where zeros and games exist. In a multi-agent environment, multiple agents share an environment that needs to cooperate or compete with other agents in completing the objective. Based on the relationships between agents in a multi-agent environment, multi-agent tasks can be categorized into 3 categories: complete collaboration, complete competition, mixed tasks (simultaneous competition and collaboration). In order for an agent to learn effectively in a multi-agent environment, the agent is required to learn not only the dynamic characteristics of the environment, but also the dynamic characteristics of other agents. In the existing multi-agent reinforcement learning algorithm, the simplest idea is to train each agent independently, treat the other agents as part of the environment, and maximize self-rewarding feedback. But this idea violates the basic assumption of reinforcement learning. Because of the change in other agent policies, the environment is non-stationary for any agent, and the markov characteristics are not met, the agent cannot use a reinforcement learning algorithm based on a stationary markov decision process. Another extreme idea is to model the multi-agent as one centrally controlled single agent whose action space is the joint action space of all multi-agents. This idea is non-scalable, with the size of the state space and the action space increasing exponentially with the number of agents. Also, since the central controller must collect information of each agent and distribute decisions to each agent in the decision process, there is a very high communication requirement for agents performing decisions, which is difficult to achieve in the real world.

Disclosure of Invention

Aiming at the problem that the prior art agent cannot simultaneously meet the Markov characteristic and the expansibility learning, the invention provides a multi-agent reinforcement learning method based on a layered attention mechanism, which comprises the following steps:

constructing a learning environment, wherein the learning environment comprises a plurality of agents;

the Critic network calculates an estimated value based on the observed value and the action value of other intelligent agents in the plurality of intelligent agents, which are acquired by the hierarchical attention mechanism of each intelligent agent, and optimizes the estimated value through a minimum joint loss function until the minimum joint loss function converges;

calculating an action-value function based on the observed value, the action value and the trained estimated value in combination with an estimated Actor network, and optimizing the action-value function by maximizing a dominant function until an optimal action-value function is obtained;

based on the optimal action-value function, a deterministic action is performed.

Preferably, the building a learning environment, the learning environment including a plurality of agents, includes:

all agents in the learning environment are numbered based on building a collaborative, competitive, and hybrid learning environment.

Preferably, the Critic network calculates an estimated value based on the observed value and the action value of other agents in the plurality of agents obtained by the hierarchical attention mechanism of each agent, and optimizes the estimated value by a minimum joint loss function until the minimum joint loss function converges, including:

Constructing an estimated Critic network by combining a layered attention mechanism and a multi-head attention mechanism;

based on the observed value and the action value, grading the plurality of agents through a hierarchical attention mechanism, carrying out weight configuration, and calculating the contribution value of each agent to other agents in the plurality of agents through the weight configuration;

calculating an estimated value based on an observed value and an action value obtained in the process that the estimated Critic network receives interaction between a plurality of agents and the environment and the contribution value of each weighted agent to other agents in the plurality of agents;

and optimizing the estimated value through a minimum joint loss function formula until the minimum joint loss function converges, and obtaining the optimized estimated value.

Preferably, the constructing the estimated Critic network by combining the hierarchical attention mechanism and the multi-head attention mechanism comprises:

dividing the hierarchical attention mechanism into a plurality of expression subspaces, and establishing a multi-head attention mechanism;

setting all multi-head attention to the same network structure based on a multi-head attention mechanism, and calculating individual-level weights and group-level weights of enemies and friends in each expression subspace;

an estimated Critic network is constructed based on individual and group level weights of enemies and friends within all expression subspaces.

Preferably, the classifying the plurality of agents through a hierarchical attention mechanism and performing weight configuration based on the observed value and the action value, calculating the contribution value of each agent to other agents in the plurality of agents through weight configuration, including:

inputting the observed value and the action value into a cyclic neural network for coding, and calculating the individual weight of each intelligent agent through a hierarchical attention mechanism and coding;

and calculating enemy group level weight and friend group level weight based on the category and individual level weight of each agent, and calculating the contribution value of each agent to other agents in the plurality of agents after weighting.

Preferably, the calculating the action-value function based on the observed value, the action value and the trained estimated value in combination with the estimated Actor network and optimizing by maximizing the dominance function until obtaining the optimal action-value function includes:

receiving observed values, action values and the estimated values of all the agents through an Actor network, and calculating an action-value function of each agent;

and optimizing based on the action-value function and the maximized dominant function formula of each agent to obtain the optimal action-value number.

Preferably, the system further comprises a target Actor network and a target Critic network;

copying historical information generated in the process of interaction between the estimated Actor network and the environment to a target Actor network, and storing the historical information into a memory pool;

the target Critic network calculates a next estimated value through a next observed value and an action value;

the target Actor network predicts the next execution action through the observed value and action value in the memory pool and the next estimated value calculated by the target Critic network;

and optimizing the predicted action of the target Actor network based on the predicted next executed action and the executed action of the estimated Actor network interacted with the environment.

Preferably, each of the agents is subjected to a contribution value of another agent of the plurality of agents, and the contribution value is calculated according to the following formula:

wherein x is _i The contribution value of all weighted agents to the selected agent is obtained; gi _m Is the overall value in the same set; alpha _m For the intelligent i friendsDegree of attention of the party's entirety (or enemy entirety); reLU is a nonlinear activation function; v (V) _g A linear value projection matrix at a group level;converting original value into query value, +.>Converting the original value into a key value; e, e _i For observations o of agent i _i And an action value a _i Encoding is performed.

Preferably, the minimized joint loss function is calculated as follows:

wherein L is _Q (mu) is a minimization of the joint loss function;is a mathematical expectation; />An estimate of the action-value function for agent i; o is the observation value of the selected agent i; a is the action value of the selected intelligent agent i; r is a reward; o' is a predicted observed value of the selected agent; a' is the predicted action value of the selected agent i; o' _i Predicting an observed value for agent i'; a' _i Predicting an action value for the agent;is a target Critic network; />The method is a target Policy network; omega is the maximumTemperature parameters of entropy; y is _i Is the real feedback value given by the environment when the agent i executes the action value a under the observed value o.

Preferably, the maximum merit function is calculated as follows:

the evaluation value output by each action is calculated according to the following formula:

wherein J is an objective function; d is distribution;an encoder for observations; />Is an embedded function.

Based on the same inventive concept, the invention also provides a multi-agent reinforcement learning system based on a layered attention mechanism, which is characterized by comprising a learning environment module, a Critic module, an Actor network module and an execution action module;

the learning environment module: constructing a learning environment, wherein the learning environment comprises a plurality of agents;

The Critic module: the Critic network calculates an estimated value based on the observed value and the action value of other intelligent agents in the plurality of intelligent agents, which are acquired by the hierarchical attention mechanism of each intelligent agent, and optimizes the estimated value through a minimum joint loss function until the minimum joint loss function converges;

the Actor network module: calculating an action-value function based on the observed value, the action value and the trained estimated value in combination with an estimated Actor network, and optimizing the action-value function by maximizing a dominant function until an optimal action-value function is obtained;

the execution action module: based on the optimal action-value function, a deterministic action is performed.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a multi-agent reinforcement learning method based on a layered attention mechanism, which comprises the following steps: constructing a learning environment, wherein the learning environment comprises a plurality of intelligent agents, a Critic network calculates estimated values based on observed values and action values of other intelligent agents in the plurality of intelligent agents, which are acquired by a layered attention mechanism of each intelligent agent, optimizes through a minimum joint loss function until the minimum joint loss function converges, calculates an action-value function based on the observed values, the action values and the trained estimated values in combination with an estimated Actor network, optimizes through a maximized advantage function until an optimal action-value function is acquired, and executes deterministic actions based on the optimal action-value function; the invention combines the layered attention mechanism with the Actor-Critic network framework to realize the extendability reinforcement learning of the intelligent agent under the belonged environment.

2. The invention provides a multi-agent reinforcement learning algorithm for coding information by utilizing RNN in a Critic network, which solves the problem that when heterogeneous agents exist, the sequence information received by the agents with different lengths is effectively preprocessed.

3. The invention provides a multi-agent reinforcement learning algorithm for compressing received information in a Critic network by using an attention mechanism, which can effectively solve the problem that the input space of Critic is sensitive to the number of agents and improve the expandability of the algorithm in multi-agent tasks.

Drawings

FIG. 1 is a schematic diagram of a multi-agent reinforcement learning method based on a hierarchical attention mechanism according to the present invention;

FIG. 2 is a diagram of a network framework of the present invention;

FIG. 3 is an environmental profile of the present invention;

FIG. 4 is a flowchart of an algorithm of the present invention;

FIG. 5 is a schematic diagram showing the performance of the present invention in four scenarios compared to other reinforcement learning algorithms;

FIG. 6 is a diagram showing the comparison of the present invention with other reinforcement learning algorithms in a pursuit scenario with a number of changes;

fig. 7 is a schematic diagram of attention weight distribution in a pursuit scene according to the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings.

Example 1

With reference to fig. 1, the present invention provides a multi-agent reinforcement learning method based on a hierarchical attention mechanism, including:

constructing a learning environment, wherein the learning environment comprises a plurality of intelligent agents;

the Critic network calculates an estimated value based on the observed value and the action value of other intelligent agents in the plurality of intelligent agents obtained by the layered attention mechanism of each intelligent agent, and optimizes the estimated value through the minimum joint loss function until the minimum joint loss function converges;

Constructing a learning environment, the learning environment including a plurality of agents, including:

The Critic network calculates an estimated value based on observed values and action values of other agents in the plurality of agents acquired by a hierarchical attention mechanism of each agent, and optimizes the estimated value through a minimum joint loss function until the minimum joint loss function converges, and the method comprises the following steps:

calculating an estimated value based on an observed value and an action value obtained in the process that the estimated Critic network receives interaction between a plurality of intelligent agents and the environment and a contribution value of each weighted intelligent agent to other intelligent agents in the plurality of intelligent agents;

Constructing an estimated Critic network using a combination of hierarchical and multi-headed attention mechanisms, comprising:

Based on the observed value and the action value, grading the plurality of agents through a hierarchical attention mechanism and carrying out weight configuration, and calculating the contribution value of each agent to other agents in the plurality of agents through the weight configuration, wherein the method comprises the following steps:

inputting the observed value and the action value into a cyclic neural network for coding, and calculating the individual weight of each agent through a hierarchical attention mechanism and coding;

Calculating an action-value function based on the observed value, the action value and the trained estimated value in combination with the estimated Actor network, and optimizing by maximizing the advantage function until obtaining an optimal action-value function, wherein the method comprises the following steps:

receiving observed values, action values and estimated values of all the agents through an Actor network, and calculating an action-value function of each agent;

The system also comprises a target Actor network and a target Critic network;

The target Critic network calculates a next estimated value through the next observed value and the action value;

the target Actor network predicts the next execution action through the observed value and the action value in the memory pool and the next estimated value calculated by the target Critic network;

the predicted actions of the target Actor network are optimized based on the predicted next execution actions and the execution actions of the estimated Actor network interacting with the environment.

Each agent is subject to contribution values of other agents of the plurality of agents, calculated as follows:

wherein x is _i The contribution value of all weighted agents to the selected agent is obtained; gi _m Is the overall value in the same set; alpha _m The attention degree of the intelligent agent i to the whole friend (or the whole enemy); reLU is a nonlinear activation function; v (V) _g A linear value projection matrix at a group level;converting original value into query value, +.>Converting the original value into a key value; e, e _i For observations o of agent i _i And an action value a _i Encoding is performed.

Minimizing the joint loss function, calculated as:

wherein L is _Q (mu) is a minimization of the joint loss function;is a mathematical expectation; />An estimate of the action-value function for agent i; o is the observation value of the selected agent i; a is the action value of the selected intelligent agent i; r is a reward; o' is a predicted observed value of the selected agent; a' is the predicted action value of the selected agent i; o' _i Predicting an observed value for agent i'; a' _i Predicting an action value for the agent i'; />Is a target Critic network; />The method is a target Policy network; ω is the temperature parameter of maximum entropy; y is _i Is the real feedback value given by the environment when the agent i executes the action value a under the observed value o.

Maximizing the dominance function, calculated as:

the evaluation value output for each action is calculated as follows:

Example 2

Referring to fig. 2, the present invention includes the steps of:

first, a reinforcement learning environment is constructed in conjunction with fig. 3. Mainly comprises three cooperation environments and a pursuit environment (mixed environment). The environment simulates a real physical world environment in which there are elastic forces, resistances, etc. The method comprises the following specific steps:

1.1 collaborative navigation scenarios: in this scenario, the X agents reach as many as possible L preset target points (l=x) by collaboration. All agents have only physical actions, but can observe their relative positions with other agents and target points. The rewards feedback of the agents is related to the distance of each target to any one agent, so it is required that the agent must cover each target to obtain the maximum rewards. Moreover, when an agent collides with another agent, a penalty is obtained. We try to let the agent learn to cover all targets and avoid collisions during movement. The scenario is a simple collaborative scenario.

1.2 collection of treasures scene: the scenario includes 8 agents, 6 of which are *** hunters and 2 of which are *** banks. Each collection bank corresponds to a collection with a different serial number. Hunters can collect treasures of any serial number and place them in banks of the corresponding serial number. It is the responsibility of each bank to collect as many treasures as possible from hunters. The agent may observe the relative positions of other agents. The hunter can obtain the whole rewards by successfully collecting the treasures, and all the intelligent agents can obtain the rewards for storing the treasures. Hunters are additionally penalized by collisions with each other. The scenario pertains to a collaboration involving global and personal rewards.

1.3 director-actor scenario: the scenario includes 8 agents, 4 of which are actors, 4 directors. In the initialization, directors and actors are randomly matched one by one. In this environment, the actor cannot see the target directly in the field of view and must rely on communication with the director to reach the target. The director can locate the actor and the targets that the actor needs to reach and send discrete communications to the corresponding actor. The rewards feedback of the commander and the actor depends on the distance of the actor to the target, and the closer the distance is, the higher the rewards are. Notably, in this scenario, the director has only communication actions and the actor has only physical actions. In our environment setup, the communication is part of the actor's observation space, not part of the last learned model. The scenario belongs to a collaboration scenario where heterogeneous agents exist.

1.4 pursuing the scene: in this scenario, there are X slower, cooperative chasers, Y faster escapers (Y < X), and L randomly generated obstacle landmarks (L < X). Each agent can observe its relative distance from other agents and obstacles. In each time step, if the chaser collides with the escaper, the chaser gets a huge reward, and the escaper gets punishment. At the same time, the environment has boundaries, and when an evacuator collides with the boundaries, additional punishment is also obtained. The scene belongs to a mixed scene, and not only cooperation but also competition exists among the intelligent agents.

Second, a hierarchical attention mechanism is constructed that handles the acceptance information. In a multi-agent task scenario, an algorithm using an Actor-Critic framework requires a Critic network to receive information (i.e., an observed value and an action value) of other agents, and the Critic network gives feedback to the Actor network according to the information of the other agents and the observed value and the action value of the Actor network, so that the Actor network can select better actions according to the observed value of the Actor network. Therefore, the accuracy of the Critic network in processing the information influences the quality of the agent policy. We use a hierarchical attention mechanism and recurrent neural networks (Recurrent Neural Networks, RNN) to make the centralized Critic network learn the received information more accurately. In real world environments, especially for mixed tasks where there is both collaboration and competition, the agent may not only have enemies that need attention, but also have powerful teammates. From the perspective of both sides of the enemy, the agent needs to calculate which one has greater influence on itself from the threat of the enemy and the assistance of the friend at this time. The algorithm builds the hierarchical attention mechanism as follows:

2.1 according to a priori knowledge, arbitrarily selecting an agent i from the environment, grouping all agents except the agent i, and setting an index. The method comprises the following specific steps:

2.1.1 numbering all agents, the environment contains a total of N agents, i e { 1..n }. The symbol \i is used in the algorithm to denote a collection of agents other than agent i.

2.1.2 based on a priori knowledge, i.e. the type of agent, the agents in \i are classified as friends (i.e. same type as agent i) or hostiles (i.e. different type from agent i). The index of the set formed by all friends of the intelligent agent i is j1; the index of the set of all enemies of agent i is j ₂ 。

2.1.3 complete cooperative task can be seen as a special case when the number of enemies is 0, where j ₂ Is an empty set.

2.2 the hierarchical attention mechanism is introduced into the Actor-Critic framework, which is focused on training and step-by-step execution. The method comprises the following specific steps:

2.2.1 centralized Critic network for agent i receives observations o= (o) for all agents in \i ₁ ，...，o _N ) Sum action value a= (a) ₁ ，...，a _N )。

2.2.2 observations o of each agent i' in \i _i ' sum action value a _i The' input RNN is encoded as shown in the following equation. Because all agents follow the Markov characteristics, the agents' movements The value of the action has a close relationship with the observed value of the current agent. The RNN can better preserve sequence information implicit in the original information when processing sequence information with an association.

e _i ′＝h _i ′(o _i ′，a _i ′)

Wherein e _i 'embedded "state-action" binary group representing agent i', h _i ' is an embedded function, observed value o for agent i _i ' sum action value a _i ' encoding.

2.2.3 calculate the individual level of attention mechanism. First calculate friend j ₁ The importance distribution of the information of the intelligent agents in the group obtains the overall value of the weighted friends according to the influence of each intelligent agent in the friend group on the intelligent agent iAs shown in the following formula, where m=1. Next, calculate enemy j ₂ The importance distribution of the information of the agents in the group obtains the weighted integrated value of the enemy according to the influence of each agent in the enemy group on the agent i>As shown in the following formula, where m=2.

Wherein,attention weights representing individual levels within the same group; reLU represents a nonlinear activation function, and an algorithm uses a LeakyReLU modified according to ReLU; v (V) _a Is a linear value projection matrix of individual levels. At the individual levelIn the attention weight calculation of (2), use +.>Converting the original value into a query value, +. >And converting the original value into a key value, and calculating and normalizing according to the correlation between the query value and the key value. Meanwhile, in order to prevent gradient from disappearing, we scale the calculation result of the correlation.

2.2.4 calculate the group level attention mechanism. According to the influence of the overall value of the friend and the overall value of the enemy on the intelligent agent i, calculating the importance distribution of different groups of information, and obtaining the contribution x of all other weighted intelligent agents to the intelligent agent i _i As shown in the following formula.

Wherein alpha is _m The attention degree of the intelligent agent i to the whole friend (or the whole enemy) is reflected, namely the attention weight of the group level; reLU represents a nonlinear activation function, and an algorithm uses a LeakyReLU modified according to ReLU; v (V) _g Is a group-level linear value projection matrix. Calculation of attention weights with reference to individual levels, utilizationThe original value is converted into a query value,and converting the original value into a key value, and calculating, normalizing and scaling according to the correlation between the query value and the key value.

2.3 combining hierarchical attention mechanism with multi-head attention mechanism, the specific steps are as follows:

2.3.1 classical attention mechanisms are used to capture the correlation between keys and query values, allowing the model to dynamically focus on valid information based on key-value (value). The transducer model proposed by Google in article Attention is all you need in 2017 uses a multi-head attention mechanism, and can integrate the attention distribution in different expression subspaces as shown in the following formula. In the multi-head attention mechanism, the network structure at each attention head is the same.

2.3.2 reference to transducer, a multi-headed attention mechanism structure is introduced. In this case, each attention head contains two separate sets of parameters, namely individual levelsAnd group level->Shared among all agents, creating a common impact of other agents on agent i. Different attention heads may focus on different weighted effects. In the group level attention calculation of the attention header, the results of all the attention headers and the information of the agent i are connected into a vector as the input of the Critic network.

2.3.3 in each hierarchy of each attention header, the weights used to extract candidate values, keys and values are shared among all agents. Even in a challenge environment, key parameters can be shared between agents because the cost function approximation of a multi-agent is a multiple-task regression problem. Agents of the same goal use a set of centralized Critic based on enhanced attention mechanisms. Using parameter sharing allows our approach to learn efficiently in environments where rewards to individual agents are different but have common characteristics. Third, a centralized Critic network is built using a hierarchical attention mechanism. Because of the parameter sharing, the Critic networks of agents within the same group are updated together.

Third step, constructing an Actor network

3.1 calculating the Q-value function of agent iThe estimation of the action-value function representing agent i requires receiving observations o= (o) of all agents ₁ ，...，o _N ) Sum action value a= (a) ₁ ，...，a _N ) As shown in the following formula.

Where fi is a two-layer Multi-layer Perceptron (MLP).

3.2 design minimizes the joint loss function as shown in the following equation.

Wherein,and->The target Critic network and the target Policy network are respectively; ω is the temperature parameter of the maximum entropy, used to determine the balance between the maximum entropy and the prize value; y is _i Is the real feedback value given by the environment when the agent i executes the action value a under the observation value o.

The variance of the objective function is reduced by using a dominance function that only marginalizes the behavior of the specified agent i.

3.3 by comparing the value of the specific action with the value of the average action of the specified agent, it is known how the specific action of agent i will result in an increase in the expected return, or whether the increase in rewards is due to the action of other agents, with all other agent actions fixed.

3.4 in continuous action space we can estimate the desired return needed by sampling from the strategy of agent i, or by learning a separate value weight matrix with the actions of the other agents alone as output.

3.5 in discrete action space we can output through neural networks every possible action 'a' that the agent i can take _i ∈A _i Estimation of corresponding expected returnsTo calculate a baseline, as shown in the following equation.

Wherein A is _i (o, a) is the dominance function, b (o, a _\i ) Is a negative fact baseline.

3.6 to calculate the counterfactual baseline b (o, a _\i ) We need to go fromRemoving a from the input of (a) _i And for each possible action' a _i An evaluation value is output. For this purpose we add an observation encoder +.>As shown in the following formula, e is substituted for e required in formula fi _i ＝h _i (o _i ，a _i ) And f is set to _i The output dimension of (a) is modified to the action space dimension of agent i, and an evaluation value is output for each possible action.

3.7 the estimated Policy network for agent i is updated using a random gradient descent method to maximize the merit function as shown in the following equation.

And 3.8, sampling from actions a obtained by the current policies of all the agents, wherein the actions a are used for gradient estimation of the agent i, updating an estimated Policy network of the agent i, and solving the problem that the agent i cannot generalize according to the current policies.

Fourth, constructing a network frame. The network structure of the algorithm follows the framework of centralized training and distributed execution of the classical multi-agent Actor-Critic framework. The framework includes four networks, two Actor neural networks (referred to as a target Actor network and an estimated Actor network), respectively, and two Critic neural networks (referred to as a target Critic network and an estimated Critic network) for guiding updating of the Actor network. The update frequencies of the estimated network and the target network are inconsistent, namely the target network is updated slowly, and the estimated network is updated fast. When the neural network is trained, only the parameters of the estimated Actor network and the estimated Critic network are required to be trained, and the parameters of the target Actor network and the target Critic network are copied from the parameters of the estimated Actor network and the estimated Critic network at intervals of preset fixed time steps.

4.1 to increase the utilization rate of the sampled data, online learning is changed into offline learning. A memory playback pool is set, and history information generated in the process of interaction between each agent and the environment is stored in the playback pool. To fix the policy, a slower updating target network is used to save parameters that estimate the current time of the network. Therefore, the structure of the target network is identical to that of the estimation network, and the parameters are copied by the estimation network after k time steps. Thus, the network can always train with the currently sampled data in k time steps without the need to resample the data from time to time.

4.2 constructing an algorithm framework for centralized training and distributed execution. Each agent has an estimated Actor network, which is input as the observed value o of the agent i itself to the environment _i The output of the network is a deterministic action a _i The action is performed in the environment and gets a feedback from the environment. The estimation Actor network only uses the generated data of the interaction of the agent itself with the environment during training. And each agent also corresponds to an estimated Critic network. Unlike the estimated Actor network, the estimated Critic network inputs data generated by all agents, i.e., includes the observed values and action values of all agents, forming a centralized Critic network. In the training process, a centralized Critic network is used for assisting an Actor network; in execution, only the Actor network is used. Referring to fig. 4, the steps of the frame are as follows:

4.2.1 initializing parameters of an Actor network and a Critic network;

4.2.2 random action exploration is performed on the environment, and information (o, a, r, o') sampled from the environment is stored in a playback pool. Wherein o= (o) ₁ ，...，o _N ) Is the combined observation value of all the agents to the environment at the time t, a= (a) ₁ ，...，a _N ) The method is characterized in that actions respectively executed by all the intelligent agents at the moment t, r is rewarding feedback of environments respectively obtained after the actions of all the intelligent agents are executed, and is joint observation of the environments by all the intelligent agents at the moment t+1 after the actions of all the intelligent agents are executed;

4.2.3 extracting a batch of sample data from the recall pool (o ⁿ ，a ⁿ ，r ⁿ ，o′ ⁿ ) Where n is the number of samples in the batch.

4.2.4 inputting action a and environment observation information o in the sample pool into an agent estimation evaluation home network, firstly, preprocessing the information by an RNN network, then outputting the information to a layered attention mechanism network, carrying out representation learning on the information, compressing the information into a vector with a fixed length, and finally, calculating an estimated valueValues.

4.2.5 Using the loss functionTo update the estimated critics network. Wherein (1)>

4.2.6 use of an object dominance function

To update the estimated actor network.

4.2.7 computing agent i using an estimated Critic network A value for guiding the estimation of the Actor network to calculate the execution action a _i 。

4.2.8 copying parameters of the estimated network to the target network using soft updates, as shown in the following formula, whereinAnd->The target Critic network and the target Policy network are respectively, and mu and theta are respectively.

Compared with other reinforcement learning algorithms, the invention Actor Hierarchical Attention Critic (AHAC for short) is characterized in that experimental equipment is a desktop computer, an Intel i7-8700 processor is carried, the processor frequency is 3.20gHz, the RAM size is 32GB, the display card is Nvidia GTX 1050G size is 4GB, and the experimental system is Wu Ban Tu 16.04.04 version. First, a Multi-agent particle world environment (Multi-agent Particle Environment, MPE) based on OPEN AI OPEN source was tested, and environmental parameters were set using default settings. The environment is a test environment commonly used by multi-agent reinforcement learning algorithms. The environment effectively reflects several cooperation countermeasure scenes in the real world, the environment uses force and time to calculate the speed and distance between the intelligent agents, and several problems in the real world can be abstracted as soon as possible.

The generalization performance of the algorithm in different scenes is compared. We tested the performance of three algorithms and their variants (i.e., madppg, MAAC and AHAC) in four scenarios, where both AHAC-RNN and AHAC-HAN are simplified versions of AHAC, AHAC-RNN is the use of RNN-based preprocessing modules only, and AHAC-HAN is the use of hierarchical attention mechanisms only. The performance of the algorithm is compared by comparing the benefit value (i.e., rewards) of the agent, the comparison results being combined with fig. 5. Wherein the higher the prize value, the better the algorithm performance. The performance of the invention is superior to other algorithms, and the performance and generalization capability of the model are higher.

And (3) comparing the expandability of the algorithm in the pursuit scene, namely comparing the influence of the number of different agents on the performance of the algorithm. In the pursuit scene, the numbers of the pursuers (friends) and the evacuees (enemies) are respectively set to be 3vs.1,3vs.2,3vs.3,6vs.2,6vs.6 and 9vs.3, and as the number of the intelligent agents increases, the performance of the method is far superior to that of other algorithms as the comparison result is combined with fig. 6. Because the Critic network based on the hierarchical attention mechanism can extract important information from large-scale information, the chaser can select a better strategy.

Verifying the role of the hierarchical attention mechanism in the algorithm. In the pursuit scenario, rewards are shared among the pursuers, so the weight distribution of the information exchanged between the pursuers is only related to the distance between the agents. Attention weight distribution in a pursuit scene, in combination with fig. 7. When all the evacuees are far apart and the other chasers are close together, the chasers can preferentially avoid collision with the other chasers. As the evacuee gets closer to the chaser, the chaser will strive to get closer to the evacuee. Therefore, the hierarchical attention mechanism can indeed capture important information of other agents, helping the agents obtain higher rewards.

Example 3

And a learning environment module: constructing a learning environment, wherein the learning environment comprises a plurality of intelligent agents;

critic module: the Critic network calculates an estimated value based on the observed value and the action value of other intelligent agents in the plurality of intelligent agents obtained by the layered attention mechanism of each intelligent agent, and optimizes the estimated value through the minimum joint loss function until the minimum joint loss function converges;

an Actor network module: calculating an action-value function based on the observed value, the action value and the trained estimated value in combination with an estimated Actor network, and optimizing the action-value function by maximizing a dominant function until an optimal action-value function is obtained;

and an execution action module: based on the optimal action-value function, a deterministic action is performed.

The acquisition module comprises an estimated value sub-module, a minimum joint loss function sub-module and an optimization sub-module;

the Critic module comprises a construction sub-module, a contribution value calculation sub-module, an estimated value calculation sub-module and an estimated value optimization sub-module;

and (3) constructing a sub-module: constructing an estimated Critic network by combining a layered attention mechanism and a multi-head attention mechanism;

a calculate contribution value sub-module: based on the observed value and the action value, grading the plurality of agents through a hierarchical attention mechanism, carrying out weight configuration, and calculating the contribution value of each agent to other agents in the plurality of agents through the weight configuration;

The calculate estimation submodule: calculating an estimated value based on an observed value and an action value obtained in the process that the estimated Critic network receives interaction between a plurality of agents and the environment and the contribution value of each weighted agent to other agents in the plurality of agents;

the optimal estimation value submodule: and optimizing the estimated value through a minimum joint loss function formula until the minimum joint loss function converges, and obtaining the optimized estimated value.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as providing for the use of additional embodiments and advantages of all such modifications, equivalents, improvements and similar to the present invention are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. A multi-agent reinforcement learning method based on a layered attention mechanism is characterized by comprising the following steps:

performing a deterministic action based on the optimal action-value function;

the learning environment is a collaborative navigation scene simulating a real physical world, in the scene, a plurality of agents reach a preset target point through collaboration, the quantity of the agents is the same as that of the preset target point, all the agents only have physical actions, and the relative positions of the agents, other agents and the target point can be observed; the rewards feedback of the agents is related to the distance of each target to any agent, so the agents are required to cover each target to obtain the maximum rewards; moreover, when the intelligent agent collides with other intelligent agents, punishment is obtained;

The Critic network calculates an estimated value based on the observed value and the action value of other agents in the plurality of agents obtained by the hierarchical attention mechanism of each agent, and optimizes the estimated value through a minimum joint loss function until the minimum joint loss function converges, and the method comprises the following steps:

optimizing the estimated value through a minimum joint loss function formula until the minimum joint loss function converges, and obtaining an optimized estimated value;

the method for constructing the estimated Critic network by combining the hierarchical attention mechanism and the multi-head attention mechanism comprises the following steps:

constructing an estimated Critic network based on individual-level weights and group-level weights of enemies and friends in all expression subspaces;

based on the observed value and the action value, grading the plurality of agents through a hierarchical attention mechanism and carrying out weight configuration, and calculating the contribution value of each agent to other agents in the plurality of agents through weight configuration, wherein the method comprises the following steps:

calculating enemy group level weight and friend group level weight based on the category and individual level weight of each agent, and calculating the contribution value of each agent to other agents in the plurality of agents after weighting;

the calculation of the action-value function based on the observed value, the action value and the trained estimation value combined with the estimation Actor network and the optimization by maximizing the dominance function until the optimal action-value function is obtained comprises the following steps:

Receiving observed values, action values and the estimated values of all the agents through the estimated Actor network, and calculating an action-value function of each agent;

optimizing based on the action-value function and the maximized dominant function formula of each intelligent agent to obtain an optimal action-value number;

the system also comprises a target Actor network and a target Critic network;

optimizing the predicted actions of the target Actor network based on the predicted next executed actions and the executed actions of the estimated Actor network interacting with the environment;

each agent receives the contribution value of other agents in the plurality of agents, and the contribution value is calculated according to the following formula:

wherein x is _i The contribution value of all weighted agents to the selected agent is obtained;is the overall value in the same set; alpha _m The attention degree of the intelligent agent i to the whole friend (or the whole enemy); reLU is a nonlinear activation function; v (V) _g A linear value projection matrix at a group level; />Converting original value into query value, +.>Converting the original value into a key value; e, e _i For observations o of agent i _i And an action value a _i Coding;

minimizing the joint loss function, calculated as:

wherein L is _Q (mu) is a minimization of the joint loss function;is a mathematical expectation; />An estimate of the action-value function for agent i; o is the observation value of the selected agent i; a is the action value of the selected intelligent agent i; r is a reward; o' is a predicted observed value of the selected agent; a' is the predicted action value of the selected agent i; o' _i Predictive viewing for agent iPerforming value inspection; a' _i Predicting an action value for the agent i';is a target Critic network; />The method is a target Policy network; ω is the temperature parameter of maximum entropy; y is _i Is the real feedback value given by the environment when the agent i executes the action value a under the observed value o.

2. The method of claim 1, wherein the maximizing the merit function is calculated as:

3. A multi-agent reinforcement learning system based on a hierarchical attention mechanism, the system implementing the steps of the method of claim 1, comprising a learning environment module, a Critic network module, an Actor network module, and an execution action module;

The learning environment module: for building a learning environment, the learning environment comprising a plurality of agents;

the Critic network module: the Critic network calculates an estimated value based on the observed value and the action value of other intelligent agents in the plurality of intelligent agents, which are acquired by the hierarchical attention mechanism of each intelligent agent, and optimizes the estimated value through a minimum joint loss function until the minimum joint loss function converges;