CN116560239B

CN116560239B - Multi-agent reinforcement learning method, device and medium

Info

Publication number: CN116560239B
Application number: CN202310824569.9A
Authority: CN
Inventors: 谭明奎; 林坤阳; 王宇丰; 陈沛豪; 杜卿; 胡灏; 李利
Original assignee: Guangdong Guangwu Internet Technology Co ltd; South China University of Technology SCUT
Current assignee: Guangdong Guangwu Internet Technology Co ltd; South China University of Technology SCUT
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-09-12
Anticipated expiration: 2043-07-06
Also published as: CN116560239A

Abstract

The application discloses a multi-agent reinforcement learning method, a multi-agent reinforcement learning device and a multi-agent reinforcement learning medium, and belongs to the technical field of autonomous control of behaviors of a plurality of agents. The method comprises the following steps: obtaining observation, wherein the intelligent agent obtains action probability distribution according to the observation, and reasoning the action probability distribution of the teammate intelligent agent based on the observation; according to the obtained action probability distribution, calculating the behavior consistency of each agent and the teammate agent; acquiring a dynamic adjustment factor through a dynamic scaling network, and calculating internal rewards of behavior consistency according to the dynamic adjustment factor; according to a chained algorithm, optimizing parameters of a dynamic scaling network with the aim of maximizing external return; and realizing the collaboration task of multiple agents by using the optimally completed strategy. The application provides an intrinsic reward based on behavior consistency, solves the problem that a sub-optimization strategy appears due to the fact that a multi-agent cooperation algorithm ignores cooperation of behavior intentions among agents, and can effectively improve cooperation performance among the multi-agents.

Description

Multi-agent reinforcement learning method, device and medium

Technical Field

The application relates to the technical field of autonomous control of behaviors of multiple agents, in particular to a multi-agent reinforcement learning method, a multi-agent reinforcement learning device and a medium.

Background

In reality, many automation scenes can be regarded as multi-agent scenes. Such as multi-robot control, autopilot, electronic gaming, etc. In a multi-intelligent environment, each agent performs a corresponding action based on its observations, together completing a task. The decision model of each agent is trained by multi-agent reinforcement learning technology.

In reinforcement learning, the distribution of rewards to an agent by an environment directly affects the training outcome of the agent. In the multi-agent environment, because the environment generally only provides a unique external reward for all agents, namely, there is a sparse reward problem, it is difficult for each agent to accurately evaluate the influence of its own actions on the overall environment, and thus it is difficult to learn an optimal decision model. Thus, designing an internal incentive for each agent is an effective way to solve the incentive sparse problem. In order to design a reasonable internal rewards distribution mode, some existing schemes encourage the intelligent agents to make diversified behaviors by designing a learnable internal rewards; still other studies have utilized internal rewards to promote similar behavior among agents, making the overall system more predictable. However, existing internal rewards distribution schemes only encourage consistent or inconsistent behavior by the agents throughout their actions, and do not dynamically promote when consistent behavior should be made between agents. While dynamically encouraging consistent behavior is critical to facilitating inter-agent collaboration in a complex multi-agent environment.

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the application aims to provide a multi-agent reinforcement learning method, device and medium based on behavior consistency internal rewards.

The technical scheme adopted by the application is as follows:

a multi-agent reinforcement learning method comprises the following steps:

obtaining observation, wherein the intelligent agent obtains action probability distribution according to the observation, and reasoning the action probability distribution of teammate intelligent agent based on the observation;

according to the obtained action probability distribution, calculating the behavior consistency of each agent and the teammate agent;

acquiring dynamic adjustment factors through a Dynamic Scaling (DSN) network based on the behavior consistency, and calculating internal rewards of the behavior consistency according to the dynamic adjustment factors;

according to a chained algorithm, optimizing parameters of a dynamic scaling network with the aim of maximizing external return;

and realizing the collaboration task of multiple agents by using the optimally completed strategy.

Further, the obtaining the observation, the agent obtaining the action probability distribution according to the observation, and reasoning the action probability distribution of the teammate agent based on the observation, includes:

for each intelligent agent, inputting the observation of the current moment of the intelligent agent, and reasoning the next action probability distribution by the intelligent agent by using an actuator of the intelligent agent;

the intelligent agent inputs own observation into the executors of other intelligent agents to infer the action probability distribution of the teammate intelligent agent executors facing the same observation.

Further, the calculating the behavior consistency of each agent with the teammate agent includes:

and calculating the similarity between the motion probability distribution of the next step obtained by reasoning of each agent and the motion probability distribution output by teammate agents when facing the same observation, and taking the similarity as behavior consistency.

Further, the similarity is calculated by using KL divergence, and the calculation formula is as follows:

wherein ,representing intelligent agent->And intelligent agent->At->Behavior consistency of time of day->Representing intelligent agent->At->Time acquisition and agent->In the same observation, the output action is +.>Probability of class action,/->Representing intelligent agent->At->The moment output action is +.>Probability of class action,/->Representing the length of the agent's action space.

Further, the obtaining the dynamic adjustment factor through the dynamic scaling network, calculating the internal rewards of the behavior consistency according to the dynamic adjustment factor, including:

constructing a dynamic scaling network; the input of the dynamic scaling network is global observation, and the output is a dynamic adjustment factor consistent with the behavior of different agents; wherein the global observation is an observation of all agents;

based on the dynamic adjustment factor and the behavior consistency, an internal reward based on the behavior consistency is calculated.

Further, for the agentIn->The internal rewards based on behavior consistency obtained at the moment are as follows:

wherein ,is intelligent body->Dynamic adjustment factor of self dynamic scaling network output, < >>For a scaled super-parameter,representing intelligent agent->And intelligent agent->At->Behavior consistency of time of day->Is intelligent body->Teammate agent collection; when->When the value is negative, the agent is encouraged +.>And intelligent agent->Making a consistency behavior between the two; when->If the value is positive, punishment intelligent agent is +.>And intelligent agent->And (5) making a consistent behavior.

Further, the optimizing the parameters of the dynamic scaling network according to the chained algorithm with the objective of maximizing the external return includes:

an external value objective function is constructed, and the derivative of the external value objective function is related with the parameters of the dynamic scaling network through a chain rule, so that the updating direction of the dynamic adjustment factor is consistent with the increasing direction of the external value.

set the firstThe parameter of the dynamic scaling network corresponding to the intelligent agent is +.>External return +.>For a pair ofGradient of->Expressed as:

（1）

wherein Indicate->Network parameters updated by the corresponding executor strategies of the intelligent agents;

in formula (1)Further expressed as:

（2）

wherein Representing the external advantage, given by an external evaluator; />Indicate->The actuator strategy output action after the update of the individual agent is +.>Probability of (2); />To record +.>A buffer for historical observation and action of the intelligent agent;

according to the Soft Actor-Critic algorithm,the calculation of (a) is expressed as：

（3）

wherein For learning rate->Is the entropy temperature coefficient in the Soft Actor-Critic algorithm, +.>Is->The auxiliary judgment device after updating corresponding to the intelligent agent inputs the observation +.>A value of time; in the Soft Actor-Critic algorithm, two identical auxiliary judgments are defined for each agent +.> and />The parameters of each auxiliary evaluator are updated independently and the participating reward input is the sum of the internal reward and the external reward, i.e +.>，/>For external rewards, then；

Further deriving, in formula (1)Equal to:

（4）

wherein 、/>Respectively +.>The intelligent agents correspond to parameters before and after updating of the auxiliary judging device;is->Before updating, the auxiliary judgment device corresponding to the intelligent agent inputs the observation +.>Action value at the time; />Is->An actuator strategy before the update of the intelligent agent; substituting the formula (2) and the formula (4) into the formula (1) to calculate the external return +.>For->Gradient of->The method comprises the steps of carrying out a first treatment on the surface of the And optimizing the dynamic scaling network parameters by a gradient ascent method.

The application adopts another technical scheme that:

a multi-agent reinforcement learning device, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.

The application adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the application are as follows: the application provides an intrinsic reward based on behavior consistency, solves the problem that a sub-optimization strategy appears due to the fact that a multi-agent cooperation algorithm ignores cooperation of behavior intentions among agents, and can effectively improve cooperation performance among the multi-agents.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present application, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a schematic diagram of a multi-agent reinforcement learning method based on behavior consistency internal rewards in an embodiment of the application;

FIG. 2 is a schematic diagram of a Dynamic Scaling Network (DSN) in accordance with an embodiment of the present application;

FIG. 3 is a task schematic diagram of N agents cooperatively navigating to N target points in an embodiment of the application.

Fig. 4 is a flowchart illustrating steps of a multi-agent reinforcement learning method according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.

In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

In order to solve the defects of the prior art, the application provides an intrinsic reward based on behavior consistency, thereby encouraging an agent to dynamically adjust the consistency of the behavior and the teammate behavior so as to maximize the task reward.

As shown in fig. 4, the embodiment provides a multi-agent reinforcement learning method, which includes the following steps:

s1, obtaining observation, wherein the intelligent agent obtains action probability distribution according to the observation, and reasoning the action probability distribution of the teammate intelligent agent based on the observation.

For each agent, inputting the observation of environmental feedback under the current deployment, the agent utilizes the own actuator to infer the next action probability distribution, and simultaneously inputs the own observation into the actuators of other agents to infer the action probability distribution of teammate agents when the actuators face the same observation.

S2, calculating the behavior consistency of each agent and the teammate agents according to the obtained action probability distribution.

In this embodiment, KL divergence is used to calculate the behavior consistency of each agent and its teammate agent, that is, calculate the similarity between the motion probability distribution of the next step inferred in step S1 and the motion probability distribution output when teammate agents face the same observation, as the behavior consistency.

S3, based on the behavior consistency, acquiring a dynamic adjustment factor through a dynamic scaling network, and calculating internal rewards of the behavior consistency according to the dynamic adjustment factor.

In this embodiment, a Dynamic Scaling Network (DSN) is constructed, the input is a global observation, the output is a dynamic factor corresponding to the behavior consistency with the teammate agent, and the dynamic factor is multiplied by the behavior consistency, thereby calculating an internal reward based on the behavior consistency.

And S4, optimizing parameters of the dynamic scaling network according to a chained algorithm and taking the maximization of external return as a target.

In this embodiment, an external value objective function is constructed and the derivative of the external value objective function is correlated with the dynamic factor network parameters by the chain law so that the update direction of the dynamic factor coincides with the direction in which the external value increases.

And S5, realizing the cooperation task of the multiple agents by using the optimized strategy.

After the strategy training of the intelligent agents is completed through the steps, the executors of each intelligent agent are used as main bodies of the intelligent agents and deployed to the multi-intelligent agent cooperation task.

The above method is explained in detail below with reference to the drawings and specific examples.

As shown in fig. 1, the embodiment provides a multi-agent reinforcement learning method based on behavior consistency internal rewards, which specifically includes the following steps:

step 1: and obtaining the observation and the action of the intelligent agent, and reasoning the behavior of the teammate intelligent agent.

The multi-agent reinforcement learning problem considered in this embodiment is a decentralised partial considerable Markov decision process definingFor the number of intelligent agents->For observing by the intelligent body, the person is added with->For intelligent action->Is an actuator strategy of an intelligent agent. For intelligent agent->It is->The observation of the moment is->The output action probability distribution is expressed as +.>，/>Indicating the length of the action space of the intelligent body, let ∈ ->Representing its teammate agent collection. For intelligent agent->Teammate agent of (2)When it gets and is intelligent->The behavior is defined as +.>。

Step 2: and calculating the behavior consistency of each agent and teammate agents.

And (3) calculating the consistency of the self-agent behaviors and the teammate agent behaviors obtained in the step (1). In particular, for the agentAccording to its own behaviour in step 1 +.>Team mate agent->Is combined with the action of the agent +.>The behavior consistency of (2) is expressed as the KL divergence of the probability distribution between behaviors, calculated as follows:

（1）

wherein ,representing->In other words, it is +.>At->Behavior consistency at time. The larger the KL divergence (i.e. the larger +.>) The larger the difference of the behaviors adopted by the two agents is, the more inconsistent the behaviors of the two agents are. />Representing intelligent agent->At->Time acquisition and agent->In the same observation, the output action is +.>Probability of class action,/->Representing intelligent agent->At the position ofThe moment output action is +.>Probability of class action,/->Representing the length of the agent's action space.

Step 3: the DSN network is designed to acquire dynamic adjustment factors and calculate behavior consistency internal rewards.

Based on the behavior consistency defined in step 2, the present embodiment expects to dynamically encourage or punish the behavior consistency of each agent with different teammate agents based on global status. Thus, the present embodiment designs a Dynamic Scaling Network (DSN) for each agent to output dynamic adjustment factors for behavior consistency with different agents, defined asLength of. Specifically, the DSN is composed of a multi-layer perceptron (MLP) with three full connection layers and ReLU and SoftMax activation functions, and the network structure is shown in FIG. 2, wherein ∈ ->Is the input state vector length.

For intelligent agentDefined herein as->The consistency internal rewards obtained at the moment are as follows:

（2）

wherein ,is intelligent body->Dynamic regulator of self DSN output, +.>Is a scaled super parameter. When->When the value is negative, i.e. encourage the agent +.>And intelligent agent->Making a consistency behavior between the two; when->If the value is positive, punishment intelligent agent is +.>And intelligent agent->And (5) making a consistent behavior. Thus, the introduced adjustment factor provides a reliable basis for determining when to reward or penalize consistent behavior between agents during collaboration.

Step 4: and optimizing the DSN network parameters according to a chained algorithm and taking the maximization of external return as a target.

Set the firstThe DSN network parameter corresponding to each intelligent agent is +.>Then, using the chain derivative rule to externally report +.>For->Gradient of->Can be expressed as:

（3）

wherein Indicate->And the network parameters after the corresponding actuator strategies of the intelligent agents are updated. For the left item aboveCan be further expressed as:

（4）

wherein Representing the external advantage, given by an external evaluator; />Indicate->The actuator strategy output action after the update of the individual agent is +.>Is a probability of (2).

According to the Soft Actor-Critic algorithm,the calculation of (2) can be expressed as:

（5）

wherein For learning rate->Is the entropy temperature coefficient in the Soft Actor-Critic algorithm, +.>Is->The auxiliary judgment device after updating corresponding to the intelligent agent inputs the observation +.>Action value at the time; in the Soft Actor-Critic algorithm, two identical auxiliary judgments are defined for each agent to output two action values +.> and />The parameters of each auxiliary evaluator are updated independently and the participating prize inputs are the sum of the internal and external prizes, i.e，/>For external rewards, then->。

It can be further deduced that the right term of formula (3)Equal to:

（6）

wherein 、/>Respectively +.>The intelligent agents correspond to parameters before and after updating of the auxiliary judging device;is->Before updating, the auxiliary judgment device corresponding to the intelligent agent inputs the observation +.>Action value at the time; />Is->An actuator strategy before the update of the intelligent agent; substituting the formula (4) and the formula (6) into the formula (3) to calculate the external return +.>For->Gradient of->The method comprises the steps of carrying out a first treatment on the surface of the And optimizing the dynamic scaling network parameters by a gradient ascent method.

Step 5: and finishing the multi-agent cooperative task by using the optimized strategy.

For each agent that has been optimized to complete, they are deployed into the environment to perform collaborative tasks. At each instant, each agent will take its own observations and input these observations into its corresponding actuator strategy. Based on the observations, each policy will generate a next action, and the environment will enter the next state based on the actions taken by all agents. All agents then make a next decision based on the new environmental conditions, and so forth, until the multi-agent cooperative task is completed.

Referring to fig. 3, fig. 3 is an example of a task for N agents to cooperatively navigate to N target points. In this embodiment, the N agents are required to be unable to navigate to the same target point in the presence of multiple agents. In this task, the intention of the behaviors of both the agents needs to be considered between the agents, as shown in fig. 3 (a), where four agents cooperate to reach four targets, when the agents 1 and 2 observe the same target, the agent 1 selects the aggressive behavior, i.e. navigates to the target, and the agent 2 selects the fair behavior, i.e. selects to explore other targets. At this time, the same situation occurs for the agent 3 and the agent 4, and the agent 3 selects the aggressive behavior. As shown in fig. 3 (b), at this time, if agent 4 chooses to agree with agent 2 (modesty), the mission reward will be high, whereas if agent 4 keeps agree with agents 1 and 3 (aggressive), agent 4 will reach the same goal as agent 3, and we want to penalize this behavior.

In other embodiments, the multi-agent reinforcement learning method described above may be applied to existing tasks, such as applying the multi-agent reinforcement learning method of the present application to *** football tasks: this task is represented by three agents representing my players who aim to successfully cooperate across a defender and a gatekeeper to push a ball into the goal and shoot the goal.

In summary, compared with the prior art, the method of the application has at least the following advantages and beneficial effects:

(1) The application provides a concept that the dynamic state and the teammate agent keep the consistency of actions in the multi-agent system and plays an important role in efficiently completing the collaborative task for the first time.

(2) The application defines a quantification index of action consistency, namely, given the same observation of two intelligent agents, the KL divergence of the action probability distribution output by an actuator of the intelligent agent.

(3) The application provides a dynamic behavior consistency reward and a learning paradigm, which encourage an intelligent agent to learn to guide on maximization of task rewards so as to dynamically select different behavior consistency from teammate intelligent agents.

The embodiment also provides a multi-agent reinforcement learning device, which comprises:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as shown in fig. 4.

The multi-agent reinforcement learning device of the embodiment can execute the multi-agent reinforcement learning method provided by the embodiment of the method, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects.

Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 4.

The embodiment also provides a storage medium which stores instructions or programs capable of executing the multi-agent reinforcement learning method provided by the embodiment of the method, and when the instructions or programs are run, any combination of the embodiments of the executable method can implement steps, so that the method has corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The multi-agent reinforcement learning method is characterized by comprising the following steps of:

based on the behavior consistency, acquiring a dynamic adjustment factor through a dynamic scaling network, and calculating internal rewards of the behavior consistency according to the dynamic adjustment factor;

the cooperation task of multiple agents is realized by using an optimized strategy;

the method for obtaining the dynamic adjustment factors through the dynamic scaling network, calculating internal rewards of behavior consistency according to the dynamic adjustment factors comprises the following steps:

constructing a dynamic scaling network; the input of the dynamic scaling network is global observation, and the output is a dynamic adjustment factor consistent with the behavior of different agents;

calculating internal rewards based on the behavior consistency according to the dynamic adjustment factors and the behavior consistency;

for intelligent agentIn->The internal rewards based on behavior consistency obtained at the moment are as follows:

，

wherein ,is intelligent body->Dynamic adjustment factor of self dynamic scaling network output, < >>For a scaling super parameter +_>Representing intelligent agent->And intelligent agent->At->Behavior consistency of time of day->Is intelligent body->Teammate agent collection; when->When the value is negative, the agent is encouraged +.>And intelligent agent->Making a consistency behavior between the two; when->If the value is positive, punishment intelligent agent is +.>And an agentMaking a consistency behavior between the two;

the optimizing the parameters of the dynamic scaling network according to the chained algorithm with the aim of maximizing the external return comprises the following steps:

constructing an external value objective function, and correlating the derivative of the external value objective function with parameters of a dynamic scaling network through a chain rule, so that the updating direction of the dynamic adjustment factor is consistent with the increasing direction of the external value;

set the firstThe parameter of the dynamic scaling network corresponding to the intelligent agent is +.>External return +.>For->Gradient of->Expressed as:

（1），

in formula (1)Further expressed as:

（2），

according to the Soft Actor-Critic algorithm,the calculation of (2) is expressed as:

（3），

wherein For learning rate->Is the entropy temperature coefficient in the Soft Actor-Critic algorithm, +.>Is->The auxiliary judgment device after updating corresponding to the intelligent agent inputs the observation +.>Action value at the time; in the Soft Actor-Critic algorithm, two identical auxiliary judgments are defined for each agent to output two action values +.> and />The parameters of each auxiliary evaluator are updated independently and the participating prize inputs are the sum of the internal and external prizes, i.e，/>For external rewards, then->；

Further deriving, in formula (1)Equal to:

（4），

wherein 、/>Respectively +.>The intelligent agents correspond to parameters before and after updating of the auxiliary judging device; />Is->Before updating, the auxiliary judgment device corresponding to the intelligent agent inputs the observation +.>Action value at the time; />Is->An actuator strategy before the update of the intelligent agent; substituting the formula (2) and the formula (4) into the formula (1) to calculate the external return +.>For->Gradient of->The method comprises the steps of carrying out a first treatment on the surface of the And optimizing the dynamic scaling network parameters by a gradient ascent method.

2. The multi-agent reinforcement learning method of claim 1, wherein the obtaining the observations, the agents obtaining an action probability distribution based on the observations, and reasoning about the action probability distribution of teammate agents based on the observations, comprises:

3. The multi-agent reinforcement learning method of claim 1, wherein said calculating the behavior consistency of each agent with teammate agents comprises:

4. The multi-agent reinforcement learning method of claim 3, wherein the similarity is calculated using KL divergence, and the calculation formula is as follows:

，

wherein ,representing intelligent agent->At->Time acquisition and agent->In the same observation, the output action is +.>Probability of class action,/->Representing intelligent agent->At->The moment output action is +.>Probability of class action,/->Representing the length of the agent's action space.

5. A multi-agent reinforcement learning device, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-4.

6. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-4 when being executed by a processor.