CN113095498A - Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium - Google Patents

Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium Download PDF

Info

Publication number
CN113095498A
CN113095498A CN202110315995.0A CN202110315995A CN113095498A CN 113095498 A CN113095498 A CN 113095498A CN 202110315995 A CN202110315995 A CN 202110315995A CN 113095498 A CN113095498 A CN 113095498A
Authority
CN
China
Prior art keywords
network
divergence
strategy
target
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110315995.0A
Other languages
Chinese (zh)
Other versions
CN113095498B (en
Inventor
卢宗青
苏可凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202110315995.0A priority Critical patent/CN113095498B/en
Publication of CN113095498A publication Critical patent/CN113095498A/en
Application granted granted Critical
Publication of CN113095498B publication Critical patent/CN113095498B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a divergence-based multi-agent cooperative learning method, a divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and a divergence-based storage medium, wherein the divergence-based multi-agent cooperative learning method comprises the following steps of: initializing a value network, a policy network and a target policy network; changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode; training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode; and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data. The multi-agent cooperative learning method provided by the embodiment of the disclosure utilizes the regular term based on divergence to enhance the exploration capability of the agents and solve the cooperative problem of the multi-agents.

Description

Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium
Technical Field
The invention relates to the technical field of machine learning, in particular to a divergence-based multi-agent cooperative learning method, a divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning media.
Background
Reinforcement learning agents may accomplish autonomous learning of behavioral strategies by interacting with the environment, and thus have been successfully applied to tasks in the single agent domain, such as robotic arm control, board games, and games. However, many tasks in real life often require multiple agents to complete through cooperation, such as logistics robots, unmanned driving, large-scale instant strategy games, and other tasks. Therefore, multi-agent cooperative learning has become more of a concern in recent years.
In a cooperative multi-agent task, each agent typically perceives only local information within its visual range due to communication limitations. It is difficult to form an effective collaboration between agents if each agent learns from their respective local information. In the prior art, the exploration capacity of an agent is improved by adding an entropy regular term, but the original Markov decision process is modified by adding the entropy regular term, so that a convergence strategy obtained by reinforcement learning based on entropy regular is not the optimal strategy of the original problem. Bias may be introduced to the convergence strategy.
Disclosure of Invention
The disclosed embodiments provide a divergence-based multi-agent cooperative learning method, apparatus, device and medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, the disclosed embodiments provide a divergence-based multi-agent cooperative learning method, including:
initializing a value network, a policy network and a target policy network;
changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode;
training a plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;
and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.
In an optional embodiment, before changing the updating manner of the value network, the policy network, and the target policy network according to a preset regular term based on divergence, the method further includes:
and constructing a maximized objective function of the multi-agent according to the divergence-based regularization item.
In one optional embodiment, the divergence-based regularization term is:
Figure BDA0002991208470000021
wherein, pi represents a policy network, atRepresents the motion, stRepresenting the state and p representing the target policy network.
In an optional embodiment, changing the update mode of the value network according to a preset regular term based on divergence includes:
the value network is updated according to the maximized objective function:
Figure BDA0002991208470000022
Figure BDA0002991208470000023
Figure BDA0002991208470000024
where λ is the regular term coefficient, π represents the policy network, ρ represents the target policy network, φ is a parameter of the value network,
Figure BDA0002991208470000025
is a parameter of the target value network, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents discountsA deduction factor, s 'represents the new state to which the environment is transferred after the agent makes a decision, a' represents the action made by the agent in the new state,
Figure BDA0002991208470000026
a loss function representing a network of values,
Figure BDA0002991208470000027
representing a network of target values, QφAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.
In an optional embodiment, changing the update mode of the policy network according to a preset regular term based on divergence includes:
the policy network updates according to the minimization objective function:
Figure BDA0002991208470000031
wherein, thetaiRepresenting a policy network piiIs determined by the parameters of (a) and (b),
Figure BDA0002991208470000032
representing the loss function of the policy network, E representing the mathematical expectation, aiRepresenting an action, s representing a state,
Figure BDA0002991208470000033
representing a function of values given a strategy pi and a target strategy p, lambda representing the regular term coefficient, piRepresenting a target policy network, DKLIndicating KL divergence.
In an optional embodiment, changing the update mode of the target policy network according to a preset regular term based on divergence includes:
the target policy network updates according to the running average of the policy network:
Figure BDA0002991208470000034
wherein the content of the first and second substances,
Figure BDA0002991208470000035
a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θiRepresenting parameters of the policy network.
In an optional embodiment, the training of the plurality of agents according to the value network, the policy network, and the target policy network to obtain the experience data, and the updating of the value network, the policy network, and the target policy network according to the experience data and the latest updating manner include:
the intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data;
the environment gives a reward according to the current state and the joint action, and moves to the next state;
storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards;
and after a period of experience is finished, acquiring a plurality of pieces of experience data from the cache database for training, and updating the value network, the strategy network and the target strategy network.
In a second aspect, embodiments of the present disclosure provide a divergence-based multi-agent cooperative learning apparatus, comprising:
the initialization module is used for initializing a value network, a strategy network and a target strategy network;
the change module is used for changing the update modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest update mode;
the training module is used for training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;
and the execution module is used for acquiring observation data from the environment by a plurality of agents, and making a decision by combining the experience data and the updated strategy network to obtain action data.
In a third aspect, the disclosed embodiments provide a divergence-based multi-agent cooperative learning device, comprising a processor and a memory storing program instructions, the processor being configured to execute the divergence-based multi-agent cooperative learning method provided by the above embodiments when executing the program instructions.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to implement a divergence-based multi-agent cooperative learning method provided by the above-mentioned embodiments.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
the disclosed embodiment provides a cooperative learning method of multiple agents, which is used for solving the cooperative problem of the multiple agents. The method adds a target strategy to the intelligent agent, adds a regular term based on divergence in a reward function of a general Markov decision process by using the target strategy, and provides a new strategy updating mode aiming at the Markov decision process based on divergence. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced. The method provided by the embodiment of the disclosure has certain flexibility and has important application value in the fields of intelligent transportation, robot control and the like.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic flow diagram illustrating a divergence-based multi-agent cooperative learning method in accordance with an exemplary embodiment;
FIG. 2 is a schematic flow diagram illustrating a divergence-based multi-agent training method in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating the architecture of a divergence-based multi-agent cooperative learning apparatus in accordance with an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating a configuration of a divergence-based multi-agent cooperative learning device, according to an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In a multi-agent collaboration problem, the state of the environment, i.e. all information in the environment, is s. Each agent i can obtain partial information o in the environmentiBased on the history information τi={(oi,ai) And its own policy piiGet and execute own action aiAccording to the current state s and the joint action a of all agents (a ═ a)1,a2,…,an) And gets the same global reward r (s, a) while the context transitions to the next state s' P (· | s, a). Remember the union policy of all agents as pi ═ pi1×π2×…×πnThen each agent maximizes the common objective function as:
Figure BDA0002991208470000061
wherein E represents mathematical expectation, π represents strategy, atRepresents the motion, stIndicating the status and gamma a discount factor.
The reinforced learning based on entropy regularization is mainly to subtract log pi (a) from the reward functiont|st) I.e. modifying the optimization objective of the agent to:
Figure BDA0002991208470000062
wherein E represents mathematical expectation, π represents strategy, atRepresents the motion, stRepresenting the state, gamma representing the discount factor, and lambda representing the coefficient of the regularization term. The addition of the entropy regulation term can improve the exploration of the intelligent agentThe method has the advantages that the conventional Markov decision process is modified by the entropy regular term, so that the convergence strategy obtained by reinforcement learning based on entropy regular is not the optimal strategy of the original problem, and deviation is brought to the convergence strategy.
The embodiment of the disclosure provides a new regular term based on divergence, and provides a new strategy updating mode aiming at the Markov decision process based on divergence, the exploration capacity of the intelligent agent is enhanced by using the regular term based on divergence, and meanwhile, when the strategy of the intelligent agent is converged, the regular term disappears, and the deviation brought to the convergence strategy by adding the regular term is avoided. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced.
The divergence-based multi-agent cooperative learning method provided by the embodiment of the present application will be described in detail below with reference to fig. 1 to 2.
The method is suitable for the situation that a plurality of cooperative intelligent agents exist, each intelligent agent interacts with the environment in the training process, in real life, a plurality of tasks are often completed by the cooperation of the plurality of intelligent agents, such as logistics robots, unmanned driving, large-scale instant strategy games and the like, and the system which can complete the tasks by the cooperation of the plurality of intelligent agents is called a multi-intelligent-agent system, namely a multi-intelligent-agent system. For example, a warehouse logistics system is a multi-agent, wherein each logistics robot is an agent.
Referring to fig. 1, the method specifically includes the following steps.
S101 initialization value network Q and strategy network piiAnd target policy network ρi
S102, changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode.
In a possible implementation, before executing step S102, constructing a maximization objective function of the multi-agent according to the divergence-based regularization term is further included.
Specifically, a target strategy ρ is added to each agentiSubtracting one from the reward function using the newly added objective strategy
Figure BDA0002991208470000071
The term is called a divergence-based regularization term. Wherein, pi represents a policy network, atRepresents the motion, stRepresenting the state and p representing the target policy network.
Further, the maximum objective function of the multi-agent is obtained by using the divergence-based regularization term as follows:
Figure BDA0002991208470000072
wherein E represents mathematical expectation, π represents strategy, atRepresents the motion, stRepresenting the state, gamma representing the discount factor, lambda representing the coefficient of the regularization term, and p representing the target policy network.
The embodiment of the disclosure changes the entropy regular term into the regular term based on the divergence, and obtains a new optimization objective function by using the regular term based on the divergence. When the target strategy is taken as the past strategy, the regular item can give a reward to reduce when the probability of a certain action is increased, otherwise the reward can be improved, namely the regular item encourages the intelligent agent to take an action with the decreased probability and discourages the intelligent agent from taking an action with the increased probability, so that the intelligent agent can be more fully explored in the training process. Meanwhile, the addition of the regular term also controls the step length of strategy updating actually, namely, the new strategy cannot deviate from the past strategy too much, and the stability of strategy improvement is facilitated. And after the regular term based on divergence is added, the latest updating mode is obtained according to derivation, and off-policy (off-policy) training can be realized, so that the problem of sampling efficiency faced by the reinforced learning method based on the policy under the condition of multiple intelligent agents is solved.
Specifically, changing the update mode of the value network according to a preset regular term based on divergence includes:
in general reinforcement learning, a value function is generally defined as:
Figure BDA0002991208470000073
wherein Q represents a value function, π represents a policy, atRepresents the motion, stRepresenting the status, gamma representing the discount factor, E representing the mathematical expectation, and r representing the global reward.
During the training process, the updating is usually performed according to the bellman optimal equation, that is:
Figure BDA0002991208470000081
wherein Q is*Representing an updated value function, pi representing a policy, a representing an action, s representing a state, gamma representing a discount factor, E representing a mathematical expectation, r representing a global reward, s 'representing a state at a next time, a' representing an action at a next time.
After adding the regularization term based on divergence, a new iterative formula for the optimal value function is obtained, namely:
Figure BDA0002991208470000082
wherein the content of the first and second substances,
Figure BDA0002991208470000083
representing an optimal value function, a representing an action, s representing a state, γ representing a discount factor, E representing a mathematical expectation, r representing a global reward, s 'representing a state at a next time, a' representing an action at a next time, λ representing a coefficient of a regularization term, ρ representing a target policy network,
Figure BDA0002991208470000084
representing the optimal strategy. This formula, in addition to adding a regularization term based on divergence, is not comparable to the Bellman optimal equationThe limit a' is the optimal action which can be selected arbitrarily, so that the flexibility in training is increased, and meanwhile, the objective function design of TD (lambda) can be adopted in off-strategy training.
According to the above iterative formula, the designed value network is updated according to the maximized objective function:
Figure BDA0002991208470000085
Figure BDA0002991208470000086
Figure BDA0002991208470000087
where λ is the regular term coefficient, π represents the policy network, ρ represents the target policy network, φ is a parameter of the value network,
Figure BDA0002991208470000088
is a parameter of the network of target values, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents a discount factor, s 'represents a new state to which the environment is transferred after the agent makes a decision, a' represents actions made by the agent in the new state,
Figure BDA0002991208470000091
a loss function representing a network of values,
Figure BDA0002991208470000092
targeting of a network of target values, QφAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.
Further, changing the updating mode of the policy network according to a preset regular term based on divergence, comprising:
policy-based reinforcement learning methods are usually rootThe policy network update mode is obtained according to the policy gradient theorem, but the policy gradient theorem limits that the update must be in policy (on-policy). After adding the regular term based on divergence, a new strategy lifting theorem is obtained, namely an old strategy pi is givenoldOnly new strategy is needednewSatisfy the requirement of
Figure BDA0002991208470000093
Wherein
Figure BDA0002991208470000094
DKLRepresents the KL divergence, then pinewIs just to pioldMore preferably. According to the above theorem, the gradient of the strategy can be calculated by deducing according to the following formula:
Figure BDA0002991208470000095
wherein, thetaiRepresenting a policy network piiIs determined by the parameters of (a) and (b),
Figure BDA0002991208470000096
representing a cache database for storing training data,
Figure BDA0002991208470000097
representing the loss function of the policy network, E representing the mathematical expectation, aiRepresenting an action, s representing a state,
Figure BDA0002991208470000098
representing a value function given a policy pi and a target policy p, where lambda represents the regularization term coefficient and p represents the target policy network.
The above gradient formula, except for adding the regular term, is the biggest difference from the policy gradient theorem in that it no longer limits the distribution of the state s, so that the off-policy update can be performed by using the above update mode. In a multi-agent complexThe counter-fact baseline function can be set to reduce the computation in addition to the credit allocation problem by subtracting a counter-fact baseline function from the gradient and reducing the variance of the gradient update. Taking the baseline function as
Figure BDA0002991208470000099
The gradient formula can be further rewritten as follows, and the update method of the policy network according to the minimization objective function is also shown as the following formula:
Figure BDA0002991208470000101
wherein, thetaiRepresenting a policy network piiIs determined by the parameters of (a) and (b),
Figure BDA0002991208470000102
representing the loss function of the policy network, E representing the mathematical expectation, aiRepresenting an action, s representing a state,
Figure BDA0002991208470000103
representing a function of values given a strategy pi and a target strategy p, lambda representing the regular term coefficient, piRepresenting a target policy network, DKLIndicating KL divergence.
Further, changing the updating mode of the target policy network according to a preset regular term based on divergence, comprising:
for the objective strategy, in the most ideal case, the following iterative approach should be used:
Figure BDA0002991208470000104
i.e. taking the constant rho ═ pitUnder the condition of obtaining the optimal strategy pit+1When such iterative process converges, pit+1=πtAnd then have
Figure BDA0002991208470000105
Therefore, the convergence strategy is the optimal strategy of the original Markov decision process, and the deviation brought to the convergence strategy by adding the regular term is avoided. However, such an ideal case requires training through reinforcement learning until convergence every iteration, and is too costly, so the following approximation method can be adopted:
Figure BDA0002991208470000106
Figure BDA0002991208470000107
wherein the content of the first and second substances,
Figure BDA0002991208470000108
a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θiRepresenting parameters of the policy network. I.e., one-step gradient update instead of maximization, while letting the target strategy take the moving average of the strategy.
The target policy network updates according to the running average of the policy network:
Figure BDA0002991208470000109
wherein the content of the first and second substances,
Figure BDA00029912084700001010
a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θiRepresenting parameters of the policy network.
According to the step, the latest updating mode of the value network, the strategy network and the target strategy network can be obtained according to the regular item based on the divergence, and through the latest updating mode, when the strategy of the intelligent agent is converged, the regular item disappears, so that the deviation of the convergence strategy caused by adding the regular item is avoided, the off-strategy training can be realized, the problem of sampling efficiency in the multi-intelligent-agent environment is solved, the strategy updating step length is controlled, and the strategy improving stability is enhanced.
S103, training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode.
Fig. 2 is a flow diagram illustrating a divergence-based multi-agent training method according to an exemplary embodiment, as shown in fig. 2, the multi-agent training process includes: s201, an intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data; s202, the environment gives rewards according to the current state and the joint action, and moves to the next state; s203, storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards; s204, after a period of experience is finished, a plurality of pieces of experience data are obtained from the buffer database for training, and the value network, the strategy network and the target strategy network are updated.
In particular, the training data is obtained by the agent interacting with the environment, i.e. agent i obtains observation data o from the environmentiAnd using empirical data τiAccording to a strategy ofiTo make a decision to obtain an action aiThe context then gives the reward r according to the current state s and the associated action a and moves to the next state s ', storing the tuple (s, a, s', o, r) as a piece of experience data in the cache database D. After a period of experience is over, a plurality of experiences are taken from D for training, and the value network Q and the strategy network pi are updated according to the latest updating modeiAnd a target policy network ρi
In one exemplary scenario, at game star dispute 2, a player will maneuver several units to combat an enemy, with the goal of eliminating all units of the enemy. In the training, each unit is regarded as an agent, each agent has a visual field range, the obtained observation data is the relevant indexes (such as life value, shield value, unit type and the like) of the unit in the visual field range, the actions which can be taken by the agent comprise movement, attack, release skill and the like, the strategy of the agent is generated by a strategy network and is a probability distribution of the actions which can be taken by the agent under the current state, the agent can obtain rewards when hurting or destroying the enemy unit, the state of each step (data corresponding to all information of the current game and only available during the training), the observation of all agents, the actions of all agents, the rewards obtained after the actions are taken and new states transferred to are stored as experience, after a plurality of steps are executed or a game is finished, the experience data obtained before is taken out for training, the strategy of the intelligent agent is updated, new experience is obtained by using the new strategy, and the process is repeated until the intelligent agent obtains better performance.
S104, a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.
Further, in execution, for agent i: each time observation data o is acquired from the environmentiIn combination with empirical data τiAccording to the updated strategy network piiTo make a decision to obtain an action ai
According to the cooperative learning method of the multi-agent, provided by the embodiment of the disclosure, the cooperative problem of the multi-agent is solved, the exploration capacity of the agent is enhanced by utilizing the regular item based on divergence, and meanwhile, when the strategy of the agent is converged, the regular item disappears, so that the deviation of the convergence strategy caused by adding the regular item is avoided. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced. The method provided by the embodiment of the disclosure has certain flexibility, for example, in the field of robot control, the problem of cooperative control of multiple robots can be solved, and in the field of games, the method can solve the cooperative control of multiple game roles and serve as an artificial intelligence system of multiple intelligent agents in games.
The disclosed embodiment also provides a divergence-based multi-agent cooperative learning apparatus, configured to perform the divergence-based multi-agent cooperative learning method of the foregoing embodiment, as shown in fig. 3, the apparatus includes:
an initialization module 301, configured to initialize a value network, a policy network, and a target policy network;
a changing module 302, configured to change the update modes of the value network, the policy network, and the target policy network according to a preset regular term based on divergence, to obtain a latest update mode;
the training module 303 is configured to train the multiple agents according to the value network, the policy network, and the target policy network to obtain experience data, and update the value network, the policy network, and the target policy network according to the experience data and the latest update mode;
and the execution module 304 is configured to obtain observation data from the environment by the multiple agents, and make a decision by combining the experience data and the updated policy network to obtain action data.
It should be noted that, when the divergence-based multi-agent cooperative learning apparatus provided in the foregoing embodiment executes the divergence-based multi-agent cooperative learning method, only the division of the function modules is illustrated, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device may be divided into different function modules, so as to complete all or part of the functions described above. In addition, the divergence-based multi-agent cooperative learning device provided by the above embodiment and the divergence-based multi-agent cooperative learning method embodiment belong to the same concept, and the detailed implementation process thereof is referred to as the method embodiment, which is not described herein again.
The embodiment of the present disclosure further provides an electronic device corresponding to the divergence-based multi-agent cooperative learning method provided in the foregoing embodiment, so as to execute the divergence-based multi-agent cooperative learning method.
Referring to fig. 4, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 4, the electronic apparatus includes: a processor 400, a memory 401, a bus 402 and a communication interface 403, wherein the processor 400, the communication interface 403 and the memory 401 are connected through the bus 402; the memory 401 has stored therein a computer program executable on the processor 400, the processor 400 executing the divergence-based multi-agent cooperative learning method provided by any of the foregoing embodiments of the present application when executing the computer program.
The Memory 401 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 403 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 402 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. Wherein the memory 401 is used for storing a program, and the processor 400 executes the program after receiving an execution instruction, and the divergence-based multi-agent cooperative learning method disclosed in any of the embodiments of the present application can be applied to the processor 400, or implemented by the processor 400.
Processor 400 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 400. The Processor 400 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 400 reads the information in the memory 401 and completes the steps of the method in combination with the hardware.
The electronic device provided by the embodiment of the application and the divergence-based multi-agent cooperative learning method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 5, the computer readable storage medium is shown as an optical disc 500, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program performs the divergence-based multi-agent cooperative learning method provided by any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the divergence-based multi-agent cooperative learning method provided by the embodiment of the present application have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer-readable storage medium.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A divergence-based multi-agent cooperative learning method is characterized by comprising the following steps:
initializing a value network, a policy network and a target policy network;
changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode;
training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;
and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.
2. The method of claim 1, wherein before changing the updating modes of the value network, the policy network and the target policy network according to a preset regular term based on divergence, the method further comprises:
and constructing a maximized objective function of the multi-agent according to the divergence-based regularization item.
3. The method of claim 2, based onThe regular term for divergence is:
Figure FDA0002991208460000011
wherein, pi represents a policy network, atRepresents the motion, stRepresenting the state and p representing the target policy network.
4. The method of claim 1, wherein changing the update mode of the value network according to a preset regularization term based on divergence comprises:
the value network is updated according to the maximized objective function:
Figure FDA0002991208460000012
Figure FDA0002991208460000013
Figure FDA0002991208460000014
where λ is the regular term coefficient, π represents the policy network, ρ represents the target policy network, φ is a parameter of the value network,
Figure FDA0002991208460000015
is a parameter of the network of target values, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents a discount factor, s 'represents a new state to which the environment is transferred after the agent makes a decision, a' represents actions made by the agent in the new state,
Figure FDA0002991208460000021
a loss function representing a network of values,
Figure FDA0002991208460000022
representing a network of target values, QφAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.
5. The method of claim 1, wherein changing the update mode of the policy network according to a preset divergence-based regularization term comprises:
the policy network updates according to the minimization objective function:
Figure FDA0002991208460000023
wherein, thetaiRepresenting a policy network piiIs determined by the parameters of (a) and (b),
Figure FDA0002991208460000024
representing the loss function of the policy network, E representing the mathematical expectation, aiRepresenting an action, s representing a state,
Figure FDA0002991208460000025
representing a function of values given a strategy pi and a target strategy p, lambda representing the regular term coefficient, piRepresenting a target policy network, DKLIndicating KL divergence.
6. The method of claim 1, wherein changing the update mode of the target policy network according to a preset divergence-based regularization term comprises:
the target policy network updates according to the running average of the policy network:
Figure FDA0002991208460000026
wherein the content of the first and second substances,
Figure FDA0002991208460000027
a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θiRepresenting parameters of the policy network.
7. The method of claim 1, wherein training agents according to the value network, the policy network, and the target policy network to obtain experience data, and updating the value network, the policy network, and the target policy network according to the experience data and the latest update method comprises:
the intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data;
the environment gives a reward according to the current state and the joint action, and moves to the next state;
storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards;
and after a period of experience is finished, acquiring a plurality of pieces of experience data from the cache database for training, and updating the value network, the strategy network and the target strategy network.
8. A divergence-based multi-agent cooperative learning apparatus, comprising:
the initialization module is used for initializing a value network, a strategy network and a target strategy network;
the change module is used for changing the update modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest update mode;
the training module is used for training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;
and the execution module is used for acquiring observation data from the environment by a plurality of agents, and making a decision by combining the experience data and the updated strategy network to obtain action data.
9. A divergence-based multi-agent cooperative learning apparatus, comprising a processor and a memory storing program instructions, the processor being configured to, upon execution of the program instructions, perform a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.
10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.
CN202110315995.0A 2021-03-24 2021-03-24 Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium Active CN113095498B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110315995.0A CN113095498B (en) 2021-03-24 2021-03-24 Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110315995.0A CN113095498B (en) 2021-03-24 2021-03-24 Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Publications (2)

Publication Number Publication Date
CN113095498A true CN113095498A (en) 2021-07-09
CN113095498B CN113095498B (en) 2022-11-18

Family

ID=76669465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110315995.0A Active CN113095498B (en) 2021-03-24 2021-03-24 Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Country Status (1)

Country Link
CN (1) CN113095498B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113391556A (en) * 2021-08-12 2021-09-14 中国科学院自动化研究所 Group distributed control method and device based on role distribution
CN113780577A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Layered decision-making complete cooperation multi-agent reinforcement learning method and system
CN114418128A (en) * 2022-03-25 2022-04-29 新华三人工智能科技有限公司 Model deployment method and device
CN115494844A (en) * 2022-09-26 2022-12-20 成都朴为科技有限公司 Multi-robot searching method and system
CN115660110A (en) * 2022-12-26 2023-01-31 中国科学院自动化研究所 Multi-agent credit allocation method, device, readable storage medium and agent

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032863A1 (en) * 2016-07-27 2018-02-01 Google Inc. Training a policy neural network and a value neural network
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN111514585A (en) * 2020-03-17 2020-08-11 清华大学 Method and system for controlling agent, computer device, and storage medium
CN111612126A (en) * 2020-04-18 2020-09-01 华为技术有限公司 Method and device for reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032863A1 (en) * 2016-07-27 2018-02-01 Google Inc. Training a policy neural network and a value neural network
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN111514585A (en) * 2020-03-17 2020-08-11 清华大学 Method and system for controlling agent, computer device, and storage medium
CN111612126A (en) * 2020-04-18 2020-09-01 华为技术有限公司 Method and device for reinforcement learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DEMING YUAN等: "Distributed Mirror Descent for Online Composite Optimization", 《 IEEE TRANSACTIONS ON AUTOMATIC CONTROL》 *
JIECHUAN JIANG和ZONGQING LU: "learning fairness in multi-agent systems", 《ARXIV.ORG/ABS/1910.14472》 *
ZHANG-WEI HONG等: "A Deep Policy Inference Q-Network for Multi-Agent Systems", 《PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS》 *
孙长银等: "多智能体深度强化学习的若干关键科学问题", 《自动化学报》 *
曲昭伟等: "考虑博弈的多智能体强化学习分布式信号控制", 《交通运输***工程与信息》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113391556A (en) * 2021-08-12 2021-09-14 中国科学院自动化研究所 Group distributed control method and device based on role distribution
CN113391556B (en) * 2021-08-12 2021-12-07 中国科学院自动化研究所 Group distributed control method and device based on role distribution
CN113780577A (en) * 2021-09-07 2021-12-10 中国船舶重工集团公司第七0九研究所 Layered decision-making complete cooperation multi-agent reinforcement learning method and system
CN113780577B (en) * 2021-09-07 2023-09-05 中国船舶重工集团公司第七0九研究所 Hierarchical decision complete cooperation multi-agent reinforcement learning method and system
CN114418128A (en) * 2022-03-25 2022-04-29 新华三人工智能科技有限公司 Model deployment method and device
CN114418128B (en) * 2022-03-25 2022-07-29 新华三人工智能科技有限公司 Model deployment method and device
CN115494844A (en) * 2022-09-26 2022-12-20 成都朴为科技有限公司 Multi-robot searching method and system
CN115660110A (en) * 2022-12-26 2023-01-31 中国科学院自动化研究所 Multi-agent credit allocation method, device, readable storage medium and agent

Also Published As

Publication number Publication date
CN113095498B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN113095498B (en) Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium
Subramanian et al. Multi type mean field reinforcement learning
Hou et al. An evolutionary transfer reinforcement learning framework for multiagent systems
US20190220750A1 (en) Solution search processing apparatus and solution search processing method
CN113093727A (en) Robot map-free navigation method based on deep security reinforcement learning
CN111898770B (en) Multi-agent reinforcement learning method, electronic equipment and storage medium
Sinapov et al. Learning inter-task transferability in the absence of target task samples
WO2016107426A1 (en) Systems and methods to adaptively select execution modes
US20190354100A1 (en) Bayesian control methodology for the solution of graphical games with incomplete information
CN116560239B (en) Multi-agent reinforcement learning method, device and medium
CN116136945A (en) Unmanned aerial vehicle cluster countermeasure game simulation method based on anti-facts base line
CN116776929A (en) Multi-agent task decision method based on PF-MADDPG
Park et al. Predictable mdp abstraction for unsupervised model-based rl
Shen Bionic communication network and binary pigeon-inspired optimization for multiagent cooperative task allocation
US20150150011A1 (en) Self-splitting of workload in parallel computation
CN116088586A (en) Method for planning on-line tasks in unmanned aerial vehicle combat process
Voss et al. Playing a strategy game with knowledge-based reinforcement learning
CN113599832A (en) Adversary modeling method, apparatus, device and storage medium based on environment model
Kim et al. Disentangling successor features for coordination in multi-agent reinforcement learning
Xu et al. Cascade attribute learning network
CN116489193B (en) Combat network self-adaptive combination method, device, equipment and medium
Lu et al. Optimal Cost Constrained Adversarial Attacks for Multiple Agent Systems
Liu et al. Soft-actor-attention-critic based on unknown agent action prediction for multi-agent collaborative confrontation
Perepu et al. DSDF: Coordinated look-ahead strategy in multi-agent reinforcement learning with noisy agents
Jensen et al. Industrial policy for advanced ai: Compute pricing and the safety tax

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant