CN113095498A

CN113095498A - Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Info

Publication number: CN113095498A
Application number: CN202110315995.0A
Authority: CN
Inventors: 卢宗青; 苏可凡
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-07-09
Anticipated expiration: 2041-03-24
Also published as: CN113095498B

Abstract

The invention discloses a divergence-based multi-agent cooperative learning method, a divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and a divergence-based storage medium, wherein the divergence-based multi-agent cooperative learning method comprises the following steps of: initializing a value network, a policy network and a target policy network; changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode; training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode; and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data. The multi-agent cooperative learning method provided by the embodiment of the disclosure utilizes the regular term based on divergence to enhance the exploration capability of the agents and solve the cooperative problem of the multi-agents.

Description

Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Technical Field

The invention relates to the technical field of machine learning, in particular to a divergence-based multi-agent cooperative learning method, a divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning media.

Background

Reinforcement learning agents may accomplish autonomous learning of behavioral strategies by interacting with the environment, and thus have been successfully applied to tasks in the single agent domain, such as robotic arm control, board games, and games. However, many tasks in real life often require multiple agents to complete through cooperation, such as logistics robots, unmanned driving, large-scale instant strategy games, and other tasks. Therefore, multi-agent cooperative learning has become more of a concern in recent years.

In a cooperative multi-agent task, each agent typically perceives only local information within its visual range due to communication limitations. It is difficult to form an effective collaboration between agents if each agent learns from their respective local information. In the prior art, the exploration capacity of an agent is improved by adding an entropy regular term, but the original Markov decision process is modified by adding the entropy regular term, so that a convergence strategy obtained by reinforcement learning based on entropy regular is not the optimal strategy of the original problem. Bias may be introduced to the convergence strategy.

Disclosure of Invention

The disclosed embodiments provide a divergence-based multi-agent cooperative learning method, apparatus, device and medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, the disclosed embodiments provide a divergence-based multi-agent cooperative learning method, including:

initializing a value network, a policy network and a target policy network;

changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode;

training a plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;

and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.

In an optional embodiment, before changing the updating manner of the value network, the policy network, and the target policy network according to a preset regular term based on divergence, the method further includes:

and constructing a maximized objective function of the multi-agent according to the divergence-based regularization item.

In one optional embodiment, the divergence-based regularization term is:

wherein, pi represents a policy network, a_tRepresents the motion, s_tRepresenting the state and p representing the target policy network.

In an optional embodiment, changing the update mode of the value network according to a preset regular term based on divergence includes:

the value network is updated according to the maximized objective function:

where λ is the regular term coefficient, π represents the policy network, ρ represents the target policy network, φ is a parameter of the value network,

is a parameter of the target value network, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents discountsA deduction factor, s 'represents the new state to which the environment is transferred after the agent makes a decision, a' represents the action made by the agent in the new state,

a loss function representing a network of values,

representing a network of target values, Q_φAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.

In an optional embodiment, changing the update mode of the policy network according to a preset regular term based on divergence includes:

the policy network updates according to the minimization objective function:

wherein, theta_iRepresenting a policy network pi_iIs determined by the parameters of (a) and (b),

representing the loss function of the policy network, E representing the mathematical expectation, a_iRepresenting an action, s representing a state,

representing a function of values given a strategy pi and a target strategy p, lambda representing the regular term coefficient, p_iRepresenting a target policy network, D_KLIndicating KL divergence.

In an optional embodiment, changing the update mode of the target policy network according to a preset regular term based on divergence includes:

the target policy network updates according to the running average of the policy network:

wherein the content of the first and second substances,

a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θ_iRepresenting parameters of the policy network.

In an optional embodiment, the training of the plurality of agents according to the value network, the policy network, and the target policy network to obtain the experience data, and the updating of the value network, the policy network, and the target policy network according to the experience data and the latest updating manner include:

the intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data;

the environment gives a reward according to the current state and the joint action, and moves to the next state;

storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards;

and after a period of experience is finished, acquiring a plurality of pieces of experience data from the cache database for training, and updating the value network, the strategy network and the target strategy network.

In a second aspect, embodiments of the present disclosure provide a divergence-based multi-agent cooperative learning apparatus, comprising:

the initialization module is used for initializing a value network, a strategy network and a target strategy network;

the change module is used for changing the update modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest update mode;

the training module is used for training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;

and the execution module is used for acquiring observation data from the environment by a plurality of agents, and making a decision by combining the experience data and the updated strategy network to obtain action data.

In a third aspect, the disclosed embodiments provide a divergence-based multi-agent cooperative learning device, comprising a processor and a memory storing program instructions, the processor being configured to execute the divergence-based multi-agent cooperative learning method provided by the above embodiments when executing the program instructions.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium having computer-readable instructions stored thereon, the computer-readable instructions being executable by a processor to implement a divergence-based multi-agent cooperative learning method provided by the above-mentioned embodiments.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the disclosed embodiment provides a cooperative learning method of multiple agents, which is used for solving the cooperative problem of the multiple agents. The method adds a target strategy to the intelligent agent, adds a regular term based on divergence in a reward function of a general Markov decision process by using the target strategy, and provides a new strategy updating mode aiming at the Markov decision process based on divergence. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced. The method provided by the embodiment of the disclosure has certain flexibility and has important application value in the fields of intelligent transportation, robot control and the like.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram illustrating a divergence-based multi-agent cooperative learning method in accordance with an exemplary embodiment;

FIG. 2 is a schematic flow diagram illustrating a divergence-based multi-agent training method in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the architecture of a divergence-based multi-agent cooperative learning apparatus in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a configuration of a divergence-based multi-agent cooperative learning device, according to an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a computer storage medium in accordance with an exemplary embodiment.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In a multi-agent collaboration problem, the state of the environment, i.e. all information in the environment, is s. Each agent i can obtain partial information o in the environment_iBased on the history information τ_i＝{(o_i，a_i) And its own policy pi_iGet and execute own action a_iAccording to the current state s and the joint action a of all agents (a ═ a)₁，a₂，…，a_n) And gets the same global reward r (s, a) while the context transitions to the next state s' P (· | s, a). Remember the union policy of all agents as pi ═ pi₁×π₂×…×π_nThen each agent maximizes the common objective function as:

wherein E represents mathematical expectation, π represents strategy, a_tRepresents the motion, s_tIndicating the status and gamma a discount factor.

The reinforced learning based on entropy regularization is mainly to subtract log pi (a) from the reward function_t|s_t) I.e. modifying the optimization objective of the agent to:

wherein E represents mathematical expectation, π represents strategy, a_tRepresents the motion, s_tRepresenting the state, gamma representing the discount factor, and lambda representing the coefficient of the regularization term. The addition of the entropy regulation term can improve the exploration of the intelligent agentThe method has the advantages that the conventional Markov decision process is modified by the entropy regular term, so that the convergence strategy obtained by reinforcement learning based on entropy regular is not the optimal strategy of the original problem, and deviation is brought to the convergence strategy.

The embodiment of the disclosure provides a new regular term based on divergence, and provides a new strategy updating mode aiming at the Markov decision process based on divergence, the exploration capacity of the intelligent agent is enhanced by using the regular term based on divergence, and meanwhile, when the strategy of the intelligent agent is converged, the regular term disappears, and the deviation brought to the convergence strategy by adding the regular term is avoided. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced.

The divergence-based multi-agent cooperative learning method provided by the embodiment of the present application will be described in detail below with reference to fig. 1 to 2.

The method is suitable for the situation that a plurality of cooperative intelligent agents exist, each intelligent agent interacts with the environment in the training process, in real life, a plurality of tasks are often completed by the cooperation of the plurality of intelligent agents, such as logistics robots, unmanned driving, large-scale instant strategy games and the like, and the system which can complete the tasks by the cooperation of the plurality of intelligent agents is called a multi-intelligent-agent system, namely a multi-intelligent-agent system. For example, a warehouse logistics system is a multi-agent, wherein each logistics robot is an agent.

Referring to fig. 1, the method specifically includes the following steps.

S101 initialization value network Q and strategy network pi_iAnd target policy network ρ_i。

S102, changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode.

In a possible implementation, before executing step S102, constructing a maximization objective function of the multi-agent according to the divergence-based regularization term is further included.

Specifically, a target strategy ρ is added to each agent_iSubtracting one from the reward function using the newly added objective strategy

The term is called a divergence-based regularization term. Wherein, pi represents a policy network, a_tRepresents the motion, s_tRepresenting the state and p representing the target policy network.

Further, the maximum objective function of the multi-agent is obtained by using the divergence-based regularization term as follows:

wherein E represents mathematical expectation, π represents strategy, a_tRepresents the motion, s_tRepresenting the state, gamma representing the discount factor, lambda representing the coefficient of the regularization term, and p representing the target policy network.

The embodiment of the disclosure changes the entropy regular term into the regular term based on the divergence, and obtains a new optimization objective function by using the regular term based on the divergence. When the target strategy is taken as the past strategy, the regular item can give a reward to reduce when the probability of a certain action is increased, otherwise the reward can be improved, namely the regular item encourages the intelligent agent to take an action with the decreased probability and discourages the intelligent agent from taking an action with the increased probability, so that the intelligent agent can be more fully explored in the training process. Meanwhile, the addition of the regular term also controls the step length of strategy updating actually, namely, the new strategy cannot deviate from the past strategy too much, and the stability of strategy improvement is facilitated. And after the regular term based on divergence is added, the latest updating mode is obtained according to derivation, and off-policy (off-policy) training can be realized, so that the problem of sampling efficiency faced by the reinforced learning method based on the policy under the condition of multiple intelligent agents is solved.

Specifically, changing the update mode of the value network according to a preset regular term based on divergence includes:

in general reinforcement learning, a value function is generally defined as:

wherein Q represents a value function, π represents a policy, a_tRepresents the motion, s_tRepresenting the status, gamma representing the discount factor, E representing the mathematical expectation, and r representing the global reward.

During the training process, the updating is usually performed according to the bellman optimal equation, that is:

wherein Q is^*Representing an updated value function, pi representing a policy, a representing an action, s representing a state, gamma representing a discount factor, E representing a mathematical expectation, r representing a global reward, s 'representing a state at a next time, a' representing an action at a next time.

After adding the regularization term based on divergence, a new iterative formula for the optimal value function is obtained, namely:

wherein the content of the first and second substances,

representing an optimal value function, a representing an action, s representing a state, γ representing a discount factor, E representing a mathematical expectation, r representing a global reward, s 'representing a state at a next time, a' representing an action at a next time, λ representing a coefficient of a regularization term, ρ representing a target policy network,

representing the optimal strategy. This formula, in addition to adding a regularization term based on divergence, is not comparable to the Bellman optimal equationThe limit a' is the optimal action which can be selected arbitrarily, so that the flexibility in training is increased, and meanwhile, the objective function design of TD (lambda) can be adopted in off-strategy training.

According to the above iterative formula, the designed value network is updated according to the maximized objective function:

is a parameter of the network of target values, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents a discount factor, s 'represents a new state to which the environment is transferred after the agent makes a decision, a' represents actions made by the agent in the new state,

a loss function representing a network of values,

targeting of a network of target values, Q_φAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.

Further, changing the updating mode of the policy network according to a preset regular term based on divergence, comprising:

policy-based reinforcement learning methods are usually rootThe policy network update mode is obtained according to the policy gradient theorem, but the policy gradient theorem limits that the update must be in policy (on-policy). After adding the regular term based on divergence, a new strategy lifting theorem is obtained, namely an old strategy pi is given_oldOnly new strategy is needed_newSatisfy the requirement of

Wherein

D_KLRepresents the KL divergence, then pi_newIs just to pi_oldMore preferably. According to the above theorem, the gradient of the strategy can be calculated by deducing according to the following formula:

representing a cache database for storing training data,

representing a value function given a policy pi and a target policy p, where lambda represents the regularization term coefficient and p represents the target policy network.

The above gradient formula, except for adding the regular term, is the biggest difference from the policy gradient theorem in that it no longer limits the distribution of the state s, so that the off-policy update can be performed by using the above update mode. In a multi-agent complexThe counter-fact baseline function can be set to reduce the computation in addition to the credit allocation problem by subtracting a counter-fact baseline function from the gradient and reducing the variance of the gradient update. Taking the baseline function as

The gradient formula can be further rewritten as follows, and the update method of the policy network according to the minimization objective function is also shown as the following formula:

Further, changing the updating mode of the target policy network according to a preset regular term based on divergence, comprising:

for the objective strategy, in the most ideal case, the following iterative approach should be used:

i.e. taking the constant rho ═ pi^tUnder the condition of obtaining the optimal strategy pi^t+1When such iterative process converges, pi^t+1＝π^tAnd then have

Therefore, the convergence strategy is the optimal strategy of the original Markov decision process, and the deviation brought to the convergence strategy by adding the regular term is avoided. However, such an ideal case requires training through reinforcement learning until convergence every iteration, and is too costly, so the following approximation method can be adopted:

wherein the content of the first and second substances,

a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θ_iRepresenting parameters of the policy network. I.e., one-step gradient update instead of maximization, while letting the target strategy take the moving average of the strategy.

wherein the content of the first and second substances,

According to the step, the latest updating mode of the value network, the strategy network and the target strategy network can be obtained according to the regular item based on the divergence, and through the latest updating mode, when the strategy of the intelligent agent is converged, the regular item disappears, so that the deviation of the convergence strategy caused by adding the regular item is avoided, the off-strategy training can be realized, the problem of sampling efficiency in the multi-intelligent-agent environment is solved, the strategy updating step length is controlled, and the strategy improving stability is enhanced.

S103, training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode.

Fig. 2 is a flow diagram illustrating a divergence-based multi-agent training method according to an exemplary embodiment, as shown in fig. 2, the multi-agent training process includes: s201, an intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data; s202, the environment gives rewards according to the current state and the joint action, and moves to the next state; s203, storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards; s204, after a period of experience is finished, a plurality of pieces of experience data are obtained from the buffer database for training, and the value network, the strategy network and the target strategy network are updated.

In particular, the training data is obtained by the agent interacting with the environment, i.e. agent i obtains observation data o from the environment_iAnd using empirical data τ_iAccording to a strategy of_iTo make a decision to obtain an action a_iThe context then gives the reward r according to the current state s and the associated action a and moves to the next state s ', storing the tuple (s, a, s', o, r) as a piece of experience data in the cache database D. After a period of experience is over, a plurality of experiences are taken from D for training, and the value network Q and the strategy network pi are updated according to the latest updating mode_iAnd a target policy network ρ_i。

In one exemplary scenario, at game star dispute 2, a player will maneuver several units to combat an enemy, with the goal of eliminating all units of the enemy. In the training, each unit is regarded as an agent, each agent has a visual field range, the obtained observation data is the relevant indexes (such as life value, shield value, unit type and the like) of the unit in the visual field range, the actions which can be taken by the agent comprise movement, attack, release skill and the like, the strategy of the agent is generated by a strategy network and is a probability distribution of the actions which can be taken by the agent under the current state, the agent can obtain rewards when hurting or destroying the enemy unit, the state of each step (data corresponding to all information of the current game and only available during the training), the observation of all agents, the actions of all agents, the rewards obtained after the actions are taken and new states transferred to are stored as experience, after a plurality of steps are executed or a game is finished, the experience data obtained before is taken out for training, the strategy of the intelligent agent is updated, new experience is obtained by using the new strategy, and the process is repeated until the intelligent agent obtains better performance.

S104, a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.

Further, in execution, for agent i: each time observation data o is acquired from the environment_iIn combination with empirical data τ_iAccording to the updated strategy network pi_iTo make a decision to obtain an action a_i。

According to the cooperative learning method of the multi-agent, provided by the embodiment of the disclosure, the cooperative problem of the multi-agent is solved, the exploration capacity of the agent is enhanced by utilizing the regular item based on divergence, and meanwhile, when the strategy of the agent is converged, the regular item disappears, so that the deviation of the convergence strategy caused by adding the regular item is avoided. And after the regular item is added, off-strategy training can be realized, and the problem of sampling efficiency in a multi-agent environment is solved. Meanwhile, the regular term also controls the step length of strategy updating, and the stability of strategy improvement is enhanced. The method provided by the embodiment of the disclosure has certain flexibility, for example, in the field of robot control, the problem of cooperative control of multiple robots can be solved, and in the field of games, the method can solve the cooperative control of multiple game roles and serve as an artificial intelligence system of multiple intelligent agents in games.

The disclosed embodiment also provides a divergence-based multi-agent cooperative learning apparatus, configured to perform the divergence-based multi-agent cooperative learning method of the foregoing embodiment, as shown in fig. 3, the apparatus includes:

an initialization module 301, configured to initialize a value network, a policy network, and a target policy network;

a changing module 302, configured to change the update modes of the value network, the policy network, and the target policy network according to a preset regular term based on divergence, to obtain a latest update mode;

the training module 303 is configured to train the multiple agents according to the value network, the policy network, and the target policy network to obtain experience data, and update the value network, the policy network, and the target policy network according to the experience data and the latest update mode;

and the execution module 304 is configured to obtain observation data from the environment by the multiple agents, and make a decision by combining the experience data and the updated policy network to obtain action data.

It should be noted that, when the divergence-based multi-agent cooperative learning apparatus provided in the foregoing embodiment executes the divergence-based multi-agent cooperative learning method, only the division of the function modules is illustrated, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device may be divided into different function modules, so as to complete all or part of the functions described above. In addition, the divergence-based multi-agent cooperative learning device provided by the above embodiment and the divergence-based multi-agent cooperative learning method embodiment belong to the same concept, and the detailed implementation process thereof is referred to as the method embodiment, which is not described herein again.

The embodiment of the present disclosure further provides an electronic device corresponding to the divergence-based multi-agent cooperative learning method provided in the foregoing embodiment, so as to execute the divergence-based multi-agent cooperative learning method.

Referring to fig. 4, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 4, the electronic apparatus includes: a processor 400, a memory 401, a bus 402 and a communication interface 403, wherein the processor 400, the communication interface 403 and the memory 401 are connected through the bus 402; the memory 401 has stored therein a computer program executable on the processor 400, the processor 400 executing the divergence-based multi-agent cooperative learning method provided by any of the foregoing embodiments of the present application when executing the computer program.

The Memory 401 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 403 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 402 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. Wherein the memory 401 is used for storing a program, and the processor 400 executes the program after receiving an execution instruction, and the divergence-based multi-agent cooperative learning method disclosed in any of the embodiments of the present application can be applied to the processor 400, or implemented by the processor 400.

Processor 400 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 400. The Processor 400 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 400 reads the information in the memory 401 and completes the steps of the method in combination with the hardware.

The electronic device provided by the embodiment of the application and the divergence-based multi-agent cooperative learning method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.

Referring to fig. 5, the computer readable storage medium is shown as an optical disc 500, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program performs the divergence-based multi-agent cooperative learning method provided by any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the divergence-based multi-agent cooperative learning method provided by the embodiment of the present application have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer-readable storage medium.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A divergence-based multi-agent cooperative learning method is characterized by comprising the following steps:

initializing a value network, a policy network and a target policy network;

training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;

2. The method of claim 1, wherein before changing the updating modes of the value network, the policy network and the target policy network according to a preset regular term based on divergence, the method further comprises:

3. The method of claim 2, based onThe regular term for divergence is:

4. The method of claim 1, wherein changing the update mode of the value network according to a preset regularization term based on divergence comprises:

the value network is updated according to the maximized objective function:

a loss function representing a network of values,

5. The method of claim 1, wherein changing the update mode of the policy network according to a preset divergence-based regularization term comprises:

the policy network updates according to the minimization objective function:

6. The method of claim 1, wherein changing the update mode of the target policy network according to a preset divergence-based regularization term comprises:

wherein the content of the first and second substances,

7. The method of claim 1, wherein training agents according to the value network, the policy network, and the target policy network to obtain experience data, and updating the value network, the policy network, and the target policy network according to the experience data and the latest update method comprises:

8. A divergence-based multi-agent cooperative learning apparatus, comprising:

9. A divergence-based multi-agent cooperative learning apparatus, comprising a processor and a memory storing program instructions, the processor being configured to, upon execution of the program instructions, perform a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.