CN114037049A

CN114037049A - Multi-agent reinforcement learning method based on value function reliability and related device

Info

Publication number: CN114037049A
Application number: CN202111222876.7A
Authority: CN
Inventors: 李帅斌; 崔金强; 宋伟伟; 孙涛; 丁玉隆; 尉越
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-11

Abstract

The application discloses a multi-agent reinforcement learning method based on value function credibility and a related device, wherein the method comprises the steps of determining potential energy functions of all agents based on observed values of all agents; determining a local value function and a candidate global value function based on each potential energy function; determining a plurality of reputation values based on the global environment information, the local value function and the candidate global value function; determining a global value function based on each local value function, the candidate global value function and the credit value; and training the multiple intelligent agents by utilizing a centralized training distribution mechanism based on the global value function and the potential energy functions of the intelligent agents. According to the method and the system, the local environment information of each intelligent agent can be learned by determining the plurality of local value functions, and the accuracy of the global value function can be improved by combining the plurality of local value functions to form the global value function, so that the convergence rate of the multi-intelligent-agent reinforcement learning is increased. Meanwhile, the credibility is configured for each local value function when the global value function is calculated, and the learning efficiency and robustness of multi-agent reinforcement learning are improved.

Description

Multi-agent reinforcement learning method based on value function reliability and related device

Technical Field

The application relates to the technical field of multi-agent reinforcement learning, in particular to a multi-agent reinforcement learning method based on value function credibility and a related device.

Background

A value-based decomposition method is a method of multi-agent reinforcement learning, which utilizes a neural network to fit a global joint value function Q_totAnd then decompose the value into a local potential energy function Q_iAnd a centralized training distributed execution mechanism is utilized, in the training stage, all the intelligent agents are merged with the data interacted with the environment independently to be regarded as an 'intelligent agent' training, namely Q is utilized_totPerforming centralized training; in the execution stage, all agents do not influence each other and are all according to the potential energy function Q of the agents_iAn action is taken. However, the multi-agent reinforcement learning method based on value decomposition only obtains a global joint cost function through potential energy functions of local individual agents in a certain combination, but ignores the importance of a local environment, namely a local cost function, thereby influencing the convergence speed of multi-agent reinforcement learning.

Thus, the prior art has yet to be improved and enhanced.

Disclosure of Invention

The technical problem to be solved by the present application is to provide a multi-agent reinforcement learning method and related apparatus based on the value function reliability, aiming at the deficiencies of the prior art.

In order to solve the above technical problem, a first aspect of the embodiments of the present application provides a multi-agent reinforcement learning method based on value function credibility, where the method includes:

acquiring an observed value of each intelligent agent, and determining a potential energy function corresponding to each intelligent agent based on the observed value corresponding to each intelligent agent;

determining a plurality of local value functions and a candidate global value function based on potential energy functions corresponding to all the intelligent agents, wherein each local value function in the plurality of local value functions is determined and obtained based on a part of potential energy functions;

acquiring global environment information corresponding to multiple intelligent agents, and determining credit values corresponding to local value functions and candidate global value functions respectively based on the global environment information, the local value functions and the candidate global value functions;

determining a global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and the candidate global value function;

and training the multi-agent by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents respectively.

The multi-agent reinforcement learning method based on the value function credibility is characterized in that potential energy functions corresponding to the local value functions are different.

The multi-agent reinforcement learning method based on value function credibility, wherein the determining of the plurality of local value functions based on the potential energy functions corresponding to the agents specifically comprises:

dividing the multi-agent into a plurality of agent groups, wherein at least one agent group in the agent groups comprises more than 1 agent;

for each intelligent agent group in a plurality of intelligent agent groups, determining a local value function corresponding to each intelligent agent based on the potential energy function corresponding to each intelligent agent in the intelligent agent group to obtain a plurality of local value functions.

The multi-agent reinforcement learning method based on the value function credibility, wherein the dividing the multi-agent into a plurality of agent groups specifically comprises:

acquiring the classes of agents corresponding to the agents in the multi-agent system, and detecting whether the classes of the agents corresponding to the agents are the same;

when all the agents are the same, randomly distributing a plurality of agents into a plurality of agent groups;

when not all are the same, the multiple agents are divided into agent groups based on agent categories.

The multi-agent reinforcement learning method based on the value function credibility is characterized in that the global environment information comprises global state information and training step length.

The multi-agent reinforcement learning method based on the value function credibility, wherein the determining the respective corresponding credit values of each local value function and the candidate global value function based on the global environment information, the plurality of local value functions and the candidate global value function specifically comprises:

inputting the global environment information, the plurality of local value functions and the candidate global value functions into a multilayer perceptron, and outputting candidate credit values corresponding to the local value functions and the candidate global value functions through the multilayer perceptron;

and inputting each candidate credit value into a softmax layer, and outputting the credit values corresponding to each local value function and each candidate global value function through the softmax layer.

The multi-agent reinforcement learning method based on the value function credibility is characterized in that the determining of the global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and the candidate global value function is specifically as follows:

respectively corresponding credit values of each local value function and each candidate global value function are used as the breaking factors of the local value function and the candidate global value function;

and summing the local value functions and the candidate global value functions based on the depreciation factors to obtain the global value function corresponding to the multi-agent.

A second aspect of the embodiments of the present application provides a multi-agent reinforcement learning apparatus based on value function reliability, where the apparatus includes:

the acquisition module is used for acquiring the observed value of each intelligent agent and determining the potential energy function corresponding to each intelligent agent based on the observed value corresponding to each intelligent agent;

the system comprises a first determining module, a second determining module and a control module, wherein the first determining module is used for determining a plurality of local value functions and a candidate global value function based on potential energy functions corresponding to all agents respectively, and each local value function in the plurality of local value functions is determined and obtained based on a partial potential energy function;

the second determining module is used for acquiring global environment information corresponding to the multiple intelligent agents and determining credit values corresponding to the local value functions and the candidate global value functions based on the global environment information, the local value functions and the candidate global value functions;

the third determining module is used for determining the global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and each candidate global value function;

and the training module is used for training the multi-agent by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy function corresponding to each agent.

The multi-agent reinforcement learning device based on the value function credibility is characterized in that potential energy functions corresponding to the local value functions in the plurality of local value functions are different from each other.

A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the multi-agent reinforcement learning method based on value function confidence as described in any of the above.

A fourth aspect of the embodiments of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps of any of the above-described multi-agent reinforcement learning method based on value function confidence.

Has the advantages that: compared with the prior art, the method comprises the steps of obtaining the observed value of each intelligent agent, and determining the potential energy function corresponding to each intelligent agent based on the observed value corresponding to each intelligent agent; determining a plurality of local value functions and candidate global value functions based on the potential energy functions corresponding to the agents respectively; acquiring global environment information corresponding to multiple intelligent agents, and determining credit values corresponding to local value functions and candidate global value functions respectively based on the global environment information, the local value functions and the candidate global value functions; determining a global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and the candidate global value function; and training the multi-agent by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents respectively. According to the method and the system, the local environment information of each intelligent agent can be learned by determining the plurality of local value functions, and then the global value functions are formed by combining the plurality of local value functions, so that the accuracy of the global value functions can be improved, and the convergence rate of the multi-intelligent-agent reinforcement learning can be improved. Meanwhile, the credibility is configured for each local value function when the global value function is calculated, and the learning efficiency and robustness of multi-agent reinforcement learning can be improved through the credibility.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without any inventive work.

FIG. 1 is a flow chart of a multi-agent reinforcement learning method based on value function confidence level provided by the present application.

FIG. 2 is a schematic flow chart of a multi-agent reinforcement learning method based on value function confidence level provided by the present application.

Fig. 3 is a comparison graph of the QMIX algorithm and the method provided in this embodiment, in which the Q-value hybrid network employs the QMIX algorithm.

Fig. 4 is a comparison diagram of the method provided in this embodiment and the VDN algorithm adopted by the Q-value hybrid network in the method provided in this embodiment.

FIG. 5 is a graph showing Q value analysis.

Fig. 6 is a schematic structural diagram of a multi-agent reinforcement learning device based on the confidence level of a value function according to the present application.

Fig. 7 is a schematic structural diagram of a terminal device provided in the present application.

Detailed Description

The present application provides a multi-agent reinforcement learning method based on value function reliability and a related device, and in order to make the purpose, technical scheme and effect of the present application clearer and clearer, the present application is further described in detail below by referring to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that, the sequence numbers and sizes of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process is determined by its function and inherent logic, and should not constitute any limitation on the implementation process of this embodiment.

The inventor finds that the value-based decomposition method is a method for multi-agent reinforcement learning, and a neural network is used for fitting a global joint value function Q_totAnd then decompose the value into a local potential energy function Q_iAnd a centralized training distributed execution mechanism is utilized, in the training stage, all the intelligent agents are merged with the data interacted with the environment independently to be regarded as an 'intelligent agent' training, namely Q is utilized_totPerforming centralized training; in the execution stage, all agents do not influence each other and are all according to the potential energy function Q of the agents_iAn action is taken. For a locally observable Markov decision process G ═<n；S,P,R；Ω,A>Where n is the number of agents, S is a finite set of states, P is the transition probability matrix, R is the reward function, Ω is the joint observation matrix, and A is the joint action matrix. Under such a setting, the global state is s, and the agents rely on Q according to their respective observations_iTake respective actions a_iThe join action A brings the global state to the next state s', while the environment returns the reward r for the join action. The goal is to learn a joint strategy to make long-term rewards

However, the multi-agent reinforcement learning method based on value decomposition only obtains a global joint cost function through potential energy functions of local individual agents in a certain combination, but ignores the importance of a local environment, namely a local cost function, thereby influencing the convergence speed of multi-agent reinforcement learning.

In order to solve the above problem, in the embodiment of the present application, an observed value of each agent is obtained, and a potential energy function corresponding to each agent is determined based on the observed value corresponding to each agent; determining a plurality of local value functions and candidate global value functions based on the potential energy functions corresponding to the agents respectively; acquiring global environment information corresponding to multiple intelligent agents, and determining credit values corresponding to local value functions and candidate global value functions respectively based on the global environment information, the local value functions and the candidate global value functions; determining a global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and the candidate global value function; and training the multi-agent by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents respectively. According to the method and the system, the local environment information of each intelligent agent can be learned by determining the plurality of local value functions, and then the global value functions are formed by combining the plurality of local value functions, so that the accuracy of the global value functions can be improved, and the convergence rate of the multi-intelligent-agent reinforcement learning can be improved. Meanwhile, the credibility is configured for each local value function when the global value function is calculated, and the learning efficiency and robustness of multi-agent reinforcement learning can be improved through the credibility.

The following further describes the content of the application by describing the embodiments with reference to the attached drawings.

The embodiment provides a multi-agent reinforcement learning method based on value function credibility, as shown in fig. 1, the method includes:

and S10, acquiring the observed value of each intelligent agent, and determining the potential energy function corresponding to each intelligent agent based on the observed value corresponding to each intelligent agent.

Specifically, each agent is an agent in a multi-agent system, and each agent in the multi-agent system cooperates to complete a task, for example, the multi-agent system is applied to an interstellar competition game, and the like. Each agent in the multi-agent system can observe local state information, so that the observed value of each agent can be obtained, wherein the observed value is the observed local observed value of the agent. For example, when the multi-agent system is applied to an interstellar dispute game, a game scene can be started through the API, and the observation value of each agent at the beginning of the game is obtained.

After the observed values of the agents are obtained, the potential energy function of each agent can be learned according to the observed values of each agent. The potential energy function can be obtained by learning according to a single intelligent agent reinforcement learning algorithm, for example, a DRQN algorithm is adopted. In a specific implementation manner, the potential energy function can be determined and obtained based on the single-agent learning network, specifically, after the observation value is obtained, the observation value is input into the single-agent learning network, and the potential energy function Q corresponding to the agent is output through the single-agent learning network_iWherein the potential energy function Q_iCan be expressed as:

Q_i(τ_i,a_i)＝f(o_i)

wherein f represents the network function of the single agent learning network, o_iRepresents an observed value of agent i; q_i(τ_i,a_i) Representing the current agent trajectory τ using the network function of the single agent learning network based on observations of agent i_iAnd action a_iPotential energy function of.

In one particular implementation, as shown in fig. 2 (b), the single agent learning network Agenti may include a first multi-layer sensor MLP, a GRU network, and a second multi-layer sensor MLP, the observed value of agent i

Inputting the information into a GRU network through a first multi-layer sensor MLP, wherein the GRU network stores hidden layer information of the previous moment

And observed values through the first multi-layered perceptron MLP

Splicing to generate hidden layer information at next moment

And the input item of the second multilayer sensor is input into the GRU network input item, and the MLP outputs a potential energy function through the second multilayer sensor

In addition, in practical application, each agent may correspond to a single agent learning network, and each agent determines a potential energy function corresponding to each agent through the single agent learning network, or a single agent learning network is preset, and an observed value corresponding to each agent is sequentially input into the single agent learning network, and a potential energy function corresponding to each agent is determined through the single agent learning network.

And S20, determining a plurality of local value functions and candidate global value functions based on the potential energy functions corresponding to the agents respectively.

Specifically, each of the plurality of local value functions is determined based on a partial potential energy function, and the candidate global value function is determined based on all potential energy functions. It can be understood that after the potential energy functions corresponding to the respective agents are obtained, a candidate global value function may be obtained through calculation based on all the potential energy functions, for example, the potential energy functions are directly added to obtain the candidate global value function, or a weighted summation manner with weights, such as a preset super network or an attention mechanism, is used to obtain the candidate global value function. In the embodiment, the local value functions are determined, so that the advantages of the local potential energy functions can be learned firstly, a better local value function is learned, and then the global value function is learned, so that a more accurate global value function can be learned, the learning process is easier, and the convergence is faster.

In an implementation manner of this embodiment, the determining the plurality of local value functions based on the potential energy functions corresponding to the respective agents specifically includes:

dividing a plurality of agents into a plurality of agent groups;

Specifically, each agent group in the plurality of agent groups comprises at least one agent, and the number of agents comprising at least one agent group is greater than 1. It is understood that the existence of at least one local value function among the plurality of local value functions is determined based on a plurality of potential energy functions, wherein the plurality of agents means greater than or equal to 2 agents. In one specific implementation, each of the plurality of agent groups includes agents that are different from each other, that is, each agent can only be divided into one agent group when the plurality of agents are divided into the plurality of agent groups. For example, a multi-agent includes agent a, agent B, agent c, and agent d, and may be divided into agent group a and agent group B, where agent group a includes agent a and agent B, and agent group B includes agent c and agent d.

After the groups of agents are determined to be acquired, as shown in fig. 2, local value functions corresponding to the agent values may be determined through a Q-value Mixing Network (Q _ value Mixing Network)

The Q-value hybrid network employs a network framework based on a value decomposition algorithm, such as VDN and QMIX, for example. In one specific implementation, as shown in (c) of fig. 2, the Q-value hybrid network may include a super network and a limited multi-layer sensor, and the potential energy function corresponding to each of the plurality of agents is determined according to the Q-value hybrid network

And global state information input hyper-network, the hyper-network outputs input items of the limited multi-layer perceptron, the input items input the limited multi-layer perceptron, and the limited multi-layer perceptron outputs the local value function corresponding to the intelligent agent group. Furthermore, it should be noted that the candidate global value function can also be determined by a Q-value hybrid networkAnd obtaining the potential energy function and the global state information corresponding to each intelligent agent respectively when the candidate global value function is determined through the Q value hybrid network.

In one implementation of this embodiment, the dividing the multi-agent into agent groups includes:

Specifically, the multi-agent system is applied in different environments, and the categories of each agent in the multi-agent system are different, for example, the categories of the agents in the multi-agent system are partially the same and partially different in environment a, and the categories of the agents in the multi-agent system are all the same in environment B. At least two agents are required to be included in at least one agent group when determining the local value function, so that when the categories of all agents in the multi-agent system are the same, the agent groups cannot be divided based on agent categories. Therefore, when the classes of the agents corresponding to the agents in the multi-agent are obtained, whether the classes of the agents corresponding to the agents are completely the same or not can be detected firstly, when the classes are completely the same, agent group division can be carried out in a random distribution mode, at least one agent group in a plurality of agent groups obtained through division comprises at least two agents, and certainly, in practical application, each agent group can be guaranteed to comprise at least two agents, so that local environment information learned by a local value function can be improved. Furthermore, when the classes of agents are not all the same, the multi-agent may be directly divided into a plurality of agent groups based on the agent classes, and of course, when the agent classes of each agent are different, the multi-agent may not be divided according to the agent classes, and at this time, the multi-agent may be divided according to a preset dividing mode, for example, each agent is numbered, and then each preset number of agents is divided into one agent group according to the number, and the like.

S30, obtaining global environment information corresponding to the multiple agents, and determining credit values corresponding to the local value functions and the candidate global value functions based on the global environment information, the local value functions and the candidate global value functions.

Specifically, the global environment information includes global state information and a training step length, the global state information and the training step length are influence factors of a global value function, when a credit value is determined, the global state information and the training step length are combined with a local value function to determine the credit value, the accuracy of the credibility of each local value function can be determined, and therefore the accuracy of the candidate global value function calculated based on the credit value and the local value function can be improved. In addition, the reputation value of the local value function is used for reflecting the credibility of the local value function when the global value function is formed, and the reputation value of the candidate global value function is used for reflecting the credibility of the global value function when the global value function is formed, wherein the larger the reputation value is, the higher the credibility is, and conversely, the smaller the reputation value is, the lower the credibility is.

In an implementation manner of this embodiment, as shown in fig. 2 (b), after a plurality of local value functions and candidate global value functions are obtained, reputation values corresponding to the local value functions and the candidate global value functions may be determined first through a reputation value Network (CoQ Network), and then the global value functions are determined based on the determined reputation values, the local value functions, and the candidate global value functions. The reputation value network comprises a multilayer perceptron, a softmax layer and a fusion layer in sequence, wherein the multilayer perceptron and the softmax layer are used for determining reputation values corresponding to each local value function and each candidate global value function, and the fusion layer is used for determining the global value function based on the reputation values, the local value functions and the candidate global value functions obtained through determination.

Based on this, the determining the reputation value corresponding to each of the local value function and the candidate global value function based on the global environment information, the plurality of local value functions, and the candidate global value function specifically includes:

In particular, each reputation value is a numerical value between 0-1, and the sum of all reputation values is equal to 1. That is, after the candidate reputation values corresponding to the local value functions and the candidate global value functions are output through the multi-layer perceptron, the candidate reputation values are subjected to activation processing through the softmax layer, so that the reputation values output through the softmax layer are all between 0 and 1, and the sum of all reputation values is equal to 1. In this embodiment, the calculation formula of the reputation value can be expressed as:

wherein the content of the first and second substances,

a value of a reputation is represented that is,

representing a value function of a set of value functions consisting of several local value functions and candidate global value functions, s representing global state information, n_stepsRepresents the training step size, f represents the multi-layer perceptron, and Softmax represents the Softmax function of the configuration of the Softmax layer.

And S40, determining the global value function corresponding to the multi-agent based on the local value functions, the candidate global value function and the reputation value corresponding to each of the local value functions and the candidate global value function.

Specifically, after obtaining the reputation values corresponding to each local value function and each candidate global value function, and each local value function and each candidate global value function, each reputation value may be used as a depreciation factor, and then each local value function and each candidate global value function are summed to obtain a global value function. Based on this, the determining the global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the reputation value corresponding to each of the local value function and the candidate global value function specifically includes:

Specifically, the credit value of each local value function is used as a depreciation factor to be multiplied by the local value function to obtain an addition term, the credit value of the candidate global value function is used as a depreciation factor to be multiplied by the candidate global value function to be used as an addition term, and then the addition terms are added to obtain the global value function. Thus, the calculation formula of the global value function may be:

wherein Q is_totRepresenting a global value function, CoQ_iRepresents the reputation value of the ith local value function,

representing the ith local value function, CoQ representing the source value of the candidate global value function, Q_globalA function representing a candidate global value is generated,

and n is the number of agents.

And S50, training the multiple agents by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents respectively.

Specifically, after the global value function and the potential energy functions corresponding to the respective intelligence are obtained, the global value function is used for performing centralized training, the potential energy functions are used for performing distributed execution, and the steps S10-S50 are repeated, so that the task is completed in the multi-agent cooperation environment.

In addition, to further illustrate the effect of the multi-agent reinforcement learning method based on value function credibility provided by the present embodiment, experiments are respectively performed in six application scenarios of the interstellar dispute game to obtain the experimental effect graphs as shown in fig. 3 and fig. 4, where, as shown in fig. 3, a comparison graph of the QMIX algorithm and the method provided by the present embodiment is adopted, and the Q value hybrid network in the method provided by the present embodiment adopts the QMIX algorithm; fig. 4 is a comparison diagram of the method provided by the present embodiment and the VDN algorithm, where the VDN algorithm is adopted by the Q-value hybrid network in the method provided by the present embodiment. As can be seen from fig. 3 and 4, the convergence rate of the method provided by the present embodiment in each application scenario is faster than that of the existing training method. In addition, fig. 5 is a schematic diagram of a Q value analysis diagram, and as can be seen from fig. 5, the introduction of the display local value in the method provided by the embodiment makes the global Q value approach the true Q value.

In summary, the present embodiment provides a multi-agent reinforcement learning method based on value function reliability, where the method includes acquiring an observed value of each agent, and determining a potential energy function corresponding to each agent based on the observed value corresponding to each agent; determining a plurality of local value functions and candidate global value functions based on the potential energy functions corresponding to the agents respectively; acquiring global environment information corresponding to multiple intelligent agents, and determining credit values corresponding to local value functions and candidate global value functions respectively based on the global environment information, the local value functions and the candidate global value functions; determining a global value function corresponding to the multi-agent based on each local value function, the candidate global value function and the credit values corresponding to each local value function and the candidate global value function; and training the multi-agent by utilizing a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents respectively. According to the method and the system, the local environment information of each intelligent agent can be learned by determining the plurality of local value functions, and then the global value functions are formed by combining the plurality of local value functions, so that the accuracy of the global value functions can be improved, and the convergence rate of the multi-intelligent-agent reinforcement learning can be improved. Meanwhile, the credibility is configured for each local value function when the global value function is calculated, and the learning efficiency and robustness of multi-agent reinforcement learning can be improved through the credibility.

Based on the above multi-agent reinforcement learning method based on value function reliability, this embodiment provides a multi-agent reinforcement learning apparatus based on value function reliability, as shown in fig. 6, the apparatus includes:

an obtaining module 100, configured to obtain an observed value of each intelligent agent, and determine a potential energy function corresponding to each intelligent agent based on the observed value corresponding to each intelligent agent;

a first determining module 200, configured to determine a plurality of local value functions and a candidate global value function based on potential energy functions corresponding to the respective agents, where each of the plurality of local value functions is determined based on a partial potential energy function;

a second determining module 300, configured to obtain global environment information corresponding to multiple agents, and determine reputation values corresponding to each local value function and each candidate global value function based on the global environment information, the multiple local value functions, and the candidate global value function;

a third determining module 400, configured to determine a global value function corresponding to the multi-agent based on each local value function, the candidate global value function, and the reputation values corresponding to each local value function and each candidate global value function;

a training module 500, configured to train the multiple agents by using a centralized training distribution execution mechanism based on the global value function and the potential energy functions corresponding to the agents.

Based on the above-mentioned multi-agent reinforcement learning method based on value function reliability, the present embodiment provides a computer-readable storage medium storing one or more programs, which can be executed by one or more processors, to implement the steps in the multi-agent reinforcement learning method based on value function reliability as described in the above-mentioned embodiment.

Based on the above multi-agent reinforcement learning method based on value function credibility, the present application further provides a terminal device, as shown in fig. 7, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the storage medium and the instruction processors in the terminal device are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A multi-agent reinforcement learning method based on value function confidence, the method comprising:

2. A value function reliability-based multi-agent reinforcement learning method as claimed in claim 1 wherein each local value function of said plurality of local value functions has a different potential energy function.

3. A multi-agent reinforcement learning method based on value function credibility as claimed in claim 1 or 2, wherein the determining several local value functions based on the potential energy function corresponding to each agent specifically comprises:

4. A method for multi-agent reinforcement learning based on value function confidence as claimed in claim 3, wherein the dividing of multi-agents into agent groups includes:

5. A value function reliability-based multi-agent reinforcement learning method as claimed in claim 1 wherein the global environment information comprises global state information and training step size.

6. A multi-agent reinforcement learning method based on value function credibility as claimed in claim 1 or 5, wherein the determining reputation value corresponding to each local value function and candidate global value function based on the global environment information, the plurality of local value functions and candidate global value functions specifically comprises:

7. A multi-agent reinforcement learning method based on value function reliability as claimed in claim 1, wherein the determining of the global value function corresponding to the multi-agent based on the respective reputation values corresponding to each of the local value functions, the candidate global value function and the local value functions and the candidate global value function is specifically:

8. A multi-agent reinforcement learning apparatus based on value function confidence, said apparatus comprising:

9. A value function reliability-based multi-agent reinforcement learning device as claimed in claim 8 wherein each local value function of said plurality of local value functions has a different potential energy function.

10. A computer-readable storage medium, storing one or more programs, which are executable by one or more processors, for performing the steps in the method for multi-agent reinforcement learning based on trustworthiness of value functions as claimed in any one of claims 1-7.

11. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the method for multi-agent reinforcement learning based on value function confidence of any of claims 1-7.