CN113392935B

CN113392935B - Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Info

Publication number: CN113392935B
Application number: CN202110777110.9A
Authority: CN
Inventors: 陈晋音; 胡书隆; 王雪柯; 章燕
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2023-05-30
Anticipated expiration: 2041-07-09
Also published as: CN113392935A

Abstract

The invention discloses a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, which comprises the following steps: building a multi-agent reinforcement learning collaborative simulation scene, and training the multi-agent by using a depth deterministic strategy gradient algorithm; the personality generator predicts the probability distribution of the picture observed by the agent by using the probability classifier, trains the probability distributor, and makes the probability distributor more accurate in distinguishing the agent, so that the personality of the agent is gradually revealed; acquiring characteristic information of pictures observed by each time step intelligent agent, regularizing the rewarding discount factors, and updating the obtained rewarding discount factors to rewarding functions in the personality generator to obtain newly set rewarding functions; and updating the newly set reward function to a multi-agent reinforcement learning framework of the depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence.

Description

Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Technical Field

The invention relates to the field of defense of deep reinforcement learning, in particular to a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism.

Background

Deep reinforcement learning is one of the directions of attention paid to artificial intelligence in recent years, and with rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like.

The deep reinforcement learning algorithm is more applied to a single-agent scene, in the single-agent reinforcement learning, the environment where the agents are positioned is stable and unchanged, but in the multi-agent reinforcement (MARL) learning, the environment is complex and dynamic, the action of each agent can influence the action selection of other agents, and the multi-agent reinforcement learning has the problems of dimensional explosion, difficulty in determining a reward function and unstable environment, so that great difficulty is brought to the learning process; meanwhile, the targets of the relationship such as cooperation, competition and the like can be related to the difficulty in determining the rewards, and the advantages and disadvantages of the rewards design directly affect the quality of the learned strategy because the tasks of each agent in the multi-agent system can be different but are mutually coupled and influenced.

Multi-agent reinforcement learning is widely applied to multi-agent cooperation scenes, but people commonly observe that when agents are equivalent and have shared global rewards, the agents learn similar behaviors in a common training stage; however, learning similar behavior can easily trap the learned strategy into local optima. Some studies purposely pursued differences in agent strategies through diversity, however induced differences are not directly linked to task success. Instead, the presence of personalities and learning collaboration can automatically drive agents to take different actions and play different roles as needed to successfully complete a task.

In the existing multi-agent deep reinforcement learning algorithm, the agents may choose to bias towards actions that are easy to complete the target task, i.e., the agents may tend to complete the easy task, resulting in few or no agents completing the complex task, and eventually causing the entire multi-agent scene to fall into a locally optimal situation, resulting in a reduction of the overall global rewards. In practice, since reinforcement learning agents are mostly equivalent, agents should be allowed to develop personalities through policy learning process interacting with the environment; thus in a multi-agent environment, agents should develop personalities from their experiences by exploring and interacting with the environment separately.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, which achieves the purposes of enabling each agent to fully exert individuality and better completing respective tasks so as to achieve the optimization of global rewards.

A multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, the method comprising the steps of:

building a multi-agent reinforcement learning collaborative simulation scene, and training the multi-agent by using a depth deterministic strategy gradient algorithm;

the personality generator predicts the probability distribution of the picture observed by the intelligent agent by using the probability classifier, trains the probability distributor by using the rewarding function with the rewarding discount factor, so that the probability distributor can distinguish the intelligent agent more accurately, and the personality of the intelligent agent is gradually revealed;

acquiring characteristic information of pictures observed by the intelligent agent at each time step by utilizing an image attention mechanism, regularizing rewarding discount factors, giving the intelligent agent a positive rewarding discount factor when the characteristic information is concentrated near the corresponding task of the intelligent agent, and giving the intelligent agent a negative rewarding discount factor when the characteristic information is not near the corresponding task of the intelligent agent;

updating the obtained rewarding discount factor to a rewarding function in the personality generator to obtain a newly set rewarding function; and updating the newly set reward function to a multi-agent reinforcement learning framework of the depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence.

The training steps of the multi-agent using the depth deterministic strategy gradient algorithm are as follows:

initializing a random process N of action exploration to obtain an initial state x;

for each agent i, selecting actions for the current strategy and exploration process

wherein o_i Indicating the observation of agent i at time t, N _t Is shown inExploration at time t, θ _i The parameters representing the Actor network are represented by,

representing a mapping of a state space to an action space;

execution of action a= (a) ₁ ,a ₂ ...a _N ) And observe the prize r and the next state x';

each Actor collects the current state, action and the next state (x, a, r, x ') and stores the current state, action and the next state (x, a, r, x') into an experience playback pool;

randomly sampling small batches of samples S (x ^j ,a ^j ,r ^j ,x' ^j ) Where j represents a certain moment;

setting a reward function:

wherein ,Q_i ^μ' A 'representing the Q-value function of agent i at the next instant of j' _k The angle marks represent actions obtained by observation at the next k moment; r is (r) _i ^j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' ^j For the next state at time j, (a' ₁ ,....a' _N ) For action a= (a) ₁ ,a ₂ ...a _N ) Is the next action of (a);

updating the Critic network by minimizing the loss of the bonus function:

wherein ,x^j The state at the moment j is the state at the moment j,

for action at moment j +.>

A Q-value function representing agent i at time j;

updating the Actor network using the policy gradients calculated from the sampled data:

wherein ,

for observation of agent i at j, x ^j State at time j>

The action at the moment j;

let theta _i ' update target network parameters for each agent i, where τ e (0, 1) is a random parameter:

θ′ _i ←τθ _i +(1-τ)θ′ _i

meanwhile, in the training process, each intelligent agent in the multiple intelligent agents interacts with the environment to obtain experience data and obtain a strategy of each intelligent agent at one time.

Training the depth deterministic strategy gradient algorithm, wherein the training steps are as follows:

each Actor network collects data and stores the data into a buffer pool, and when the threshold value of the buffer pool is larger than a preset threshold value, learning is started;

and updating the strategy parameters by using the Actor network, updating the action value parameters by using the Critic network, and updating the Critic network.

Preferably, the training process of the personality generator is as follows:

carrying out small-batch random sampling from the buffer pool, and calculating cross entropy by utilizing a probability classifier;

updating the classified neural network parameters by minimizing cross entropy;

a new reward function is set with the new neural network parameters.

The probability classifier is expressed as:

P(i)＝C(i|O _i )

wherein: c (i/O) _i ) Is based on the observation O of each agent i _i The task classification probability obtained, P (i) represents the prediction probability;

the calculation formula of the cross entropy is expressed as follows:

CE＝-∑Z(i)log P(i)

wherein Z (i) is the true classification probability;

the update formula of the classified neural network parameters is expressed as follows:

wherein ,

classifying neural network parameters;

the new bonus function is expressed as follows:

wherein ,R_i And represents the prize value of i for the agent, and gamma is the prize discount factor.

Preferably, the step of regularizing the bonus discount factor is as follows:

acquisition of observations of agent i at time t Using Grad-CAM

Is characterized by:

calculating partial derivatives of probability p output by the last layer softmax of the probability classifier network on all pixels of the final layer feature map:

wherein i is the serial number of the agent, A is the characteristic diagram output by the convolution of the last layer, k is the serial number of the channel dimension of the characteristic diagram, and h and w are serial numbers of the high and wide dimension respectively;

after the partial derivative of each pixel of the feature map is obtained, taking a global average of the feature map on a wide-high dimension:

the sensitivity degree of the i type obtained in the last step relative to the kth channel of the feature map output by the last layer of convolution layer is taken as a weight, the last layer of feature map is weighted, linear combination is carried out, and then the weighted last layer of feature map is sent into a ReLU activation function for processing:

analyzing the characteristic information of Grad-CAM thermodynamic diagram, if agent i ₁ The display of the observed thermodynamic diagram characteristic information is concentrated on its corresponding task t _i Nearby, a positive prize is given to the agent:

γ＝(1-λ)r

wherein, the distance between lambda agent and corresponding task after normalization treatment, r represents the instant rewards obtained by agent i completing task, and the rewards discount factor gamma is positive;

if agent i ₁ The observed characteristic information is not in its corresponding task t _i Surrounding, or observed characteristic information is related to subtasks corresponding to other agents, then a negative reward is given to the agent:

γ＝-(1-λ)r

i.e. closer to the non-subject task, the higher the negative reward obtained, the negative the reward discount factor gamma.

Compared with the prior art, the invention has the following advantages:

(1) The personality of each intelligent agent can be reflected in the training process by utilizing the personality generator; regularizing the classification probability in the personality generator with a weighted rewards regularized rewards function obtained based on an attention mechanism; the set rewarding function is modified according to different tasks of each agent, so that each agent corresponds to a different rewarding function; the newly set reward function is suitable for a depth deterministic strategy gradient algorithm, so that a training strategy is optimized, and the global reward can reach the optimal purpose more quickly.

Drawings

FIG. 1 is a schematic flow chart of a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism;

FIG. 2 is a schematic diagram of training multiple agents using a depth deterministic strategy gradient algorithm provided by the present invention;

FIG. 3 is a flow chart of the bonus function setup provided by the invention.

Detailed Description

The invention will be further described with reference to the drawings and specific examples.

According to the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism, a probability classifier is firstly learned by using a personality generator, the probability classifier predicts probability distribution on a given observed agent, and then intrinsic rewards of probability are correctly predicted by the classifier for each agent; encouraged by intrinsic rewards, agents tend to access their own familiar observations; learning the probabilistic classifier through such observations makes the underlying reward signal stronger and in turn makes the agent more identifiable. Since observations of different subject visits cannot be easily distinguished by the classifier in the early learning stage, the intrinsic reward signal is insufficient to induce subject features, a regularization based on an attention mechanism is employed to learn the classifier to increase discrimination and enhance feedback, thereby promoting the appearance of personality.

Fig. 1 is a flow chart of a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism according to the present embodiment, which can be used in a game scene to train the game scene to achieve a globally optimal state.

As shown in fig. 1-3, the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism comprises the following steps:

(1) Multi-agent centralized training process

(1.1) building a multi-agent reinforcement learning collaborative simulation scene;

(1.2) training the multi-agent based on a multi-agent reinforcement learning framework of a depth deterministic strategy gradient algorithm in multi-agent reinforcement learning;

(1.2.1) initializing a random process N of action exploration to obtain an initial state x;

(1.2.2) for each agent i, select actions for the current strategy and exploration procedure

wherein o_i Indicating the observation of agent i at time t, N _t Represents the search at time t, θ _i The parameters representing the Actor network are represented by,

representing a mapping of a state space to an action space;

(1.2.3) performing action a= (a) ₁ ,a ₂ ...a _N ) And observe the prize r and the next state 'x';

(1.2.4) each Actor collecting current state, action and next state (x, a, r, x') and storing them in the experience playback pool;

(1.2.5) random sampling of small batches of samples S (x) from an empirical playback pool ^j ,a ^j ,r ^j ,x' ^j ) Where j represents a certain moment;

(1.2.6) setting a bonus function:

wherein ,Q_i ^μ' Representing the Q-value function of agent i at the next instant of j，a' _k The angle marks represent actions obtained by observation at the next k moment; r is (r) _i ^j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' ^j For the next state at time j, (a' ₁ ,....a′ _N ) For action a= (a) ₁ ,a ₂ ...a _N ) Is the next action of (a);

(1.2.7) updating the Critic network by minimizing the loss of the bonus function:

wherein ,x^j The state at the moment j is the state at the moment j,

for action at moment j +.>

A Q-value function representing agent i at time j;

(1.2.8) updating the Actor network with the policy gradients computed from the sampled data:

wherein ,

for observation of agent i at j, x ^j State at time j>

The action at the moment j;

(1.2.9) let θ' _i Updating target network parameters for each agent i, where τ e (0, 1) is a random parameter:

θ′ _i ←τθ _i +(1-τ)θ′ _i ；

(1.3) in the training process, each agent interacts with the environment to obtain empirical data, thereby obtaining a policy for each agent;

(1.4) collecting data by each Actor network, storing the data into buffer pools, and starting learning when the number of the buffer pools is larger than a preset threshold value;

(1.5) updating the policy parameters by using the Actor network, updating the action value parameters by using the Critic network, and updating the Critic network.

(2) Personality generator training process

(2.1) personality generator utilizes a probabilistic classifier C (i/O) _i ) Predicting the probability distribution observed by the agents, each agent having the probability of correct prediction as an intrinsic return for each time step;

(2.2) setting the reward function of each agent to R _i +γC(i/O _i), wherein R_i Is that each agent i obtains global rewards, C (i/O) _i ) Is based on the observation O of each agent i _i The resulting task classification probability, and γ is the adjustment parameter for the weighted intrinsic rewards;

(2.3) initial differences existing between agent policies are defined by C (i/O) _i ) The difference is obtained and fed back to each agent as an intrinsic reward.

(2.4) classifier C (i/O) _i ) From neural networks

Parameterized and learned in a supervised manner. At each time step t, observe O for each agent i _i As input, the label i of the agent is used as a label, and (i, O) _i ) The pair is stored into a new buffer B;

(2.5) updating by minimizing Cross entropy loss (CE)

The cross entropy loss is calculated based on the uniform sampling batch of the observation buffer B;

(2.6) as the expected revenue of each agent is maximized, the variance of agent policies is exacerbated with optimization of environmental rewards;

(2.7) as the behavior of the agent becomes more and more identifiable, the classifier can more accurately distinguish the agent, so that the personality becomes apparent.

(3) Regularization of bonus discount factor gamma by an induced image attention mechanism

(3.1) observation O at each time step due to each agent i _i All are image data frame by frame, and Grad-CAM is used for acquiring observation O of each time step _i Is a characteristic information of (a);

(3.2) if agent i ₁ The observed characteristic information is concentrated in its corresponding task t _i Nearby, giving a forward rewarding gamma= (1-lambda) r to the intelligent agent, wherein lambda is the distance between the intelligent agent and the corresponding task after normalization treatment, r represents the instant rewarding obtained by the intelligent agent i for completing the task, namely, the closer to the task to be completed, the higher the obtained positive rewarding is, and the rewarding discount factor gamma is positive;

(3.3) if agent i ₁ The observed characteristic information is not in its corresponding task t _i Around, or the observed characteristic information is related to the subtasks corresponding to other agents, a negative reward gamma= - (1-lambda) r is given to the agents, namely, the closer to the non-subject task, the higher the obtained negative reward is, and the reward discount factor gamma is negative.

(3.4) updating the parameter lambda to the bonus function set in the personality generator;

and (3.5) updating the newly set reward function into the depth deterministic strategy gradient algorithm for training until the algorithm converges and the individualization of the agent is embodied.

Claims

1. The multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism is characterized by comprising the following steps of:

the method comprises the steps of obtaining feature information of pictures observed by an agent in each time step and regularized rewarding discount factors by using an image attention mechanism, namely obtaining feature information observed by the agent at moment by using Grad-CAM, calculating partial derivatives of probability output by a last layer of softmax of a probability classifier network to all pixels of the last layer of feature images, taking global average on a wide-high dimension again, weighting the last layer of feature images by taking the sensitivity degree of i types obtained in the last step relative to the ith channel of the feature images output by the last layer of convolution layers as weight, performing linear combination, sending the weighted last layer of feature images into a ReLU activation function to process, giving the agent a positive rewarding discount factor when the feature information is concentrated near the corresponding task of the agent, and giving the agent a negative rewarding factor when the feature information is not near the corresponding task of the agent; updating the obtained rewarding discount factor to a rewarding function in the personality generator to obtain a newly set rewarding function; updating the newly set reward function to a multi-agent reinforcement learning framework of a depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence;

the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism is used in the game scene to train the game scene to achieve the global optimal state.

2. The method for optimizing multi-agent deep reinforcement learning strategy based on attention mechanism according to claim 1, wherein the training steps of the multi-agent using depth deterministic strategy gradient algorithm are as follows:

representing a mapping of a state space to an action space;

setting a reward function:

wherein ,

a 'representing the Q-value function of agent i at the next instant of j' _k The angle marks represent actions obtained by observation at the next k moment; r is (r) _i ^j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' ^j For the next state at time j, (a' ₁ ,....a′ _N ) For action a= (a) ₁ ,a ₂ ...a _N ) Is the next action of (a);

updating the Critic network by minimizing the loss of the bonus function:

/>

wherein ,x^j At time jThe state of the device is that,

for action at moment j +.>

A Q-value function representing agent i at time j;

wherein ,

for observation of agent i at j, x ^j State at time j>

The action at the moment j;

let theta' _i Updating target network parameters for each agent i, where τ e (0, 1) is a random parameter:

θ′ _i ←τθ _i +(1-τ)θ′ _i

3. The method for optimizing multi-agent deep reinforcement learning strategy based on the attention mechanism according to claim 2, wherein the depth deterministic strategy gradient algorithm is trained by the following steps:

4. The multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism of claim 2, wherein the training process of the personality generator is as follows:

updating the classified neural network parameters by minimizing cross entropy;

a new reward function is set with the new neural network parameters.

5. The attention mechanism based multi-agent deep reinforcement learning strategy optimization method of claim 4, wherein the probability classifier is expressed as:

P(i)＝C(i|O _i )

the calculation formula of the cross entropy is expressed as follows:

CE＝-ΣZ(i)logP(i)

wherein Z (i) is the true classification probability;

θ _i ←min(-∑Z(i)logP(i))

wherein ,θ_i Classifying neural network parameters;

the new bonus function is expressed as follows:

6. The attention mechanism based multi-agent deep reinforcement learning strategy optimization method of claim 5 wherein the step of regularizing the bonus discount factor is as follows:

acquisition of observations of agent i at time t Using Grad-CAM

Is characterized by: />

γ＝(1-λ)r

γ＝-(1-λ)r