CN113392935B - Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism - Google Patents

Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism Download PDF

Info

Publication number
CN113392935B
CN113392935B CN202110777110.9A CN202110777110A CN113392935B CN 113392935 B CN113392935 B CN 113392935B CN 202110777110 A CN202110777110 A CN 202110777110A CN 113392935 B CN113392935 B CN 113392935B
Authority
CN
China
Prior art keywords
agent
probability
reinforcement learning
rewarding
updating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110777110.9A
Other languages
Chinese (zh)
Other versions
CN113392935A (en
Inventor
陈晋音
胡书隆
王雪柯
章燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110777110.9A priority Critical patent/CN113392935B/en
Publication of CN113392935A publication Critical patent/CN113392935A/en
Application granted granted Critical
Publication of CN113392935B publication Critical patent/CN113392935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, which comprises the following steps: building a multi-agent reinforcement learning collaborative simulation scene, and training the multi-agent by using a depth deterministic strategy gradient algorithm; the personality generator predicts the probability distribution of the picture observed by the agent by using the probability classifier, trains the probability distributor, and makes the probability distributor more accurate in distinguishing the agent, so that the personality of the agent is gradually revealed; acquiring characteristic information of pictures observed by each time step intelligent agent, regularizing the rewarding discount factors, and updating the obtained rewarding discount factors to rewarding functions in the personality generator to obtain newly set rewarding functions; and updating the newly set reward function to a multi-agent reinforcement learning framework of the depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence.

Description

Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
Technical Field
The invention relates to the field of defense of deep reinforcement learning, in particular to a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism.
Background
Deep reinforcement learning is one of the directions of attention paid to artificial intelligence in recent years, and with rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like.
The deep reinforcement learning algorithm is more applied to a single-agent scene, in the single-agent reinforcement learning, the environment where the agents are positioned is stable and unchanged, but in the multi-agent reinforcement (MARL) learning, the environment is complex and dynamic, the action of each agent can influence the action selection of other agents, and the multi-agent reinforcement learning has the problems of dimensional explosion, difficulty in determining a reward function and unstable environment, so that great difficulty is brought to the learning process; meanwhile, the targets of the relationship such as cooperation, competition and the like can be related to the difficulty in determining the rewards, and the advantages and disadvantages of the rewards design directly affect the quality of the learned strategy because the tasks of each agent in the multi-agent system can be different but are mutually coupled and influenced.
Multi-agent reinforcement learning is widely applied to multi-agent cooperation scenes, but people commonly observe that when agents are equivalent and have shared global rewards, the agents learn similar behaviors in a common training stage; however, learning similar behavior can easily trap the learned strategy into local optima. Some studies purposely pursued differences in agent strategies through diversity, however induced differences are not directly linked to task success. Instead, the presence of personalities and learning collaboration can automatically drive agents to take different actions and play different roles as needed to successfully complete a task.
In the existing multi-agent deep reinforcement learning algorithm, the agents may choose to bias towards actions that are easy to complete the target task, i.e., the agents may tend to complete the easy task, resulting in few or no agents completing the complex task, and eventually causing the entire multi-agent scene to fall into a locally optimal situation, resulting in a reduction of the overall global rewards. In practice, since reinforcement learning agents are mostly equivalent, agents should be allowed to develop personalities through policy learning process interacting with the environment; thus in a multi-agent environment, agents should develop personalities from their experiences by exploring and interacting with the environment separately.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, which achieves the purposes of enabling each agent to fully exert individuality and better completing respective tasks so as to achieve the optimization of global rewards.
A multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism, the method comprising the steps of:
building a multi-agent reinforcement learning collaborative simulation scene, and training the multi-agent by using a depth deterministic strategy gradient algorithm;
the personality generator predicts the probability distribution of the picture observed by the intelligent agent by using the probability classifier, trains the probability distributor by using the rewarding function with the rewarding discount factor, so that the probability distributor can distinguish the intelligent agent more accurately, and the personality of the intelligent agent is gradually revealed;
acquiring characteristic information of pictures observed by the intelligent agent at each time step by utilizing an image attention mechanism, regularizing rewarding discount factors, giving the intelligent agent a positive rewarding discount factor when the characteristic information is concentrated near the corresponding task of the intelligent agent, and giving the intelligent agent a negative rewarding discount factor when the characteristic information is not near the corresponding task of the intelligent agent;
updating the obtained rewarding discount factor to a rewarding function in the personality generator to obtain a newly set rewarding function; and updating the newly set reward function to a multi-agent reinforcement learning framework of the depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence.
The training steps of the multi-agent using the depth deterministic strategy gradient algorithm are as follows:
initializing a random process N of action exploration to obtain an initial state x;
for each agent i, selecting actions for the current strategy and exploration process
Figure BDA0003155972800000021
wherein oi Indicating the observation of agent i at time t, N t Is shown inExploration at time t, θ i The parameters representing the Actor network are represented by,
Figure BDA0003155972800000022
representing a mapping of a state space to an action space;
execution of action a= (a) 1 ,a 2 ...a N ) And observe the prize r and the next state x';
each Actor collects the current state, action and the next state (x, a, r, x ') and stores the current state, action and the next state (x, a, r, x') into an experience playback pool;
randomly sampling small batches of samples S (x j ,a j ,r j ,x' j ) Where j represents a certain moment;
setting a reward function:
Figure BDA0003155972800000031
wherein ,Qi μ' A 'representing the Q-value function of agent i at the next instant of j' k The angle marks represent actions obtained by observation at the next k moment; r is (r) i j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' j For the next state at time j, (a' 1 ,....a' N ) For action a= (a) 1 ,a 2 ...a N ) Is the next action of (a);
updating the Critic network by minimizing the loss of the bonus function:
Figure BDA0003155972800000032
wherein ,xj The state at the moment j is the state at the moment j,
Figure BDA0003155972800000033
for action at moment j +.>
Figure BDA0003155972800000037
A Q-value function representing agent i at time j;
updating the Actor network using the policy gradients calculated from the sampled data:
Figure BDA0003155972800000034
wherein ,
Figure BDA0003155972800000035
for observation of agent i at j, x j State at time j>
Figure BDA0003155972800000036
The action at the moment j;
let theta i ' update target network parameters for each agent i, where τ e (0, 1) is a random parameter:
θ′ i ←τθ i +(1-τ)θ′ i
meanwhile, in the training process, each intelligent agent in the multiple intelligent agents interacts with the environment to obtain experience data and obtain a strategy of each intelligent agent at one time.
Training the depth deterministic strategy gradient algorithm, wherein the training steps are as follows:
each Actor network collects data and stores the data into a buffer pool, and when the threshold value of the buffer pool is larger than a preset threshold value, learning is started;
and updating the strategy parameters by using the Actor network, updating the action value parameters by using the Critic network, and updating the Critic network.
Preferably, the training process of the personality generator is as follows:
carrying out small-batch random sampling from the buffer pool, and calculating cross entropy by utilizing a probability classifier;
updating the classified neural network parameters by minimizing cross entropy;
a new reward function is set with the new neural network parameters.
The probability classifier is expressed as:
P(i)=C(i|O i )
wherein: c (i/O) i ) Is based on the observation O of each agent i i The task classification probability obtained, P (i) represents the prediction probability;
the calculation formula of the cross entropy is expressed as follows:
CE=-∑Z(i)log P(i)
wherein Z (i) is the true classification probability;
the update formula of the classified neural network parameters is expressed as follows:
Figure BDA0003155972800000041
wherein ,
Figure BDA0003155972800000042
classifying neural network parameters;
the new bonus function is expressed as follows:
Figure BDA0003155972800000043
wherein ,Ri And represents the prize value of i for the agent, and gamma is the prize discount factor.
Preferably, the step of regularizing the bonus discount factor is as follows:
acquisition of observations of agent i at time t Using Grad-CAM
Figure BDA0003155972800000044
Is characterized by:
calculating partial derivatives of probability p output by the last layer softmax of the probability classifier network on all pixels of the final layer feature map:
Figure BDA0003155972800000045
wherein i is the serial number of the agent, A is the characteristic diagram output by the convolution of the last layer, k is the serial number of the channel dimension of the characteristic diagram, and h and w are serial numbers of the high and wide dimension respectively;
after the partial derivative of each pixel of the feature map is obtained, taking a global average of the feature map on a wide-high dimension:
Figure BDA0003155972800000051
the sensitivity degree of the i type obtained in the last step relative to the kth channel of the feature map output by the last layer of convolution layer is taken as a weight, the last layer of feature map is weighted, linear combination is carried out, and then the weighted last layer of feature map is sent into a ReLU activation function for processing:
Figure BDA0003155972800000052
analyzing the characteristic information of Grad-CAM thermodynamic diagram, if agent i 1 The display of the observed thermodynamic diagram characteristic information is concentrated on its corresponding task t i Nearby, a positive prize is given to the agent:
γ=(1-λ)r
wherein, the distance between lambda agent and corresponding task after normalization treatment, r represents the instant rewards obtained by agent i completing task, and the rewards discount factor gamma is positive;
if agent i 1 The observed characteristic information is not in its corresponding task t i Surrounding, or observed characteristic information is related to subtasks corresponding to other agents, then a negative reward is given to the agent:
γ=-(1-λ)r
i.e. closer to the non-subject task, the higher the negative reward obtained, the negative the reward discount factor gamma.
Compared with the prior art, the invention has the following advantages:
(1) The personality of each intelligent agent can be reflected in the training process by utilizing the personality generator; regularizing the classification probability in the personality generator with a weighted rewards regularized rewards function obtained based on an attention mechanism; the set rewarding function is modified according to different tasks of each agent, so that each agent corresponds to a different rewarding function; the newly set reward function is suitable for a depth deterministic strategy gradient algorithm, so that a training strategy is optimized, and the global reward can reach the optimal purpose more quickly.
Drawings
FIG. 1 is a schematic flow chart of a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism;
FIG. 2 is a schematic diagram of training multiple agents using a depth deterministic strategy gradient algorithm provided by the present invention;
FIG. 3 is a flow chart of the bonus function setup provided by the invention.
Detailed Description
The invention will be further described with reference to the drawings and specific examples.
According to the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism, a probability classifier is firstly learned by using a personality generator, the probability classifier predicts probability distribution on a given observed agent, and then intrinsic rewards of probability are correctly predicted by the classifier for each agent; encouraged by intrinsic rewards, agents tend to access their own familiar observations; learning the probabilistic classifier through such observations makes the underlying reward signal stronger and in turn makes the agent more identifiable. Since observations of different subject visits cannot be easily distinguished by the classifier in the early learning stage, the intrinsic reward signal is insufficient to induce subject features, a regularization based on an attention mechanism is employed to learn the classifier to increase discrimination and enhance feedback, thereby promoting the appearance of personality.
Fig. 1 is a flow chart of a multi-agent deep reinforcement learning strategy optimization method based on an attention mechanism according to the present embodiment, which can be used in a game scene to train the game scene to achieve a globally optimal state.
As shown in fig. 1-3, the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism comprises the following steps:
(1) Multi-agent centralized training process
(1.1) building a multi-agent reinforcement learning collaborative simulation scene;
(1.2) training the multi-agent based on a multi-agent reinforcement learning framework of a depth deterministic strategy gradient algorithm in multi-agent reinforcement learning;
(1.2.1) initializing a random process N of action exploration to obtain an initial state x;
(1.2.2) for each agent i, select actions for the current strategy and exploration procedure
Figure BDA0003155972800000061
wherein oi Indicating the observation of agent i at time t, N t Represents the search at time t, θ i The parameters representing the Actor network are represented by,
Figure BDA0003155972800000062
representing a mapping of a state space to an action space;
(1.2.3) performing action a= (a) 1 ,a 2 ...a N ) And observe the prize r and the next state 'x';
(1.2.4) each Actor collecting current state, action and next state (x, a, r, x') and storing them in the experience playback pool;
(1.2.5) random sampling of small batches of samples S (x) from an empirical playback pool j ,a j ,r j ,x' j ) Where j represents a certain moment;
(1.2.6) setting a bonus function:
Figure BDA0003155972800000071
wherein ,Qi μ' Representing the Q-value function of agent i at the next instant of j,a' k The angle marks represent actions obtained by observation at the next k moment; r is (r) i j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' j For the next state at time j, (a' 1 ,....a′ N ) For action a= (a) 1 ,a 2 ...a N ) Is the next action of (a);
(1.2.7) updating the Critic network by minimizing the loss of the bonus function:
Figure BDA0003155972800000072
wherein ,xj The state at the moment j is the state at the moment j,
Figure BDA0003155972800000073
for action at moment j +.>
Figure BDA0003155972800000077
A Q-value function representing agent i at time j;
(1.2.8) updating the Actor network with the policy gradients computed from the sampled data:
Figure BDA0003155972800000074
wherein ,
Figure BDA0003155972800000075
for observation of agent i at j, x j State at time j>
Figure BDA0003155972800000076
The action at the moment j;
(1.2.9) let θ' i Updating target network parameters for each agent i, where τ e (0, 1) is a random parameter:
θ′ i ←τθ i +(1-τ)θ′ i
(1.3) in the training process, each agent interacts with the environment to obtain empirical data, thereby obtaining a policy for each agent;
(1.4) collecting data by each Actor network, storing the data into buffer pools, and starting learning when the number of the buffer pools is larger than a preset threshold value;
(1.5) updating the policy parameters by using the Actor network, updating the action value parameters by using the Critic network, and updating the Critic network.
(2) Personality generator training process
(2.1) personality generator utilizes a probabilistic classifier C (i/O) i ) Predicting the probability distribution observed by the agents, each agent having the probability of correct prediction as an intrinsic return for each time step;
(2.2) setting the reward function of each agent to R i +γC(i/O i), wherein Ri Is that each agent i obtains global rewards, C (i/O) i ) Is based on the observation O of each agent i i The resulting task classification probability, and γ is the adjustment parameter for the weighted intrinsic rewards;
(2.3) initial differences existing between agent policies are defined by C (i/O) i ) The difference is obtained and fed back to each agent as an intrinsic reward.
(2.4) classifier C (i/O) i ) From neural networks
Figure BDA0003155972800000081
Parameterized and learned in a supervised manner. At each time step t, observe O for each agent i i As input, the label i of the agent is used as a label, and (i, O) i ) The pair is stored into a new buffer B;
(2.5) updating by minimizing Cross entropy loss (CE)
Figure BDA0003155972800000082
The cross entropy loss is calculated based on the uniform sampling batch of the observation buffer B;
(2.6) as the expected revenue of each agent is maximized, the variance of agent policies is exacerbated with optimization of environmental rewards;
(2.7) as the behavior of the agent becomes more and more identifiable, the classifier can more accurately distinguish the agent, so that the personality becomes apparent.
(3) Regularization of bonus discount factor gamma by an induced image attention mechanism
(3.1) observation O at each time step due to each agent i i All are image data frame by frame, and Grad-CAM is used for acquiring observation O of each time step i Is a characteristic information of (a);
(3.2) if agent i 1 The observed characteristic information is concentrated in its corresponding task t i Nearby, giving a forward rewarding gamma= (1-lambda) r to the intelligent agent, wherein lambda is the distance between the intelligent agent and the corresponding task after normalization treatment, r represents the instant rewarding obtained by the intelligent agent i for completing the task, namely, the closer to the task to be completed, the higher the obtained positive rewarding is, and the rewarding discount factor gamma is positive;
(3.3) if agent i 1 The observed characteristic information is not in its corresponding task t i Around, or the observed characteristic information is related to the subtasks corresponding to other agents, a negative reward gamma= - (1-lambda) r is given to the agents, namely, the closer to the non-subject task, the higher the obtained negative reward is, and the reward discount factor gamma is negative.
(3.4) updating the parameter lambda to the bonus function set in the personality generator;
and (3.5) updating the newly set reward function into the depth deterministic strategy gradient algorithm for training until the algorithm converges and the individualization of the agent is embodied.

Claims (6)

1. The multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism is characterized by comprising the following steps of:
building a multi-agent reinforcement learning collaborative simulation scene, and training the multi-agent by using a depth deterministic strategy gradient algorithm;
the personality generator predicts the probability distribution of the picture observed by the intelligent agent by using the probability classifier, trains the probability distributor by using the rewarding function with the rewarding discount factor, so that the probability distributor can distinguish the intelligent agent more accurately, and the personality of the intelligent agent is gradually revealed;
the method comprises the steps of obtaining feature information of pictures observed by an agent in each time step and regularized rewarding discount factors by using an image attention mechanism, namely obtaining feature information observed by the agent at moment by using Grad-CAM, calculating partial derivatives of probability output by a last layer of softmax of a probability classifier network to all pixels of the last layer of feature images, taking global average on a wide-high dimension again, weighting the last layer of feature images by taking the sensitivity degree of i types obtained in the last step relative to the ith channel of the feature images output by the last layer of convolution layers as weight, performing linear combination, sending the weighted last layer of feature images into a ReLU activation function to process, giving the agent a positive rewarding discount factor when the feature information is concentrated near the corresponding task of the agent, and giving the agent a negative rewarding factor when the feature information is not near the corresponding task of the agent; updating the obtained rewarding discount factor to a rewarding function in the personality generator to obtain a newly set rewarding function; updating the newly set reward function to a multi-agent reinforcement learning framework of a depth deterministic strategy gradient algorithm to train the multi-agent until the multi-agent reaches convergence;
the multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism is used in the game scene to train the game scene to achieve the global optimal state.
2. The method for optimizing multi-agent deep reinforcement learning strategy based on attention mechanism according to claim 1, wherein the training steps of the multi-agent using depth deterministic strategy gradient algorithm are as follows:
initializing a random process N of action exploration to obtain an initial state x;
for each agent i, selecting actions for the current strategy and exploration process
Figure FDA0004191748070000011
wherein oi Indicating the observation of agent i at time t, N t Represents the search at time t, θ i The parameters representing the Actor network are represented by,
Figure FDA0004191748070000021
representing a mapping of a state space to an action space;
execution of action a= (a) 1 ,a 2 ...a N ) And observe the prize r and the next state x';
each Actor collects the current state, action and the next state (x, a, r, x ') and stores the current state, action and the next state (x, a, r, x') into an experience playback pool;
randomly sampling small batches of samples S (x j ,a j ,r j ,x' j ) Where j represents a certain moment;
setting a reward function:
Figure FDA0004191748070000022
wherein ,
Figure FDA0004191748070000023
a 'representing the Q-value function of agent i at the next instant of j' k The angle marks represent actions obtained by observation at the next k moment; r is (r) i j Representing the prize value of agent i at time j, gamma being the prize discount factor, x' j For the next state at time j, (a' 1 ,....a′ N ) For action a= (a) 1 ,a 2 ...a N ) Is the next action of (a);
updating the Critic network by minimizing the loss of the bonus function:
Figure FDA0004191748070000024
/>
wherein ,xj At time jThe state of the device is that,
Figure FDA0004191748070000025
for action at moment j +.>
Figure FDA0004191748070000026
A Q-value function representing agent i at time j;
updating the Actor network using the policy gradients calculated from the sampled data:
Figure FDA0004191748070000027
wherein ,
Figure FDA0004191748070000029
for observation of agent i at j, x j State at time j>
Figure FDA0004191748070000028
The action at the moment j;
let theta' i Updating target network parameters for each agent i, where τ e (0, 1) is a random parameter:
θ′ i ←τθ i +(1-τ)θ′ i
meanwhile, in the training process, each intelligent agent in the multiple intelligent agents interacts with the environment to obtain experience data and obtain a strategy of each intelligent agent at one time.
3. The method for optimizing multi-agent deep reinforcement learning strategy based on the attention mechanism according to claim 2, wherein the depth deterministic strategy gradient algorithm is trained by the following steps:
each Actor network collects data and stores the data into a buffer pool, and when the threshold value of the buffer pool is larger than a preset threshold value, learning is started;
and updating the strategy parameters by using the Actor network, updating the action value parameters by using the Critic network, and updating the Critic network.
4. The multi-agent deep reinforcement learning strategy optimization method based on the attention mechanism of claim 2, wherein the training process of the personality generator is as follows:
carrying out small-batch random sampling from the buffer pool, and calculating cross entropy by utilizing a probability classifier;
updating the classified neural network parameters by minimizing cross entropy;
a new reward function is set with the new neural network parameters.
5. The attention mechanism based multi-agent deep reinforcement learning strategy optimization method of claim 4, wherein the probability classifier is expressed as:
P(i)=C(i|O i )
wherein: c (i/O) i ) Is based on the observation O of each agent i i The task classification probability obtained, P (i) represents the prediction probability;
the calculation formula of the cross entropy is expressed as follows:
CE=-ΣZ(i)logP(i)
wherein Z (i) is the true classification probability;
the update formula of the classified neural network parameters is expressed as follows:
θ i ←min(-∑Z(i)logP(i))
wherein ,θi Classifying neural network parameters;
the new bonus function is expressed as follows:
Figure FDA0004191748070000031
wherein ,Ri And represents the prize value of i for the agent, and gamma is the prize discount factor.
6. The attention mechanism based multi-agent deep reinforcement learning strategy optimization method of claim 5 wherein the step of regularizing the bonus discount factor is as follows:
acquisition of observations of agent i at time t Using Grad-CAM
Figure FDA0004191748070000032
Is characterized by: />
Calculating partial derivatives of probability p output by the last layer softmax of the probability classifier network on all pixels of the final layer feature map:
Figure FDA0004191748070000041
wherein i is the serial number of the agent, A is the characteristic diagram output by the convolution of the last layer, k is the serial number of the channel dimension of the characteristic diagram, and h and w are serial numbers of the high and wide dimension respectively;
after the partial derivative of each pixel of the feature map is obtained, taking a global average of the feature map on a wide-high dimension:
Figure FDA0004191748070000042
the sensitivity degree of the i type obtained in the last step relative to the kth channel of the feature map output by the last layer of convolution layer is taken as a weight, the last layer of feature map is weighted, linear combination is carried out, and then the weighted last layer of feature map is sent into a ReLU activation function for processing:
Figure FDA0004191748070000043
analyzing the characteristic information of Grad-CAM thermodynamic diagram, if agent i 1 The display of the observed thermodynamic diagram characteristic information is concentrated on its corresponding task t i Nearby, a positive prize is given to the agent:
γ=(1-λ)r
wherein, the distance between lambda agent and corresponding task after normalization treatment, r represents the instant rewards obtained by agent i completing task, and the rewards discount factor gamma is positive;
if agent i 1 The observed characteristic information is not in its corresponding task t i Surrounding, or observed characteristic information is related to subtasks corresponding to other agents, then a negative reward is given to the agent:
γ=-(1-λ)r
i.e. closer to the non-subject task, the higher the negative reward obtained, the negative the reward discount factor gamma.
CN202110777110.9A 2021-07-09 2021-07-09 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism Active CN113392935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110777110.9A CN113392935B (en) 2021-07-09 2021-07-09 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110777110.9A CN113392935B (en) 2021-07-09 2021-07-09 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Publications (2)

Publication Number Publication Date
CN113392935A CN113392935A (en) 2021-09-14
CN113392935B true CN113392935B (en) 2023-05-30

Family

ID=77625608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110777110.9A Active CN113392935B (en) 2021-07-09 2021-07-09 Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN113392935B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792861B (en) * 2021-09-16 2024-02-27 中国科学技术大学 Multi-agent reinforcement learning method and system based on value distribution
CN113759929B (en) * 2021-09-22 2022-08-23 西安航天动力研究所 Multi-agent path planning method based on reinforcement learning and model predictive control
CN113919485B (en) * 2021-10-19 2024-03-15 西安交通大学 Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN114123178B (en) * 2021-11-17 2023-12-19 哈尔滨工程大学 Multi-agent reinforcement learning-based intelligent power grid partition network reconstruction method
CN114130034B (en) * 2021-11-19 2023-08-18 天津大学 Multi-agent game AI design method based on attention mechanism and reinforcement learning
CN114187978A (en) * 2021-11-24 2022-03-15 中山大学 Compound optimization method based on deep learning connection fragment
CN113962390B (en) * 2021-12-21 2022-04-01 中国科学院自动化研究所 Method for constructing diversified search strategy model based on deep reinforcement learning network
CN114454160B (en) * 2021-12-31 2024-04-16 中国人民解放军国防科技大学 Mechanical arm grabbing control method and system based on kernel least square soft Belman residual error reinforcement learning
CN114489107B (en) * 2022-01-29 2022-10-25 哈尔滨逐宇航天科技有限责任公司 Aircraft double-delay depth certainty strategy gradient attitude control method
CN114627054A (en) * 2022-02-11 2022-06-14 北京理工大学 CT-X image registration method and device based on multi-scale reinforcement learning
CN114527666B (en) * 2022-03-09 2023-08-11 西北工业大学 CPS system reinforcement learning control method based on attention mechanism
CN114625089B (en) * 2022-03-15 2024-05-03 大连东软信息学院 Job shop scheduling method based on improved near-end strategy optimization algorithm
CN114625091A (en) * 2022-03-21 2022-06-14 京东城市(北京)数字科技有限公司 Optimization control method and device, storage medium and electronic equipment
CN114841872A (en) * 2022-04-12 2022-08-02 浙江大学 Digital halftone processing method based on multi-agent deep reinforcement learning
CN114900619B (en) * 2022-05-06 2023-05-05 北京航空航天大学 Self-adaptive exposure driving camera shooting underwater image processing system
CN114925850B (en) * 2022-05-11 2024-02-20 华东师范大学 Deep reinforcement learning countermeasure defense method for disturbance rewards
CN114815904B (en) * 2022-06-29 2022-09-27 中国科学院自动化研究所 Attention network-based unmanned cluster countermeasure method and device and unmanned equipment
CN115333961B (en) * 2022-06-30 2023-10-13 北京邮电大学 Wireless communication network management and control method based on deep reinforcement learning and related equipment
CN115167136B (en) * 2022-07-21 2023-04-07 中国人民解放军国防科技大学 Intelligent agent control method based on deep reinforcement learning and conditional entropy bottleneck
CN115062871B (en) * 2022-08-11 2022-11-29 山西虚拟现实产业技术研究院有限公司 Intelligent electric meter state evaluation method based on multi-agent reinforcement learning
CN115333152A (en) * 2022-08-22 2022-11-11 电子科技大学 Distributed real-time control method for voltage of power distribution network
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model
CN115797394B (en) * 2022-11-15 2023-09-05 北京科技大学 Multi-agent coverage method based on reinforcement learning
CN115826013B (en) * 2023-02-15 2023-04-21 广东工业大学 Beidou satellite positioning method based on light reinforcement learning under urban multipath environment
CN116629128B (en) * 2023-05-30 2024-03-29 哈尔滨工业大学 Method for controlling arc additive forming based on deep reinforcement learning
CN116560239B (en) * 2023-07-06 2023-09-12 华南理工大学 Multi-agent reinforcement learning method, device and medium
CN117151928A (en) * 2023-09-05 2023-12-01 广州大学 Power saving calculation method and device combined with reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method
CN112329948A (en) * 2020-11-04 2021-02-05 腾讯科技(深圳)有限公司 Multi-agent strategy prediction method and device
CN112801290A (en) * 2021-02-26 2021-05-14 中国人民解放军陆军工程大学 Multi-agent deep reinforcement learning method, system and application
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200162535A1 (en) * 2018-11-19 2020-05-21 Zhan Ma Methods and Apparatus for Learning Based Adaptive Real-time Streaming
EP3899797A1 (en) * 2019-01-24 2021-10-27 DeepMind Technologies Limited Multi-agent reinforcement learning with matchmaking policies
US20210089910A1 (en) * 2019-09-25 2021-03-25 Deepmind Technologies Limited Reinforcement learning using meta-learned intrinsic rewards

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635917A (en) * 2018-10-17 2019-04-16 北京大学 A kind of multiple agent Cooperation Decision-making and training method
CN112329948A (en) * 2020-11-04 2021-02-05 腾讯科技(深圳)有限公司 Multi-agent strategy prediction method and device
CN112801290A (en) * 2021-02-26 2021-05-14 中国人民解放军陆军工程大学 Multi-agent deep reinforcement learning method, system and application
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
counterfactual multi-agent (COMA) policy gradients;Shimon Whiteson;The Thirty-Second AAAI Conference on Artificial Intelligence;1-9 *
多智能体协作模拟环境的设计与实现;陈晋音;计算机应用;第25卷;308-310 *

Also Published As

Publication number Publication date
CN113392935A (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN113392935B (en) Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism
CN107403426B (en) Target object detection method and device
US11867599B2 (en) Apparatus and methods for controlling attention of a robot
CN113537106B (en) Fish ingestion behavior identification method based on YOLOv5
CN112465151A (en) Multi-agent federal cooperation method based on deep reinforcement learning
CN110874578B (en) Unmanned aerial vehicle visual angle vehicle recognition tracking method based on reinforcement learning
CN108510194A (en) Air control model training method, Risk Identification Method, device, equipment and medium
CN111079561A (en) Robot intelligent grabbing method based on virtual training
CN111507501A (en) Method and device for executing personalized path planning through reinforcement learning
CN111246091A (en) Dynamic automatic exposure control method and device and electronic equipment
JP7059695B2 (en) Learning method and learning device
CN114842343A (en) ViT-based aerial image identification method
CN113065379B (en) Image detection method and device integrating image quality and electronic equipment
CN113870304A (en) Abnormal behavior detection and tracking method and device, readable storage medium and equipment
CN113393495B (en) High-altitude parabolic track identification method based on reinforcement learning
US11080837B2 (en) Architecture for improved machine learning operation
CN113378638B (en) Method for identifying abnormal behavior of turbine operator based on human body joint point detection and D-GRU network
CN116630751B (en) Trusted target detection method integrating information bottleneck and uncertainty perception
CN113561995A (en) Automatic driving decision method based on multi-dimensional reward architecture deep Q learning
CN110826563A (en) Finger vein segmentation method and device based on neural network and probability map model
CN115909027A (en) Situation estimation method and device
CN115630361A (en) Attention distillation-based federal learning backdoor defense method
JP2022514886A (en) How to Train Neural Networks
CN117709602B (en) Urban intelligent vehicle personification decision-making method based on social value orientation
Liu et al. Hybrid-Input Convolutional Neural Network-Based Underwater Image Quality Assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant