CN117875375A

CN117875375A - Multi-agent deep reinforcement learning method based on safety exploration

Info

Publication number: CN117875375A
Application number: CN202311811726.9A
Authority: CN
Inventors: 卢晓珍; 林龙河; 刘智博
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-12

Abstract

The invention provides a multi-agent deep reinforcement learning method based on safety exploration, which can avoid dangerous actions under the condition of ensuring that the task performance of an agent is not reduced. According to the method, a G network based on a long-term risk value is added in a traditional multi-agent deep reinforcement learning network, the risk value of the action of an agent is estimated through deep reinforcement learning, and the estimated risk value is utilized to correct rewards obtained by the action. The safety of actions selected by the intelligent agent in the exploration process is ensured. In the training process, the method uses a double experience pool of instant early warning values, and ensures the full exploration of the intelligent body. Meanwhile, the method utilizes the deep neural network to carry out credit allocation on each intelligent agent, and utilizes a framework of centralized training-distributed execution to enable each intelligent agent to have cooperation capability.

Description

Multi-agent deep reinforcement learning method based on safety exploration

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a multi-agent deep reinforcement learning method based on security exploration.

Background

The multi-agent deep reinforcement learning is an important research direction in the machine learning field, and can be applied to important scenes such as automatic driving, energy distribution, track planning and the like. As social informatization continues to be in deep, the security of algorithms is also becoming more and more important. At present, many tasks have high requirements on the safety performance of each intelligent agent, and dangerous strategies can bring irreversible serious damage, so that how to ensure the safety without excessively sacrificing the system performance becomes an important research problem.

The research on multi-agent deep reinforcement Learning algorithm at home and abroad is quite abundant, and in 2015, ardi Tampuu et al propose an Independent Q Learning (IQL) algorithm, and a complete decentralization method is used, so that the agent is independently trained according to own observation data. The disadvantage of this approach is that it considers each agent to be completely independent of the others, ignoring the impact between agents. The framework of centralized training-distributed execution solves this problem, for example, peter Sunehag et al proposed a Value-decomposition network (VDN) algorithm, which uses a linear combination to combine rewards obtained by each agent into a total reward, and performs centralized training to update the network of each agent. When executing, each agent performs policy selection according to own network. In the algorithm, the agents can cooperate with each other. However, in the face of complex collaborative tasks, the contribution of each agent to the overall system is likely not a simple linear relationship, which can lead to poor performance of the algorithm. In 2018, tabish rastid et al proposed QMIX algorithm to improve VDN, and utilize deep neural network to perform credit allocation on each agent, so that contribution of each agent to the system can be reasonably allocated in complex tasks. The patent with the Chinese patent number ZL202111495010.3 provides a cooperative fight method and device of an intelligent agent, the QMIX framework is utilized for improvement, a multi-unmanned aerial vehicle cooperative air combat decision network model is established, and good application potential of the algorithm is presented.

However, the above method has difficulty in solving the safety problem in the exploration process. In order to ensure that the actions of the intelligent agent in exploration are safer, the method is mainly to set negative rewards, limit action space and the like, so that the explorability of the intelligent agent is reduced, and the final performance of an intelligent agent system is reduced. The neural network based on long-term risks and the double experience playback pools based on instant early warning values are added on the multi-agent deep reinforcement learning method, and the long-term risks are utilized to correct the corresponding values of optional actions of the agents, so that the safety requirements of the agents are met while the exploratory properties of the agents are guaranteed.

Disclosure of Invention

The invention aims to balance the safety and system performance of a multi-agent system, revises a Q network based on discount rewards by using a G network based on discount risk values, and performs strategy selection by using the revised network, so that the safety of the multi-agent system in the exploration process is ensured, the multi-agent system can be fully explored in the learning process, and a safety strategy with higher benefit is obtained;

in order to achieve the above purpose, the present invention adopts the following technical scheme: a multi-agent deep reinforcement learning method based on safety exploration comprises the following steps:

step 1: initializing N intelligent agents and setting the number of rounds of learningInitializing the environment at the beginning of each round of learning;

step 2: setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity of NAnd dangerous experience pool->Initializing a learning rate alpha and a discount factor gamma, and setting the current time k=1;

step 3: obtaining local observations o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent _i At the same time, the local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) G network based on discount risk value is input to obtain estimated risk value G of each agent _i ；

Step 4: correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards of each intelligent agent

Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent _i Inputting a second hybrid network to obtain estimated total risk +.>

Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent _i And predicting total risk->Parameters for the second hybrid network->Updating;

step 7: repeating steps 3-6, increasing K by 1 every time it is repeated until it is equal to K;

step 8: repeatingSub-steps 2-6.

Further, the formula for correcting the Q network based on the discount rewards by the G network based on the discount risk value in the step 4 is as follows:

wherein a is _i ^(k) Is the action selected by each agent.

Further, the parameters of the first hybrid network in the step 6The step of updating includes:

step A1: acquisition ofExecuting action a _i ^(k) Obtaining instant rewards r _i ^(k) ；

Step A2: according to instant rewards r _i ^(k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the Q network of discount rewards.

Further, the minimization of the loss functionThe method comprises the following steps:

further, the parameters of the second hybrid network in the step 6The step of updating includes:

step B1: desired prize value Q by each agent _i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent _i ^(k) ；

Step B2: acquiring global state s of current environment ^(k) Wherein the global state s ^(k) Including local observation o _i ^(k) And an index reflecting the current task execution phase;

step B3: action a selected according to each agent _i ^(k) And global state s of the current environment ^(k) Obtaining the positions s of each intelligent agent ^(k) Lower execution a _i ^(k) The instant risk value omega is obtained _i ^(k) If omega _i ^(k) 0, will { s } _i ^(k) ,a _i ^(k) ,ω _i ^(k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>

Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>

Step B5: based on the sample pool B and the estimated total riskUpdating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.

Further, the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:

wherein,μ measures the impact of future long-term risk, the greater its value, the more significant the impact of future long-term risk received during the update.

Further, the initialization parameters of the first hybrid network are as followsThe second hybrid network initialization parameters are as followsAnd the first hybrid network and the second hybrid network are both neural networks.

The beneficial effects are that: the invention corrects the Q network based on discount rewards by using the G network based on discount risk values, comprehensively considers long-term risks and instant early warning, has better safety performance, and simultaneously utilizes the first mixed network and the second mixed network to reasonably distribute credit to each intelligent agent, thereby better training a model with better performance.

Drawings

Fig. 1 is a schematic diagram of a first hybrid network according to the present invention.

Fig. 2 is a schematic diagram of a second hybrid network according to the present invention.

Fig. 3 is a simulation diagram of an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings.

As shown in fig. 1-2, the present invention provides a multi-agent deep reinforcement learning method based on security exploration, which is characterized by comprising:

step 1: initializing N intelligent agents and setting the number of rounds of learningAnd the reset environment is performed at the beginning of each round of learning.

Step 2: initializing an environment, setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity NAnd dangerous experience pool->The learning rate alpha and the discount factor gamma are initialized, and the current time k=1 is set.

Step 3: obtaining local observations o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent _i At the same time, the local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) G network based on discount risk value is input to obtain estimated risk value G of each agent _i 。

In step 3, the Q network based on discount rewards and the G network based on discount risk values belong to deep neural networks, wherein the Q network based on discount rewards is used for measuring performance of an intelligent agent, the better performance of the intelligent agent is, the higher the given rewards value is, the rewards are evaluation of current actions of the intelligent agent, the G network based on discount risk values is used for measuring dangerousness of actions of the intelligent agent, the higher the corresponding risk value is, and the risk value is evaluation of dangerousness of the current actions of the intelligent agent.

Step 4: utilization based on discountsG network of risk value corrects Q network based on discount rewards to obtain corrected expected rewards value of each agent

In step 4, based onThe method comprises the steps of realizing the correction of a G network based on a discount risk value to a Q network based on discount rewards, and performing strategy selection by utilizing the corrected network, so that the safety in the multi-agent exploration process is ensured, and the multi-agent exploration process can be fully explored in the learning process, thereby obtaining a safety strategy with higher benefit, wherein a _i ^(k) Is the action selected by each agent.

In step 5, the first hybrid network and the second hybrid network both belong to a neural network, and credit allocation is performed on each agent by the first hybrid network and the second hybrid network, so that a model with better performance can be trained better through the total benefits of the benefit prediction system of each agent.

Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent _i And predicting total risk->Parameters for the second hybrid network->And updating.

In step 6, parameters for the first hybrid networkThe step of updating includes: step A1: acquisition ofExecuting action a _i ^(k) Obtaining instant rewards r _i ^(k) Where a is actually a reference to the use of argmax, and has no actual meaning for deriving the maximum Q value for all actions possible;

step A2: according to instant rewards r _i ^(k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>UpdatingIn the process, each agent automatically updates its own parameters according to gradient back propagation based on the Q network of the discount rewards, wherein the loss function is minimized +.>The method comprises the following steps:

parameters for a second hybrid networkThe updating step comprises the following steps: step B1: desired prize value Q by each agent _i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent _i ^(k) ；

Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, the G network of each agent based on the discount risk value automatically updates own parameters according to gradient back propagation, and the updating formula of the process is as follows:

wherein,a ^* using argmin onlyThe term "a" means, without actual meaning, that the minimum G value for all actions is obtained, ">The sample pool is a set, F _j,z I.e., the jth element in the z-th set of data in the sample pool.

Step 7: repeating steps 3-6, increasing 1 every time K is repeated until it is equal to K.

Step 8: repeatingSub-steps 2-6.

In a specific implementation, the agent is specifically an unmanned aerial vehicle.

The specific implementation of the invention is as follows:

as shown in fig. 3, the environment is a three-dimensional space that simulates a drone flight. The figure includes an initial position 1 of the unmanned aerial vehicle, namely a starting point, a target position 2 of the unmanned aerial vehicle, namely an ending point, and a columnar barrier 3 in space. The unmanned aerial vehicle target 2 reaches the end point as soon as possible on the premise of collision to the columnar obstacle 3 as few as possible, the unmanned aerial vehicle position is represented by three-dimensional coordinates, and the unmanned aerial vehicle actions comprise 6 actions which are respectively upward, downward, left, right, front and back movement by one unit.

Step 1: deploying 2 unmanned aerial vehicles in unmanned aerial vehicle system, and setting the number of wheels to learn

Step 2: initializing environment, resetting the position of the unmanned aerial vehicle to a starting point, setting the time number K=300 of each round of learning, and initializing a safety experience pool with the maximum capacity of N=32And dangerous experience pool->Initializing learning rate alpha=0.001, discount factor gamma=0.8, setting current timeScore k=1.

Step 3: obtaining local observations o through unmanned aerial vehicle _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) Inputting a Q network based on discount rewards to obtain a predicted expected rewards value Q of each action of each unmanned aerial vehicle _i At the same time, the local observation o of each agent _i ^(k) And a state-action sequence tau at the current moment _i ^(k) G network based on discount risk value is input to obtain estimated risk value G of each unmanned aerial vehicle _i Wherein the local observation o _i ^(k) Including a portion of the map information, the current location of the drone, and damage level information thereof.

Step 4: based on formula (1.1), correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards values of the unmanned aerial vehicleWhere ρ=0.1, τ at this time _i ^(k) The state and the action of the unmanned aerial vehicle at the current moment and the first 4 moments are included.

Step 5: corrected expected reward value of each unmanned aerial vehicleInputting the first mixed network to obtain the estimated value of the joint action +.>Simultaneously, the estimated risk value G of each unmanned aerial vehicle _i Inputting a second hybrid network to obtain estimated total risk +.>

Step 6: acquisition ofExecuting action a _i ^(k) Obtaining instant rewards r _i ^(k) Then according to the instant rewards r _i ^(k) And the estimated value of the combined action ∈>And define a loss function->The parameters of the first hybrid network are updated by random gradient descent as in equation (1.2)>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, the Q network of each unmanned aerial vehicle based on discount rewards automatically updates own parameters according to gradient back propagation, and meanwhile, selects Q with the probability of 0.8 _i The unmanned aerial vehicle with the largest value acts, other acts are randomly selected with the probability of 0.2, and the act a selected by each unmanned aerial vehicle is obtained _i ^(k) And obtains global state s of the current environment ^(k) Wherein the global state s ^(k) Including local observation o _i ^(k) And an index reflecting the current task execution stage, and then according to the action a selected by each agent _i ^(k) And global state s of the current environment ^(k) Obtaining the positions s of each unmanned aerial vehicle ^(k) Lower execution a _i ^(k) The instant risk value omega is obtained _i ^(k) If omega _i ^(k) 0, will { s } _i ^(k) ,a _i ^(k) ,ω _i ^(k) Storage of a set of data in a secure experience pool ∈>Otherwise store it in dangerous experiencePool->Then from the secure experience pool->Random extraction of->Group data, in dangerous experience pool->Random extraction of->Group data, composition size +.>Is>Updating the parameters of the second hybrid network by equation (1.3)>And parameters in a second hybrid networkIn the updating process, the G network of each unmanned aerial vehicle based on the discount risk value automatically updates own parameters according to gradient back propagation, wherein indexes reflecting the current task execution stage comprise information of the whole map and positions of various obstacles.

The invention corrects the Q network based on discount rewards by using the G network based on discount risk values, and performs strategy selection by using the corrected network, thereby ensuring the safety of a plurality of agents in the exploration process, fully exploring the network in the learning process, obtaining a safety strategy with higher benefit, and reasonably distributing credits to each agent by using the first mixed network and the second mixed network, so that a model with better performance can be better trained.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The multi-agent deep reinforcement learning method based on the safety exploration is characterized by comprising the following steps of:

Step 4: q-network based on discount rewards with G-network based on discount risk valueLine correction to obtain corrected expected reward value of each agent

step 8: repeatingSub-steps 2-6.

2. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the formula for modifying the Q network based on discount rewards by the G network based on discount risk value in the step 4 is:

wherein a is _i ^(k) Is the action selected by each agent.

3. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the first hybrid network in the step 6The step of updating includes:

4. The multi-agent deep reinforcement learning method based on security exploration of claim 3, wherein the minimizing loss functionThe method comprises the following steps:

5. the multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the second hybrid network in the step 6The step of updating includes:

step B3: action a selected according to each agent _i ^(k) And global state s of the current environment ^(k) Obtaining the positions s of each intelligent agent ^(k) Lower execution a _i ^(k) The obtained isTime risk value omega _i ^(k) If omega _i ^(k) 0, will { s } _i ^(k) ,a _i ^(k) ,ω _i ^(k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>

Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.

6. The multi-agent deep reinforcement learning method based on safety exploration according to claim 5, wherein the multi-agent deep reinforcement learning method is characterized in that the multi-agent deep reinforcement learning method is based on a sample poolAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:

7. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the initialization parameters of the first hybrid network areThe second hybrid network initialization parameter is +.>And the first hybrid network and the second hybrid network are both neural networks.