CN117875375A - Multi-agent deep reinforcement learning method based on safety exploration - Google Patents

Multi-agent deep reinforcement learning method based on safety exploration Download PDF

Info

Publication number
CN117875375A
CN117875375A CN202311811726.9A CN202311811726A CN117875375A CN 117875375 A CN117875375 A CN 117875375A CN 202311811726 A CN202311811726 A CN 202311811726A CN 117875375 A CN117875375 A CN 117875375A
Authority
CN
China
Prior art keywords
agent
network
value
parameters
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311811726.9A
Other languages
Chinese (zh)
Inventor
卢晓珍
林龙河
刘智博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202311811726.9A priority Critical patent/CN117875375A/en
Publication of CN117875375A publication Critical patent/CN117875375A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-agent deep reinforcement learning method based on safety exploration, which can avoid dangerous actions under the condition of ensuring that the task performance of an agent is not reduced. According to the method, a G network based on a long-term risk value is added in a traditional multi-agent deep reinforcement learning network, the risk value of the action of an agent is estimated through deep reinforcement learning, and the estimated risk value is utilized to correct rewards obtained by the action. The safety of actions selected by the intelligent agent in the exploration process is ensured. In the training process, the method uses a double experience pool of instant early warning values, and ensures the full exploration of the intelligent body. Meanwhile, the method utilizes the deep neural network to carry out credit allocation on each intelligent agent, and utilizes a framework of centralized training-distributed execution to enable each intelligent agent to have cooperation capability.

Description

Multi-agent deep reinforcement learning method based on safety exploration
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a multi-agent deep reinforcement learning method based on security exploration.
Background
The multi-agent deep reinforcement learning is an important research direction in the machine learning field, and can be applied to important scenes such as automatic driving, energy distribution, track planning and the like. As social informatization continues to be in deep, the security of algorithms is also becoming more and more important. At present, many tasks have high requirements on the safety performance of each intelligent agent, and dangerous strategies can bring irreversible serious damage, so that how to ensure the safety without excessively sacrificing the system performance becomes an important research problem.
The research on multi-agent deep reinforcement Learning algorithm at home and abroad is quite abundant, and in 2015, ardi Tampuu et al propose an Independent Q Learning (IQL) algorithm, and a complete decentralization method is used, so that the agent is independently trained according to own observation data. The disadvantage of this approach is that it considers each agent to be completely independent of the others, ignoring the impact between agents. The framework of centralized training-distributed execution solves this problem, for example, peter Sunehag et al proposed a Value-decomposition network (VDN) algorithm, which uses a linear combination to combine rewards obtained by each agent into a total reward, and performs centralized training to update the network of each agent. When executing, each agent performs policy selection according to own network. In the algorithm, the agents can cooperate with each other. However, in the face of complex collaborative tasks, the contribution of each agent to the overall system is likely not a simple linear relationship, which can lead to poor performance of the algorithm. In 2018, tabish rastid et al proposed QMIX algorithm to improve VDN, and utilize deep neural network to perform credit allocation on each agent, so that contribution of each agent to the system can be reasonably allocated in complex tasks. The patent with the Chinese patent number ZL202111495010.3 provides a cooperative fight method and device of an intelligent agent, the QMIX framework is utilized for improvement, a multi-unmanned aerial vehicle cooperative air combat decision network model is established, and good application potential of the algorithm is presented.
However, the above method has difficulty in solving the safety problem in the exploration process. In order to ensure that the actions of the intelligent agent in exploration are safer, the method is mainly to set negative rewards, limit action space and the like, so that the explorability of the intelligent agent is reduced, and the final performance of an intelligent agent system is reduced. The neural network based on long-term risks and the double experience playback pools based on instant early warning values are added on the multi-agent deep reinforcement learning method, and the long-term risks are utilized to correct the corresponding values of optional actions of the agents, so that the safety requirements of the agents are met while the exploratory properties of the agents are guaranteed.
Disclosure of Invention
The invention aims to balance the safety and system performance of a multi-agent system, revises a Q network based on discount rewards by using a G network based on discount risk values, and performs strategy selection by using the revised network, so that the safety of the multi-agent system in the exploration process is ensured, the multi-agent system can be fully explored in the learning process, and a safety strategy with higher benefit is obtained;
in order to achieve the above purpose, the present invention adopts the following technical scheme: a multi-agent deep reinforcement learning method based on safety exploration comprises the following steps:
step 1: initializing N intelligent agents and setting the number of rounds of learningInitializing the environment at the beginning of each round of learning;
step 2: setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity of NAnd dangerous experience pool->Initializing a learning rate alpha and a discount factor gamma, and setting the current time k=1;
step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i
Step 4: correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards of each intelligent agent
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->Updating;
step 7: repeating steps 3-6, increasing K by 1 every time it is repeated until it is equal to K;
step 8: repeatingSub-steps 2-6.
Further, the formula for correcting the Q network based on the discount rewards by the G network based on the discount risk value in the step 4 is as follows:
wherein a is i (k) Is the action selected by each agent.
Further, the parameters of the first hybrid network in the step 6The step of updating includes:
step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k)
Step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the Q network of discount rewards.
Further, the minimization of the loss functionThe method comprises the following steps:
further, the parameters of the second hybrid network in the step 6The step of updating includes:
step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k)
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k)i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: based on the sample pool B and the estimated total riskUpdating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.
Further, the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:
wherein,μ measures the impact of future long-term risk, the greater its value, the more significant the impact of future long-term risk received during the update.
Further, the initialization parameters of the first hybrid network are as followsThe second hybrid network initialization parameters are as followsAnd the first hybrid network and the second hybrid network are both neural networks.
The beneficial effects are that: the invention corrects the Q network based on discount rewards by using the G network based on discount risk values, comprehensively considers long-term risks and instant early warning, has better safety performance, and simultaneously utilizes the first mixed network and the second mixed network to reasonably distribute credit to each intelligent agent, thereby better training a model with better performance.
Drawings
Fig. 1 is a schematic diagram of a first hybrid network according to the present invention.
Fig. 2 is a schematic diagram of a second hybrid network according to the present invention.
Fig. 3 is a simulation diagram of an embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings.
As shown in fig. 1-2, the present invention provides a multi-agent deep reinforcement learning method based on security exploration, which is characterized by comprising:
step 1: initializing N intelligent agents and setting the number of rounds of learningAnd the reset environment is performed at the beginning of each round of learning.
Step 2: initializing an environment, setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity NAnd dangerous experience pool->The learning rate alpha and the discount factor gamma are initialized, and the current time k=1 is set.
Step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i
In step 3, the Q network based on discount rewards and the G network based on discount risk values belong to deep neural networks, wherein the Q network based on discount rewards is used for measuring performance of an intelligent agent, the better performance of the intelligent agent is, the higher the given rewards value is, the rewards are evaluation of current actions of the intelligent agent, the G network based on discount risk values is used for measuring dangerousness of actions of the intelligent agent, the higher the corresponding risk value is, and the risk value is evaluation of dangerousness of the current actions of the intelligent agent.
Step 4: utilization based on discountsG network of risk value corrects Q network based on discount rewards to obtain corrected expected rewards value of each agent
In step 4, based onThe method comprises the steps of realizing the correction of a G network based on a discount risk value to a Q network based on discount rewards, and performing strategy selection by utilizing the corrected network, so that the safety in the multi-agent exploration process is ensured, and the multi-agent exploration process can be fully explored in the learning process, thereby obtaining a safety strategy with higher benefit, wherein a i (k) Is the action selected by each agent.
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
In step 5, the first hybrid network and the second hybrid network both belong to a neural network, and credit allocation is performed on each agent by the first hybrid network and the second hybrid network, so that a model with better performance can be trained better through the total benefits of the benefit prediction system of each agent.
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->And updating.
In step 6, parameters for the first hybrid networkThe step of updating includes: step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) Where a is actually a reference to the use of argmax, and has no actual meaning for deriving the maximum Q value for all actions possible;
step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>UpdatingIn the process, each agent automatically updates its own parameters according to gradient back propagation based on the Q network of the discount rewards, wherein the loss function is minimized +.>The method comprises the following steps:
parameters for a second hybrid networkThe updating step comprises the following steps: step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k)
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k)i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, the G network of each agent based on the discount risk value automatically updates own parameters according to gradient back propagation, and the updating formula of the process is as follows:
wherein,a * using argmin onlyThe term "a" means, without actual meaning, that the minimum G value for all actions is obtained, ">The sample pool is a set, F j,z I.e., the jth element in the z-th set of data in the sample pool.
Step 7: repeating steps 3-6, increasing 1 every time K is repeated until it is equal to K.
Step 8: repeatingSub-steps 2-6.
In a specific implementation, the agent is specifically an unmanned aerial vehicle.
The specific implementation of the invention is as follows:
as shown in fig. 3, the environment is a three-dimensional space that simulates a drone flight. The figure includes an initial position 1 of the unmanned aerial vehicle, namely a starting point, a target position 2 of the unmanned aerial vehicle, namely an ending point, and a columnar barrier 3 in space. The unmanned aerial vehicle target 2 reaches the end point as soon as possible on the premise of collision to the columnar obstacle 3 as few as possible, the unmanned aerial vehicle position is represented by three-dimensional coordinates, and the unmanned aerial vehicle actions comprise 6 actions which are respectively upward, downward, left, right, front and back movement by one unit.
Step 1: deploying 2 unmanned aerial vehicles in unmanned aerial vehicle system, and setting the number of wheels to learn
Step 2: initializing environment, resetting the position of the unmanned aerial vehicle to a starting point, setting the time number K=300 of each round of learning, and initializing a safety experience pool with the maximum capacity of N=32And dangerous experience pool->Initializing learning rate alpha=0.001, discount factor gamma=0.8, setting current timeScore k=1.
Step 3: obtaining local observations o through unmanned aerial vehicle i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain a predicted expected rewards value Q of each action of each unmanned aerial vehicle i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each unmanned aerial vehicle i Wherein the local observation o i (k) Including a portion of the map information, the current location of the drone, and damage level information thereof.
Step 4: based on formula (1.1), correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards values of the unmanned aerial vehicleWhere ρ=0.1, τ at this time i (k) The state and the action of the unmanned aerial vehicle at the current moment and the first 4 moments are included.
Step 5: corrected expected reward value of each unmanned aerial vehicleInputting the first mixed network to obtain the estimated value of the joint action +.>Simultaneously, the estimated risk value G of each unmanned aerial vehicle i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) Then according to the instant rewards r i (k) And the estimated value of the combined action ∈>And define a loss function->The parameters of the first hybrid network are updated by random gradient descent as in equation (1.2)>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, the Q network of each unmanned aerial vehicle based on discount rewards automatically updates own parameters according to gradient back propagation, and meanwhile, selects Q with the probability of 0.8 i The unmanned aerial vehicle with the largest value acts, other acts are randomly selected with the probability of 0.2, and the act a selected by each unmanned aerial vehicle is obtained i (k) And obtains global state s of the current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution stage, and then according to the action a selected by each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each unmanned aerial vehicle (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k)i (k) Storage of a set of data in a secure experience pool ∈>Otherwise store it in dangerous experiencePool->Then from the secure experience pool->Random extraction of->Group data, in dangerous experience pool->Random extraction of->Group data, composition size +.>Is>Updating the parameters of the second hybrid network by equation (1.3)>And parameters in a second hybrid networkIn the updating process, the G network of each unmanned aerial vehicle based on the discount risk value automatically updates own parameters according to gradient back propagation, wherein indexes reflecting the current task execution stage comprise information of the whole map and positions of various obstacles.
The invention corrects the Q network based on discount rewards by using the G network based on discount risk values, and performs strategy selection by using the corrected network, thereby ensuring the safety of a plurality of agents in the exploration process, fully exploring the network in the learning process, obtaining a safety strategy with higher benefit, and reasonably distributing credits to each agent by using the first mixed network and the second mixed network, so that a model with better performance can be better trained.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (7)

1. The multi-agent deep reinforcement learning method based on the safety exploration is characterized by comprising the following steps of:
step 1: initializing N intelligent agents and setting the number of rounds of learningInitializing the environment at the beginning of each round of learning;
step 2: setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity of NAnd dangerous experience pool->Initializing a learning rate alpha and a discount factor gamma, and setting the current time k=1;
step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i
Step 4: q-network based on discount rewards with G-network based on discount risk valueLine correction to obtain corrected expected reward value of each agent
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->Updating;
step 7: repeating steps 3-6, increasing K by 1 every time it is repeated until it is equal to K;
step 8: repeatingSub-steps 2-6.
2. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the formula for modifying the Q network based on discount rewards by the G network based on discount risk value in the step 4 is:
wherein a is i (k) Is the action selected by each agent.
3. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the first hybrid network in the step 6The step of updating includes:
step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k)
Step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the Q network of discount rewards.
4. The multi-agent deep reinforcement learning method based on security exploration of claim 3, wherein the minimizing loss functionThe method comprises the following steps:
5. the multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the second hybrid network in the step 6The step of updating includes:
step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k)
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The obtained isTime risk value omega i (k) If omega i (k) 0, will { s } i (k) ,a i (k)i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.
6. The multi-agent deep reinforcement learning method based on safety exploration according to claim 5, wherein the multi-agent deep reinforcement learning method is characterized in that the multi-agent deep reinforcement learning method is based on a sample poolAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:
wherein,μ measures the impact of future long-term risk, the greater its value, the more significant the impact of future long-term risk received during the update.
7. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the initialization parameters of the first hybrid network areThe second hybrid network initialization parameter is +.>And the first hybrid network and the second hybrid network are both neural networks.
CN202311811726.9A 2023-12-27 2023-12-27 Multi-agent deep reinforcement learning method based on safety exploration Pending CN117875375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311811726.9A CN117875375A (en) 2023-12-27 2023-12-27 Multi-agent deep reinforcement learning method based on safety exploration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311811726.9A CN117875375A (en) 2023-12-27 2023-12-27 Multi-agent deep reinforcement learning method based on safety exploration

Publications (1)

Publication Number Publication Date
CN117875375A true CN117875375A (en) 2024-04-12

Family

ID=90585765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311811726.9A Pending CN117875375A (en) 2023-12-27 2023-12-27 Multi-agent deep reinforcement learning method based on safety exploration

Country Status (1)

Country Link
CN (1) CN117875375A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118051063A (en) * 2024-04-16 2024-05-17 中国民用航空飞行学院 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118051063A (en) * 2024-04-16 2024-05-17 中国民用航空飞行学院 Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle

Similar Documents

Publication Publication Date Title
CN111884213B (en) Power distribution network voltage adjusting method based on deep reinforcement learning algorithm
CN110852448A (en) Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN117875375A (en) Multi-agent deep reinforcement learning method based on safety exploration
CN110442135A (en) A kind of unmanned boat paths planning method and system based on improved adaptive GA-IAGA
CN107886201B (en) Multi-objective optimization method and device for multi-unmanned aerial vehicle task allocation
CN110471444A (en) UAV Intelligent barrier-avoiding method based on autonomous learning
CN112937564A (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
CN111950873B (en) Satellite real-time guiding task planning method and system based on deep reinforcement learning
He et al. Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data
CN110991972A (en) Cargo transportation system based on multi-agent reinforcement learning
CN107016464A (en) Threat estimating method based on dynamic bayesian network
CN113885555A (en) Multi-machine task allocation method and system for power transmission line dense channel routing inspection
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN113641192A (en) Route planning method for unmanned aerial vehicle crowd sensing task based on reinforcement learning
CN112198892A (en) Multi-unmanned aerial vehicle intelligent cooperative penetration countermeasure method
CN112183288B (en) Multi-agent reinforcement learning method based on model
CN113561986A (en) Decision-making method and device for automatically driving automobile
CN107016212A (en) Intention analysis method based on dynamic Bayesian network
CN113507717A (en) Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN114815891A (en) PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method
CN113283827B (en) Two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning
CN113283013B (en) Multi-unmanned aerial vehicle charging and task scheduling method based on deep reinforcement learning
CN113299079B (en) Regional intersection signal control method based on PPO and graph convolution neural network
CN111767991B (en) Measurement and control resource scheduling method based on deep Q learning
CN113110101A (en) Production line mobile robot gathering type recovery and warehousing simulation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination