CN117875375A - Multi-agent deep reinforcement learning method based on safety exploration - Google Patents
Multi-agent deep reinforcement learning method based on safety exploration Download PDFInfo
- Publication number
- CN117875375A CN117875375A CN202311811726.9A CN202311811726A CN117875375A CN 117875375 A CN117875375 A CN 117875375A CN 202311811726 A CN202311811726 A CN 202311811726A CN 117875375 A CN117875375 A CN 117875375A
- Authority
- CN
- China
- Prior art keywords
- agent
- network
- value
- parameters
- risk
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000002787 reinforcement Effects 0.000 title claims abstract description 20
- 230000009471 action Effects 0.000 claims abstract description 37
- 230000008569 process Effects 0.000 claims abstract description 19
- 230000007774 longterm Effects 0.000 claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims description 8
- 230000009916 joint effect Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 3
- 230000008901 benefit Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a multi-agent deep reinforcement learning method based on safety exploration, which can avoid dangerous actions under the condition of ensuring that the task performance of an agent is not reduced. According to the method, a G network based on a long-term risk value is added in a traditional multi-agent deep reinforcement learning network, the risk value of the action of an agent is estimated through deep reinforcement learning, and the estimated risk value is utilized to correct rewards obtained by the action. The safety of actions selected by the intelligent agent in the exploration process is ensured. In the training process, the method uses a double experience pool of instant early warning values, and ensures the full exploration of the intelligent body. Meanwhile, the method utilizes the deep neural network to carry out credit allocation on each intelligent agent, and utilizes a framework of centralized training-distributed execution to enable each intelligent agent to have cooperation capability.
Description
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a multi-agent deep reinforcement learning method based on security exploration.
Background
The multi-agent deep reinforcement learning is an important research direction in the machine learning field, and can be applied to important scenes such as automatic driving, energy distribution, track planning and the like. As social informatization continues to be in deep, the security of algorithms is also becoming more and more important. At present, many tasks have high requirements on the safety performance of each intelligent agent, and dangerous strategies can bring irreversible serious damage, so that how to ensure the safety without excessively sacrificing the system performance becomes an important research problem.
The research on multi-agent deep reinforcement Learning algorithm at home and abroad is quite abundant, and in 2015, ardi Tampuu et al propose an Independent Q Learning (IQL) algorithm, and a complete decentralization method is used, so that the agent is independently trained according to own observation data. The disadvantage of this approach is that it considers each agent to be completely independent of the others, ignoring the impact between agents. The framework of centralized training-distributed execution solves this problem, for example, peter Sunehag et al proposed a Value-decomposition network (VDN) algorithm, which uses a linear combination to combine rewards obtained by each agent into a total reward, and performs centralized training to update the network of each agent. When executing, each agent performs policy selection according to own network. In the algorithm, the agents can cooperate with each other. However, in the face of complex collaborative tasks, the contribution of each agent to the overall system is likely not a simple linear relationship, which can lead to poor performance of the algorithm. In 2018, tabish rastid et al proposed QMIX algorithm to improve VDN, and utilize deep neural network to perform credit allocation on each agent, so that contribution of each agent to the system can be reasonably allocated in complex tasks. The patent with the Chinese patent number ZL202111495010.3 provides a cooperative fight method and device of an intelligent agent, the QMIX framework is utilized for improvement, a multi-unmanned aerial vehicle cooperative air combat decision network model is established, and good application potential of the algorithm is presented.
However, the above method has difficulty in solving the safety problem in the exploration process. In order to ensure that the actions of the intelligent agent in exploration are safer, the method is mainly to set negative rewards, limit action space and the like, so that the explorability of the intelligent agent is reduced, and the final performance of an intelligent agent system is reduced. The neural network based on long-term risks and the double experience playback pools based on instant early warning values are added on the multi-agent deep reinforcement learning method, and the long-term risks are utilized to correct the corresponding values of optional actions of the agents, so that the safety requirements of the agents are met while the exploratory properties of the agents are guaranteed.
Disclosure of Invention
The invention aims to balance the safety and system performance of a multi-agent system, revises a Q network based on discount rewards by using a G network based on discount risk values, and performs strategy selection by using the revised network, so that the safety of the multi-agent system in the exploration process is ensured, the multi-agent system can be fully explored in the learning process, and a safety strategy with higher benefit is obtained;
in order to achieve the above purpose, the present invention adopts the following technical scheme: a multi-agent deep reinforcement learning method based on safety exploration comprises the following steps:
step 1: initializing N intelligent agents and setting the number of rounds of learningInitializing the environment at the beginning of each round of learning;
step 2: setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity of NAnd dangerous experience pool->Initializing a learning rate alpha and a discount factor gamma, and setting the current time k=1;
step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i ;
Step 4: correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards of each intelligent agent
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->Updating;
step 7: repeating steps 3-6, increasing K by 1 every time it is repeated until it is equal to K;
step 8: repeatingSub-steps 2-6.
Further, the formula for correcting the Q network based on the discount rewards by the G network based on the discount risk value in the step 4 is as follows:
wherein a is i (k) Is the action selected by each agent.
Further, the parameters of the first hybrid network in the step 6The step of updating includes:
step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) ;
Step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the Q network of discount rewards.
Further, the minimization of the loss functionThe method comprises the following steps:
further, the parameters of the second hybrid network in the step 6The step of updating includes:
step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k) ;
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k) ,ω i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: based on the sample pool B and the estimated total riskUpdating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.
Further, the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:
wherein,μ measures the impact of future long-term risk, the greater its value, the more significant the impact of future long-term risk received during the update.
Further, the initialization parameters of the first hybrid network are as followsThe second hybrid network initialization parameters are as followsAnd the first hybrid network and the second hybrid network are both neural networks.
The beneficial effects are that: the invention corrects the Q network based on discount rewards by using the G network based on discount risk values, comprehensively considers long-term risks and instant early warning, has better safety performance, and simultaneously utilizes the first mixed network and the second mixed network to reasonably distribute credit to each intelligent agent, thereby better training a model with better performance.
Drawings
Fig. 1 is a schematic diagram of a first hybrid network according to the present invention.
Fig. 2 is a schematic diagram of a second hybrid network according to the present invention.
Fig. 3 is a simulation diagram of an embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings.
As shown in fig. 1-2, the present invention provides a multi-agent deep reinforcement learning method based on security exploration, which is characterized by comprising:
step 1: initializing N intelligent agents and setting the number of rounds of learningAnd the reset environment is performed at the beginning of each round of learning.
Step 2: initializing an environment, setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity NAnd dangerous experience pool->The learning rate alpha and the discount factor gamma are initialized, and the current time k=1 is set.
Step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i 。
In step 3, the Q network based on discount rewards and the G network based on discount risk values belong to deep neural networks, wherein the Q network based on discount rewards is used for measuring performance of an intelligent agent, the better performance of the intelligent agent is, the higher the given rewards value is, the rewards are evaluation of current actions of the intelligent agent, the G network based on discount risk values is used for measuring dangerousness of actions of the intelligent agent, the higher the corresponding risk value is, and the risk value is evaluation of dangerousness of the current actions of the intelligent agent.
Step 4: utilization based on discountsG network of risk value corrects Q network based on discount rewards to obtain corrected expected rewards value of each agent
In step 4, based onThe method comprises the steps of realizing the correction of a G network based on a discount risk value to a Q network based on discount rewards, and performing strategy selection by utilizing the corrected network, so that the safety in the multi-agent exploration process is ensured, and the multi-agent exploration process can be fully explored in the learning process, thereby obtaining a safety strategy with higher benefit, wherein a i (k) Is the action selected by each agent.
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
In step 5, the first hybrid network and the second hybrid network both belong to a neural network, and credit allocation is performed on each agent by the first hybrid network and the second hybrid network, so that a model with better performance can be trained better through the total benefits of the benefit prediction system of each agent.
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->And updating.
In step 6, parameters for the first hybrid networkThe step of updating includes: step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) Where a is actually a reference to the use of argmax, and has no actual meaning for deriving the maximum Q value for all actions possible;
step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>UpdatingIn the process, each agent automatically updates its own parameters according to gradient back propagation based on the Q network of the discount rewards, wherein the loss function is minimized +.>The method comprises the following steps:
parameters for a second hybrid networkThe updating step comprises the following steps: step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k) ;
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k) ,ω i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, the G network of each agent based on the discount risk value automatically updates own parameters according to gradient back propagation, and the updating formula of the process is as follows:
wherein,a * using argmin onlyThe term "a" means, without actual meaning, that the minimum G value for all actions is obtained, ">The sample pool is a set, F j,z I.e., the jth element in the z-th set of data in the sample pool.
Step 7: repeating steps 3-6, increasing 1 every time K is repeated until it is equal to K.
Step 8: repeatingSub-steps 2-6.
In a specific implementation, the agent is specifically an unmanned aerial vehicle.
The specific implementation of the invention is as follows:
as shown in fig. 3, the environment is a three-dimensional space that simulates a drone flight. The figure includes an initial position 1 of the unmanned aerial vehicle, namely a starting point, a target position 2 of the unmanned aerial vehicle, namely an ending point, and a columnar barrier 3 in space. The unmanned aerial vehicle target 2 reaches the end point as soon as possible on the premise of collision to the columnar obstacle 3 as few as possible, the unmanned aerial vehicle position is represented by three-dimensional coordinates, and the unmanned aerial vehicle actions comprise 6 actions which are respectively upward, downward, left, right, front and back movement by one unit.
Step 1: deploying 2 unmanned aerial vehicles in unmanned aerial vehicle system, and setting the number of wheels to learn
Step 2: initializing environment, resetting the position of the unmanned aerial vehicle to a starting point, setting the time number K=300 of each round of learning, and initializing a safety experience pool with the maximum capacity of N=32And dangerous experience pool->Initializing learning rate alpha=0.001, discount factor gamma=0.8, setting current timeScore k=1.
Step 3: obtaining local observations o through unmanned aerial vehicle i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain a predicted expected rewards value Q of each action of each unmanned aerial vehicle i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each unmanned aerial vehicle i Wherein the local observation o i (k) Including a portion of the map information, the current location of the drone, and damage level information thereof.
Step 4: based on formula (1.1), correcting the Q network based on the discount rewards by using the G network based on the discount risk values to obtain corrected expected rewards values of the unmanned aerial vehicleWhere ρ=0.1, τ at this time i (k) The state and the action of the unmanned aerial vehicle at the current moment and the first 4 moments are included.
Step 5: corrected expected reward value of each unmanned aerial vehicleInputting the first mixed network to obtain the estimated value of the joint action +.>Simultaneously, the estimated risk value G of each unmanned aerial vehicle i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) Then according to the instant rewards r i (k) And the estimated value of the combined action ∈>And define a loss function->The parameters of the first hybrid network are updated by random gradient descent as in equation (1.2)>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, the Q network of each unmanned aerial vehicle based on discount rewards automatically updates own parameters according to gradient back propagation, and meanwhile, selects Q with the probability of 0.8 i The unmanned aerial vehicle with the largest value acts, other acts are randomly selected with the probability of 0.2, and the act a selected by each unmanned aerial vehicle is obtained i (k) And obtains global state s of the current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution stage, and then according to the action a selected by each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each unmanned aerial vehicle (k) Lower execution a i (k) The instant risk value omega is obtained i (k) If omega i (k) 0, will { s } i (k) ,a i (k) ,ω i (k) Storage of a set of data in a secure experience pool ∈>Otherwise store it in dangerous experiencePool->Then from the secure experience pool->Random extraction of->Group data, in dangerous experience pool->Random extraction of->Group data, composition size +.>Is>Updating the parameters of the second hybrid network by equation (1.3)>And parameters in a second hybrid networkIn the updating process, the G network of each unmanned aerial vehicle based on the discount risk value automatically updates own parameters according to gradient back propagation, wherein indexes reflecting the current task execution stage comprise information of the whole map and positions of various obstacles.
The invention corrects the Q network based on discount rewards by using the G network based on discount risk values, and performs strategy selection by using the corrected network, thereby ensuring the safety of a plurality of agents in the exploration process, fully exploring the network in the learning process, obtaining a safety strategy with higher benefit, and reasonably distributing credits to each agent by using the first mixed network and the second mixed network, so that a model with better performance can be better trained.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.
Claims (7)
1. The multi-agent deep reinforcement learning method based on the safety exploration is characterized by comprising the following steps of:
step 1: initializing N intelligent agents and setting the number of rounds of learningInitializing the environment at the beginning of each round of learning;
step 2: setting the time number K of each round of learning, and initializing a safety experience pool with the maximum capacity of NAnd dangerous experience pool->Initializing a learning rate alpha and a discount factor gamma, and setting the current time k=1;
step 3: obtaining local observations o of each agent i (k) And a state-action sequence tau at the current moment i (k) Local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) Inputting a Q network based on discount rewards to obtain expected rewards Q of each agent i At the same time, the local observation o of each agent i (k) And a state-action sequence tau at the current moment i (k) G network based on discount risk value is input to obtain estimated risk value G of each agent i ;
Step 4: q-network based on discount rewards with G-network based on discount risk valueLine correction to obtain corrected expected reward value of each agent
Step 5: corrected expected rewarding value of each agentInputting the first mixed network to obtain the estimated value of the joint action +.>At the same time, the estimated risk value G of each agent i Inputting a second hybrid network to obtain estimated total risk +.>
Step 6: modified expected prize value based on each agentAnd the estimated value of the combined action ∈>Parameter for the first hybrid network->Update while based on expected prize value Q of each agent i And predicting total risk->Parameters for the second hybrid network->Updating;
step 7: repeating steps 3-6, increasing K by 1 every time it is repeated until it is equal to K;
step 8: repeatingSub-steps 2-6.
2. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the formula for modifying the Q network based on discount rewards by the G network based on discount risk value in the step 4 is:
wherein a is i (k) Is the action selected by each agent.
3. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the first hybrid network in the step 6The step of updating includes:
step A1: acquisition ofExecuting action a i (k) Obtaining instant rewards r i (k) ;
Step A2: according to instant rewards r i (k) And the predictive value of the joint actionAnd updating the parameters of the first hybrid network by means of a random gradient descent method>To minimize the loss function->And parameters in the first hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the Q network of discount rewards.
4. The multi-agent deep reinforcement learning method based on security exploration of claim 3, wherein the minimizing loss functionThe method comprises the following steps:
5. the multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the parameters of the second hybrid network in the step 6The step of updating includes:
step B1: desired prize value Q by each agent i Selecting the actions of each agent by using an E-greedy method to obtain the action a selected by each agent i (k) ;
Step B2: acquiring global state s of current environment (k) Wherein the global state s (k) Including local observation o i (k) And an index reflecting the current task execution phase;
step B3: action a selected according to each agent i (k) And global state s of the current environment (k) Obtaining the positions s of each intelligent agent (k) Lower execution a i (k) The obtained isTime risk value omega i (k) If omega i (k) 0, will { s } i (k) ,a i (k) ,ω i (k) For storing a set of data in a secure experience poolOtherwise store it in the dangerous experience pool +.>
Step B4: among the agents, for each agent, from a secure experience poolRandom extraction of->Group data and risk experience pool->Random extraction of->Group data, composition size +.>Is>
Step B5: according to the sample cellAnd predicting total risk->Updating the parameters of the second hybrid network +.>And parameters in the second hybrid network +.>In the updating process, each agent automatically updates own parameters according to gradient back propagation based on the G network of the discount risk value.
6. The multi-agent deep reinforcement learning method based on safety exploration according to claim 5, wherein the multi-agent deep reinforcement learning method is characterized in that the multi-agent deep reinforcement learning method is based on a sample poolAnd predicting total risk->Updating the parameters of the second hybrid network +.>The updated formula of (2) is:
wherein,μ measures the impact of future long-term risk, the greater its value, the more significant the impact of future long-term risk received during the update.
7. The multi-agent deep reinforcement learning method based on security exploration according to claim 1, wherein the initialization parameters of the first hybrid network areThe second hybrid network initialization parameter is +.>And the first hybrid network and the second hybrid network are both neural networks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311811726.9A CN117875375A (en) | 2023-12-27 | 2023-12-27 | Multi-agent deep reinforcement learning method based on safety exploration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311811726.9A CN117875375A (en) | 2023-12-27 | 2023-12-27 | Multi-agent deep reinforcement learning method based on safety exploration |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117875375A true CN117875375A (en) | 2024-04-12 |
Family
ID=90585765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311811726.9A Pending CN117875375A (en) | 2023-12-27 | 2023-12-27 | Multi-agent deep reinforcement learning method based on safety exploration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117875375A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118051063A (en) * | 2024-04-16 | 2024-05-17 | 中国民用航空飞行学院 | Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle |
-
2023
- 2023-12-27 CN CN202311811726.9A patent/CN117875375A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118051063A (en) * | 2024-04-16 | 2024-05-17 | 中国民用航空飞行学院 | Training method for obstacle avoidance flight of low-altitude unmanned aerial vehicle |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111884213B (en) | Power distribution network voltage adjusting method based on deep reinforcement learning algorithm | |
CN110852448A (en) | Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning | |
CN117875375A (en) | Multi-agent deep reinforcement learning method based on safety exploration | |
CN110442135A (en) | A kind of unmanned boat paths planning method and system based on improved adaptive GA-IAGA | |
CN107886201B (en) | Multi-objective optimization method and device for multi-unmanned aerial vehicle task allocation | |
CN110471444A (en) | UAV Intelligent barrier-avoiding method based on autonomous learning | |
CN112937564A (en) | Lane change decision model generation method and unmanned vehicle lane change decision method and device | |
CN111950873B (en) | Satellite real-time guiding task planning method and system based on deep reinforcement learning | |
He et al. | Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data | |
CN110991972A (en) | Cargo transportation system based on multi-agent reinforcement learning | |
CN107016464A (en) | Threat estimating method based on dynamic bayesian network | |
CN113885555A (en) | Multi-machine task allocation method and system for power transmission line dense channel routing inspection | |
CN114089776B (en) | Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning | |
CN113641192A (en) | Route planning method for unmanned aerial vehicle crowd sensing task based on reinforcement learning | |
CN112198892A (en) | Multi-unmanned aerial vehicle intelligent cooperative penetration countermeasure method | |
CN112183288B (en) | Multi-agent reinforcement learning method based on model | |
CN113561986A (en) | Decision-making method and device for automatically driving automobile | |
CN107016212A (en) | Intention analysis method based on dynamic Bayesian network | |
CN113507717A (en) | Unmanned aerial vehicle track optimization method and system based on vehicle track prediction | |
CN114815891A (en) | PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method | |
CN113283827B (en) | Two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning | |
CN113283013B (en) | Multi-unmanned aerial vehicle charging and task scheduling method based on deep reinforcement learning | |
CN113299079B (en) | Regional intersection signal control method based on PPO and graph convolution neural network | |
CN111767991B (en) | Measurement and control resource scheduling method based on deep Q learning | |
CN113110101A (en) | Production line mobile robot gathering type recovery and warehousing simulation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |