CN114840024A - Unmanned aerial vehicle control decision method based on context memory - Google Patents

Unmanned aerial vehicle control decision method based on context memory Download PDF

Info

Publication number
CN114840024A
CN114840024A CN202210577604.7A CN202210577604A CN114840024A CN 114840024 A CN114840024 A CN 114840024A CN 202210577604 A CN202210577604 A CN 202210577604A CN 114840024 A CN114840024 A CN 114840024A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
network
target
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210577604.7A
Other languages
Chinese (zh)
Inventor
罗杰豪
方敏
谢佳晨
史令安
王鹏
王宏博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210577604.7A priority Critical patent/CN114840024A/en
Publication of CN114840024A publication Critical patent/CN114840024A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention provides an unmanned aerial vehicle control decision method based on scene memory, which comprises the following steps: constructing an unmanned aerial vehicle control scene; acquiring observation information of each unmanned aerial vehicle; constructing a deep reinforcement learning model H and initializing; constructing an episode memory exploration table; carrying out iterative training on the deep reinforcement learning model H; and independently controlling the behavior of the unmanned aerial vehicle by using the trained deep reinforcement learning model H. The invention uses the multi-agent distributed plot memory exploration table to store the optimal return of the past similar experience of the agent, and selects the action in the form of random action and the optimal exploration action combination in the plot memory exploration table in the actual exploration process of the agent, thereby improving the completion rate of the unmanned aerial vehicle control task and the convergence time of the deep reinforcement learning algorithm.

Description

Unmanned aerial vehicle control decision method based on context memory
Technical Field
The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-unmanned aerial vehicle control decision method which can be used for monitoring scene environments in a multi-unmanned aerial vehicle scene.
Background
In the field of multi-unmanned-machine control, multi-agent deep reinforcement learning is a very popular research direction at present, and a very good effect is achieved. Multi-agent reinforcement learning is an indispensable branch of reinforcement learning research direction. The multi-agent reinforcement learning can be applied to a series of tasks needing to be completed cooperatively, such as multi-unmanned aerial vehicle reconnaissance, unmanned aerial vehicle cluster confrontation and the like. Compared with a single agent, the task complexity processed by the multiple agents is higher, the agents not only need to consider the environmental factors in the learning process, but also need to consider the behaviors of other agents, and therefore the difficulty of the agent learning is greatly improved. Meanwhile, under some special environments such as sparse reward, complex coordination tasks and the like, the related research of multi-agent reinforcement learning has many defects. While the defects of a multi-agent reinforcement learning algorithm are overcome, the cooperation relation among multiple unmanned aerial vehicles needs to be analyzed by combining actual complex unmanned aerial vehicle environment conditions, and the multiple unmanned aerial vehicles can monitor the environment information.
Multi-agent Q learning is the most typical reinforcement learning algorithm for solving the multi-drone collaboration problem. The Q learning can evaluate the value of the state and the action of each unmanned aerial vehicle in an off-line mode, each unmanned aerial vehicle firstly continuously explores and accumulates enough track experiences in the environment, and then the Q learning continuously corrects the value evaluation of the joint state and the action of the unmanned aerial vehicles according to the track experiences. By continuously trying to execute random actions in any state in the environment, the Q learning can judge the optimal value behavior of the unmanned aerial vehicle in any state through sufficient historical experience data. Q learning is the most primitive form of reinforcement learning and is the basis for other more complex approaches.
Traditional multi-agent Q learning utilizes tables to store data. As the number of drones increases, the method of storing data using the table becomes difficult to apply. When the number of states and the number of behaviors of the unmanned aerial vehicle are very large, the storage space of the Q table becomes huge, and the time of the query table is also very long, so that Q learning is not meaningful in practical application. To solve this problem, *** proposed the DQN algorithm in 2013. The DQN combines the neural network and Q learning together, so that a table is not needed to record a Q value, the state of the unmanned aerial vehicle is directly used as the input of the neural network, and all action values are calculated by using the neural network. However, DQN can only be used to solve the problem of discrete actions, but cannot be used to process the problem of continuous actions, and DQN has a low sample utilization rate in the training process, and the Q value obtained by training is not stable.
OpenAI, a famous artificial intelligence company, proposes a multi-agent deterministic policy gradient algorithm maddppg, which provides a learning framework of centralized training and step-by-step execution, so that the algorithm can process complex multi-agent scene tasks that cannot be solved by traditional reinforcement learning. When the workload of one task is overlarge and the task can be split into different subtasks, a plurality of agents can execute the subtasks in parallel, and the task processing speed is accelerated. When there is a problem with an agent in the system, the rest of the agents in the system can take over the content executed by the agent, thereby improving the robustness of the system to a higher level. However, the madpg algorithm mostly adopts epsilon-greedy strategy combining strategic action and random action in the aspect of searching action selection of the agent, so that reinforcement learning is difficult to search efficiently at the initial stage of training, the convergence rate of the algorithm is slow, and the monitoring performance is poor in a complex multi-unmanned aerial vehicle dynamic monitoring scene.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle control decision method based on context memory, aiming at overcoming the defects of the prior art, so that in the action selection stage, the action selection of the unmanned aerial vehicle is guided by combining a context memory exploration table and random action selection, the exploration of the unmanned aerial vehicle on a state action space is promoted, the convergence speed is accelerated, and the monitoring performance of the unmanned aerial vehicle on a dynamic scene is improved.
In order to achieve the purpose, the technical scheme of the invention comprises the following steps:
(1) setting a control scene of the unmanned aerial vehicle, acquiring a state set of each unmanned aerial vehicle at each moment, and calculating the movement of each unmanned aerial vehicle at each momentMaking a reward value: r is i =u i R, wherein u i Representing the number of pedestrians observed by the ith unmanned aerial vehicle, and r is the reward value obtained when the unmanned aerial vehicle observes one pedestrian.
(2) Constructing a deep reinforcement learning model T and initializing:
2a) selecting a deep reinforcement learning model T formed by bidirectional connection of a real Critic network and a real Actor network and bidirectional connection of a target Critic network and a target Actor network for each unmanned aerial vehicle;
2b) initializing realistic Critic network parameters for n drones
Figure BDA0003660931350000021
Realistic Actor network parameters
Figure BDA0003660931350000022
Target Critic network parameters
Figure BDA0003660931350000023
And target Actor network parameters
Figure BDA0003660931350000024
Initializing a network learning rate alpha, a discount rate gamma in future, a training batch size batch, an experience playback cache pool size N and a soft update rate tau of a target network; initializing the training times to be K equal to 0, and setting the maximum training times to be K;
(3) constructing a multi-agent scene memory exploration table for each unmanned aerial vehicle, initializing the maximum scene memory capacity c and key value dimension dim in the scene memory;
(4) and (3) carrying out iterative training on the deep reinforcement learning model H:
4a) initializing the training times to be k, setting the maximum training time to be Q, setting the Q to be 5000, and setting the k to be 1;
4b) initializing the iteration time as T, setting the maximum iteration time as T as 1000, and setting T as 1;
4c) determining whether there is a current state-action pair in the current scene memory exploration table
Figure BDA0003660931350000031
If the current scene memory exploration table has the current scene
Figure BDA0003660931350000032
Then choose random action with greedy probability of epsilon and choose the optimal action from the profile memory exploration table with probability of 1-epsilon
Figure BDA0003660931350000033
And perform unmanned aerial vehicle actions
Figure BDA0003660931350000034
Earning rewards
Figure BDA0003660931350000035
And obtaining the output value from the scene memory search table
Figure BDA0003660931350000036
Then enter the next state
Figure BDA0003660931350000037
t=t+1;
If the current scene memory searching list does not exist currently
Figure BDA0003660931350000038
Then will be
Figure BDA0003660931350000039
Inputting the actual Critic network to obtain the output value
Figure BDA00036609313500000310
Then entering the next time state
Figure BDA00036609313500000311
t=t+1;
4d) Forming an action set with all unmanned aerial vehicle current actions
Figure BDA00036609313500000312
And interact with the environment to derive a set of reward values
Figure BDA00036609313500000313
Grouping the output values of the scene memory exploration table or the output values of the real Critic network into a set
Figure BDA00036609313500000314
With state s of all unmanned aerial vehicles at the next moment t+1 Composing state collections
Figure BDA00036609313500000315
Information(s) of the four sets of unmanned aerial vehicles t ,a t ,s t+1 ,r t ,Q t ) Storing the data into an experience playback cache;
4e) judging the number of experience vectors in the playback cache:
if the number of the experience vectors in the experience playback cache is larger than N/2, taking out M samples from the experience playback cache, updating a real criticic network by using a minimum loss function, and updating a real Actor network by gradient descent;
if the number of the experience vectors in the experience playback cache is less than or equal to N/2, returning to 5 c);
if the number of the reference vectors is larger than N, removing the earliest generated reference vector;
4f) updating the target network in a soft updating mode;
4g) comparing the current iteration time T with an iteration time upper limit T:
if it is
Figure BDA00036609313500000316
Or T > T, then the current state-action pair is calculated
Figure BDA00036609313500000317
Q of (2) EM Value, and pair the current state-action
Figure BDA00036609313500000318
And calculated Q EM Value is stored in ith unmanned aerial vehicleLet k be k +1, execute 5 h);
otherwise, return to 5 c);
4h) comparing the training times k with the training time upper limit Q, and judging whether the training is stopped:
if k is larger than Q, finishing the training of the deep reinforcement learning model H, and executing (6);
otherwise, return to 5 b);
(5) and (3) autonomously controlling the behavior of the unmanned aerial vehicle by using the trained deep reinforcement learning model H:
5a) the current state-action pair of the ith unmanned aerial vehicle
Figure BDA0003660931350000041
Inputting the data into a target criticic network of a trained deep reinforcement learning model H to obtain an output value
Figure BDA0003660931350000042
5b) Action of ith unmanned aerial vehicle
Figure BDA0003660931350000043
And the output value of the target Critic network
Figure BDA0003660931350000044
Inputting the data into a target Actor network in a trained deep reinforcement learning model H to obtain the output of the target Actor network
Figure BDA0003660931350000045
The output is
Figure BDA0003660931350000046
I.e. the action to be taken by the drone at the next moment.
Compared with the prior art, the invention has the following advantages:
1. the invention uses the multi-agent distributed context memory exploration table to store the optimal return of the past similar experience of the unmanned aerial vehicle, so that the unmanned aerial vehicle can replay the action which has generated high return in the context memory exploration table, obtain better results again, and does not need to wait for a lengthy network gradient updating process, thereby accelerating the convergence speed of the deep reinforcement learning model and exploring the optimal action more quickly.
2. In the actual exploration process of the unmanned aerial vehicle, the invention selects the action in the form of the combination of the random action and the optimal exploration action in the scenario memory exploration table, namely the probability of the random action selected by the unmanned aerial vehicle is epsilon, and the probability of the optimal exploration action selected in the scenario memory exploration table is 1-epsilon, so that the unmanned aerial vehicle can keep randomness during exploration, and the problem of insufficient exploration performance of the unmanned aerial vehicle is solved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a view of a ground area scene monitored by the unmanned aerial vehicle;
FIG. 3 is a comparison graph of simulation of the average prize value of the present invention compared to the prior art in a 50 × 50 scenario;
FIG. 4 is a simulation comparison of the average prize value of the present invention compared to the prior art in a 100 x 100 scenario.
Detailed Description
Embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, the implementation steps of this example are as follows:
step 1, setting an unmanned aerial vehicle control scene.
Referring to fig. 2, the present example constructs an unmanned aerial vehicle control scene, including a spatial region with a size of X × Y, an unmanned aerial vehicle set U, and a pedestrian set H;
set U ═ U of unmanned aerial vehicle in initialization space region 0 ,U 1 ,...,U i ,...,U n U, i-th drone i Is A i ={a 1 ,a 2 ,a 3 ,a 4 Where n denotes the total number of drones, i is from 0-n, n ≧ 1, in this example X ═ 50 meters, Y ═ 50 meters, n ═ 4;
initializing pedestrian set H-H in scene 1 ,h 2 ,...,h k ,...,h m H, initializing the k-th pedestrian h k In the position of(x k ,y k ) M represents the total number of pedestrians, m is more than or equal to 1, and m is 8 in the example;
initializing reward r of ith drone i Ith unmanned plane U i State set s of i ={x i ,y i ,u i },u i Represent that the number of pedestrians, u, is observed by the ith unmanned aerial vehicle i ≥0;x i 、x k Abscissa, x, representing unmanned aerial vehicle and pedestrian, respectively i ,x k <X,y i 、y k Ordinate, y, representing unmanned aerial vehicles and pedestrians i ,y k <Y。
And 2, acquiring observation information of each unmanned aerial vehicle.
Acquiring observation information of each unmanned aerial vehicle from the unmanned aerial vehicle monitoring ground state scene established in the step 1, namely calculating an action reward value of each unmanned aerial vehicle at each moment according to the state set at each moment: r is i =u i R, wherein r is the reward value obtained by the drone for each observed pedestrian. In this example, r is 20.
And 3, constructing a deep reinforcement learning model H and initializing.
3.1) constructing a deep reinforcement learning model H consisting of a real Critic network, a real Actor network, a target Critic network and an Actor network, namely, bidirectionally connecting the real Critic network and the real Actor network, and bidirectionally connecting the target Critic network and the target Actor network to form two independent network branches, wherein the real Critic network, the real Actor network, the target Critic network and the target Actor network are formed by cascading 4 rolling layers, 1 pooling layer and 3 full-connection layers;
initializing realistic Critic network parameters for n drones
Figure BDA0003660931350000051
Realistic Actor network parameters
Figure BDA0003660931350000052
Target Critic network parameters
Figure BDA0003660931350000053
And target Actor network parameters
Figure BDA0003660931350000054
The network learning rate α is initialized, the discount rate γ is returned in the future, the training batch size batch, the experience playback buffer pool size N, the soft update rate τ of the target network, the number of initialization training times is k, and the maximum number of training times is Q, in this example, α ═ 0.01, γ ═ 0.95, batch ═ 64, N ═ 10000, τ ═ 0.1, k ═ 0, Q ═ 5000.
And 4, constructing a scene memory exploration table.
Constructing a multi-agent scene memory exploration table for each unmanned aerial vehicle, wherein the multi-agent scene memory exploration table consists of two rows of contents, and one row is used for storing an index value Key of a state action pair (s, a) of the unmanned aerial vehicle; another column is for storing Q of the drone EM Value, Q EM For the optimal Q value for action a taken at current state s, the following table:
multi-agent contextual memory exploration table
Figure BDA0003660931350000061
Initializing the maximum capacity c of the scenario memory exploration table, and setting the Key value dimension dim in the scenario memory, where c is 100000 and dim is 64 in this example.
And 5, performing iterative training on the deep reinforcement learning model H.
5.1) initializing the training times to k, setting the maximum training times to Q to 5000, and setting k to 1;
5.2) initializing the iteration number of each training as T, setting the maximum iteration number as T as 1000, and setting T as 1;
5.3) judging whether the current state-action pair exists in the current scene memory exploration table
Figure BDA0003660931350000062
If the current scene memory exploration table has the current scene
Figure BDA0003660931350000063
Then choose random action with greedy probability of epsilon and choose the optimal action from the profile memory exploration table with probability of 1-epsilon
Figure BDA0003660931350000064
And perform unmanned aerial vehicle actions
Figure BDA0003660931350000065
Earning rewards
Figure BDA0003660931350000066
Then obtaining the output value from the scene memory exploration table
Figure BDA0003660931350000067
Enter the next state
Figure BDA0003660931350000068
t=t+1;
If the current scene memory searching list does not exist currently
Figure BDA0003660931350000069
Then will be
Figure BDA00036609313500000610
Inputting the actual Critic network to obtain the output value
Figure BDA00036609313500000611
Then enter the next time state
Figure BDA00036609313500000612
t=t+1;
5.4) forming an action set by using the current actions of all unmanned aerial vehicles
Figure BDA00036609313500000613
And interact with the environment to derive a set of reward values
Figure BDA0003660931350000071
Grouping the output values of the scene memory exploration table or the output values of the real Critic network into a set
Figure BDA0003660931350000072
With the state s of all drones at the next moment t+1 Composing state collections
Figure BDA0003660931350000073
Information(s) of the four sets of unmanned aerial vehicles t ,a t ,s t+1 ,r t ,Q t ) Storing the data into an experience playback cache;
5.5) judging the number of the experience vectors in the playback buffer:
if the number of the experience vectors in the experience playback cache is larger than N/2, taking out M samples from the experience playback cache, updating a real criticic network by using a minimum loss function, and updating a real Actor network by gradient descent;
the loss function for a real Critic network is:
Figure BDA0003660931350000074
Figure BDA0003660931350000075
where M represents the number of samples taken from the experience buffer pool, Q i Represents the output value, Q ', in the jth sample of drone i' i An output value representing a target criticic network of drone i;
the loss function of a real Actor network is:
Figure BDA0003660931350000076
wherein
Figure BDA0003660931350000077
Is in the target Actor networkUnmanned aerial vehicle i observes information as
Figure BDA0003660931350000078
The target network policy under (a) is,
Figure BDA0003660931350000079
network policy representing unmanned aerial vehicle i at target Actor
Figure BDA00036609313500000710
The output value of the lower target criticic network.
If the number of the experience vectors in the experience playback cache is less than or equal to N/2, returning to 5.3);
if the number of the reference vectors is larger than N, removing the earliest generated reference vector;
5.6) updating the target network in a soft updating mode.
Using parameters of the current real Critic network
Figure BDA00036609313500000711
And parameters of a real Actor network
Figure BDA00036609313500000712
Updating parameters of target Critic network at update rate tau
Figure BDA00036609313500000713
And parameters of the target Actor network
Figure BDA00036609313500000714
Obtaining the parameters of the target Critic network at the next moment
Figure BDA00036609313500000715
And parameters of the target Actor network
Figure BDA00036609313500000716
The formula is expressed as follows:
Figure BDA00036609313500000717
Figure BDA00036609313500000718
5.7) comparing the current iteration time T with an iteration time upper limit T:
if it is
Figure BDA0003660931350000081
Or T > T, then the current state-action pair is calculated
Figure BDA0003660931350000082
Q of (2) EM Value, execute 5.8); and pair the current state and the action
Figure BDA0003660931350000083
And calculated Q EM Storing the value into a scene memory exploration table of the ith unmanned aerial vehicle, and executing 5.9 by making k equal to k + 1);
otherwise, return to 5.3).
5.8) calculating Q EM The value:
5.8.1) fetch all records of the current training from the experience playback cache of the ith drone
Figure BDA0003660931350000084
Indicating the status of the ith unmanned plane at the t moment,
Figure BDA0003660931350000085
indicating the action of the drone at the ith time instant,
Figure BDA0003660931350000086
is the reward of the ith unmanned plane at the time t;
5.8.2) counting all the record numbers n, and calculating the total return obtained after corresponding action is taken under the current observation state according to the time t in the reverse order:
Figure BDA0003660931350000087
wherein γ represents an attenuation factor;
5.8.3) calculate the current state-action pair with the total reward G
Figure BDA0003660931350000088
Q of (2) EM The value:
Figure BDA0003660931350000089
wherein EM is a scene memory exploration table of the current unmanned aerial vehicle;
5.9) comparing the training times k with the upper limit Q of the training times, and judging whether the training is stopped:
if k is larger than Q, finishing the training of the deep reinforcement learning model H, and executing the step 6;
otherwise, return to 5.2).
Step 6, using the trained deep reinforcement learning model H to autonomously control the behavior of the unmanned aerial vehicle:
6.1) pairing the current state and action of the ith unmanned aerial vehicle
Figure BDA00036609313500000810
Inputting the data into a target criticic network of a trained deep reinforcement learning model H to obtain an output value
Figure BDA00036609313500000811
6.2) action of the ith unmanned aerial vehicle
Figure BDA00036609313500000812
And the output value of the target Critic network
Figure BDA00036609313500000813
Inputting the data into a target Actor network in a trained deep reinforcement learning model H to obtain the output of the target Actor network
Figure BDA00036609313500000814
The output is
Figure BDA00036609313500000815
I.e. the action to be taken by the drone at the next moment.
The technical effects of the present invention will be further described with reference to simulation experiments.
1. Simulation conditions are as follows:
simulation experiments were simulated using Python language on a Microsoft windows 10 system with a CPU of Intel i7-9700 and memory 16G.
The unmanned aerial vehicle used in the experiment monitors the ground area scene as shown in fig. 2, wherein fig. 2a is global observation information of the environment, red dots represent pedestrians on the ground, black dots represent obstacles, and green dots represent forests. Fig. 2b is a monitoring view of the drone. Each unmanned aerial vehicle and pedestrian has four types of actions, namely upward movement, left movement, right movement and downward movement. The goal of the drone is to monitor pedestrians on the ground, the drone being rewarded when a pedestrian is present in the field of view of the drone;
2. experimental contents and results:
compared with the existing unmanned aerial vehicle control decision method based on two algorithms of MADPG and IQL, the unmanned aerial vehicle control decision method carries out simulation comparison on the task completion rate, the maximum reward value and the number of steps reaching the maximum reward for controlling the unmanned aerial vehicle under two different scenes of areas of 50 x 50 and 100 x 100 shown in figure 2, wherein the simulation results of the task completion rate and the maximum reward value are shown in table 2, the result of reaching the maximum reward step number under the scene of 50 x 50 is shown in figure 3, and the result of reaching the maximum reward step number under the scene of 100 x 100 is shown in figure 4.
TABLE 2 comparison of the present invention with the existing unmanned aerial vehicle control decision results based on two algorithms of MADPG and IQL
Figure BDA0003660931350000091
As can be seen from table 2, in a scene of 50 × 50, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 99.2%, the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 97.4%, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 62.5%; the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 158, the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 155, and the maximum reward value of the unmanned aerial vehicle control decision method based on the IQL algorithm is 108.
Under the scene of 100 multiplied by 100, the task completion rate of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 97.7 percent, the task completion rate of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 75.1 percent, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 52.6 percent; the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 155, the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 115, and the maximum reward value of the unmanned aerial vehicle control decision method based on the IQL algorithm is 75.
As can be seen from fig. 3, in a 50 × 50 scenario, the maximum number of steps reached by the method is 1200, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 2500, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 2500.
As can be seen from fig. 4, in a 100 × 100 scenario, the maximum reward number of steps reached by the method is 2100, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 3200, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 5000.
The comparison result shows that the method has more obvious advantages in terms of task completion rate, maximum reward and convergence speed compared with the prior art, and can effectively improve the effect of unmanned aerial vehicle control decision.

Claims (7)

1. An unmanned aerial vehicle control decision method based on scene memory is characterized by comprising the following steps:
(1) setting an unmanned aerial vehicle control scene, acquiring a state set of each unmanned aerial vehicle at each moment, and calculating an action reward value of each unmanned aerial vehicle at each moment: r is i =u i R, wherein u i Representing the number of pedestrians observed by the ith unmanned aerial vehicle, and r is the reward value obtained when each unmanned aerial vehicle observes one pedestrian.
(2) Constructing a deep reinforcement learning model H and initializing:
2a) selecting a deep reinforcement learning model H consisting of the bidirectional connection of a real Critic network and a real Actor network and the bidirectional connection of a target Critic network and a target Actor network for each unmanned aerial vehicle;
2b) initializing realistic Critic network parameters for n drones
Figure FDA0003660931340000011
Realistic Actor network parameters
Figure FDA0003660931340000012
Target Critic network parameters
Figure FDA0003660931340000013
And target Actor network parameters
Figure FDA0003660931340000014
Initializing a network learning rate alpha, a discount rate gamma in future, a training batch size batch, an experience playback cache pool size N and a soft update rate tau of a target network; initializing the training times to be K equal to 0, and setting the maximum training times to be K;
(3) establishing a multi-agent scene memory exploration table for each unmanned aerial vehicle, initializing the maximum scene memory capacity c and the key value dimension dim in the scene memory;
(4) and (3) carrying out iterative training on the deep reinforcement learning model H:
4a) initializing the training times to be k, setting the maximum training times to be Q, setting the Q to be 5000, and setting the k to be 1;
4b) initializing the iteration time as T, setting the maximum iteration time as T as 1000, and setting T as 1;
4c) determining whether there is a current state-action pair in the current scene memory exploration table
Figure FDA0003660931340000015
If the current scene memory exploration table has the current scene
Figure FDA0003660931340000016
Then choose random action with greedy probability of epsilon and choose the optimal action from the profile memory exploration table with probability of 1-epsilon
Figure FDA0003660931340000017
And perform unmanned aerial vehicle actions
Figure FDA0003660931340000018
Earning rewards
Figure FDA0003660931340000019
And obtaining the output value from the scene memory search table
Figure FDA00036609313400000110
Then enter the next state
Figure FDA00036609313400000111
If the current scene memory searching list does not exist currently
Figure FDA00036609313400000112
Then will be
Figure FDA00036609313400000113
Inputting the actual Critic network to obtain the output value
Figure FDA00036609313400000114
Then enter the next time state
Figure FDA00036609313400000115
4d) Forming action sets with all unmanned aerial vehicle current actions
Figure FDA00036609313400000116
And interacts with the environment to derive a set of reward values r t ={r t i ,...,r t n Will combine the output values of the scene memory search table or the output values of the real Critic network into a set
Figure FDA0003660931340000021
With the state s of all drones at the next moment t+1 Composing state collections
Figure FDA0003660931340000022
Information(s) of the four sets of unmanned aerial vehicles t ,a t ,s t+1 ,r t ,Q t ) Storing the data into an experience playback cache;
4e) judging the number of experience vectors in the playback cache:
if the number of the experience vectors in the experience playback cache is larger than N/2, taking out M samples from the experience playback cache, updating a real criticic network by using a minimum loss function, and updating a real Actor network by gradient descent;
if the number of the experience vectors in the experience playback cache is less than or equal to N/2, returning to 5 c);
if the number of the reference vectors is larger than N, removing the earliest generated reference vector;
4f) updating the target network in a soft updating mode;
4g) comparing the current iteration time T with an iteration time upper limit T:
if it is
Figure FDA0003660931340000023
Or T > T, then the current state-action pair is calculated
Figure FDA0003660931340000024
Q of (2) EM Value, and pair the current state-action
Figure FDA0003660931340000025
And calculated Q EM Storing the value into a scene memory exploration table of the ith unmanned aerial vehicle, and executing for 4h, wherein k is k + 1;
otherwise, return to 4 c);
4h) comparing the training times k with the training time upper limit Q, and judging whether the training is stopped:
if k is larger than Q, finishing the training of the deep reinforcement learning model H, and executing (6);
otherwise, return to 4 b);
(5) and (3) autonomously controlling the behavior of the unmanned aerial vehicle by using the trained deep reinforcement learning model H:
5a) the current state-action pair of the ith unmanned aerial vehicle
Figure FDA0003660931340000026
Inputting the data into a target criticic network of a trained deep reinforcement learning model H to obtain an output value
Figure FDA0003660931340000027
5b) Action of ith unmanned aerial vehicle
Figure FDA0003660931340000028
And the output value of the target Critic network
Figure FDA0003660931340000029
Inputting the data into a target Actor network in a trained deep reinforcement learning model H to obtain the output of the target Actor network
Figure FDA00036609313400000210
The output is
Figure FDA00036609313400000211
I.e. the action to be taken by the drone at the next moment.
2. The method of claim 1, wherein: the actual Critic network and the actual Actor network in the 2a) are formed by cascading 4 convolution layers, 1 pooling layer and 3 full-connection layers, and the output value Q of the actual Critic network i As an input to the real Actor network, the output a of the real Actor network i As input to a real Critic network:
the loss function of the realistic Critic network is:
Figure FDA0003660931340000031
y j =r i j +γQ i ′(s′ j ,a′ 1 ,...,a′ n )
where M represents the number of samples taken from the experience buffer pool, Q i Represents the output function, Q ', in the jth sample of drone i' i An output function representing a target criticic network of drone i;
the loss function of the real Actor network is as follows:
Figure FDA0003660931340000032
wherein
Figure FDA0003660931340000033
The observation information of the unmanned aerial vehicle i in the target Actor network is
Figure FDA0003660931340000034
Target network policy of, Q i μ Network policy representing unmanned aerial vehicle i at target Actor
Figure FDA0003660931340000035
The output function of the lower target criticic network.
3. The method of claim 1, wherein: the target Critic network and the target Actor network in the 2a) are respectively composed of 4 convolution layers, 1 pooling layer and 3 full-connection layer cascades, and the output value Q of the target Critic network i ' As input of target Actor network, output of target Actor network a i ' as an objectInput to the Critic network.
4. The method of claim 1, wherein: the step 4f) of updating the target network in a soft update mode is to use the theta of the parameters of the current real Critic network t Q And the parameter theta of the real Actor network t μ Updating parameter theta of target Critic network by updating rate tau t Q′ And parameter θ of target Actor network t μ′ Obtaining the parameters of the target Critic network at the next moment
Figure FDA0003660931340000036
And parameters of the target Actor network
Figure FDA0003660931340000037
The formula is expressed as follows:
Figure FDA0003660931340000041
5. the method of claim 1, wherein: the multi-agent scene memory exploration table constructed in the step (3) consists of two columns, wherein one column is used for storing the state action pairs (s, a) of the unmanned aerial vehicle as an index value, and the other column is used for storing the Q of the unmanned aerial vehicle EM The Value is taken as Value.
6. The method of claim 1, wherein: said 4g) calculating current state-action pairs
Figure FDA0003660931340000042
Q of (2) EM The values, achieved are as follows:
4g1) taking all records of the current training from the experience playback cache of the ith drone
Figure FDA0003660931340000043
Figure FDA0003660931340000044
Indicating the status of the ith unmanned plane at the t moment,
Figure FDA0003660931340000045
indicating the movement of the unmanned plane at the ith time, r t i Is the reward of the ith unmanned plane at the time t;
4g2) counting the number n of all records, and calculating the total return obtained after corresponding actions are taken under the current observation state according to the time t in a reverse order:
Figure FDA0003660931340000046
wherein γ represents an attenuation factor;
4g3) computing current state-action pairs with total reward G
Figure FDA0003660931340000047
Q of (2) EM The value:
Figure FDA0003660931340000048
wherein EM is a scene memory exploration table of the current agent.
7. The method of claim 1, wherein: in the step (1), the unmanned aerial vehicle controls the scene, and the following is realized:
an unmanned aerial vehicle control scene is built, a space area with the size of X Y is initialized, and the set of unmanned aerial vehicles in the space area is set as U ═ U 0 ,U 1 ,...,U i ,...,U n U, i-th drone i Is A i ={a 1 ,a 2 ,a 3 ,a 4 N represents the total number of the unmanned aerial vehicles, i is from 0 to n, and n is more than or equal to 1;
initializing pedestrian set H-H in scene 1 ,h 2 ,...,h k ,...,h m }, initializationKth pedestrian h k Is in the position of (x) k ,y k ) M represents the total number of pedestrians, and m is more than or equal to 1;
initializing reward r of ith drone i Ith unmanned plane U i Set of states of(s) i ={x i ,y i ,u i },u i Represent that the number of pedestrians, u, is observed by the ith unmanned aerial vehicle i ≥0;x i 、x k Abscissa, x, representing unmanned aerial vehicle and pedestrian, respectively i ,x k <X,y i 、y k Ordinate, y, representing unmanned aerial vehicles and pedestrians i ,y k <Y。
CN202210577604.7A 2022-05-25 2022-05-25 Unmanned aerial vehicle control decision method based on context memory Pending CN114840024A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210577604.7A CN114840024A (en) 2022-05-25 2022-05-25 Unmanned aerial vehicle control decision method based on context memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210577604.7A CN114840024A (en) 2022-05-25 2022-05-25 Unmanned aerial vehicle control decision method based on context memory

Publications (1)

Publication Number Publication Date
CN114840024A true CN114840024A (en) 2022-08-02

Family

ID=82571406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210577604.7A Pending CN114840024A (en) 2022-05-25 2022-05-25 Unmanned aerial vehicle control decision method based on context memory

Country Status (1)

Country Link
CN (1) CN114840024A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117707219A (en) * 2024-02-05 2024-03-15 西安羚控电子科技有限公司 Unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117707219A (en) * 2024-02-05 2024-03-15 西安羚控电子科技有限公司 Unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning
CN117707219B (en) * 2024-02-05 2024-05-17 西安羚控电子科技有限公司 Unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN109523029B (en) Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method
CN108594858B (en) Unmanned aerial vehicle searching method and device for Markov moving target
CN110794842A (en) Reinforced learning path planning algorithm based on potential field
CN111240356B (en) Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN112801290A (en) Multi-agent deep reinforcement learning method, system and application
Sui et al. Path planning of multiagent constrained formation through deep reinforcement learning
Jia et al. Improving policy optimization with generalist-specialist learning
CN111768028A (en) GWLF model parameter adjusting method based on deep reinforcement learning
CN116841317A (en) Unmanned aerial vehicle cluster collaborative countermeasure method based on graph attention reinforcement learning
CN112613608A (en) Reinforced learning method and related device
CN114840024A (en) Unmanned aerial vehicle control decision method based on context memory
CN114815882A (en) Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning
CN113894780B (en) Multi-robot cooperation countermeasure method, device, electronic equipment and storage medium
CN115047907A (en) Air isomorphic formation command method based on multi-agent PPO algorithm
Justesen et al. Learning a behavioral repertoire from demonstrations
CN116796844A (en) M2 GPI-based unmanned aerial vehicle one-to-one chase game method
Huang et al. A deep reinforcement learning approach to preserve connectivity for multi-robot systems
CN116859989A (en) Unmanned aerial vehicle cluster intelligent countermeasure strategy generation method based on group cooperation
CN111950691A (en) Reinforced learning strategy learning method based on potential action representation space
Desai et al. Deep reinforcement learning to play space invaders
CN118276438A (en) Multi-subject pursuit optimal strategy method based on threat degree reinforcement learning algorithm
CN111643905B (en) Information processing method and device and computer readable storage medium
CN118036469A (en) Dynamic scene-oriented unmanned system searching and planning strategy optimizing method, device, medium and product
Li et al. Research on Multi-robot Path Planning Method Based on Improved MADDPG Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination