CN114840024A

CN114840024A - Unmanned aerial vehicle control decision method based on context memory

Info

Publication number: CN114840024A
Application number: CN202210577604.7A
Authority: CN
Inventors: 罗杰豪; 方敏; 谢佳晨; 史令安; 王鹏; 王宏博
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-08-02

Abstract

The invention provides an unmanned aerial vehicle control decision method based on scene memory, which comprises the following steps: constructing an unmanned aerial vehicle control scene; acquiring observation information of each unmanned aerial vehicle; constructing a deep reinforcement learning model H and initializing; constructing an episode memory exploration table; carrying out iterative training on the deep reinforcement learning model H; and independently controlling the behavior of the unmanned aerial vehicle by using the trained deep reinforcement learning model H. The invention uses the multi-agent distributed plot memory exploration table to store the optimal return of the past similar experience of the agent, and selects the action in the form of random action and the optimal exploration action combination in the plot memory exploration table in the actual exploration process of the agent, thereby improving the completion rate of the unmanned aerial vehicle control task and the convergence time of the deep reinforcement learning algorithm.

Description

Unmanned aerial vehicle control decision method based on context memory

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-unmanned aerial vehicle control decision method which can be used for monitoring scene environments in a multi-unmanned aerial vehicle scene.

Background

In the field of multi-unmanned-machine control, multi-agent deep reinforcement learning is a very popular research direction at present, and a very good effect is achieved. Multi-agent reinforcement learning is an indispensable branch of reinforcement learning research direction. The multi-agent reinforcement learning can be applied to a series of tasks needing to be completed cooperatively, such as multi-unmanned aerial vehicle reconnaissance, unmanned aerial vehicle cluster confrontation and the like. Compared with a single agent, the task complexity processed by the multiple agents is higher, the agents not only need to consider the environmental factors in the learning process, but also need to consider the behaviors of other agents, and therefore the difficulty of the agent learning is greatly improved. Meanwhile, under some special environments such as sparse reward, complex coordination tasks and the like, the related research of multi-agent reinforcement learning has many defects. While the defects of a multi-agent reinforcement learning algorithm are overcome, the cooperation relation among multiple unmanned aerial vehicles needs to be analyzed by combining actual complex unmanned aerial vehicle environment conditions, and the multiple unmanned aerial vehicles can monitor the environment information.

Multi-agent Q learning is the most typical reinforcement learning algorithm for solving the multi-drone collaboration problem. The Q learning can evaluate the value of the state and the action of each unmanned aerial vehicle in an off-line mode, each unmanned aerial vehicle firstly continuously explores and accumulates enough track experiences in the environment, and then the Q learning continuously corrects the value evaluation of the joint state and the action of the unmanned aerial vehicles according to the track experiences. By continuously trying to execute random actions in any state in the environment, the Q learning can judge the optimal value behavior of the unmanned aerial vehicle in any state through sufficient historical experience data. Q learning is the most primitive form of reinforcement learning and is the basis for other more complex approaches.

Traditional multi-agent Q learning utilizes tables to store data. As the number of drones increases, the method of storing data using the table becomes difficult to apply. When the number of states and the number of behaviors of the unmanned aerial vehicle are very large, the storage space of the Q table becomes huge, and the time of the query table is also very long, so that Q learning is not meaningful in practical application. To solve this problem, *** proposed the DQN algorithm in 2013. The DQN combines the neural network and Q learning together, so that a table is not needed to record a Q value, the state of the unmanned aerial vehicle is directly used as the input of the neural network, and all action values are calculated by using the neural network. However, DQN can only be used to solve the problem of discrete actions, but cannot be used to process the problem of continuous actions, and DQN has a low sample utilization rate in the training process, and the Q value obtained by training is not stable.

OpenAI, a famous artificial intelligence company, proposes a multi-agent deterministic policy gradient algorithm maddppg, which provides a learning framework of centralized training and step-by-step execution, so that the algorithm can process complex multi-agent scene tasks that cannot be solved by traditional reinforcement learning. When the workload of one task is overlarge and the task can be split into different subtasks, a plurality of agents can execute the subtasks in parallel, and the task processing speed is accelerated. When there is a problem with an agent in the system, the rest of the agents in the system can take over the content executed by the agent, thereby improving the robustness of the system to a higher level. However, the madpg algorithm mostly adopts epsilon-greedy strategy combining strategic action and random action in the aspect of searching action selection of the agent, so that reinforcement learning is difficult to search efficiently at the initial stage of training, the convergence rate of the algorithm is slow, and the monitoring performance is poor in a complex multi-unmanned aerial vehicle dynamic monitoring scene.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle control decision method based on context memory, aiming at overcoming the defects of the prior art, so that in the action selection stage, the action selection of the unmanned aerial vehicle is guided by combining a context memory exploration table and random action selection, the exploration of the unmanned aerial vehicle on a state action space is promoted, the convergence speed is accelerated, and the monitoring performance of the unmanned aerial vehicle on a dynamic scene is improved.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) setting a control scene of the unmanned aerial vehicle, acquiring a state set of each unmanned aerial vehicle at each moment, and calculating the movement of each unmanned aerial vehicle at each momentMaking a reward value: r is _i ＝u _i R, wherein u _i Representing the number of pedestrians observed by the ith unmanned aerial vehicle, and r is the reward value obtained when the unmanned aerial vehicle observes one pedestrian.

(2) Constructing a deep reinforcement learning model T and initializing:

2a) selecting a deep reinforcement learning model T formed by bidirectional connection of a real Critic network and a real Actor network and bidirectional connection of a target Critic network and a target Actor network for each unmanned aerial vehicle;

2b) initializing realistic Critic network parameters for n drones

Realistic Actor network parameters

Target Critic network parameters

And target Actor network parameters

Initializing a network learning rate alpha, a discount rate gamma in future, a training batch size batch, an experience playback cache pool size N and a soft update rate tau of a target network; initializing the training times to be K equal to 0, and setting the maximum training times to be K;

(3) constructing a multi-agent scene memory exploration table for each unmanned aerial vehicle, initializing the maximum scene memory capacity c and key value dimension dim in the scene memory;

(4) and (3) carrying out iterative training on the deep reinforcement learning model H:

4a) initializing the training times to be k, setting the maximum training time to be Q, setting the Q to be 5000, and setting the k to be 1;

4b) initializing the iteration time as T, setting the maximum iteration time as T as 1000, and setting T as 1;

4c) determining whether there is a current state-action pair in the current scene memory exploration table

If the current scene memory exploration table has the current scene

Then choose random action with greedy probability of epsilon and choose the optimal action from the profile memory exploration table with probability of 1-epsilon

And perform unmanned aerial vehicle actions

Earning rewards

And obtaining the output value from the scene memory search table

Then enter the next state

t＝t+1；

If the current scene memory searching list does not exist currently

Then will be

Inputting the actual Critic network to obtain the output value

Then entering the next time state

t＝t+1；

4d) Forming an action set with all unmanned aerial vehicle current actions

And interact with the environment to derive a set of reward values

Grouping the output values of the scene memory exploration table or the output values of the real Critic network into a set

With state s of all unmanned aerial vehicles at the next moment _t+1 Composing state collections

Information(s) of the four sets of unmanned aerial vehicles _t ,a _t ,s _t+1 ,r _t ,Q _t ) Storing the data into an experience playback cache;

4e) judging the number of experience vectors in the playback cache:

if the number of the experience vectors in the experience playback cache is larger than N/2, taking out M samples from the experience playback cache, updating a real criticic network by using a minimum loss function, and updating a real Actor network by gradient descent;

if the number of the experience vectors in the experience playback cache is less than or equal to N/2, returning to 5 c);

if the number of the reference vectors is larger than N, removing the earliest generated reference vector;

4f) updating the target network in a soft updating mode;

4g) comparing the current iteration time T with an iteration time upper limit T:

if it is

Or T > T, then the current state-action pair is calculated

Q of (2) ^EM Value, and pair the current state-action

And calculated Q ^EM Value is stored in ith unmanned aerial vehicleLet k be k +1, execute 5 h);

otherwise, return to 5 c);

4h) comparing the training times k with the training time upper limit Q, and judging whether the training is stopped:

if k is larger than Q, finishing the training of the deep reinforcement learning model H, and executing (6);

otherwise, return to 5 b);

(5) and (3) autonomously controlling the behavior of the unmanned aerial vehicle by using the trained deep reinforcement learning model H:

5a) the current state-action pair of the ith unmanned aerial vehicle

Inputting the data into a target criticic network of a trained deep reinforcement learning model H to obtain an output value

5b) Action of ith unmanned aerial vehicle

And the output value of the target Critic network

Inputting the data into a target Actor network in a trained deep reinforcement learning model H to obtain the output of the target Actor network

The output is

I.e. the action to be taken by the drone at the next moment.

Compared with the prior art, the invention has the following advantages:

1. the invention uses the multi-agent distributed context memory exploration table to store the optimal return of the past similar experience of the unmanned aerial vehicle, so that the unmanned aerial vehicle can replay the action which has generated high return in the context memory exploration table, obtain better results again, and does not need to wait for a lengthy network gradient updating process, thereby accelerating the convergence speed of the deep reinforcement learning model and exploring the optimal action more quickly.

2. In the actual exploration process of the unmanned aerial vehicle, the invention selects the action in the form of the combination of the random action and the optimal exploration action in the scenario memory exploration table, namely the probability of the random action selected by the unmanned aerial vehicle is epsilon, and the probability of the optimal exploration action selected in the scenario memory exploration table is 1-epsilon, so that the unmanned aerial vehicle can keep randomness during exploration, and the problem of insufficient exploration performance of the unmanned aerial vehicle is solved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a view of a ground area scene monitored by the unmanned aerial vehicle;

FIG. 3 is a comparison graph of simulation of the average prize value of the present invention compared to the prior art in a 50 × 50 scenario;

FIG. 4 is a simulation comparison of the average prize value of the present invention compared to the prior art in a 100 x 100 scenario.

Detailed Description

Embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this example are as follows:

step 1, setting an unmanned aerial vehicle control scene.

Referring to fig. 2, the present example constructs an unmanned aerial vehicle control scene, including a spatial region with a size of X × Y, an unmanned aerial vehicle set U, and a pedestrian set H;

set U ═ U of unmanned aerial vehicle in initialization space region ₀ ,U ₁ ,...,U _i ,...,U _n U, i-th drone _i Is A _i ＝{a ₁ ,a ₂ ,a ₃ ,a ₄ Where n denotes the total number of drones, i is from 0-n, n ≧ 1, in this example X ═ 50 meters, Y ═ 50 meters, n ═ 4;

initializing pedestrian set H-H in scene ₁ ,h ₂ ,...,h _k ,...,h _m H, initializing the k-th pedestrian h _k In the position of(x _k ,y _k ) M represents the total number of pedestrians, m is more than or equal to 1, and m is 8 in the example;

initializing reward r of ith drone _i Ith unmanned plane U _i State set s of _i ＝{x _i ,y _i ,u _i }，u _i Represent that the number of pedestrians, u, is observed by the ith unmanned aerial vehicle _i ≥0；x _i 、x _k Abscissa, x, representing unmanned aerial vehicle and pedestrian, respectively _i ,x _k ＜X，y _i 、y _k Ordinate, y, representing unmanned aerial vehicles and pedestrians _i ,y _k ＜Y。

And 2, acquiring observation information of each unmanned aerial vehicle.

Acquiring observation information of each unmanned aerial vehicle from the unmanned aerial vehicle monitoring ground state scene established in the step 1, namely calculating an action reward value of each unmanned aerial vehicle at each moment according to the state set at each moment: r is _i ＝u _i R, wherein r is the reward value obtained by the drone for each observed pedestrian. In this example, r is 20.

And 3, constructing a deep reinforcement learning model H and initializing.

3.1) constructing a deep reinforcement learning model H consisting of a real Critic network, a real Actor network, a target Critic network and an Actor network, namely, bidirectionally connecting the real Critic network and the real Actor network, and bidirectionally connecting the target Critic network and the target Actor network to form two independent network branches, wherein the real Critic network, the real Actor network, the target Critic network and the target Actor network are formed by cascading 4 rolling layers, 1 pooling layer and 3 full-connection layers;

initializing realistic Critic network parameters for n drones

Realistic Actor network parameters

Target Critic network parameters

And target Actor network parameters

The network learning rate α is initialized, the discount rate γ is returned in the future, the training batch size batch, the experience playback buffer pool size N, the soft update rate τ of the target network, the number of initialization training times is k, and the maximum number of training times is Q, in this example, α ═ 0.01, γ ═ 0.95, batch ═ 64, N ═ 10000, τ ═ 0.1, k ═ 0, Q ═ 5000.

And 4, constructing a scene memory exploration table.

Constructing a multi-agent scene memory exploration table for each unmanned aerial vehicle, wherein the multi-agent scene memory exploration table consists of two rows of contents, and one row is used for storing an index value Key of a state action pair (s, a) of the unmanned aerial vehicle; another column is for storing Q of the drone ^EM Value, Q ^EM For the optimal Q value for action a taken at current state s, the following table:

multi-agent contextual memory exploration table

Initializing the maximum capacity c of the scenario memory exploration table, and setting the Key value dimension dim in the scenario memory, where c is 100000 and dim is 64 in this example.

And 5, performing iterative training on the deep reinforcement learning model H.

5.1) initializing the training times to k, setting the maximum training times to Q to 5000, and setting k to 1;

5.2) initializing the iteration number of each training as T, setting the maximum iteration number as T as 1000, and setting T as 1;

5.3) judging whether the current state-action pair exists in the current scene memory exploration table

If the current scene memory exploration table has the current scene

And perform unmanned aerial vehicle actions

Earning rewards

Then obtaining the output value from the scene memory exploration table

Enter the next state

t＝t+1；

If the current scene memory searching list does not exist currently

Then will be

Inputting the actual Critic network to obtain the output value

Then enter the next time state

t＝t+1；

5.4) forming an action set by using the current actions of all unmanned aerial vehicles

And interact with the environment to derive a set of reward values

With the state s of all drones at the next moment _t+1 Composing state collections

5.5) judging the number of the experience vectors in the playback buffer:

the loss function for a real Critic network is:

where M represents the number of samples taken from the experience buffer pool, Q _i Represents the output value, Q ', in the jth sample of drone i' _i An output value representing a target criticic network of drone i;

the loss function of a real Actor network is:

wherein

Is in the target Actor networkUnmanned aerial vehicle i observes information as

The target network policy under (a) is,

network policy representing unmanned aerial vehicle i at target Actor

The output value of the lower target criticic network.

If the number of the experience vectors in the experience playback cache is less than or equal to N/2, returning to 5.3);

5.6) updating the target network in a soft updating mode.

Using parameters of the current real Critic network

And parameters of a real Actor network

Updating parameters of target Critic network at update rate tau

And parameters of the target Actor network

Obtaining the parameters of the target Critic network at the next moment

And parameters of the target Actor network

The formula is expressed as follows:

5.7) comparing the current iteration time T with an iteration time upper limit T:

if it is

Or T > T, then the current state-action pair is calculated

Q of (2) ^EM Value, execute 5.8); and pair the current state and the action

And calculated Q ^EM Storing the value into a scene memory exploration table of the ith unmanned aerial vehicle, and executing 5.9 by making k equal to k + 1);

otherwise, return to 5.3).

5.8) calculating Q ^EM The value:

5.8.1) fetch all records of the current training from the experience playback cache of the ith drone

Indicating the status of the ith unmanned plane at the t moment,

indicating the action of the drone at the ith time instant,

is the reward of the ith unmanned plane at the time t;

5.8.2) counting all the record numbers n, and calculating the total return obtained after corresponding action is taken under the current observation state according to the time t in the reverse order:

wherein γ represents an attenuation factor;

5.8.3) calculate the current state-action pair with the total reward G

Q of (2) ^EM The value:

wherein EM is a scene memory exploration table of the current unmanned aerial vehicle;

5.9) comparing the training times k with the upper limit Q of the training times, and judging whether the training is stopped:

if k is larger than Q, finishing the training of the deep reinforcement learning model H, and executing the step 6;

otherwise, return to 5.2).

Step 6, using the trained deep reinforcement learning model H to autonomously control the behavior of the unmanned aerial vehicle:

6.1) pairing the current state and action of the ith unmanned aerial vehicle

6.2) action of the ith unmanned aerial vehicle

And the output value of the target Critic network

The output is

I.e. the action to be taken by the drone at the next moment.

The technical effects of the present invention will be further described with reference to simulation experiments.

1. Simulation conditions are as follows:

simulation experiments were simulated using Python language on a Microsoft windows 10 system with a CPU of Intel i7-9700 and memory 16G.

The unmanned aerial vehicle used in the experiment monitors the ground area scene as shown in fig. 2, wherein fig. 2a is global observation information of the environment, red dots represent pedestrians on the ground, black dots represent obstacles, and green dots represent forests. Fig. 2b is a monitoring view of the drone. Each unmanned aerial vehicle and pedestrian has four types of actions, namely upward movement, left movement, right movement and downward movement. The goal of the drone is to monitor pedestrians on the ground, the drone being rewarded when a pedestrian is present in the field of view of the drone;

2. experimental contents and results:

compared with the existing unmanned aerial vehicle control decision method based on two algorithms of MADPG and IQL, the unmanned aerial vehicle control decision method carries out simulation comparison on the task completion rate, the maximum reward value and the number of steps reaching the maximum reward for controlling the unmanned aerial vehicle under two different scenes of areas of 50 x 50 and 100 x 100 shown in figure 2, wherein the simulation results of the task completion rate and the maximum reward value are shown in table 2, the result of reaching the maximum reward step number under the scene of 50 x 50 is shown in figure 3, and the result of reaching the maximum reward step number under the scene of 100 x 100 is shown in figure 4.

TABLE 2 comparison of the present invention with the existing unmanned aerial vehicle control decision results based on two algorithms of MADPG and IQL

As can be seen from table 2, in a scene of 50 × 50, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 99.2%, the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 97.4%, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 62.5%; the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 158, the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 155, and the maximum reward value of the unmanned aerial vehicle control decision method based on the IQL algorithm is 108.

Under the scene of 100 multiplied by 100, the task completion rate of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 97.7 percent, the task completion rate of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 75.1 percent, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 52.6 percent; the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 155, the maximum reward value of the unmanned aerial vehicle control decision method based on the MADDPG algorithm is 115, and the maximum reward value of the unmanned aerial vehicle control decision method based on the IQL algorithm is 75.

As can be seen from fig. 3, in a 50 × 50 scenario, the maximum number of steps reached by the method is 1200, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 2500, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 2500.

As can be seen from fig. 4, in a 100 × 100 scenario, the maximum reward number of steps reached by the method is 2100, the task completion rate of the unmanned aerial vehicle control decision method based on the maddppg algorithm is 3200, and the task completion rate of the unmanned aerial vehicle control decision method based on the IQL algorithm is 5000.

The comparison result shows that the method has more obvious advantages in terms of task completion rate, maximum reward and convergence speed compared with the prior art, and can effectively improve the effect of unmanned aerial vehicle control decision.

Claims

1. An unmanned aerial vehicle control decision method based on scene memory is characterized by comprising the following steps:

(1) setting an unmanned aerial vehicle control scene, acquiring a state set of each unmanned aerial vehicle at each moment, and calculating an action reward value of each unmanned aerial vehicle at each moment: r is _i ＝u _i R, wherein u _i Representing the number of pedestrians observed by the ith unmanned aerial vehicle, and r is the reward value obtained when each unmanned aerial vehicle observes one pedestrian.

(2) Constructing a deep reinforcement learning model H and initializing:

2a) selecting a deep reinforcement learning model H consisting of the bidirectional connection of a real Critic network and a real Actor network and the bidirectional connection of a target Critic network and a target Actor network for each unmanned aerial vehicle;

2b) initializing realistic Critic network parameters for n drones

Realistic Actor network parameters

Target Critic network parameters

And target Actor network parameters

(3) establishing a multi-agent scene memory exploration table for each unmanned aerial vehicle, initializing the maximum scene memory capacity c and the key value dimension dim in the scene memory;

4a) initializing the training times to be k, setting the maximum training times to be Q, setting the Q to be 5000, and setting the k to be 1;

If the current scene memory exploration table has the current scene

And perform unmanned aerial vehicle actions

Earning rewards

And obtaining the output value from the scene memory search table

Then enter the next state

If the current scene memory searching list does not exist currently

Then will be

Inputting the actual Critic network to obtain the output value

Then enter the next time state

4d) Forming action sets with all unmanned aerial vehicle current actions

And interacts with the environment to derive a set of reward values r _t ＝{r _t ⁱ ,...,r _t ⁿ Will combine the output values of the scene memory search table or the output values of the real Critic network into a set

4e) judging the number of experience vectors in the playback cache:

4f) updating the target network in a soft updating mode;

if it is

Or T > T, then the current state-action pair is calculated

Q of (2) ^EM Value, and pair the current state-action

And calculated Q ^EM Storing the value into a scene memory exploration table of the ith unmanned aerial vehicle, and executing for 4h, wherein k is k + 1;

otherwise, return to 4 c);

otherwise, return to 4 b);

5a) the current state-action pair of the ith unmanned aerial vehicle

5b) Action of ith unmanned aerial vehicle

And the output value of the target Critic network

The output is

I.e. the action to be taken by the drone at the next moment.

2. The method of claim 1, wherein: the actual Critic network and the actual Actor network in the 2a) are formed by cascading 4 convolution layers, 1 pooling layer and 3 full-connection layers, and the output value Q of the actual Critic network _i As an input to the real Actor network, the output a of the real Actor network _i As input to a real Critic network:

the loss function of the realistic Critic network is:

y ^j ＝r _i ^j +γQ _i ′(s′ ^j ,a′ ₁ ,...,a′ _n )

where M represents the number of samples taken from the experience buffer pool, Q _i Represents the output function, Q ', in the jth sample of drone i' _i An output function representing a target criticic network of drone i;

the loss function of the real Actor network is as follows:

wherein

The observation information of the unmanned aerial vehicle i in the target Actor network is

Target network policy of, Q _i ^μ Network policy representing unmanned aerial vehicle i at target Actor

The output function of the lower target criticic network.

3. The method of claim 1, wherein: the target Critic network and the target Actor network in the 2a) are respectively composed of 4 convolution layers, 1 pooling layer and 3 full-connection layer cascades, and the output value Q of the target Critic network _i ' As input of target Actor network, output of target Actor network a _i ' as an objectInput to the Critic network.

4. The method of claim 1, wherein: the step 4f) of updating the target network in a soft update mode is to use the theta of the parameters of the current real Critic network _t ^Q And the parameter theta of the real Actor network _t ^μ Updating parameter theta of target Critic network by updating rate tau _t ^Q′ And parameter θ of target Actor network _t ^μ′ Obtaining the parameters of the target Critic network at the next moment

And parameters of the target Actor network

The formula is expressed as follows:

5. the method of claim 1, wherein: the multi-agent scene memory exploration table constructed in the step (3) consists of two columns, wherein one column is used for storing the state action pairs (s, a) of the unmanned aerial vehicle as an index value, and the other column is used for storing the Q of the unmanned aerial vehicle ^EM The Value is taken as Value.

6. The method of claim 1, wherein: said 4g) calculating current state-action pairs

Q of (2) ^EM The values, achieved are as follows:

4g1) taking all records of the current training from the experience playback cache of the ith drone

Indicating the status of the ith unmanned plane at the t moment,

indicating the movement of the unmanned plane at the ith time, r _t ⁱ Is the reward of the ith unmanned plane at the time t;

4g2) counting the number n of all records, and calculating the total return obtained after corresponding actions are taken under the current observation state according to the time t in a reverse order:

wherein γ represents an attenuation factor;

4g3) computing current state-action pairs with total reward G

Q of (2) ^EM The value:

wherein EM is a scene memory exploration table of the current agent.

7. The method of claim 1, wherein: in the step (1), the unmanned aerial vehicle controls the scene, and the following is realized:

an unmanned aerial vehicle control scene is built, a space area with the size of X Y is initialized, and the set of unmanned aerial vehicles in the space area is set as U ═ U ₀ ,U ₁ ,...,U _i ,...,U _n U, i-th drone _i Is A _i ＝{a ₁ ,a ₂ ,a ₃ ,a ₄ N represents the total number of the unmanned aerial vehicles, i is from 0 to n, and n is more than or equal to 1;

initializing pedestrian set H-H in scene ₁ ,h ₂ ,...,h _k ,...,h _m }, initializationKth pedestrian h _k Is in the position of (x) _k ,y _k ) M represents the total number of pedestrians, and m is more than or equal to 1;

initializing reward r of ith drone _i Ith unmanned plane U _i Set of states of(s) _i ＝{x _i ,y _i ,u _i }，u _i Represent that the number of pedestrians, u, is observed by the ith unmanned aerial vehicle _i ≥0；x _i 、x _k Abscissa, x, representing unmanned aerial vehicle and pedestrian, respectively _i ,x _k ＜X，y _i 、y _k Ordinate, y, representing unmanned aerial vehicles and pedestrians _i ,y _k ＜Y。