CN114371634B

CN114371634B - Unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback

Info

Publication number: CN114371634B
Application number: CN202111584665.8A
Authority: CN
Inventors: 林旺群; ***; 王伟; 王锐华; 李妍; 张世杰
Original assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Current assignee: Strategic Evaluation And Consultation Center Of Pla Academy Of Military Sciences
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-10-25
Anticipated expiration: 2041-12-22
Also published as: CN114371634A

Abstract

An unmanned aerial vehicle combat analog simulation method based on multi-stage experience playback comprises an unmanned aerial vehicle simulation preparation information setting step, an unmanned aerial vehicle training network construction step based on multi-target intelligent body training, an unmanned aerial vehicle training network training step based on multi-target intelligent body training, and repeated simulation and finishing steps. The invention sets a multi-stage post experience playback pool, stores and utilizes three plot samples with different priorities, randomly samples according to different priorities of the experience playback pool, and divides and utilizes useful sample information when the priority is higher and the sampling probability is higher, thereby greatly improving training efficiency, improving the overall utilization rate of samples and accelerating the learning of a multi-target intelligent body model.

Description

Unmanned aerial vehicle combat analog simulation method based on multi-stage after experience playback

Technical Field

The invention relates to the field of virtual simulation, in particular to an unmanned aerial vehicle combat simulation method based on multi-level after-the-fact experience playback, which can introduce a multi-level after-the-fact experience playback mechanism in multi-target intelligent body depth reinforcement learning, quickly improve the learning performance of an intelligent body and quickly improve the speed of the intelligent body to finish multi-target tasks in the unmanned aerial vehicle flight simulation.

Background

With the development of unmanned and intelligent technologies, the use of unmanned aerial vehicles has become an important topic in the civil and military science fields, the original unmanned aerial vehicles are mainly operated manually, and with the development of intellectualization and simulation, simulation control methods of various intelligent bodies have been applied to the flight simulation operation of the unmanned aerial vehicles.

For example, in the prior art, a behavior tree-based method can be used for simulation reinforcement learning of unmanned aerial vehicle flight, and the method can clearly model and rapidly implement deployment on an unmanned aerial vehicle flight strategy, but the method is limited by perceptual and intuitive cognitive limitations of an artificial mode, and is limited by granularity and index precision of a rule concept, so that the model is difficult to objectively, rapidly, accurately and robustly make. Moreover, the unmanned aerial vehicle simulation reinforcement learning method is high in cost in a complex scene and difficult to realize.

Another unmanned aerial vehicle control method in the prior art is mainly realized through deep reinforcement learning, the idea of a general cost function approximator is used, the target and state information of unmanned aerial vehicle navigation are jointly used as the input of a multi-target intelligent body, meanwhile, the experience playback after the fact is used, and on the basis of the existing plot, a new virtual plot is generated by using a virtual target, so that the number of flight samples of the unmanned aerial vehicle is expanded. Although the depth reinforcement learning method based on the experience playback after the fact uses an automatic learning mode and promotes the training efficiency of unmanned aerial vehicle navigation in the simulation environment, the method collects the sample flying by the unmanned aerial vehicle in a random sampling mode for learning of the intelligent body, so that the utilization rate of the sample is very low, and even the final performance of the intelligent body of the unmanned aerial vehicle is influenced.

Therefore, how to solve the problem that the transfer sample utilization rate of the unmanned aerial vehicle is not enough in the multi-target intelligent agent reinforcement learning training process, the intelligent agent learning efficiency of the unmanned aerial vehicle is improved, and the technical problem to be solved urgently is formed.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle combat simulation method based on multi-stage after experience playback, which can introduce a multi-stage after experience playback mechanism in multi-target intelligent body depth reinforcement learning, quickly improve the learning performance of an intelligent body and quickly improve the speed of the intelligent body for learning and completing multi-target tasks in unmanned aerial vehicle flight simulation.

In order to achieve the purpose, the invention adopts the following technical scheme:

an unmanned aerial vehicle combat analog simulation method based on multi-level after-the-fact experience playback comprises the following steps:

unmanned aerial vehicle simulation preparation information setting step S110:

setting preparation information of unmanned aerial vehicle simulated flight, and setting a reward function according to task requirements, wherein the preparation information comprises state information, target information and legal action of the unmanned aerial vehicle simulated flight;

an unmanned aerial vehicle training network construction step S120 based on multi-target agent training:

constructing an unmanned aerial vehicle training network based on multi-target intelligent body training, wherein the unmanned aerial vehicle training network comprises an actor-critic architecture, an unmanned aerial vehicle battlefield simulation environment, a temporary buffer pool and a multi-stage experience playback pool;

the actor-critic architecture comprises an actor network, an actor target network, a critic network and a critic target network, wherein the actor network is used for inputting the state and target information of the unmanned aerial vehicle simulated flight in the step S110, outputting the action required to be executed by the unmanned aerial vehicle intelligent agent,

the unmanned aerial vehicle battlefield simulation environment is used for simulating the flight operation condition of the unmanned aerial vehicle, calculating the flight track of an unmanned aerial vehicle intelligent body by using action instructions obtained by an actor network, feeding back three information of states, rewards and whether targets are finished required by the training actor network to the actor network serving as an intelligent body network model for training, and synthesizing the current environment state, the next moment state, the rewards, the actions and the current targets to form a transfer tuple;

the temporary buffer pool is used for storing transfer samples, namely transfer tuples, generated by interaction of the unmanned aerial vehicle agent and the environment, wherein the transfer samples comprise transfer samples under a certain plot and virtual transfer samples aiming at the virtual plot;

the multi-stage experience playback pool comprises a main experience pool

Secondary experience pool

And a pool of failed experiences

The three experience pools are used for classifying and storing the plots or virtual plots in the temporary buffer pool after judgment, updating the actor network and the critic network by using a plurality of transfer tuples stored in the temporary buffer pool as samples, and then updating the actor target network and the critic network by using soft update, wherein the main experience pools are the experience pools from the main experience pool to the target experience poolExperience pool

Secondary experience pool

And a pool of failed experiences

The sampling proportion of the samples for providing is corresponding to the priority, and the sampling proportion is reduced in sequence;

training the unmanned aerial vehicle training network of the multi-target intelligent training step S130:

setting preparation information and reward functions of the simulated flight of the unmanned aerial vehicle by using the step S110, performing plot and virtual plot training by using the unmanned aerial vehicle training network trained on the basis of the multi-target intelligent agent in the step S120 to expand transfer samples, and combining all the transfer samples with a main experience pool

After the targets of all the plots are compared, the plots are respectively classified into a main experience pool

Secondary experience pool

And a pool of failed experiences

And obtaining samples according to the proportion of high and low priorities to train and update the intelligent network model, training by utilizing a plurality of navigation scenarios, selecting the intelligent network model with the highest target completion rate as the optimal intelligent network model, and performing the training of the step for multiple times to finish the training of the intelligent network model.

Repeating the simulation and ending the step S140:

and repeatedly executing the step S120 and the step S130, randomly initializing an intelligent agent model each time the step S120 is executed, training the intelligent agent model initialized in the step S120 when the step S130 is executed, and finishing the training when the target completion rate of the unmanned aerial vehicle intelligent agent expected by the user reaches a threshold value.

A storage medium for storing computer-executable instructions which, when executed by a processor, perform the above-described unmanned aerial vehicle combat simulation method based on multi-level post experience playback.

In conclusion, the invention provides an unmanned aerial vehicle combat analog simulation method based on multi-level after experience playback, three levels of after experience playback pools are introduced to store samples in a grading way, and the samples with high priority are given higher sampling probability, and the samples with high priority contain more information beneficial to the intelligent body model to learn and complete the target, so that the invention can accelerate the learning speed of the multi-target intelligent body model and improve the efficiency of unmanned aerial vehicle combat analog simulation.

Drawings

Fig. 1 is a flowchart of a method for simulating unmanned aerial vehicle combat based on multi-level post experience playback according to an embodiment of the present invention;

FIG. 2 is a block diagram of a UAV combat simulation network based on multi-level post experience playback in accordance with a specific embodiment of the present invention;

fig. 3 is a diagram illustrating the specific steps of training a training network for drones according to a specific embodiment of the present invention;

FIG. 4 is a schematic diagram of a combat scenario illustrating simulated simulation of unmanned aerial vehicle combat in accordance with an embodiment of the present invention;

fig. 5 is a schematic view of an action space of a drone combat simulation according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The invention relates to a description of related terms related to a specific algorithm in an unmanned aerial vehicle combat simulation method based on multi-stage after experience playback.

1. Depth deterministic strategy gradient algorithm

The method adopts a deep reinforcement learning algorithm based on an 'actor-critic' framework, integrates the advantages of deep Q learning, and can train an intelligent agent model applied to a continuous action space scene. In the invention, the continuous action space can be a flight action instruction of the unmanned aerial vehicle, namely, each speed vector of the unmanned aerial vehicle is continuously changed, the depth certainty strategy gradient algorithm is combined with a multi-stage experience playback mechanism, and an unmanned aerial vehicle intelligent body is trained

2. Intelligent agent network model

The software model based on the neural network is a complex network system formed by a large number of simple neurons which are widely connected with each other. In the invention, an actor network is used as an intelligent network model, the input is state information of an unmanned aerial vehicle simulation environment and a target to be completed, the output is an instruction or action acting on the simulation environment, and the simulation environment is an unmanned aerial vehicle battlefield battle environment.

3. Unmanned aerial vehicle agent

The unmanned aerial vehicle intelligent body refers to an entity interacting with a battlefield environment, and the interaction with the battlefield environment refers to the unmanned aerial vehicle intelligent body acting according to the current battlefield environment state information, so that the battlefield environment state information is changed and a reward signal is fed back to the unmanned aerial vehicle intelligent body. The actions that the drone agent needs to make are generated by the agent network model.

4. Virtual target

The virtual target refers to the position states of k unmanned aerial vehicles randomly extracted from a plot after the plot is finished, and the position states are used as new targets to be reached.

5. Plot and virtual plot

The scenario refers to a sequence of states, actions and rewards of the experience of the intelligent agent model from the beginning to the incomplete goal or the complete goal within a limited time when the intelligent agent model interacts with the environment, and the sequence is expressed in the form of a set of a plurality of transition tuples formed by the experience, in this case, all the states of the unmanned aerial vehicle from the starting point to the end point are specifically referred. The virtual plot is a new plot obtained by replacing the target position of the original unmanned aerial vehicle of the plot with the target position of the virtual unmanned aerial vehicle.

6. Transferring samples

The transfer sample is a basic unit forming a plot, each interaction between the intelligent network model of the unmanned aerial vehicle and the simulation environment can generate a state s, a state s 'at the next moment, a reward r and an action (instruction) a, the transfer sample records data generated by each interaction by using a quadruplet, a target g is further introduced into the multi-stage post experience playback method, the g represents a target to be reached by the unmanned aerial vehicle, and in sum, the transfer sample can be represented as (s | | | g, a, r, s' | | g).

7. Target

The target refers to a state that the unmanned aerial vehicle intelligent body is expected to reach, for example, in the invention, the target is any coordinate position of an enemy area which the unmanned aerial vehicle hopes to reach. Multiple targets mean that what the agent needs to learn is not a fixed target, for example, in this example, what the drone needs to learn is any possible coordinate position to reach an enemy area, rather than a fixed coordinate position.

8. Experience pool

And a buffer area in the memory or the hard disk is used for storing the transfer sample. The stored transfer samples may be used repeatedly for training of the agent network model. The experience pool is a basic unit constituting a multi-level post experience playback pool, and in the present invention, a main experience pool is included

Secondary experience pool

And a pool of failed experiences

And waiting for three experience pools.

9. Multi-stage post experience playback pool

The multi-stage experience playback pool is used for storing plot samples and virtual plot samplesOriginally, for unmanned aerial vehicle agent and environment interactive process generate, virtual plot sample is for using virtual target to generate and the environment interaction generates with the unmanned aerial vehicle agent afterwards. Three playback pools with different priorities are set up in the multi-level experience playback pool, and are respectively main experience pools

Secondary experience pool

And a pool of failed experiences

The three playback pools are used for respectively storing the plot samples and the virtual plot samples with different priorities, and uniform random sampling is adopted for a single experience pool.

The invention mainly comprises the following steps: sets a multi-level experience playback pool after the events, stores and utilizes three different priority levels (main experience pool)

Secondary experience pool

And a pool of failed experiences

) The plot samples are randomly sampled according to different priorities of the empirical replay pool and different probabilities, the higher the priority is, the higher the sampling probability is, useful sample information is utilized, and the training efficiency is greatly improved.

Referring to fig. 1 and fig. 2, a flow chart of the unmanned aerial vehicle combat simulation method based on multi-level post experience playback and a block diagram of the unmanned aerial vehicle combat simulation network based on multi-level post experience playback are respectively shown.

The multi-level after-the-fact experience playback multi-target intelligent agent deep reinforcement learning method comprises the following steps:

unmanned aerial vehicle simulation preparation information setting step S110:

specifically, refer to the operation scene of the unmanned aerial vehicle operation simulation degree shown in fig. 4.

The state information includes: unmanned aerial vehicle coordinates of our party (x) _f ,y _f ,z _f ) Coordinate of enemy unmanned plane (x) _e ,y _e ,z _e ) Unmanned aerial vehicle track direction angle

Enemy unmanned aerial vehicle track direction angle

Unmanned aerial vehicle roll angle phi _f Enemy unmanned aerial vehicle roll angle phi _e And a velocity vector included angle beta between the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy;

the target information specifies a target location (x) for the air war zone _d ,y _d )；

Referring to fig. 5, the wing directions of the drone are defined as y-axis, the fuselage direction is defined as x-axis, the normal direction of the fuselage is defined as z-axis, and the legal action is defined as the velocity vector v along the x-axis _x Angular velocity vector w about axis x _x Angular velocity vector w around z-axis _z And the vector w of angular velocity about the y-axis _y ；

The reward function is: whether the designated target location of the air war zone is reached is used as an examination, the reward is set to be a value of-1 or 0, the reward with the value of 0 is fed back when the designated target location is reached, and the reward with the value of-1 is fed back when the designated target location is not reached.

In the invention, the unmanned aerial vehicle intelligent body hopes to go to a plurality of different investigation or combat places, the place is any coordinate position of an enemy area as shown in figure 4, and an enemy unmanned aerial vehicle flies in a fixed track in the enemy area and carries out air combat in a fixed strategy. The unmanned aerial vehicle intelligent body is controlled by the intelligent body network model to execute a high-quality maneuvering strategy when an enemy is detected or the unmanned aerial vehicle fights in the air with the enemy, so that the unmanned aerial vehicle occupies a favorable situation position.

An unmanned aerial vehicle training network construction step S120 based on multi-target intelligent training:

referring to fig. 2, an overall architecture diagram of a depth-deterministic policy gradient algorithm based on multi-level post-hoc playback of experience is shown, mainly employing an "actor-critic" architecture,

the system comprises an actor-critic architecture, an unmanned aerial vehicle battlefield simulation environment, a temporary buffer pool and a multi-level post experience playback pool;

the "actor-critic" architecture includes an actor network, an actor target network, a critic network, and a critic target network,

the actor network is used for inputting the state and target information of the unmanned aerial vehicle simulated flight in the step S110 and outputting the actions to be executed by the unmanned aerial vehicle intelligent agent, and is an intelligent agent network model for main training;

the critic network is used for evaluating the output action of the actor network to assist training of the actor network, the input of the critic network is the state and the target of the unmanned aerial vehicle simulated flight and the action required to be executed by the unmanned aerial vehicle intelligent agent, and the output of the critic network is an evaluation scalar. The evaluation scalar value is used for evaluating input actions, namely evaluating the quality of actions executed by the unmanned aerial vehicle of the owner under certain state information of the battlefield environment of the unmanned aerial vehicle and target conditions needing to be finished, and training the actor network, namely an intelligent network model, by maximizing the evaluation scalar value.

The actor target network and the critic target network are respectively used for assisting in training the actor network and the critic network, the input and the output of the actor network and the critic network are respectively the same as those of the actor network and the critic network, and the weight parameters are respectively copied from the actor network and the critic network in a soft updating mode.

For the soft update, a time sequence difference mode in reinforcement learning is adopted for training of the critic network, namely a training label of a sample at the time t needs to be calculated from a sample at the time t +1 through the critic network, so that the label can be changed continuously along with the update of network parameters of the critic, the update of the critic network and the actor network is unstable, and the parameters of the actor target network and the critic target network are updated once at intervals.

The soft update of the actor target network and the critic target network specifically comprises the following steps:

θ ^μ′ ←τθ ^μ +(1τ)θ ^μ′

θ ^Q′ ←τθ ^Q +(1τ)θ ^′Q′

wherein, theta ^μ′ ,θ ^Q′ ,θ ^μ ,θ ^Q And respectively representing the parameters of the actor target network, the evaluator target network, the actor network and the actor target network. T is equal to [0, 1).

The unmanned aerial vehicle battlefield simulation environment is used for simulating the flight operation condition of an unmanned aerial vehicle, calculating the flight track of an unmanned aerial vehicle intelligent body by utilizing action instructions obtained by an actor network, feeding back three information of states, rewards and whether targets are finished required by the actor network for training, wherein the three information is fed back to the actor network serving as an intelligent body network model for training, and the states are also information required by the actor network to output next action, and integrating the current environment state s, the next moment state s ', the rewards r, the action (instruction) a and the current target g to form a transfer tuple (s | | | | g, a, r, s' | | g);

the temporary buffer pool is used for storing transfer samples, namely transfer tuples, generated by interaction of the unmanned aerial vehicle agent and the environment, wherein the transfer samples comprise transfer samples under a certain scenario and virtual transfer samples aiming at the virtual scenario.

Specifically, the transfer tuple in the temporary buffer is specifically:

(s | | g, a, r, s' | | g) formula (1)

g＝(x _d ,y _d ) Formula (3)

a＝(v _x ,w _x ,w _z ,w _y ) Formula (4)

Formula (1) represents a transfer sample, wherein s represents a state, g represents a target position of unmanned aerial vehicle navigation, a represents an action (instruction), r represents a reward, s' represents a state of the unmanned aerial vehicle intelligent agent network model at the next moment of interaction with a simulation environment after the action is executed, and formula (2) s represents a state comprising coordinates (x) of the unmanned aerial vehicle on my side _f ,y _f ,z _f ) Coordinates of enemy unmanned aerial vehicle (x) _e ,y _e ,z _e ) Unmanned aerial vehicle track direction angle

Enemy unmanned aerial vehicle track direction angle

Unmanned aerial vehicle roll angle phi _f Enemy unmanned plane roll angle phi _e And a velocity vector included angle beta between the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy; formula (3) target g represents the coordinate location (x) of the air combat zone as the target _d ,y _d ) (ii) a Equation (4) action a, comprising an x-axis velocity vector v _x Angular velocity vector w about axis x _x Angular velocity vector w around z-axis _z And the vector w of angular velocity about the y-axis _y The components are combined together; and (5) judging to obtain the reward r according to the position of the unmanned aerial vehicle at the moment t(s), wherein if the reward r reaches the target position, the reward value is 0, and if the reward r does not reach the target position, the reward value is-1.

The transfer sample and the virtual transfer sample are obtained as follows: and ending the plot, namely the unmanned plane finishes the target or does not finish the target within a limited time step, and at the moment, all the transfer samples in the temporary buffer pool form a plot. At the moment, k transfer tuples are randomly extracted according to the plot, the position coordinates of the unmanned aerial vehicle in the state information of the previous moment in the transfer tuples are extracted to be used as virtual targets, k virtual plots are generated according to the virtual targets and are stored in a temporary buffer.

The multi-stage experience playback pool comprises a main experience pool

Secondary experience pool

And a pool of failed experiences

The three experience pools are used for classifying and storing the plots or virtual plots in the temporary buffer pool after judgment, updating the actor network and the critic network by using a depth certainty strategy gradient algorithm by using a plurality of transfer tuples stored in the temporary buffer pool as samples, and then updating the actor target network and the critic network by using soft update, wherein the events or virtual plots are updated from the main experience pool

Secondary experience pool

And a pool of failed experiences

The sampling ratio of the samples for providing in (1) corresponds to the priority, and decreases in turn.

Further, the classified storage of the plots or the virtual plots in the temporary buffer pool after judgment specifically comprises:

for each episode or virtual episode in the temporary buffer, if the target or virtual target is not completedDirectly storing the scenario or virtual scenario into a failure experience pool

If the goal or virtual goal is completed, the corresponding goal or virtual goal is compared with the goals corresponding to all episodes stored in the primary experience pool, if similar goals do not exist in the primary experience pool, the episodes are directly stored in the primary experience pool, if similar goals do exist, the bonus sum of the two episodes is compared, the bonus sum of the two episodes is stored in the primary experience pool

The remaining rewards and small episodes are deposited into a secondary experience pool

Similar targets means that the euclidean distance of the corresponding target or virtual target and the target corresponding to the plot in the master experience pool is within a threshold range.

Specifically, the primary experience pool stores episodes with different targets, the target corresponding to each episode is unique in the experience pool, and the episode is the episode with the largest accumulated reward sum under the target, namely the optimal trajectory of the unmanned aerial vehicle for completing the target. The main experience pool has the greatest priority.

The secondary experience pool stores plots with different targets, and multiple plots of the same target can be stored in the experience pool, but the track of the unmanned aerial vehicle completing the target corresponding to each plot is not optimal. The secondary experience pool has a secondary priority.

The failure experience pool stores episodes with different targets, the same target can have a plurality of episodes stored in the experience pool, but the unmanned aerial vehicle track corresponding to each episode does not complete the target. The failed experience pool has the smallest priority.

During training, a certain amount of samples are extracted from the primary experience pool, the secondary experience pool and the failure experience pool. Samples in the experience pool with high priority have higher probability to be extracted and embodiedTo extract from these three experience pools in different proportions, in an alternative embodiment, the primary experience pool

Providing a pool of secondary experiences for training

More than twice of that of the secondary experience pool

Providing a pool of experience for training as failures

More than three times, for example, the sampling ratio is 6.

Because the state space and the action space of the unmanned aerial vehicle combat environment are large, the command sequence for successfully completing the target is long, and the failure plots are many, the secondary experience pool and the failure experience pool are given larger capacities, illustratively, the capacity of the primary experience pool is set to be 1 ten thousand, 1 ten thousand transfer tuples can be accommodated, the capacity of the secondary experience pool is set to be 20 ten thousand, and the capacity of the failure experience pool is set to be 10 thousand.

The "updating the actor network and the critic network using the depth deterministic policy gradient algorithm with the plurality of transfer tuples stored therein as samples" may specifically be: setting the sample size to be B =120 and the sampling ratio to be 6

Randomly collecting 72 samples (transfer tuples) from a secondary experience pool

Randomly collecting 36 samples from a failure experience pool and a secondary experience pool

In the method, 12 samples are randomly collected, and 120 samples in the three experience pools are used for updating the intelligent agent model.

In the invention, the actor network is a 5-layer fully-connected neural network, the dimensions of the three hidden layers are all 128, the activation function uses ReLu, the input of the network is the environment state and the target to be completed, therefore, the input dimension of the neural network is 13, that is, the size of the input dimension of the neural network is 13

The output is the action (instruction) to be executed, and the size of the output dimension is 4, namely (v) _x ,w _x ,w _z ,w _y ) The activation function is tanh and the output value range is [ -1,1]. The target actor network replicates the structure and parameters of the actor network. The critic network is a 5-layer fully-connected neural network, the dimensions of three hidden layers are 128, the ReLu is used as an activation function, the input dimension is 17 and comprises an environment state, a target and an action, and the output dimension is 1. The target critic network replicates the structure and parameters of the critic network.

Therefore, in the present invention, s is the same as s' but different in specific value. When all samples can form a scenario, i.e. the drone completes the goal within the specified time step or reaches the maximum time step incomplete goal, a plurality of virtual scenarios are generated according to the scenario and the generated virtual scenarios are stored in the primary experience pool, the secondary experience pool and the failed experience pool according to the accumulated reward sum (i.e. sum of rewards of all samples) of the scenario and whether the goal is completed. All samples in a scenario may appear as a track of the flight of the drone, and the size of the reward sum assesses how well the track is, i.e., the number of time steps taken by the drone to reach the target.

When the invention is specifically trained, a suitable hardware configuration can be selected for the network setting. For example, the number of machines, the number of memories, the number of CPU servers, the number of GPU servers, and the disk capacity are included.

setting preparation information and reward function of the simulated flight of the unmanned aerial vehicle by using the step S110, performing plot and virtual plot training by using the unmanned aerial vehicle training network trained on the basis of the multi-target agent in the step S120 to expand transfer samples, and combining all the transfer samples with a main experience pool

Secondary experience pool

And a pool of failed experiences

The method comprises the steps of obtaining samples according to the proportion of high priority and low priority to train and update an intelligent agent network model, training by utilizing a plurality of navigation plots, selecting the intelligent agent network model with the highest target completion rate as the optimal intelligent agent network model, and performing training of the step for a plurality of times, such as E-round S130, to finish training of the intelligent network model.

Specifically, step S130 may include the following steps, fig. 3 shows specific steps of training the drone training network, and fig. 2 adaptively shows the corresponding positions of the specific steps in the drone combat simulation network.

Simulation scenario execution step S131: initial state s with unmanned aerial vehicle ₀ The initial target g is any coordinate position of the enemy area, the actor network is input for calculation, and the action a to be executed by the unmanned aerial vehicle is output ₀ Interacting with the environment to obtain a new state s ₁ Prize r ₀ Then change the state s ₁ And the target g inputs the actor network again, and outputs the action a to be executed by the unmanned aerial vehicle ₁ This is repeated until the episode ends.

Specifically, there are two optional conditions for the end of the episode: the unmanned aerial vehicle intelligent body finishes the target, namely, the designated detection coordinate is reached, or the interaction step number of the unmanned aerial vehicle intelligent body and the simulation environment reaches the maximum set step number T. This step is mainly the main process of the program.

Virtual scenario generation step S132: the position coordinates of the drone in the previous state information in which k =4 transfer samples were randomly sampled from the scenario generated in step S131 are set as virtual targets, which are the position coordinates that the drone has reached, i.e., extracted from the elapsed state (x ×) ( _f ,y _f ) As a virtual target has been reached, the interaction process between the drone agent and the simulation environment generates a sequence of time sequences as follows:

(s ₀ ||g,a ₀ ,r ₀ ,s ₁ ||g),(d ₁ ||g,a ₁ ,r ₁ ,s ₂ ||g),...,(s _T-1 ||g,a _T-1 ,r _T-1 ,s _T g) s of virtual target from arbitrary t time _t For any virtual target g' of the transfer sample, from the state s of the virtual target _t Starting to trace back to the initial state s ₀ Replacing the target in which the transfer sample is experienced with a virtual target g' and replacing the reward according to:

thereby generating 1 section of virtual plot, generating k sections of unmanned aerial vehicle navigation virtual plots by k virtual targets and storing the k sections of unmanned aerial vehicle navigation virtual plots into a temporary buffer pool

t (-) denotes some transition from state to target, r _g′ (s, a) represents a reward function for the corresponding sample.

The main functions of the substep are: the number of drone navigation episodes (samples) is expanded.

Buffer pool update step S133: storing the transition sample generated in step S131 and the transition sample of the virtual scenario generated in step S132 in a temporary buffer

In the method, the target abstraction of different samples in the temporary buffer is expressed as g', and all the sample classifications in the temporary buffer are set in a multi-level post experience playback pool, specifically: if the drone agent does not complete the target g', i.e. the drone has not reached the designated reconnaissance site within a defined time step, temporary buffering will be performed

The scenario containing the target g' is stored until the failure experience is late

For temporary buffering if the drone arrives at a designated reconnaissance site within a defined time step

The sum R of scenario awards with a target of g' is calculated _g′ For the dominant experience pool

Target of all plots in

If no target is present

So that

Namely, the terminal coordinate of the flight path of the unmanned aerial vehicle is not close to g ', the corresponding plot of g' is transferred to a main experience pool

Wherein ∈ =0.01 is a similar target decision interval threshold; if there is a target

So that

And awards and R _g′ Is greater than

The bonus sum corresponding to the plot is then used to pool the main experience

Chinese herbal medicine

Is transferred to a secondary experience pool

And the corresponding plot of g' is saved to the main experience pool

If R is _g′ Is less than

The reward sum of the corresponding plot, the g' corresponding plot is saved to the secondary experience pool

That is, the terminal coordinate of the flight path of the unmanned aerial vehicle is close to g', and at the moment, the sample corresponding to the optimal flight path is stored

In the storage of the remaining samples

Therefore, the invention carries out priority classification on various types of transfer samples through the sub-step, so that the temporary buffering can be carried out by multi-stage experience playback after the events

All unmanned aerial vehicles inThe scenario samples of the aviation are stored in a grading mode instead of being stored in an experience pool in a mixing mode, further, transfer samples with high priority can be obtained in a targeted mode for training of the intelligent agent network model in the next step, and therefore sampling efficiency of high-quality samples is improved.

Sample update step S134: and setting the height of the sampling proportion according to the height of the priority, adopting a certain number of samples from a multi-level post experience playback pool, updating the actor network and the commenting family network by using a depth certainty strategy gradient algorithm by using all the collected samples, and then updating the actor target network and the commenting family network by using a soft updating technology.

For example, by sampling ratios (6

120 samples are sampled, all the collected samples are used, a depth certainty strategy gradient algorithm is used for updating the actor network and the comment family network, and then a soft updating technology is used for updating the actor target network and the comment family network.

The main functions of the substeps are: the sampling probability to high-quality sample is increased to promote the training efficiency of the unmanned aerial vehicle intelligent body.

Model saving step S135: initializing the battlefield environment of the unmanned aerial vehicle and random targets for multiple times, using the unmanned aerial vehicle controlled by the existing intelligent agent model to interact with the unmanned aerial vehicle to generate a plurality of unmanned aerial vehicle navigation plots and calculating a target completion rate, and if the target completion rate of the current intelligent agent model is greater than the target completion rate of the historical optimal intelligent agent model, saving the current model as the optimal intelligent agent model.

The main functions of the substep are: because the process of deep reinforcement learning is extremely unstable, the model effect is probably not the best from training to final convergence, therefore, the invention initializes the unmanned aerial vehicle environment for many times, trains, stores the optimal model as the final intelligent network model

Repeating the simulation and ending the step S140:

and repeatedly executing the step S120 and the step S130, randomly initializing one agent model each time the step S120 is executed, training the agent model initialized in the step S120 when the step S130 is executed, and ending the training when the target completion rate of the unmanned aerial vehicle agent expected by the user reaches a threshold value, for example, 90%.

When the average plot reward and/or the target completion rate of the test generally tend to be stable and constant, the algorithm is converged, namely, the flight trajectory of the unmanned aerial vehicle intelligent body controlled by the intelligent body model to reach any random target is fixed and cannot change along with the training of the intelligent body model. Therefore, the present invention repeatedly performs steps S120 and S130 to generate an optimal model under different initialization conditions for each repeated execution of step S120, and thus can take the optimal agent model saved in the whole training process as the final model.

The specific embodiment is as follows:

in a specific embodiment, the multi-target agent training request may be sent via a remote terminal, or may be sent via a pre-programmed script.

In the multi-target agent training request, the hardware resources are the hardware configuration selected by the user based on the anti-training scale,

the initial is defined as an application environment when executing the intelligent agent model, and the application environment needs to meet the requirement that a plurality of targets can be represented by environment state display, for example, in the unmanned aerial vehicle fighting environment of the intelligent agent confrontation, the maneuvering target position of the unmanned aerial vehicle is represented by the coordinates of the unmanned aerial vehicle;

the evaluation index is set according to an actual application scene, and may be an average reward sum of a plurality of episodes or a target completion rate of the plurality of episodes.

The invention can set the maximum execution step number T, the batch sampling number B and the training round number E.

Specifically, hardware resources are configured according to a multi-target agent training request, wherein an agent model runs on a GPU server, and an unmanned aerial vehicle battlefield combat simulation engine runs on a CPU server. The maximum number of execution steps T =10000, the number of batch samples B =120, and the training round E =1000 are set.

Specifically, as shown in fig. 3, a black "cross" is a randomly generated target spot for investigation, the unmanned aerial vehicle of our party needs to reach the target spot at the fastest speed, two dotted lines are assumed moving tracks of the unmanned aerial vehicle of our party, although both tracks reach the target spot, the assumed track 1 is a more optimal track, because the track 1 can avoid the blockage of the unmanned aerial vehicle of the enemy party, and the track 2 needs the unmanned aerial vehicle of our party to destroy the unmanned aerial vehicle of the enemy party or get rid of the unmanned aerial vehicle of the enemy party to reach the target spot, therefore, the corresponding plot generated by the track 1 has greater importance, and the intelligent model should preferentially select the plot for training, so that the intelligent model can preferentially learn more important information.

Therefore, preferentially selecting more important samples for training of the agent model can speed up its learning and improve its performance compared to randomly drawing samples for ordinary post-experience playback of intelligent model training.

The invention further discloses a storage medium for storing computer-executable instructions, and the computer-executable instructions, when executed by a processor, execute the unmanned aerial vehicle combat simulation method based on multi-stage experience playback after the fact.

The invention realizes more use of high-quality samples for intelligent agent model training by the hierarchical storage of the samples and the proportional hierarchical acquisition of the samples. The multi-stage after-the-fact experience playback of the invention stores samples with different priorities by setting up three experience pools with different priorities, wherein the samples are stored by taking plots as a unit, the plots are divided into three different priorities, and the three different priorities are respectively the plots with completed targets and lower cost, completed targets and higher cost and uncompleted targets from high to low. By improving the sampling probability of using a high-priority sample for training the intelligent body model, if the track 1 and the track 2 are stored in an experience pool as samples, the method can continuously select the track 1 for training the intelligent body model, and the ordinary after experience playback randomly selects one track from the track 1 and the track 2 for training the intelligent body model at each time, so that the method improves the sample utilization rate, greatly accelerates the training of the intelligent body model, and the high-quality sample also improves the performance of the final intelligent body model.

In conclusion, the invention has the following advantages:

by introducing a multi-level after-the-fact experience playback mechanism into the deep reinforcement learning algorithm, the overall sample utilization rate is improved and the learning of the multi-target intelligent body model is accelerated when the deep reinforcement learning algorithm is applied to the training of the multi-target intelligent body model. Aiming at the training of a multi-target intelligent agent, compared with a rule-based method, the deep reinforcement learning method is not limited by expert experience, and the method has better universality; compared with a deep reinforcement learning algorithm using common post experience playback, the multi-stage post experience playback improves the sample utilization rate and improves the training effect through a multi-priority mechanism.

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback is characterized by comprising the following steps:

unmanned aerial vehicle simulation preparation information setting step S110:

the unmanned aerial vehicle battlefield simulation environment is used for simulating the flight battle condition of the unmanned aerial vehicle, calculating the flight track of an unmanned aerial vehicle intelligent body by using action instructions obtained by an actor network, feeding back three information of states, rewards and whether targets are finished required by the actor network for training to the actor network serving as an intelligent body network model for training, and integrating the current environment state, the next moment state, the rewards, the actions and the current targets to form a transfer tuple;

the multi-stage experience playback pool comprises a main experience pool

Secondary experience pool

And a pool of failed experiences

The three experience pools are used for classifying and storing the plots or virtual plots in the temporary buffer pool after judgment, updating the actor network and the commentator network by using a plurality of transfer tuples stored in the temporary buffer pool as samples, and updating the actor target network and the commentator network by using soft update, wherein the main experience pool is used for storing the plots or virtual plots in the temporary buffer pool in a classifying way, and the main experience pool is used for storing the actor target network and the commentator network in a classifying way, wherein the main experience pool is used for storing the actor target network and the commentator network in a classifying way, and the actor target network and the commentator network are updated by using soft update

Secondary experience pool

And a pool of failed experiences

Secondary experience pool

And a pool of failed experiences

According to high or low priorityTraining and updating the intelligent agent network model by using the proportion acquisition sample, training by using a plurality of navigation plots, selecting the intelligent agent network model with the highest target completion rate as the optimal intelligent agent network model, and training the step for multiple times to complete the training of the intelligent network model;

repeating the simulation and ending the step S140:

repeatedly executing the step S120 and the step S130, randomly initializing an intelligent agent model each time the step S120 is executed, training the intelligent agent model initialized by the step S120 when the step S130 is executed, and ending the training when the target completion rate of the unmanned aerial vehicle intelligent agent expected by the user reaches a threshold value;

in the unmanned aerial vehicle simulation preparation information setting step S110,

the state information includes: my unmanned aerial vehicle coordinate (x) _f ，y _f ，z _f ) Coordinates of enemy unmanned aerial vehicle (x) _e ，y _e ，z _e ) My unmanned aerial vehicle track direction angle

Enemy unmanned aerial vehicle track direction angle

Unmanned aerial vehicle roll angle phi _f Enemy unmanned plane roll angle phi _e And a velocity vector included angle beta between the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy;

the target information specifies a target location (x) for the air war zone _d ，y _d )；

Defining the directions of two wings of the unmanned aerial vehicle as an axis y, the direction of a fuselage as an axis x, the direction of a normal line of the fuselage as an axis z, and taking legal action as a velocity vector v along the axis x _x Angular velocity vector w about axis x _x Angular velocity vector w around z-axis _z And the vector w of angular velocity about the y-axis _y ；

The reward function is: whether the designated target location of the air war zone is reached is used as an examination, the reward is set to be a value-1 or 0, the reward with the value 0 is fed back when the designated target location is reached, and the reward with the value-1 is fed back when the designated target location is not reached;

in the step S120, the process proceeds,

the commenter network is used for evaluating the output action of the actor network to assist in training the actor network, the input of the commenter network is the state and the target of the unmanned aerial vehicle simulated flight and the action required to be executed by the unmanned aerial vehicle intelligent agent, and the output of the commenter network is an evaluation scalar value, the evaluation scalar value is the evaluation of the input action, namely the evaluation of the action executed by the unmanned aerial vehicle in the unmanned aerial vehicle battlefield environment is performed under the condition of certain state information and the target required to be completed, and the actor network is trained by maximizing the evaluation scalar value;

the actor target network and the critic target network are respectively used for assisting in training the actor network and the critic network, the input and the output of the actor network and the critic network are respectively the same as those of the actor network and the critic network, and the weight parameters are respectively copied from the two networks in a soft updating mode;

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

θ ^Q′ ←τθ ^Q +(1-τ)θ′ ^Q′

wherein, theta ^μ′ ，θ ^Q′ ，θ ^μ ，θ ^Q Respectively representing the parameters of an actor target network, an evaluator target network, an actor network and an actor target network, wherein tau belongs to [0,1 ];

in the step S120, the process proceeds,

the transfer tuple in the temporary buffer is specifically:

(s | | g, a, r, s' | | g) formula (1)

g＝(x _d ，y _d ) Formula (3)

a＝(v _x ，w _x ，w _z ，w _y ) Formula (4)

Formula (1) represents a transfer sample, wherein s represents a state, g represents a target position of unmanned aerial vehicle navigation, a represents an action, r represents a reward, s' represents a state of the unmanned aerial vehicle intelligent body network model at the next moment generated by interaction with a simulation environment after the action is executed, and formula (2) state s comprises coordinates (x) of the unmanned aerial vehicle at our party _f ，y _f ，z _f ) Coordinates of enemy unmanned aerial vehicle (x) _e ，y _e ，z _e ) My unmanned aerial vehicle track direction angle

Enemy unmanned aerial vehicle track direction angle

Unmanned aerial vehicle roll angle phi _f Enemy unmanned aerial vehicle roll angle phi _e And a velocity vector included angle beta between the unmanned aerial vehicle of our party and the unmanned aerial vehicle of the enemy together; equation (3) target g represents the coordinate location (x) of the air combat zone as the target _d ，y _d ) (ii) a Equation (4) action a includes an x-axis velocity vector v _x Angular velocity vector w about axis x _x Angular velocity vector w around z-axis _z And the vector w of angular velocity about the y-axis _y The components are combined together; the reward r is obtained through judgment of the position of the unmanned aerial vehicle at the time t(s), if the target position is reached, the reward value is 0, and if the target position is not reached, the reward value is-1;

in the step S120 of the present invention,

the classified storage of the plots or the virtual plots in the temporary buffer pool after judgment is specifically as follows:

for each episode or virtual episode in the temporary buffer, if the target or virtual target is not completed, the episode or virtual episode is directly stored into a failed experience pool

The similar targets mean that Euclidean distances between the corresponding target or the virtual target and the target corresponding to the plot in the main experience pool are within a threshold value range;

the step S130 of training the network of the unmanned aerial vehicle for training the multi-target agent specifically includes:

simulation scenario execution step S131: initial state s with unmanned aerial vehicle ₀ The initial target g is any coordinate position of the enemy area, the actor network is input for calculation, and the action a to be executed by the unmanned aerial vehicle is output ₀ Interacting with the environment to obtain a new state s ₁ Prize r ₀ Then change the state s ₁ And the target g inputs the actor network again and outputs the action a to be executed by the unmanned aerial vehicle ₁ Repeating the steps until the plot is finished;

virtual scenario generation step S132: the position coordinates of the drone in the previous state information of the randomly sampled k transfer samples in the scenario generated in step S131 are set as virtual targets, which are the position coordinates reached by the drone, i.e., extracted from the experienced state (x) _f ，y _f ) As a virtual target has been reached, the interaction process between the drone agent and the simulation environment generates a sequence of time sequences as follows:

(s ₀ ||g，a ₀ ，r ₀ ，s ₁ ||g)，(s ₁ ||g，a ₁ ，r ₁ ，s ₂ ||g)，...，(s _T-1 ||g，a _T-1 ，r _T-1 ，s _T ||g)

s of virtual target from arbitrary t time _t For any virtual target g' of the transfer sample, from the state s of the virtual target _t Starting to trace back to the initial state s ₀ Replacing the target in which the transfer sample is experienced with a virtual target g' and replacing the reward according to:

t (-) denotes some transition from state to target, r _g′ (s, a) represents a reward function for the corresponding sample;

In the method, the target abstraction of different samples in the temporary buffer is expressed as g', and all the sample classifications in the temporary buffer are set in a multi-level post experience playback pool, specifically: if the UAV agent does not complete the target g', it will temporarily buffer

Sample of (1), calculating sum of scenario awards targeting g' and R _g′ For the dominant experience pool

Target of all episodes in

If no target is present

So that

Then the corresponding plot of g' is transferred to the main experience pool

So that

And awards and R _g′ Is greater than

Chinese herbal medicine

Is transferred to the secondary experience pool

And the corresponding plot of g' is saved to the main experience pool

If R is _g′ Is less than

At the moment, the sample corresponding to the optimal flight path is stored in

In the storage of the remaining samples

Sample update step S134: setting the height of a sampling proportion according to the height of the priority, adopting a certain number of samples from a multi-level post experience playback pool, updating an actor network and a critic network by using a depth certainty strategy gradient algorithm by using all the collected samples, and then updating an actor target network and a critic network by using a soft updating technology;

model saving step S135: initializing the battlefield environment and random targets of the unmanned aerial vehicle for multiple times, using the unmanned aerial vehicle controlled by the existing intelligent agent model to interact with the unmanned aerial vehicle to generate a plurality of unmanned aerial vehicle navigation plots and calculating a target completion rate, and if the target completion rate of the current intelligent agent model is greater than the target completion rate of the historical optimal intelligent agent model, saving the current model as the optimal intelligent agent model.

2. The unmanned aerial vehicle combat analog simulation method of claim 1,

for a multi-level post experience playback pool,

primary experience pool

Providing samples for training as a pool of secondary experiences

More than twice, a pool of secondary experiences

Providing samples for training as a pool of failed experiences

More than three times.

3. The UAV combat simulation method of claim 2, wherein the simulation module is further configured to generate a simulation model,

in step S131, the condition for the episode to end is: and the unmanned aerial vehicle intelligent body finishes the target, or the interaction step number of the unmanned aerial vehicle intelligent body and the simulation environment reaches the maximum set step number T.

4. The UAV combat simulation method of claim 3, wherein the simulation module is further configured to generate a simulation model,

in step S140, the target completion rate threshold target for the drone agent is 90%.

5. A storage medium for storing computer-executable instructions, wherein,

the computer-executable instructions, when executed by a processor, perform the method for drone combat simulation based on multi-level post experience playback as claimed in any one of claims 1 to 4.