CN112034888B

CN112034888B - Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle

Info

Publication number: CN112034888B
Application number: CN202010944803.8A
Authority: CN
Inventors: 俞扬; 詹德川; 周志华; 王超; 袁雷; 陈立坤; 黄宇洋; 庞竟成
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2021-07-30
Anticipated expiration: 2040-09-10
Also published as: CN112034888A

Abstract

The invention discloses a method for training an autonomous control cooperation strategy of a fixed wing unmanned aerial vehicle, which comprises the following steps: (1) dynamic-based fixed-wing unmanned aerial vehicle control simulation environment E_sAcquiring real track data of a pilot for controlling the unmanned aerial vehicle, and learning in a supervision learning mode to obtain a flight control strategy of the unmanned aerial vehicle; (2) constructing a simplified abstract environment E stripped of flight control_aCreating two groups of unmanned aerial vehicles with grouped confrontation, and learning by using an APEX _ QMIX algorithm to obtain a cooperation strategy; (3) combining a flight control strategy and a cooperation strategy in a layered reinforcement learning mode to obtain a simulation environment E_sA middle-entering learning fusion strategy; (3) migrating to the real environment. The method has significance in a real scene, and has the characteristics of good generalization, low cost, strong robustness and the like.

Description

Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle

Technical Field

The invention relates to a fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method based on hierarchical reinforcement learning and multi-agent reinforcement learning, and belongs to the technical field of unmanned aerial vehicle autonomous control cooperation strategies.

Background

For a traditional autonomous control cooperation strategy of the fixed-wing unmanned aerial vehicle, an automatic control method is mainly adopted, manual modeling is carried out, and a strategy is formulated. The flight rules are established by experts in the relevant field. High cost and frequent scene changes due to complex changing environments, there are a large number of situations that are not considered in the flight rules. Thus, flight regulations generally cannot handle complex changing environments and are less capable.

Recently, with the vigorous development of the technology of machine learning, reinforcement learning brings a new solution for the autonomous control strategy of the unmanned aerial vehicle. Reinforcement Learning is a branch of machine Learning, and compared with the classic supervised Learning and unsupervised Learning problems of machine Learning, reinforcement Learning is mainly characterized by Learning in Interaction (Learning from Interaction). The Agent learns knowledge continuously according to obtained rewards or punishment in interaction with the environment, and is more suitable for the environment. The paradigm of RL learning is very similar to our process of human learning of knowledge, and as such, RL is considered an important approach to implementing general AI. By means of a reinforcement learning method, a dynamics simulation environment simulator is constructed, a reasonable reward function is designed, an autonomous control strategy of the unmanned aircraft is trained in the simulator environment, and the method is efficient and low in cost. And because training samples are abundant, the flight control strategy learned by using reinforcement learning can face various complex change conditions, and compared with the method for controlling the unmanned aerial vehicle by using rules, the method is more robust and flexible. However, the simple reinforcement learning also has the limitation, the exploration learning space is too large, the effect seriously depends on the parameter tuning and optimizing rick, and the training is difficult.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method based on layered reinforcement learning and multi-agent reinforcement learning, which abstractly divides an unmanned aerial vehicle autonomous control cooperation strategy into two layers: the high-level strategy is responsible for the cooperation strategy, and the bottom-level strategy is responsible for the flight control. And the strategy is decoupled, the exploration space is reduced, and the learning difficulty is reduced. A framework for distributed execution of applications using centralized learning of APEX _ QMIX algorithm by constructing a simplified flight control stripped collaboration environment. Selfplay was conducted to explore various possible collaboration strategies from zero. Meanwhile, the flight control is learned in a dynamic simulation environment simulator. And finally, combining the two strategies to obtain a final unmanned aerial vehicle autonomous control strategy, and migrating the final unmanned aerial vehicle autonomous control strategy to a real environment. The method has the characteristics of good generalization, low cost, strong robustness and the like.

The technical scheme is as follows: a method for training an autonomous control cooperation strategy of a fixed wing unmanned aerial vehicle is characterized in that the unmanned aerial vehicle cooperation strategy is divided into a high-level strategy and a bottom-level strategy by adopting a layered reinforcement learning method; the high-level policy is used for a cooperation policy; the underlying strategy is used for flight control; dynamic-based fixed-wing unmanned aerial vehicle control simulation environment E_sThe system is used for training the unmanned aerial vehicle to achieve flight control and cooperation targets; in addition, for strategy decoupling, exploration space is reduced, learning difficulty is reduced, and a simplified abstract environment E with flight control stripped is constructed_aThe method is used for pre-training the cooperation strategy and accelerating the learning of the cooperation strategy; the bottom layer strategy is obtained by learning in a supervision learning mode; performing strategy fusion on a high-level strategy and a bottom-level strategy, and finally applying the trained autonomous control cooperation strategy to a real environment; using APEX _ QMIX algorithm, from the abstract Environment E_aThe provided observation information is used for pre-training the cooperation strategy, and the simulation environment E is controlled according to the fixed-wing unmanned aerial vehicle_sAnd the provided observation information is used for training the fusion strategy.

The high-level strategy receives observation, gives a flying target point and controls the cooperation of the unmanned aerial vehicle; and the bottom layer strategy receives the target point of the high layer strategy, selects the optimal flight mode and flies to the target point in the fastest optimal mode.

Dynamic-based fixed-wing unmanned aerial vehicle control simulation environment E_sEm _ s. Constructing a simplified abstract environment E stripped of flight control_aEm _ a. The simulator Em _ s is used for approximately simulating the Markov process<S，A，P，R，γ>Providing observation information consistent with the unmanned aerial vehicle in the real scene and providing control instructions consistent with the real unmanned aerial vehicle, wherein the control instructions comprise three basic angle change indicators for controlling flightCommanding and controlling an accelerator control command of the flight speed of the unmanned aerial vehicle; the three basic angle change instructions comprise a pitching instruction, a rolling instruction and a yawing instruction; the control instruction is in the form of A ═<Δv，Δα，Δβ，Δγ>With a spatial dimension R⁴。

The simulator Em _ a does not relate to unmanned aerial vehicle control information in a real scene, but abstracts and simplifies an autonomous control process of the unmanned aerial vehicle into a particle game in a three-dimensional environment; the unmanned aerial vehicle is regarded as a particle, and the flight process of the unmanned aerial vehicle with the fixed step length is abstracted to a reachable target point; the simulator generates a red-blue unmanned aerial vehicle group, and performs countermeasure cooperation training.

The bottom-layer strategy is obtained by learning in a supervision learning mode, and single flight action tasks such as constant-speed flat flight, flat flight acceleration and deceleration, steepest climbing, half-roll reversing and the like are constructed. Collecting real track data of the action of the unmanned aerial vehicle when the pilot controls the unmanned aerial vehicle; extracting all the 'state-action' pairs corresponding to the track to construct a new set D {(s)₁，a₁)，(s₂，a₂)，(s₂，a₂) .., learning by a reinforcement learning method of near-end strategy optimization by taking the state as a characteristic and taking an unmanned aerial vehicle steering column instruction as a mark to obtain an optimal strategy model; the supervised learning objective function of the underlying strategy is as follows:

wherein

a_tFor the action of the unmanned aerial vehicle agent at time t, S_tFor the unmanned aerial vehicle state information at the time t, theta' is a strategy model parameter interacted with the environment and used for sampling, theta is a strategy model parameter for updating learning, and p_θ’、p_θThe state transition probability functions given for theta', theta, respectively, are at state S_tLower selection action a_tProbability of (A)^θ’E is the desired dominance function of θ'.

The steering column instruction specifically comprises a rolling instruction, a pitching instruction, a yawing instruction and a power instruction of the unmanned aerial vehicle.

In a simplified abstracted environment E stripped of flight control_aIn Em _ a of the simulator, counterwork cooperative training of unmanned aerial vehicles of both sides of red and blue is carried out, and for one unmanned aerial vehicle group, centralized learning of APEX _ QMIX algorithm is adopted, and an application framework is executed in a distributed mode; obtaining a distributed strategy of each unmanned aerial vehicle through centralized information learning; the global state information is borrowed to improve the algorithm effect; a neural network is used to integrate the local value functions of each agent into a joint action value function for evaluation of the actions of each drone.

Has the advantages that: compared with the prior art, the autonomous control cooperative strategy training method for the fixed-wing unmanned aerial vehicle has the following advantages:

(1) a simulator is built, and an autonomous control cooperation strategy of the fixed-wing unmanned aerial vehicle is trained by using a reinforcement learning algorithm. The flight rules are not established by experts in related fields, and the method is efficient and has zero trial and error cost.

(2) And the strategy decoupling is carried out by adopting layered reinforcement learning, the exploration space is reduced, and the learning difficulty is reduced.

(3) In order to accelerate the learning of the cooperative strategy, further reduce the search space and construct a simplified abstract environment E stripped of flight control_aAnd under the condition of not considering specific flight actions, the cooperation strategy model is pre-trained under the condition of only considering simplification of cooperation strategy behaviors, so that the overall training time cost is greatly reduced.

Drawings

FIG. 1 shows a simulation environment E for the control of a fixed-wing drone by the fusion strategy of the present invention_sTraining frame diagram of

FIG. 2 is a diagram of an abstract environment E according to the present invention_aTraining frame diagrams of the middle pre-training cooperation strategy; wherein, after action₁，…，action_n，agent₁，…，agent_nDirectly reaching a single-step farthest point in the connecting direction of a target point and a current position point which are output by a corresponding strategy;

FIG. 3 is a schematic diagram of the Ape-x structure of the multi-agent reinforcement learning algorithm APEX _ QMIX used in the present invention;

FIG. 4 is a schematic diagram of a hybrid network (migration network) structure of a multi-agent reinforcement learning algorithm APEX _ QMIX used in the present invention;

FIG. 5 shows a simulation environment E for controlling a fixed-wing UAV by fusing strategies according to the present invention_sThe training flowchart in (1).

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The method for training the autonomous control cooperation strategy of the fixed wing unmanned aerial vehicle comprises the following steps:

step 1: a simulator Em _ s controlled by a fixed-wing unmanned aerial vehicle is constructed based on dynamics, and the visualization part of the simulator Em _ s is realized based on a unity3D engine. Unmanned aerial vehicle simulation environment E_sThe training process in (1) is defined as a tuple form of a Markov Decision Process (MDP)<S,A,P,R>And S is unmanned aerial vehicle state information, A is unmanned aerial vehicle action, P is an environment state transfer function, and R is environment reward. Wherein S ═<V,α,β,γ>The three angles respectively correspond to the current speed of the unmanned aerial vehicle and are relative to a north-east coordinate system. Control command (motion space) a ═<Δv,Δα,Δβ,Δγ>In order to simulate the control operation of the flying stick of the real flight, three basic angle change commands for controlling the flight are included: pitch commands, roll commands, yaw commands, and throttle control commands that control the aircraft's flight speed. The specific detailed steps for realizing the simulator Em _ s based on unity3D are as follows:

step 11, creating an environment for containing the agent.

Step 12, implement an Academy subclass and add this subclass to the game objects (GameObject) in the Unity scene containing the environment. This game object will serve as a parent to any Brain object in the scene. And implements an optional method of Academy class to update the scene independently of any agent, e.g., adding, moving, or deleting agents and other entities in the environment.

And step 13, adding one or more Brain objects into the scene as the sublevels of Academy.

And step 14, realizing Agent subclasses. The Agent subclass defines the necessary code for the Agent to observe its environment, perform specified actions, and compute rewards for intensive training. An optional method is implemented to reset the agent when it completes a task or when a task fails.

And step 15, adding the Agent subclasses to corresponding unmanned aerial vehicle objects, and distributing a Brain object for each Agent object.

And step 16, implementing the aerodynamically based simulated unmanned aerial vehicle flying state transition code.

And step 17, realizing the code for the state transition of the unmanned aerial vehicle group fight of the red and blue parties, and adding a win and loss judgment code.

And step 18, customizing the environment rule, so that the environment with different scenes, different time lengths and different difficulties can be generated.

Step 2: and carrying out strategy layering based on the idea of layered reinforcement learning. The autonomous control cooperation strategy of the unmanned aerial vehicle is abstractly divided into two layers: a high-level policy and a bottom-level policy; the high-level strategy is responsible for the cooperation strategy, and the bottom-level strategy is responsible for the flight control. And obtaining a bottom flight control model in the simulator Em _ s through supervised learning. The detailed implementation steps of the process are as follows:

step 21, extracting all the 'state-action' pairs corresponding to the acquired real track data of various tactical actions of the pilot controlling the unmanned aerial vehicle to construct a training set D {(s)₁,a₁),(s₂,a₂),(s₂,a₂)...}。

And step 22, designing a proper neural network structure, selecting proper neural network hyper-parameters, and building a neural network. For example, 5 layers of fully connected neural networks, each layer of neural network uses a relu function as an activation function.

Step 23, the state s of the aircraft is determined_iAs the feature (feature), the joystick command action a is subjected to regression learning as a flag. Using the BP algorithm, the cumulative error on the training set is minimized:

wherein

Is a true tag of the state that,

to predict the tag, E_kE is the cumulative error expectation over the training set.

And step 3: constructing a simplified abstract environment E stripped of flight control by the gym open source library_aEm _ a. The detailed steps are as follows:

step 31, the state transition function code of Em _ a is realized through the unified environment interface of the gym, and the environment interface of the gym is as follows:

(1) reset (): resetting the state of the environment and returning to observation;

(2) step (action): and the physical engine advances a time step forward and returns updated, done and info. The action is the action of the intelligent agent, the assertion is the information observed by the intelligent agent from the environment, the reward is the reward received by the intelligent agent from the environment, done is a termination signal, and info is related information;

(3) render (): an image engine redraws a frame of an environment.

Rllib, using MutiAgentEnv to perform wrapper on Em _ a, so that Em _ a can perform distributed training using Ray. For centralized learning using APEX _ QMIX (QMIX algorithm employing APEX _ X structure) algorithm, the framework of distributed execution application performs selfplay (self-gaming, cooperative training of the red-blue two-party drone swarm confrontation) to provide support. APEX _ X uses a single GPU leaner to learn, a plurality of CPUworkers perform experience collection, and the tracks stored in replay buffers are distributed in priority distribution, so that collection of replay buffers can be expanded to the scale of using hundreds of CPUworkers in parallel, and the training process is greatly accelerated. The APEX _ X structure is shown in fig. 3.

And 4, step 4: in Em _ a, the APEX _ QMIX algorithm is used for carrying out cooperative training of the countermeasures of the unmanned aerial vehicles of the red and blue sides. The detailed steps are as follows:

and step 41, selecting input parameters (initialization position, win-lose judgment condition, duration and the like), initializing the environment of the simulator Em _ a, and generating the unmanned aerial vehicle clusters of the red and blue parties.

Step 42, taking the red formula as an example: a neural network model (the infrastructure is a 3-layer fully-connected neural network, and each layer of neural network uses a relu function as an activation function) is used for initializing a strategy model (a local action value function of a single agent) of each unmanned aerial vehicle. For all the intelligent agents in the red side, a mixed network is adopted to combine the local value functions of the single intelligent agents, and global state information is added in the training and learning process for assisting, so that the algorithm performance is improved. The schematic diagram of the hybrid network structure is shown in fig. 4. The cost function ultimately used is:

the update uses the conventional concept of DQN, where b denotes the number of samples sampled from the empirical memory,

representing the target network. Wherein, tau ', a ', s ' is the sample track sampled in the experience memory, the action of agent and the environment state, r is the reward given by the environment,

is a target network parameter.

And step 43, the red and blue parties organize confrontation training in a selfplay mode without any artificial priori knowledge, and the diversity and the robustness of the strategy are enriched from zero exploration.

And 5: and (4) fusing the two layers of strategy models obtained in the step (2) and the step (4) based on Hierarchical Reinforcement Learning (Hierarchical Deep Learning). The detailed steps are as follows:

step 51, establishing a double-layer network structure, wherein the first layer is called meta-controller and is responsible for determining a small target which can be reached, the second layer is a bottom layer controller, an action is given according to the target given by meta, and the new target is repeatedly determined after the small target reaches or reaches the set time. The meta-controller accepts the external prize while giving the underlying controller the internal prize. The underlying controller builds a Q function to estimate the reward scenario based on action at the current goal scenario. The Q function is as follows:

where E is expectation, γ is reward discount coefficient, a is action, s is state, g is goal, π_agSelecting an action as a strategy under the condition that the target is g, wherein the subscript t is a time step;

meta-controller establishes a Q-function to estimate the reward condition of the good for different target situations,

where E is expectation, γ is reward discount coefficient, a is action, s is state, g is goal, π_agAnd selecting the strategy as a under the condition that the target is g, wherein t and N are time steps, and f is the accumulated external reward.

Both controllers use a similar update method, i.e. a single step error is established and the update is done using gradient descent.

And step 52, initializing Meta-controller and low-level controller respectively by using the two layers of strategy models obtained in the step 2 and the step 4, and training in Em _ s. The training process is shown in figure 5.

Step 6: and finally migrate to the real environment.

Claims

1. A fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method is characterized by comprising the following steps: the unmanned aerial vehicle cooperation strategy is divided into a high-level strategy and a bottom-level strategy by adopting a layered reinforcement learning method; the high-level policy is used for a cooperation policy; the underlying strategy is used for flight control; dynamic-based fixed-wing unmanned aerial vehicle control simulation environment E_sThe system is used for training the unmanned aerial vehicle to achieve flight control and cooperation targets; constructing a simplified abstract environment E stripped of flight control_aFor pre-training the cooperation strategy; the bottom layer strategy is obtained by learning in a supervision learning mode; performing strategy fusion on a high-level strategy and a bottom-level strategy, and finally applying the trained autonomous control cooperation strategy to a real environment;

using APEX _ QMIX algorithm, from the abstract Environment E_aThe provided observation information is used for pre-training the cooperation strategy, and the simulation environment E is controlled according to the fixed-wing unmanned aerial vehicle_sTraining a fusion strategy by the provided observation information; the APEX _ QMIX algorithm is a QMIX algorithm that employs an APEX _ X structure.

2. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: the high-level strategy receives observation information, gives a flying target point and controls the cooperation of the unmanned aerial vehicle; and the bottom layer strategy receives the target point of the high layer strategy, selects the optimal flight mode and flies to the target point in the fastest optimal mode.

3. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: dynamic-based fixed-wing unmanned aerial vehicle control simulation environment E_sEm _ s of (1); constructing a simplified abstract environment E stripped of flight control_aEm _ a of (1); the moldThe simulator Em _ s is used for approximately simulating the Markov process<S，A，P，R>Providing observation information consistent with the unmanned aerial vehicle in a real scene, and providing control instructions consistent with the real unmanned aerial vehicle, wherein the control instructions comprise three basic angle change instructions for controlling flight and an accelerator control instruction for controlling the flight speed of the unmanned aerial vehicle; the three basic angle change instructions comprise a pitching instruction, a rolling instruction and a yawing instruction; the control instruction is in the form of A ═<Δv，Δα，Δβ，Δγ>With a spatial dimension R⁴(ii) a S is unmanned aerial vehicle state information, A is unmanned aerial vehicle action, P is an environment state transfer function, and R is environment reward; wherein S ═<V，α，β，γ>The three angles respectively correspond to the current speed of the unmanned aerial vehicle and are relative to a north-east coordinate system.

4. The fixed-wing drone autonomous control collaborative strategy training method of claim 3, characterized in that: the simulator Em _ a does not relate to unmanned aerial vehicle control information in a real scene, but abstracts and simplifies an autonomous control process of the unmanned aerial vehicle into a particle game in a three-dimensional environment; the unmanned aerial vehicle is regarded as a particle, and the flight process of the unmanned aerial vehicle with the fixed step length is abstracted to a reachable target point; the simulator generates a red-blue unmanned aerial vehicle group, and performs countermeasure cooperation training.

5. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: the bottom-layer strategy is obtained by learning in a supervision learning mode, and a single flight action task is constructed; collecting real track data of the action of the unmanned aerial vehicle when the pilot controls the unmanned aerial vehicle; all the tracks are corresponded

Constructing a new set D {(s) by extraction₁，a₁)，(s₂，a₂) .., learning by a reinforcement learning method of near-end strategy optimization by taking the state as a characteristic and taking an unmanned aerial vehicle steering column instruction as a mark to obtain an optimal strategy model; of the underlying policyThe supervised learning objective function is as follows:

a_tfor the action of the unmanned aerial vehicle agent at time t, S_tFor the unmanned aerial vehicle state information at the time t, theta' is a strategy model parameter interacted with the environment and used for sampling, theta is a strategy model parameter for updating learning, and p_θ′、p_θThe state transition probability functions given for theta', theta, respectively, are at state S_tLower selection action a_tProbability of (A)^θE is the desired dominance function of θ'.

6. The fixed-wing drone autonomous control collaborative strategy training method of claim 5, characterized in that: the steering column instruction specifically comprises a rolling instruction, a pitching instruction, a yawing instruction and a power instruction of the unmanned aerial vehicle.

7. The fixed-wing drone autonomous control collaborative strategy training method of claim 3, characterized in that: in a simplified abstracted environment E stripped of flight control_aIn Em _ a of the simulator, counterwork cooperative training of unmanned aerial vehicles of both sides of red and blue is carried out, and for one unmanned aerial vehicle group, centralized learning of APEX _ QMIX algorithm is adopted, and an application framework is executed in a distributed mode; obtaining a distributed strategy of each unmanned aerial vehicle through centralized information learning; the global state information is borrowed to improve the algorithm effect; a neural network is used to integrate the local value functions of each agent into a joint action value function for evaluation of the actions of each drone.