CN112034888B - Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle - Google Patents
Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle Download PDFInfo
- Publication number
- CN112034888B CN112034888B CN202010944803.8A CN202010944803A CN112034888B CN 112034888 B CN112034888 B CN 112034888B CN 202010944803 A CN202010944803 A CN 202010944803A CN 112034888 B CN112034888 B CN 112034888B
- Authority
- CN
- China
- Prior art keywords
- unmanned aerial
- strategy
- aerial vehicle
- learning
- environment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000002787 reinforcement Effects 0.000 claims abstract description 21
- RZVHIXYEVGDQDX-UHFFFAOYSA-N 9,10-anthraquinone Chemical compound C1=CC=C2C(=O)C3=CC=CC=C3C(=O)C2=C1 RZVHIXYEVGDQDX-UHFFFAOYSA-N 0.000 claims abstract description 14
- 238000004088 simulation Methods 0.000 claims abstract description 12
- 230000004927 fusion Effects 0.000 claims abstract description 6
- 230000009471 action Effects 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 13
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 5
- 239000002245 particle Substances 0.000 claims description 4
- 238000005096 rolling process Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 2
- 230000009916 joint effect Effects 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims 1
- 238000011217 control strategy Methods 0.000 abstract description 7
- 239000003795 chemical substances by application Substances 0.000 description 25
- 238000010586 diagram Methods 0.000 description 6
- 210000004556 brain Anatomy 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000000872 buffer Substances 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/104—Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
Abstract
The invention discloses a method for training an autonomous control cooperation strategy of a fixed wing unmanned aerial vehicle, which comprises the following steps: (1) dynamic-based fixed-wing unmanned aerial vehicle control simulation environment EsAcquiring real track data of a pilot for controlling the unmanned aerial vehicle, and learning in a supervision learning mode to obtain a flight control strategy of the unmanned aerial vehicle; (2) constructing a simplified abstract environment E stripped of flight controlaCreating two groups of unmanned aerial vehicles with grouped confrontation, and learning by using an APEX _ QMIX algorithm to obtain a cooperation strategy; (3) combining a flight control strategy and a cooperation strategy in a layered reinforcement learning mode to obtain a simulation environment EsA middle-entering learning fusion strategy; (3) migrating to the real environment. The method has significance in a real scene, and has the characteristics of good generalization, low cost, strong robustness and the like.
Description
Technical Field
The invention relates to a fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method based on hierarchical reinforcement learning and multi-agent reinforcement learning, and belongs to the technical field of unmanned aerial vehicle autonomous control cooperation strategies.
Background
For a traditional autonomous control cooperation strategy of the fixed-wing unmanned aerial vehicle, an automatic control method is mainly adopted, manual modeling is carried out, and a strategy is formulated. The flight rules are established by experts in the relevant field. High cost and frequent scene changes due to complex changing environments, there are a large number of situations that are not considered in the flight rules. Thus, flight regulations generally cannot handle complex changing environments and are less capable.
Recently, with the vigorous development of the technology of machine learning, reinforcement learning brings a new solution for the autonomous control strategy of the unmanned aerial vehicle. Reinforcement Learning is a branch of machine Learning, and compared with the classic supervised Learning and unsupervised Learning problems of machine Learning, reinforcement Learning is mainly characterized by Learning in Interaction (Learning from Interaction). The Agent learns knowledge continuously according to obtained rewards or punishment in interaction with the environment, and is more suitable for the environment. The paradigm of RL learning is very similar to our process of human learning of knowledge, and as such, RL is considered an important approach to implementing general AI. By means of a reinforcement learning method, a dynamics simulation environment simulator is constructed, a reasonable reward function is designed, an autonomous control strategy of the unmanned aircraft is trained in the simulator environment, and the method is efficient and low in cost. And because training samples are abundant, the flight control strategy learned by using reinforcement learning can face various complex change conditions, and compared with the method for controlling the unmanned aerial vehicle by using rules, the method is more robust and flexible. However, the simple reinforcement learning also has the limitation, the exploration learning space is too large, the effect seriously depends on the parameter tuning and optimizing rick, and the training is difficult.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method based on layered reinforcement learning and multi-agent reinforcement learning, which abstractly divides an unmanned aerial vehicle autonomous control cooperation strategy into two layers: the high-level strategy is responsible for the cooperation strategy, and the bottom-level strategy is responsible for the flight control. And the strategy is decoupled, the exploration space is reduced, and the learning difficulty is reduced. A framework for distributed execution of applications using centralized learning of APEX _ QMIX algorithm by constructing a simplified flight control stripped collaboration environment. Selfplay was conducted to explore various possible collaboration strategies from zero. Meanwhile, the flight control is learned in a dynamic simulation environment simulator. And finally, combining the two strategies to obtain a final unmanned aerial vehicle autonomous control strategy, and migrating the final unmanned aerial vehicle autonomous control strategy to a real environment. The method has the characteristics of good generalization, low cost, strong robustness and the like.
The technical scheme is as follows: a method for training an autonomous control cooperation strategy of a fixed wing unmanned aerial vehicle is characterized in that the unmanned aerial vehicle cooperation strategy is divided into a high-level strategy and a bottom-level strategy by adopting a layered reinforcement learning method; the high-level policy is used for a cooperation policy; the underlying strategy is used for flight control; dynamic-based fixed-wing unmanned aerial vehicle control simulation environment EsThe system is used for training the unmanned aerial vehicle to achieve flight control and cooperation targets; in addition, for strategy decoupling, exploration space is reduced, learning difficulty is reduced, and a simplified abstract environment E with flight control stripped is constructedaThe method is used for pre-training the cooperation strategy and accelerating the learning of the cooperation strategy; the bottom layer strategy is obtained by learning in a supervision learning mode; performing strategy fusion on a high-level strategy and a bottom-level strategy, and finally applying the trained autonomous control cooperation strategy to a real environment; using APEX _ QMIX algorithm, from the abstract Environment EaThe provided observation information is used for pre-training the cooperation strategy, and the simulation environment E is controlled according to the fixed-wing unmanned aerial vehiclesAnd the provided observation information is used for training the fusion strategy.
The high-level strategy receives observation, gives a flying target point and controls the cooperation of the unmanned aerial vehicle; and the bottom layer strategy receives the target point of the high layer strategy, selects the optimal flight mode and flies to the target point in the fastest optimal mode.
Dynamic-based fixed-wing unmanned aerial vehicle control simulation environment EsEm _ s. Constructing a simplified abstract environment E stripped of flight controlaEm _ a. The simulator Em _ s is used for approximately simulating the Markov process<S,A,P,R,γ>Providing observation information consistent with the unmanned aerial vehicle in the real scene and providing control instructions consistent with the real unmanned aerial vehicle, wherein the control instructions comprise three basic angle change indicators for controlling flightCommanding and controlling an accelerator control command of the flight speed of the unmanned aerial vehicle; the three basic angle change instructions comprise a pitching instruction, a rolling instruction and a yawing instruction; the control instruction is in the form of A ═<Δv,Δα,Δβ,Δγ>With a spatial dimension R4。
The simulator Em _ a does not relate to unmanned aerial vehicle control information in a real scene, but abstracts and simplifies an autonomous control process of the unmanned aerial vehicle into a particle game in a three-dimensional environment; the unmanned aerial vehicle is regarded as a particle, and the flight process of the unmanned aerial vehicle with the fixed step length is abstracted to a reachable target point; the simulator generates a red-blue unmanned aerial vehicle group, and performs countermeasure cooperation training.
The bottom-layer strategy is obtained by learning in a supervision learning mode, and single flight action tasks such as constant-speed flat flight, flat flight acceleration and deceleration, steepest climbing, half-roll reversing and the like are constructed. Collecting real track data of the action of the unmanned aerial vehicle when the pilot controls the unmanned aerial vehicle; extracting all the 'state-action' pairs corresponding to the track to construct a new set D {(s)1,a1),(s2,a2),(s2,a2) .., learning by a reinforcement learning method of near-end strategy optimization by taking the state as a characteristic and taking an unmanned aerial vehicle steering column instruction as a mark to obtain an optimal strategy model; the supervised learning objective function of the underlying strategy is as follows:
atFor the action of the unmanned aerial vehicle agent at time t, StFor the unmanned aerial vehicle state information at the time t, theta' is a strategy model parameter interacted with the environment and used for sampling, theta is a strategy model parameter for updating learning, and pθ’、pθThe state transition probability functions given for theta', theta, respectively, are at state StLower selection action atProbability of (A)θ’E is the desired dominance function of θ'.
The steering column instruction specifically comprises a rolling instruction, a pitching instruction, a yawing instruction and a power instruction of the unmanned aerial vehicle.
In a simplified abstracted environment E stripped of flight controlaIn Em _ a of the simulator, counterwork cooperative training of unmanned aerial vehicles of both sides of red and blue is carried out, and for one unmanned aerial vehicle group, centralized learning of APEX _ QMIX algorithm is adopted, and an application framework is executed in a distributed mode; obtaining a distributed strategy of each unmanned aerial vehicle through centralized information learning; the global state information is borrowed to improve the algorithm effect; a neural network is used to integrate the local value functions of each agent into a joint action value function for evaluation of the actions of each drone.
Has the advantages that: compared with the prior art, the autonomous control cooperative strategy training method for the fixed-wing unmanned aerial vehicle has the following advantages:
(1) a simulator is built, and an autonomous control cooperation strategy of the fixed-wing unmanned aerial vehicle is trained by using a reinforcement learning algorithm. The flight rules are not established by experts in related fields, and the method is efficient and has zero trial and error cost.
(2) And the strategy decoupling is carried out by adopting layered reinforcement learning, the exploration space is reduced, and the learning difficulty is reduced.
(3) In order to accelerate the learning of the cooperative strategy, further reduce the search space and construct a simplified abstract environment E stripped of flight controlaAnd under the condition of not considering specific flight actions, the cooperation strategy model is pre-trained under the condition of only considering simplification of cooperation strategy behaviors, so that the overall training time cost is greatly reduced.
Drawings
FIG. 1 shows a simulation environment E for the control of a fixed-wing drone by the fusion strategy of the present inventionsTraining frame diagram of
FIG. 2 is a diagram of an abstract environment E according to the present inventionaTraining frame diagrams of the middle pre-training cooperation strategy; wherein, after action1,…,actionn,agent1,…,agentnDirectly reaching a single-step farthest point in the connecting direction of a target point and a current position point which are output by a corresponding strategy;
FIG. 3 is a schematic diagram of the Ape-x structure of the multi-agent reinforcement learning algorithm APEX _ QMIX used in the present invention;
FIG. 4 is a schematic diagram of a hybrid network (migration network) structure of a multi-agent reinforcement learning algorithm APEX _ QMIX used in the present invention;
FIG. 5 shows a simulation environment E for controlling a fixed-wing UAV by fusing strategies according to the present inventionsThe training flowchart in (1).
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
The method for training the autonomous control cooperation strategy of the fixed wing unmanned aerial vehicle comprises the following steps:
step 1: a simulator Em _ s controlled by a fixed-wing unmanned aerial vehicle is constructed based on dynamics, and the visualization part of the simulator Em _ s is realized based on a unity3D engine. Unmanned aerial vehicle simulation environment EsThe training process in (1) is defined as a tuple form of a Markov Decision Process (MDP)<S,A,P,R>And S is unmanned aerial vehicle state information, A is unmanned aerial vehicle action, P is an environment state transfer function, and R is environment reward. Wherein S ═<V,α,β,γ>The three angles respectively correspond to the current speed of the unmanned aerial vehicle and are relative to a north-east coordinate system. Control command (motion space) a ═<Δv,Δα,Δβ,Δγ>In order to simulate the control operation of the flying stick of the real flight, three basic angle change commands for controlling the flight are included: pitch commands, roll commands, yaw commands, and throttle control commands that control the aircraft's flight speed. The specific detailed steps for realizing the simulator Em _ s based on unity3D are as follows:
step 11, creating an environment for containing the agent.
Step 12, implement an Academy subclass and add this subclass to the game objects (GameObject) in the Unity scene containing the environment. This game object will serve as a parent to any Brain object in the scene. And implements an optional method of Academy class to update the scene independently of any agent, e.g., adding, moving, or deleting agents and other entities in the environment.
And step 13, adding one or more Brain objects into the scene as the sublevels of Academy.
And step 14, realizing Agent subclasses. The Agent subclass defines the necessary code for the Agent to observe its environment, perform specified actions, and compute rewards for intensive training. An optional method is implemented to reset the agent when it completes a task or when a task fails.
And step 15, adding the Agent subclasses to corresponding unmanned aerial vehicle objects, and distributing a Brain object for each Agent object.
And step 16, implementing the aerodynamically based simulated unmanned aerial vehicle flying state transition code.
And step 17, realizing the code for the state transition of the unmanned aerial vehicle group fight of the red and blue parties, and adding a win and loss judgment code.
And step 18, customizing the environment rule, so that the environment with different scenes, different time lengths and different difficulties can be generated.
Step 2: and carrying out strategy layering based on the idea of layered reinforcement learning. The autonomous control cooperation strategy of the unmanned aerial vehicle is abstractly divided into two layers: a high-level policy and a bottom-level policy; the high-level strategy is responsible for the cooperation strategy, and the bottom-level strategy is responsible for the flight control. And obtaining a bottom flight control model in the simulator Em _ s through supervised learning. The detailed implementation steps of the process are as follows:
step 21, extracting all the 'state-action' pairs corresponding to the acquired real track data of various tactical actions of the pilot controlling the unmanned aerial vehicle to construct a training set D {(s)1,a1),(s2,a2),(s2,a2)...}。
And step 22, designing a proper neural network structure, selecting proper neural network hyper-parameters, and building a neural network. For example, 5 layers of fully connected neural networks, each layer of neural network uses a relu function as an activation function.
Step 23, the state s of the aircraft is determinediAs the feature (feature), the joystick command action a is subjected to regression learning as a flag. Using the BP algorithm, the cumulative error on the training set is minimized:
wherein Is a true tag of the state that,to predict the tag, EkE is the cumulative error expectation over the training set.
And step 3: constructing a simplified abstract environment E stripped of flight control by the gym open source libraryaEm _ a. The detailed steps are as follows:
step 31, the state transition function code of Em _ a is realized through the unified environment interface of the gym, and the environment interface of the gym is as follows:
(1) reset (): resetting the state of the environment and returning to observation;
(2) step (action): and the physical engine advances a time step forward and returns updated, done and info. The action is the action of the intelligent agent, the assertion is the information observed by the intelligent agent from the environment, the reward is the reward received by the intelligent agent from the environment, done is a termination signal, and info is related information;
(3) render (): an image engine redraws a frame of an environment.
Rllib, using MutiAgentEnv to perform wrapper on Em _ a, so that Em _ a can perform distributed training using Ray. For centralized learning using APEX _ QMIX (QMIX algorithm employing APEX _ X structure) algorithm, the framework of distributed execution application performs selfplay (self-gaming, cooperative training of the red-blue two-party drone swarm confrontation) to provide support. APEX _ X uses a single GPU leaner to learn, a plurality of CPUworkers perform experience collection, and the tracks stored in replay buffers are distributed in priority distribution, so that collection of replay buffers can be expanded to the scale of using hundreds of CPUworkers in parallel, and the training process is greatly accelerated. The APEX _ X structure is shown in fig. 3.
And 4, step 4: in Em _ a, the APEX _ QMIX algorithm is used for carrying out cooperative training of the countermeasures of the unmanned aerial vehicles of the red and blue sides. The detailed steps are as follows:
and step 41, selecting input parameters (initialization position, win-lose judgment condition, duration and the like), initializing the environment of the simulator Em _ a, and generating the unmanned aerial vehicle clusters of the red and blue parties.
Step 42, taking the red formula as an example: a neural network model (the infrastructure is a 3-layer fully-connected neural network, and each layer of neural network uses a relu function as an activation function) is used for initializing a strategy model (a local action value function of a single agent) of each unmanned aerial vehicle. For all the intelligent agents in the red side, a mixed network is adopted to combine the local value functions of the single intelligent agents, and global state information is added in the training and learning process for assisting, so that the algorithm performance is improved. The schematic diagram of the hybrid network structure is shown in fig. 4. The cost function ultimately used is:
the update uses the conventional concept of DQN, where b denotes the number of samples sampled from the empirical memory, representing the target network. Wherein, tau ', a ', s ' is the sample track sampled in the experience memory, the action of agent and the environment state, r is the reward given by the environment,is a target network parameter.
And step 43, the red and blue parties organize confrontation training in a selfplay mode without any artificial priori knowledge, and the diversity and the robustness of the strategy are enriched from zero exploration.
And 5: and (4) fusing the two layers of strategy models obtained in the step (2) and the step (4) based on Hierarchical Reinforcement Learning (Hierarchical Deep Learning). The detailed steps are as follows:
step 51, establishing a double-layer network structure, wherein the first layer is called meta-controller and is responsible for determining a small target which can be reached, the second layer is a bottom layer controller, an action is given according to the target given by meta, and the new target is repeatedly determined after the small target reaches or reaches the set time. The meta-controller accepts the external prize while giving the underlying controller the internal prize. The underlying controller builds a Q function to estimate the reward scenario based on action at the current goal scenario. The Q function is as follows:
where E is expectation, γ is reward discount coefficient, a is action, s is state, g is goal, πagSelecting an action as a strategy under the condition that the target is g, wherein the subscript t is a time step;
meta-controller establishes a Q-function to estimate the reward condition of the good for different target situations,
where E is expectation, γ is reward discount coefficient, a is action, s is state, g is goal, πagAnd selecting the strategy as a under the condition that the target is g, wherein t and N are time steps, and f is the accumulated external reward.
Both controllers use a similar update method, i.e. a single step error is established and the update is done using gradient descent.
And step 52, initializing Meta-controller and low-level controller respectively by using the two layers of strategy models obtained in the step 2 and the step 4, and training in Em _ s. The training process is shown in figure 5.
Step 6: and finally migrate to the real environment.
Claims (7)
1. A fixed wing unmanned aerial vehicle autonomous control cooperation strategy training method is characterized by comprising the following steps: the unmanned aerial vehicle cooperation strategy is divided into a high-level strategy and a bottom-level strategy by adopting a layered reinforcement learning method; the high-level policy is used for a cooperation policy; the underlying strategy is used for flight control; dynamic-based fixed-wing unmanned aerial vehicle control simulation environment EsThe system is used for training the unmanned aerial vehicle to achieve flight control and cooperation targets; constructing a simplified abstract environment E stripped of flight controlaFor pre-training the cooperation strategy; the bottom layer strategy is obtained by learning in a supervision learning mode; performing strategy fusion on a high-level strategy and a bottom-level strategy, and finally applying the trained autonomous control cooperation strategy to a real environment;
using APEX _ QMIX algorithm, from the abstract Environment EaThe provided observation information is used for pre-training the cooperation strategy, and the simulation environment E is controlled according to the fixed-wing unmanned aerial vehiclesTraining a fusion strategy by the provided observation information; the APEX _ QMIX algorithm is a QMIX algorithm that employs an APEX _ X structure.
2. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: the high-level strategy receives observation information, gives a flying target point and controls the cooperation of the unmanned aerial vehicle; and the bottom layer strategy receives the target point of the high layer strategy, selects the optimal flight mode and flies to the target point in the fastest optimal mode.
3. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: dynamic-based fixed-wing unmanned aerial vehicle control simulation environment EsEm _ s of (1); constructing a simplified abstract environment E stripped of flight controlaEm _ a of (1); the moldThe simulator Em _ s is used for approximately simulating the Markov process<S,A,P,R>Providing observation information consistent with the unmanned aerial vehicle in a real scene, and providing control instructions consistent with the real unmanned aerial vehicle, wherein the control instructions comprise three basic angle change instructions for controlling flight and an accelerator control instruction for controlling the flight speed of the unmanned aerial vehicle; the three basic angle change instructions comprise a pitching instruction, a rolling instruction and a yawing instruction; the control instruction is in the form of A ═<Δv,Δα,Δβ,Δγ>With a spatial dimension R4(ii) a S is unmanned aerial vehicle state information, A is unmanned aerial vehicle action, P is an environment state transfer function, and R is environment reward; wherein S ═<V,α,β,γ>The three angles respectively correspond to the current speed of the unmanned aerial vehicle and are relative to a north-east coordinate system.
4. The fixed-wing drone autonomous control collaborative strategy training method of claim 3, characterized in that: the simulator Em _ a does not relate to unmanned aerial vehicle control information in a real scene, but abstracts and simplifies an autonomous control process of the unmanned aerial vehicle into a particle game in a three-dimensional environment; the unmanned aerial vehicle is regarded as a particle, and the flight process of the unmanned aerial vehicle with the fixed step length is abstracted to a reachable target point; the simulator generates a red-blue unmanned aerial vehicle group, and performs countermeasure cooperation training.
5. The fixed-wing drone autonomous control collaborative strategy training method of claim 1, characterized in that: the bottom-layer strategy is obtained by learning in a supervision learning mode, and a single flight action task is constructed; collecting real track data of the action of the unmanned aerial vehicle when the pilot controls the unmanned aerial vehicle; all the tracks are correspondedConstructing a new set D {(s) by extraction1,a1),(s2,a2) .., learning by a reinforcement learning method of near-end strategy optimization by taking the state as a characteristic and taking an unmanned aerial vehicle steering column instruction as a mark to obtain an optimal strategy model; of the underlying policyThe supervised learning objective function is as follows:
atfor the action of the unmanned aerial vehicle agent at time t, StFor the unmanned aerial vehicle state information at the time t, theta' is a strategy model parameter interacted with the environment and used for sampling, theta is a strategy model parameter for updating learning, and pθ′、pθThe state transition probability functions given for theta', theta, respectively, are at state StLower selection action atProbability of (A)θE is the desired dominance function of θ'.
6. The fixed-wing drone autonomous control collaborative strategy training method of claim 5, characterized in that: the steering column instruction specifically comprises a rolling instruction, a pitching instruction, a yawing instruction and a power instruction of the unmanned aerial vehicle.
7. The fixed-wing drone autonomous control collaborative strategy training method of claim 3, characterized in that: in a simplified abstracted environment E stripped of flight controlaIn Em _ a of the simulator, counterwork cooperative training of unmanned aerial vehicles of both sides of red and blue is carried out, and for one unmanned aerial vehicle group, centralized learning of APEX _ QMIX algorithm is adopted, and an application framework is executed in a distributed mode; obtaining a distributed strategy of each unmanned aerial vehicle through centralized information learning; the global state information is borrowed to improve the algorithm effect; a neural network is used to integrate the local value functions of each agent into a joint action value function for evaluation of the actions of each drone.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010944803.8A CN112034888B (en) | 2020-09-10 | 2020-09-10 | Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010944803.8A CN112034888B (en) | 2020-09-10 | 2020-09-10 | Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112034888A CN112034888A (en) | 2020-12-04 |
CN112034888B true CN112034888B (en) | 2021-07-30 |
Family
ID=73584525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010944803.8A Active CN112034888B (en) | 2020-09-10 | 2020-09-10 | Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112034888B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112906888B (en) * | 2021-03-02 | 2023-05-09 | 中国人民解放军军事科学院国防科技创新研究院 | Task execution method and device, electronic equipment and storage medium |
CN113110546B (en) * | 2021-04-20 | 2022-09-23 | 南京大学 | Unmanned aerial vehicle autonomous flight control method based on offline reinforcement learning |
CN113435598B (en) * | 2021-07-08 | 2022-06-21 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113721645A (en) * | 2021-08-07 | 2021-11-30 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning |
CN113886953B (en) * | 2021-09-27 | 2022-07-19 | 中国人民解放军军事科学院国防科技创新研究院 | Unmanned aerial vehicle intelligent simulation training method and device based on distributed reinforcement learning |
CN113867178B (en) * | 2021-10-26 | 2022-05-31 | 哈尔滨工业大学 | Virtual and real migration training system for multi-robot confrontation |
CN114141028B (en) * | 2021-11-19 | 2023-05-12 | 哈尔滨工业大学(深圳) | Intelligent traffic light traffic flow regulating and controlling system |
CN114167756B (en) * | 2021-12-08 | 2023-06-02 | 北京航空航天大学 | Multi-unmanned aerial vehicle collaborative air combat decision autonomous learning and semi-physical simulation verification method |
CN114444716A (en) * | 2022-01-06 | 2022-05-06 | 中国电子科技集团公司电子科学研究院 | Multi-agent game training method and system in virtual environment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101964019A (en) * | 2010-09-10 | 2011-02-02 | 北京航空航天大学 | Against behavior modeling simulation platform and method based on Agent technology |
US9622133B1 (en) * | 2015-10-23 | 2017-04-11 | The Florida International University Board Of Trustees | Interference and mobility management in UAV-assisted wireless networks |
CN108255059A (en) * | 2018-01-19 | 2018-07-06 | 南京大学 | A kind of robot control method based on simulator training |
CN109116854A (en) * | 2018-09-16 | 2019-01-01 | 南京大学 | A kind of robot cooperated control method of multiple groups based on intensified learning and control system |
CN109992000A (en) * | 2019-04-04 | 2019-07-09 | 北京航空航天大学 | A kind of multiple no-manned plane path collaborative planning method and device based on Hierarchical reinforcement learning |
CN110231814A (en) * | 2019-07-03 | 2019-09-13 | 中国人民解放军国防科技大学 | Layered distributed control system and control method for fixed-wing unmanned aerial vehicle cluster |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11093829B2 (en) * | 2017-10-12 | 2021-08-17 | Honda Motor Co., Ltd. | Interaction-aware decision making |
CN110502033B (en) * | 2019-09-04 | 2022-08-09 | 中国人民解放军国防科技大学 | Fixed-wing unmanned aerial vehicle cluster control method based on reinforcement learning |
CN110991545B (en) * | 2019-12-10 | 2021-02-02 | 中国人民解放军军事科学院国防科技创新研究院 | Multi-agent confrontation oriented reinforcement learning training optimization method and device |
CN111144580B (en) * | 2019-12-31 | 2024-04-12 | 中国电子科技集团公司信息科学研究院 | Hierarchical reinforcement learning training method and device based on imitation learning |
CN111552301B (en) * | 2020-06-21 | 2022-05-20 | 南开大学 | Hierarchical control method for salamander robot path tracking based on reinforcement learning |
-
2020
- 2020-09-10 CN CN202010944803.8A patent/CN112034888B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101964019A (en) * | 2010-09-10 | 2011-02-02 | 北京航空航天大学 | Against behavior modeling simulation platform and method based on Agent technology |
US9622133B1 (en) * | 2015-10-23 | 2017-04-11 | The Florida International University Board Of Trustees | Interference and mobility management in UAV-assisted wireless networks |
CN108255059A (en) * | 2018-01-19 | 2018-07-06 | 南京大学 | A kind of robot control method based on simulator training |
CN109116854A (en) * | 2018-09-16 | 2019-01-01 | 南京大学 | A kind of robot cooperated control method of multiple groups based on intensified learning and control system |
CN109992000A (en) * | 2019-04-04 | 2019-07-09 | 北京航空航天大学 | A kind of multiple no-manned plane path collaborative planning method and device based on Hierarchical reinforcement learning |
CN110231814A (en) * | 2019-07-03 | 2019-09-13 | 中国人民解放军国防科技大学 | Layered distributed control system and control method for fixed-wing unmanned aerial vehicle cluster |
Non-Patent Citations (4)
Title |
---|
Automatic Successive Reinforcement Learning with Multiple Auxiliary Rewards;Fu zhaoyang,et al.;《Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence》;20191231;全文 * |
Towards Sample Efficient Reinforcement Learning;Yu yang;《Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence》;20181231;全文 * |
分层式强化学习研究进展;陈春林;《***仿真技术及其应用》;20081231;第10卷;全文 * |
分层强化学习综述;*** 等;《周志华》;20171031;第12卷(第5期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112034888A (en) | 2020-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112034888B (en) | Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle | |
CN112162564B (en) | Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm | |
CN110488859B (en) | Unmanned aerial vehicle route planning method based on improved Q-learning algorithm | |
CN110531786B (en) | Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN | |
CN112215350B (en) | Method and device for controlling agent based on reinforcement learning | |
Li et al. | Oil: Observational imitation learning | |
CN105700526A (en) | On-line sequence limit learning machine method possessing autonomous learning capability | |
CN114089776B (en) | Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning | |
CN114815882B (en) | Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning | |
CN113741533A (en) | Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning | |
CN114741886A (en) | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation | |
CN109740741A (en) | A kind of intensified learning method and its application of combination Knowledge Conversion are in the learning method of the autonomous technical ability of unmanned vehicle | |
CN115509251A (en) | Multi-unmanned aerial vehicle multi-target cooperative tracking control method based on MAPPO algorithm | |
CN113821045A (en) | Leg and foot robot reinforcement learning action generation system | |
CN116225055A (en) | Unmanned aerial vehicle autonomous flight path planning algorithm based on state decomposition in complex environment | |
Zijian et al. | Imaginary filtered hindsight experience replay for UAV tracking dynamic targets in large-scale unknown environments | |
CN114355897B (en) | Vehicle path tracking control method based on model and reinforcement learning hybrid switching | |
Jiang et al. | A deep reinforcement learning strategy for UAV autonomous landing on a platform | |
CN116796843A (en) | Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG | |
CN116796844A (en) | M2 GPI-based unmanned aerial vehicle one-to-one chase game method | |
Nguyen et al. | Apprenticeship bootstrapping | |
CN116227622A (en) | Multi-agent landmark coverage method and system based on deep reinforcement learning | |
CN115933712A (en) | Bionic fish leader-follower formation control method based on deep reinforcement learning | |
CN114371634B (en) | Unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback | |
Wang et al. | Autonomous obstacle avoidance algorithm of UAVs for automatic terrain following application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |