CN116796505B - Air combat maneuver strategy generation method based on example strategy constraint - Google Patents

Air combat maneuver strategy generation method based on example strategy constraint Download PDF

Info

Publication number
CN116796505B
CN116796505B CN202310529870.7A CN202310529870A CN116796505B CN 116796505 B CN116796505 B CN 116796505B CN 202310529870 A CN202310529870 A CN 202310529870A CN 116796505 B CN116796505 B CN 116796505B
Authority
CN
China
Prior art keywords
strategy
air combat
generation method
method based
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310529870.7A
Other languages
Chinese (zh)
Other versions
CN116796505A (en
Inventor
付宇鹏
张立民
邓向阳
朱子强
闫文君
于柯远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Naval Aeronautical University
Original Assignee
Naval Aeronautical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Naval Aeronautical University filed Critical Naval Aeronautical University
Priority to CN202310529870.7A priority Critical patent/CN116796505B/en
Publication of CN116796505A publication Critical patent/CN116796505A/en
Application granted granted Critical
Publication of CN116796505B publication Critical patent/CN116796505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention relates to an air combat maneuver strategy generation method based on example strategy constraint, belonging to the technical field of air combat agent decision control modeling. In order to solve the problems of low utilization rate of empirical data and difficult convergence of algorithms faced by short-distance air combat maneuver decision modeling, the air combat maneuver strategy generation method based on the example strategy constraint comprises three stages of example data acquisition, intelligent agent model pre-training and intelligent agent model parameter online fine adjustment, so that the utilization rate of round data and effective example data can be improved, the characteristics of imitative learning and reinforcement learning are combined, the algorithm convergence efficiency is improved, and the non-optimal problem of the example data is avoided.

Description

Air combat maneuver strategy generation method based on example strategy constraint
Technical Field
The invention relates to an air combat maneuver strategy generation method, in particular to an air combat maneuver strategy generation method based on example strategy constraint, and belongs to the technical field of air combat agent decision control modeling.
Background
In an air combat close combat scene, how the opposing parties select accurate and effective maneuvering decisions according to the current combat situation is an important research direction. With the development of algorithms in recent years, reinforcement learning and imitation learning algorithms are increasingly adopted to realize maneuver decision control. The patent application with publication number of CN112162564A discloses an unmanned aerial vehicle flight control method based on an imitation learning and reinforcement learning algorithm, a basic instruction set of an airplane model is established, and mapped with maneuvering actions to realize maneuvering control, the model precision of the scheme is limited by the number of instructions in the basic instruction set, simultaneously, as the number of basic instructions rises, the dimension output by a controller rises, and a model strategy depends on reinforcement learning training and lacks of strategy constraint. When the reinforcement learning algorithm generates a strategy, expert experience is not fully utilized, strategy constraint is lacked, and an intelligent agent is difficult to obtain forward rewards in a complex environment or a sparse rewarding environment; when generating policies based on mimicking learning algorithms, behavior depends on example data quality, thus facing balance problems of agent policy constraints and state space exploration; by adopting an end-to-end scheme, the motor action is directly learned by using a reinforcement learning algorithm, the state space is complex, the algorithm is not easy to converge, and the realization difficulty is high.
Disclosure of Invention
The invention aims to provide an air combat maneuver strategy generation method based on example strategy constraint, aiming at the problems of low empirical data utilization rate and difficult algorithm convergence faced by short-distance air combat maneuver decision modeling.
In order to solve the problems, the air combat maneuver strategy generation method based on the example strategy constraint is realized through the following technical scheme:
the air combat maneuver strategy generation method based on the example strategy constraint is characterized by comprising the following steps of: comprising three stages:
stage one: example data acquisition
Generating flight trajectory data by human expert against simple agents based on PID control, i.e. expert strategy pi E Interact with the environment to produce a quaternion (s t ,a t ,s t+1 ,r t ) An example data set D is built from flight trajectory data, consisting of a "State-action-rewarding-State" four-tuple sequence E ={τ 12 ,...,τ n }, where τ n The nth flight path is represented by the following,
the data set is used for restricting the behaviors in the training process of the intelligent body model;
a in the quadruple t For the control instruction of the control lever and the throttle, the control of the attitude and the position of the airplane is realized, r t Is a reward function;
furthermore, the simple intelligent agent can control basic behaviors such as plane flight, horizontal turning, climbing, descending and the like of the airplane;
further, the simple agent controlled aircraft aerodynamic model is a six-degree-of-freedom fixed wing aircraft model, and the six-degree-of-freedom fixed wing aircraft model comprises a stability enhancement system adopting PID control;
further, state s in the quadruple t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:
wherein phi, theta,Respectively representing course angle, pitch angle and roll angle, +.>For pitch angle speed +.>Representing the current roll angle, h representing the normalized height, and V representing the normalized velocity vector in the NED coordinate system;
relative situationExpressed as:
wherein DeltaV represents a speed difference vector under the NED coordinate system, deltaX represents a relative position vector under the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used for measuring the advantages and disadvantages of the two angles;
further, the reward function r of the present invention t The following are provided:
r t =η A r t AR r t RE r t E
the rewarding function is an important influence factor for guiding algorithm convergence, and the rewarding function design considers key air combat factors such as angle advantages, energy advantages, self stability and the like; the relative Euclidean distance R is used for guiding the my catcher to reach the combat bomb or the aerial cannon launching condition;
reward function r t As shown in the specification, mainly consider attack occupation, and the attack occupation is represented by the angle dominance r t A Advantage of relative distance r t R Energy dominance r t E Composition, wherein eta A 、η R 、η E Respectively represent r t A 、r t R 、r t E Weights of (2);
stage two: agent model pre-training
The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, policy pi for agent θ Performing supervised learning; using a loss function L bc (theta) calculating the gradient, and updating the network parameters to obtain the pre-training agent model parameters theta 0
Stage three: online fine adjustment of intelligent body model parameters
The intelligent body and the environment are interacted to perform online reinforcement learning fine adjustment, a playback experience pool is arranged and marked as D off ={(s t ,a t ,s t+1 ,R t ) Each time after the end of the round, from D off Sampling and training the intelligent body model by using a strategy gradient algorithm;
further, the algorithm adopted by the online reinforcement learning fine adjustment is an Actor-Critic framework, wherein an Actor network is a strategy network pi θ (s t ) According to the current state s t Output action a t θ represents a control network parameter; critic network is value network, and outputs value functionAccording to the current state s t Outputting the estimated value V, & gt>Representing a value network parameter;
further, to strengthen dominant strategic actions in experience, a dominant function is calculatedWherein ( + =max(·,0),/>T is the end time of the round, i.e. only +.>Gradient calculation is carried out on the dominant state-action sampling data;
further, select the strategy of example report high to constraint, design the filter
Obtaining an algorithm loss function, as:
wherein H is π Representing policy pi θ Is used for improving the exploration capability of strategies, and beta and alpha are respectively a cost loss function and coefficients of entropy.
According to the method, the advantages of the online strategy algorithm and the advantages of the offline strategy algorithm are combined, the utilization rate of round data and effective example data can be improved by the aid of the maneuvering strategy generation method based on the example strategy constraint, the characteristics of imitative learning and reinforcement learning are combined, algorithm convergence efficiency is improved, and the problem that the example data is not optimal is avoided.
According to the air combat maneuver strategy generation method, maneuver decision agent models with certain autonomy and intelligence can be generated efficiently by using example data, so that the traditional reinforcement learning algorithm is prevented from consuming a large amount of time resources and calculation resources; the proposed algorithm can be combined with any online strategy algorithm, is flexible to use, and can improve the data utilization rate of the online strategy algorithm.
Drawings
Fig. 1: the system trains the flow chart;
fig. 2: the algorithm flow chart of the invention;
fig. 3: maneuver decision situation map.
Detailed Description
The following description of the present invention will be given with reference to the accompanying drawings, which are used to further explain the constitution of the present invention.
Example 1. An air combat maneuver strategy generation method based on example strategy constraints as shown in fig. 1 includes three stages:
stage one: example data acquisition
Generating flight trajectory data by human expert against simple agents based on PID control, i.e. expert strategy pi E Interact with the environment to produce a quaternion (s t ,a t ,s t+1 ,r t ) An example data set D is built from flight trajectory data, consisting of a "State-action-rewarding-State" four-tuple sequence E ={τ 12 ,...,τ n }, where τ n The nth flight path is represented by the following,
the data set is used for restricting the behaviors in the training process of the intelligent body model;
a in the quadruple t For the control instruction of the control lever and the throttle, the control of the attitude and the position of the airplane is realized, r t Is a reward function;
furthermore, the simple intelligent agent can control basic behaviors such as plane flight, horizontal turning, climbing, descending and the like of the airplane;
further, the simple agent controlled aircraft aerodynamic model is a six-degree-of-freedom fixed wing aircraft model, and the six-degree-of-freedom fixed wing aircraft model comprises a stability enhancement system adopting PID control; the motion of the aircraft is mainly controlled by engine thrust, an elevator, an aileron and a rudder, when each control surface changes, the model changes the resultant force and the resultant moment of the aircraft according to corresponding pneumatic parameters, so that a maneuvering decision control network outputs elevator, aileron, rudder and accelerator control instructions to realize the control of the attitude and the position of the aircraft;
furthermore, human expert and simple intelligent agent fight in view distance are considered, early warning machine support is provided, the two situations are transparent, and the state s in the four-element group t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:
wherein phi, theta,Respectively representing course angle, pitch angle and roll angle, +.>For pitch angle speed +.>Representing the current roll angle, h representing the normalized height, and V representing the normalized velocity vector in the NED coordinate system;
relative situationExpressed as:
wherein DeltaV represents a speed difference vector, deltaX represents a relative position vector under the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used for measuring the advantages and disadvantages of the angles of the two parties;
further, the reward function r of the present invention t The following are provided:
r t =η A r t AR r t RE r t E
the rewarding function is an important influence factor for guiding algorithm convergence, and the rewarding function design considers key air combat factors such as angle advantages, energy advantages, self stability and the like; the relative Euclidean distance R is used for guiding the my catcher to reach the combat bomb or the aerial cannon launching condition;
reward function r t As shown in the specification, mainly consider attack occupation, and the attack occupation is represented by the angle dominance r t A Advantage of relative distance r t R Energy dominance r t E Composition, wherein eta A 、η R 、η E Respectively represent r t A 、r t R 、r t E Weights of (2);
in addition, when the flight altitude and the flight speed of the airplane are lower than or higher than the threshold value, penalty items are introduced, so that the maneuver decision is prevented from falling into the local optimum of quick death and other erroneous selections;
stage two: agent model pre-training
The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, policy pi for agent θ Performing supervised learning; using a loss function L bc (theta) calculating strategy gradient, and updating network parameters to obtain pre-trained fully-connected neural network parameters theta 0
Stage three: online fine adjustment of intelligent body model parameters
Intelligent bodyPerforming online reinforcement learning fine adjustment by interacting with environment, setting a playback experience pool, and recording as D off ={(s t ,a t ,s t+1 ,R t ) Each time after the end of the round, from D off Sampling and training a strategy network by using a strategy gradient algorithm;
further, the algorithm adopted by the online reinforcement learning fine adjustment is an Actor-Critic framework, wherein an Actor network is a strategy network pi θ (s t ) According to the current state s t Output action a t θ represents a control network parameter; critic network is value network, and outputs value functionAccording to the current state s t Outputting the estimated value V, & gt>Representing a value network parameter;
further, to strengthen dominant strategic actions in experience, a dominant function is calculatedWherein ( + =max(·,0),/>T is the end time of the round, i.e. only +.>Gradient calculation is carried out on the dominant state-action sampling data;
further, select the strategy of example report high to constraint, design the filter
Obtaining an algorithm loss function, as:
wherein H is π Representing policy pi θ The entropy regularization of (2) is used for improving the exploration capability of the strategy, beta and alpha are the cost loss function and the entropy coefficient respectively, and the algorithm flow of the invention is shown in figure 2.
Because the intelligent agent based on the pre-training strategy has a compound error in the process of interacting with the environment, the strategy can be offset by updating the strategy only by means of the reinforcement learning algorithm, and even the situations of continuous rolling, falling and the like of the airplane occur when the strategy is serious. Under the conditions, the algorithm can not be converged due to improper setting of the super-parameters of the algorithm such as the learning rate, the updating times and the like. The example data should be fully utilized to constrain the agent policy update direction in the algorithm training. In the imitation learning algorithm, the generation strategy depends on the quality of a database, and the intelligent agent is insufficient in environment exploration, so that the method combines the example strategy constraint and the reinforcement learning algorithm, and improves the convergence and exploration of the algorithm. Example 2. One application of the present application, namely in combination with the near-end policy optimization (PPO) algorithm, defines the dominance function according to the generalized dominance estimation (GAE) method:
δ t =r t +γV(s t+1 )-V(s t ),
gamma and lambda are two important parameters of the GAE function, where gamma determines the maximum of the cost function and lambda is used to balance the variance and bias.
The PPO algorithm limits the amplitude of the strategy update, cuts out the probability ratio, thereby reducing the fluctuation of the objective function, and the strategy loss and the cost loss function are as follows:
wherein the method comprises the steps ofIn return for the return,
c t (θ)=π θ (a t |s t )/π θold (a t |s t ) Representing the probability ratio of the current and old policies.
In order to improve the resource utilization rate, a plurality of distributed data acquisition agent works and a central learning agent learner are arranged in parallel simulation. Each worker interacts with the environment and stores the four-tuple trajectory data in a respective round experience poolIs a kind of medium. After the round is finished, the data of each round experience pool is stored in a playback experience pool, the playback experience pool is sampled, the data is divided into mini-batch, and the data is divided into mini-batch according to an objective function L ppo Calculating gradients, returning the gradients, accumulating each gradient by the gradients and updating the policy network and the value network parameters. Before the next round starts, the learner issues updated network parameters to each worker, which samples with a new policy.
The round experience pool is data stored by each agent in the distributed simulation, and the data are stored in the playback experience pool after the round is finished; the end of the game is noted as a round.
The simulation environment adopts an OpenAI gym platform, the aircraft dynamics and kinematics calculation is based on a JSBsim open source platform, and the aircraft aerodynamics model is a published F16 aerodynamic model. The simulation step length of the simulation airplane state, namely the decision interval of the intelligent agent is 20 milliseconds, and the maximum time is 5 minutes per round.
Fig. 3 (a) shows a simulation of the situation of a simple target of the red strike by a blue Fang Zhineng body under the conditions of opposite and same-directional movement. Under the opposite condition, the intelligent body turns with a slope, tracks in advance, cuts into the ring of the hand and attacks; under the same direction condition, the intelligent body is in a disadvantage, and the intelligent body is driven to enter the rear hemisphere of the opponent by adopting a half-bucket machine and is tracked.
FIG. 3 (b), adopting the same strategy for the agent self-game, wherein the initial states of the two parties in the left graph are identical, the heading is opposite, and at the moment, the two parties enter a single loop war; in the right graph, the blue party initially has a certain height advantage, and when two parties enter a single-ring war, the red party judges the current disadvantage, the quick pressing gradient is separated, and the blue party is continuously located in the rear hemisphere of the red party to chase.

Claims (8)

1. An air combat maneuver strategy generation method based on example strategy constraint is characterized in that: comprising three stages:
stage one: example data acquisition
Generating flight trajectory data by human expert against simple agents based on PID control, i.e. expert strategy pi E Interact with the environment to produce a quaternion (s t ,a t ,s t+1 ,r t ) An example data set D is built from flight trajectory data, consisting of a "State-action-rewarding-State" four-tuple sequence E ={τ 12 ,...,τ n }, where τ n The nth flight path is represented by the following,
the data set is used for restricting the behaviors in the training process of the intelligent body model;
a in the quadruple t For the control instruction of the control lever and the throttle, the control of the attitude and the position of the airplane is realized, r t Is a reward function;
stage two: agent model pre-training
The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, for an agent policySlightly pi θ Performing supervised learning; using a loss function L bc (theta) calculating strategy gradient, and updating network parameters to obtain pre-trained fully-connected neural network parameters theta 0
Stage three: online fine adjustment of intelligent body model parameters
On-line reinforcement learning fine adjustment is carried out on interaction between the agent strategy and the environment, a playback experience pool is set and recorded as D off ={(s t ,a t ,s t+1 ,R t ) Each time after the end of the round, from D off The sampling trains the policy network using a policy gradient algorithm.
2. An air combat maneuver strategy generation method based on example strategy constraints as claimed in claim 1 wherein: in the first stage, the simple intelligent agent can control basic behaviors such as plane flight, horizontal turning, climbing and descending of the airplane.
3. An air combat maneuver strategy generation method based on example strategy constraints according to claim 1 or 2, characterized in that: the simple agent controlled aircraft aerodynamic model in stage one is a six degree of freedom fixed wing aircraft model that includes a stability enhancement system employing PID control.
4. An air combat maneuver strategy generation method based on example strategy constraints as claimed in claim 3 wherein: state s in the quadruple in stage one t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:
wherein phi, theta,Respectively representing course angle, pitch angle and roll angle, +.>For pitch angle speed +.>Representing the current roll angle, h representing the normalized height, and V representing the normalized velocity vector in the NED coordinate system;
relative situationExpressed as:
where ΔV represents a velocity difference vector, ΔX represents a relative position vector in the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used to measure the advantages and disadvantages of both angles.
5. An example policy based scheme according to claim 4The method for generating the air combat maneuver strategy of the beam is characterized by comprising the following steps of: the bonus function r of the present invention in phase one t The following are provided:
r t =η A r t AR r t RE r t E
the bonus function is composed of angular dominance r t A Advantage of relative distance r t R Energy dominance r t E Composition, wherein eta A 、η R 、η E Respectively represent r t A 、r t R 、r t E Is a weight of (2).
6. The air combat maneuver strategy generation method based on the example strategy constraints of claim 5, wherein: the algorithm adopted by the online reinforcement learning fine adjustment in the stage three is an Actor-Critic framework, wherein an Actor network is a strategy network pi θ (s t ) According to the current state s t Output action a t θ represents a control network parameter; critic network is value network, and outputs value functionAccording to the current state s t Outputting the estimated value V, & gt>Representing the value network parameters.
7. The air combat maneuver strategy generation method based on the example strategy constraints of claim 6, wherein: calculating a merit function in stage threeWherein ( + =max(·,0),/>T is the end time of the round, i.e. only +.>Gradient computation is performed on the dominant state-motion sample data.
8. The air combat maneuver strategy generation method based on the example strategy constraints of claim 7, wherein: in the third stage, a strategy of example report high is selected to carry out constraint, and a filter is designed
Obtaining an algorithm loss function, as:
wherein H is π Representing policy pi θ Is used for improving the exploration capability of strategies, and beta and alpha are respectively a cost loss function and coefficients of entropy.
CN202310529870.7A 2023-05-11 2023-05-11 Air combat maneuver strategy generation method based on example strategy constraint Active CN116796505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310529870.7A CN116796505B (en) 2023-05-11 2023-05-11 Air combat maneuver strategy generation method based on example strategy constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310529870.7A CN116796505B (en) 2023-05-11 2023-05-11 Air combat maneuver strategy generation method based on example strategy constraint

Publications (2)

Publication Number Publication Date
CN116796505A CN116796505A (en) 2023-09-22
CN116796505B true CN116796505B (en) 2024-02-20

Family

ID=88040603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310529870.7A Active CN116796505B (en) 2023-05-11 2023-05-11 Air combat maneuver strategy generation method based on example strategy constraint

Country Status (1)

Country Link
CN (1) CN116796505B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115291625A (en) * 2022-07-15 2022-11-04 同济大学 Multi-unmanned aerial vehicle air combat decision method based on multi-agent layered reinforcement learning
CN115373411A (en) * 2022-05-31 2022-11-22 中国航空工业集团公司沈阳飞机设计研究所 Decision-making method and system for airplane autopilot control strategy
CN115755956A (en) * 2022-11-03 2023-03-07 南京航空航天大学 Unmanned aerial vehicle maneuver decision method and system driven by knowledge and data in cooperation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095481B (en) * 2021-04-03 2024-02-02 西北工业大学 Air combat maneuver method based on parallel self-game

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115373411A (en) * 2022-05-31 2022-11-22 中国航空工业集团公司沈阳飞机设计研究所 Decision-making method and system for airplane autopilot control strategy
CN115291625A (en) * 2022-07-15 2022-11-04 同济大学 Multi-unmanned aerial vehicle air combat decision method based on multi-agent layered reinforcement learning
CN115755956A (en) * 2022-11-03 2023-03-07 南京航空航天大学 Unmanned aerial vehicle maneuver decision method and system driven by knowledge and data in cooperation

Also Published As

Publication number Publication date
CN116796505A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN108319286B (en) Unmanned aerial vehicle air combat maneuver decision method based on reinforcement learning
CN110806756B (en) Unmanned aerial vehicle autonomous guidance control method based on DDPG
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
Waldock et al. Learning to perform a perched landing on the ground using deep reinforcement learning
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
CN113095481A (en) Air combat maneuver method based on parallel self-game
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
Ruan et al. Autonomous maneuver decisions via transfer learning pigeon-inspired optimization for UCAVs in dogfight engagements
CN113671825B (en) Maneuvering intelligent decision-avoiding missile method based on reinforcement learning
CN113962012A (en) Unmanned aerial vehicle countermeasure strategy optimization method and device
CN116820134A (en) Unmanned aerial vehicle formation maintaining control method based on deep reinforcement learning
CN114637312B (en) Unmanned aerial vehicle energy-saving flight control method and system based on intelligent deformation decision
Xianyong et al. Research on maneuvering decision algorithm based on improved deep deterministic policy gradient
Zhu et al. Multi-constrained intelligent gliding guidance via optimal control and DQN
CN114237268A (en) Unmanned aerial vehicle strong robust attitude control method based on deep reinforcement learning
CN116796505B (en) Air combat maneuver strategy generation method based on example strategy constraint
CN116796843A (en) Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG
CN116774731A (en) Unmanned aerial vehicle formation path planning method based on reinforcement learning
CN116661493A (en) Deep reinforcement learning-based aerial tanker control strategy method
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning
Guo et al. Maneuver decision of UAV in air combat based on deterministic policy gradient
CN113377122B (en) Adaptive control method for switching of motor-driven variant aircraft capable of perching
CN116011315A (en) Missile escape area fast calculation method based on K-sparse self-coding SVM
Ma et al. Strategy generation based on reinforcement learning with deep deterministic policy gradient for UCAV
CN117970952B (en) Unmanned aerial vehicle maneuver strategy offline modeling method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant