CN116796505B

CN116796505B - Air combat maneuver strategy generation method based on example strategy constraint

Info

Publication number: CN116796505B
Application number: CN202310529870.7A
Authority: CN
Inventors: 付宇鹏; 张立民; 邓向阳; 朱子强; 闫文君; 于柯远
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2024-02-20
Anticipated expiration: 2043-05-11
Also published as: CN116796505A

Abstract

The invention relates to an air combat maneuver strategy generation method based on example strategy constraint, belonging to the technical field of air combat agent decision control modeling. In order to solve the problems of low utilization rate of empirical data and difficult convergence of algorithms faced by short-distance air combat maneuver decision modeling, the air combat maneuver strategy generation method based on the example strategy constraint comprises three stages of example data acquisition, intelligent agent model pre-training and intelligent agent model parameter online fine adjustment, so that the utilization rate of round data and effective example data can be improved, the characteristics of imitative learning and reinforcement learning are combined, the algorithm convergence efficiency is improved, and the non-optimal problem of the example data is avoided.

Description

Air combat maneuver strategy generation method based on example strategy constraint

Technical Field

The invention relates to an air combat maneuver strategy generation method, in particular to an air combat maneuver strategy generation method based on example strategy constraint, and belongs to the technical field of air combat agent decision control modeling.

Background

In an air combat close combat scene, how the opposing parties select accurate and effective maneuvering decisions according to the current combat situation is an important research direction. With the development of algorithms in recent years, reinforcement learning and imitation learning algorithms are increasingly adopted to realize maneuver decision control. The patent application with publication number of CN112162564A discloses an unmanned aerial vehicle flight control method based on an imitation learning and reinforcement learning algorithm, a basic instruction set of an airplane model is established, and mapped with maneuvering actions to realize maneuvering control, the model precision of the scheme is limited by the number of instructions in the basic instruction set, simultaneously, as the number of basic instructions rises, the dimension output by a controller rises, and a model strategy depends on reinforcement learning training and lacks of strategy constraint. When the reinforcement learning algorithm generates a strategy, expert experience is not fully utilized, strategy constraint is lacked, and an intelligent agent is difficult to obtain forward rewards in a complex environment or a sparse rewarding environment; when generating policies based on mimicking learning algorithms, behavior depends on example data quality, thus facing balance problems of agent policy constraints and state space exploration; by adopting an end-to-end scheme, the motor action is directly learned by using a reinforcement learning algorithm, the state space is complex, the algorithm is not easy to converge, and the realization difficulty is high.

Disclosure of Invention

The invention aims to provide an air combat maneuver strategy generation method based on example strategy constraint, aiming at the problems of low empirical data utilization rate and difficult algorithm convergence faced by short-distance air combat maneuver decision modeling.

In order to solve the problems, the air combat maneuver strategy generation method based on the example strategy constraint is realized through the following technical scheme:

the air combat maneuver strategy generation method based on the example strategy constraint is characterized by comprising the following steps of: comprising three stages:

stage one: example data acquisition

Generating flight trajectory data by human expert against simple agents based on PID control, i.e. expert strategy pi _E Interact with the environment to produce a quaternion (s _t ,a _t ,s _t+1 ,r _t ) An example data set D is built from flight trajectory data, consisting of a "State-action-rewarding-State" four-tuple sequence ^E ＝{τ ₁ ,τ ₂ ,...,τ _n }, where τ _n The nth flight path is represented by the following,

the data set is used for restricting the behaviors in the training process of the intelligent body model;

a in the quadruple _t For the control instruction of the control lever and the throttle, the control of the attitude and the position of the airplane is realized, r _t Is a reward function;

furthermore, the simple intelligent agent can control basic behaviors such as plane flight, horizontal turning, climbing, descending and the like of the airplane;

further, the simple agent controlled aircraft aerodynamic model is a six-degree-of-freedom fixed wing aircraft model, and the six-degree-of-freedom fixed wing aircraft model comprises a stability enhancement system adopting PID control;

further, state s in the quadruple _t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:

wherein phi, theta,Respectively representing course angle, pitch angle and roll angle, +.>For pitch angle speed +.>Representing the current roll angle, h representing the normalized height, and V representing the normalized velocity vector in the NED coordinate system;

relative situationExpressed as:

wherein DeltaV represents a speed difference vector under the NED coordinate system, deltaX represents a relative position vector under the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used for measuring the advantages and disadvantages of the two angles;

further, the reward function r of the present invention _t The following are provided:

r _t ＝η _A r _t ^A +η _R r _t ^R +η _E r _t ^E

the rewarding function is an important influence factor for guiding algorithm convergence, and the rewarding function design considers key air combat factors such as angle advantages, energy advantages, self stability and the like; the relative Euclidean distance R is used for guiding the my catcher to reach the combat bomb or the aerial cannon launching condition;

reward function r _t As shown in the specification, mainly consider attack occupation, and the attack occupation is represented by the angle dominance r _t ^A Advantage of relative distance r _t ^R Energy dominance r _t ^E Composition, wherein eta _A 、η _R 、η _E Respectively represent r _t ^A 、r _t ^R 、r _t ^E Weights of (2);

stage two: agent model pre-training

The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, policy pi for agent _θ Performing supervised learning; using a loss function L ^bc (theta) calculating the gradient, and updating the network parameters to obtain the pre-training agent model parameters theta ₀ ；

Stage three: online fine adjustment of intelligent body model parameters

The intelligent body and the environment are interacted to perform online reinforcement learning fine adjustment, a playback experience pool is arranged and marked as D ^off ＝{(s _t ,a _t ,s _t+1 ,R _t ) Each time after the end of the round, from D ^off Sampling and training the intelligent body model by using a strategy gradient algorithm;

further, the algorithm adopted by the online reinforcement learning fine adjustment is an Actor-Critic framework, wherein an Actor network is a strategy network pi _θ (s _t ) According to the current state s _t Output action a _t θ represents a control network parameter; critic network is value network, and outputs value functionAccording to the current state s _t Outputting the estimated value V, & gt>Representing a value network parameter;

further, to strengthen dominant strategic actions in experience, a dominant function is calculatedWherein ( ₊ ＝max(·,0)，/>T is the end time of the round, i.e. only +.>Gradient calculation is carried out on the dominant state-action sampling data;

further, select the strategy of example report high to constraint, design the filter

Obtaining an algorithm loss function, as:

wherein H is ^π Representing policy pi _θ Is used for improving the exploration capability of strategies, and beta and alpha are respectively a cost loss function and coefficients of entropy.

According to the method, the advantages of the online strategy algorithm and the advantages of the offline strategy algorithm are combined, the utilization rate of round data and effective example data can be improved by the aid of the maneuvering strategy generation method based on the example strategy constraint, the characteristics of imitative learning and reinforcement learning are combined, algorithm convergence efficiency is improved, and the problem that the example data is not optimal is avoided.

According to the air combat maneuver strategy generation method, maneuver decision agent models with certain autonomy and intelligence can be generated efficiently by using example data, so that the traditional reinforcement learning algorithm is prevented from consuming a large amount of time resources and calculation resources; the proposed algorithm can be combined with any online strategy algorithm, is flexible to use, and can improve the data utilization rate of the online strategy algorithm.

Drawings

Fig. 1: the system trains the flow chart;

fig. 2: the algorithm flow chart of the invention;

fig. 3: maneuver decision situation map.

Detailed Description

The following description of the present invention will be given with reference to the accompanying drawings, which are used to further explain the constitution of the present invention.

Example 1. An air combat maneuver strategy generation method based on example strategy constraints as shown in fig. 1 includes three stages:

stage one: example data acquisition

further, the simple agent controlled aircraft aerodynamic model is a six-degree-of-freedom fixed wing aircraft model, and the six-degree-of-freedom fixed wing aircraft model comprises a stability enhancement system adopting PID control; the motion of the aircraft is mainly controlled by engine thrust, an elevator, an aileron and a rudder, when each control surface changes, the model changes the resultant force and the resultant moment of the aircraft according to corresponding pneumatic parameters, so that a maneuvering decision control network outputs elevator, aileron, rudder and accelerator control instructions to realize the control of the attitude and the position of the aircraft;

furthermore, human expert and simple intelligent agent fight in view distance are considered, early warning machine support is provided, the two situations are transparent, and the state s in the four-element group _t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:

relative situationExpressed as:

wherein DeltaV represents a speed difference vector, deltaX represents a relative position vector under the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used for measuring the advantages and disadvantages of the angles of the two parties;

r _t ＝η _A r _t ^A +η _R r _t ^R +η _E r _t ^E

in addition, when the flight altitude and the flight speed of the airplane are lower than or higher than the threshold value, penalty items are introduced, so that the maneuver decision is prevented from falling into the local optimum of quick death and other erroneous selections;

stage two: agent model pre-training

The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, policy pi for agent _θ Performing supervised learning; using a loss function L ^bc (theta) calculating strategy gradient, and updating network parameters to obtain pre-trained fully-connected neural network parameters theta ₀ ；

Stage three: online fine adjustment of intelligent body model parameters

Intelligent bodyPerforming online reinforcement learning fine adjustment by interacting with environment, setting a playback experience pool, and recording as D ^off ＝{(s _t ,a _t ,s _t+1 ,R _t ) Each time after the end of the round, from D ^off Sampling and training a strategy network by using a strategy gradient algorithm;

Obtaining an algorithm loss function, as:

wherein H is ^π Representing policy pi _θ The entropy regularization of (2) is used for improving the exploration capability of the strategy, beta and alpha are the cost loss function and the entropy coefficient respectively, and the algorithm flow of the invention is shown in figure 2.

Because the intelligent agent based on the pre-training strategy has a compound error in the process of interacting with the environment, the strategy can be offset by updating the strategy only by means of the reinforcement learning algorithm, and even the situations of continuous rolling, falling and the like of the airplane occur when the strategy is serious. Under the conditions, the algorithm can not be converged due to improper setting of the super-parameters of the algorithm such as the learning rate, the updating times and the like. The example data should be fully utilized to constrain the agent policy update direction in the algorithm training. In the imitation learning algorithm, the generation strategy depends on the quality of a database, and the intelligent agent is insufficient in environment exploration, so that the method combines the example strategy constraint and the reinforcement learning algorithm, and improves the convergence and exploration of the algorithm. Example 2. One application of the present application, namely in combination with the near-end policy optimization (PPO) algorithm, defines the dominance function according to the generalized dominance estimation (GAE) method:

δ _t ＝r _t +γV(s _t+1 )-V(s _t )，

gamma and lambda are two important parameters of the GAE function, where gamma determines the maximum of the cost function and lambda is used to balance the variance and bias.

The PPO algorithm limits the amplitude of the strategy update, cuts out the probability ratio, thereby reducing the fluctuation of the objective function, and the strategy loss and the cost loss function are as follows:

wherein the method comprises the steps ofIn return for the return,

c _t (θ)＝π _θ (a _t |s _t )/π _θold (a _t |s _t ) Representing the probability ratio of the current and old policies.

In order to improve the resource utilization rate, a plurality of distributed data acquisition agent works and a central learning agent learner are arranged in parallel simulation. Each worker interacts with the environment and stores the four-tuple trajectory data in a respective round experience poolIs a kind of medium. After the round is finished, the data of each round experience pool is stored in a playback experience pool, the playback experience pool is sampled, the data is divided into mini-batch, and the data is divided into mini-batch according to an objective function L ^ppo Calculating gradients, returning the gradients, accumulating each gradient by the gradients and updating the policy network and the value network parameters. Before the next round starts, the learner issues updated network parameters to each worker, which samples with a new policy.

The round experience pool is data stored by each agent in the distributed simulation, and the data are stored in the playback experience pool after the round is finished; the end of the game is noted as a round.

The simulation environment adopts an OpenAI gym platform, the aircraft dynamics and kinematics calculation is based on a JSBsim open source platform, and the aircraft aerodynamics model is a published F16 aerodynamic model. The simulation step length of the simulation airplane state, namely the decision interval of the intelligent agent is 20 milliseconds, and the maximum time is 5 minutes per round.

Fig. 3 (a) shows a simulation of the situation of a simple target of the red strike by a blue Fang Zhineng body under the conditions of opposite and same-directional movement. Under the opposite condition, the intelligent body turns with a slope, tracks in advance, cuts into the ring of the hand and attacks; under the same direction condition, the intelligent body is in a disadvantage, and the intelligent body is driven to enter the rear hemisphere of the opponent by adopting a half-bucket machine and is tracked.

FIG. 3 (b), adopting the same strategy for the agent self-game, wherein the initial states of the two parties in the left graph are identical, the heading is opposite, and at the moment, the two parties enter a single loop war; in the right graph, the blue party initially has a certain height advantage, and when two parties enter a single-ring war, the red party judges the current disadvantage, the quick pressing gradient is separated, and the blue party is continuously located in the rear hemisphere of the red party to chase.

Claims

1. An air combat maneuver strategy generation method based on example strategy constraint is characterized in that: comprising three stages:

stage one: example data acquisition

stage two: agent model pre-training

The intelligent body model is a fully-connected neural network, and the fully-connected neural network parameters are initialized by utilizing a behavior cloning technology to implement an example experience actionAs a label, for an agent policySlightly pi _θ Performing supervised learning; using a loss function L ^bc (theta) calculating strategy gradient, and updating network parameters to obtain pre-trained fully-connected neural network parameters theta ₀ ；

Stage three: online fine adjustment of intelligent body model parameters

On-line reinforcement learning fine adjustment is carried out on interaction between the agent strategy and the environment, a playback experience pool is set and recorded as D ^off ＝{(s _t ,a _t ,s _t+1 ,R _t ) Each time after the end of the round, from D _off The sampling trains the policy network using a policy gradient algorithm.

2. An air combat maneuver strategy generation method based on example strategy constraints as claimed in claim 1 wherein: in the first stage, the simple intelligent agent can control basic behaviors such as plane flight, horizontal turning, climbing and descending of the airplane.

3. An air combat maneuver strategy generation method based on example strategy constraints according to claim 1 or 2, characterized in that: the simple agent controlled aircraft aerodynamic model in stage one is a six degree of freedom fixed wing aircraft model that includes a stability enhancement system employing PID control.

4. An air combat maneuver strategy generation method based on example strategy constraints as claimed in claim 3 wherein: state s in the quadruple in stage one _t From the state of itselfAnd counter the relative situation of two parties>Composition, wherein self state->Expressed as:

relative situationExpressed as:

where ΔV represents a velocity difference vector, ΔX represents a relative position vector in the NED coordinate system, ATA represents an azimuth angle, AA represents a target entry angle, and the azimuth angle and the target entry angle are used to measure the advantages and disadvantages of both angles.

5. An example policy based scheme according to claim 4The method for generating the air combat maneuver strategy of the beam is characterized by comprising the following steps of: the bonus function r of the present invention in phase one _t The following are provided:

r _t ＝η _A r _t ^A +η _R r _t ^R +η _E r _t ^E

the bonus function is composed of angular dominance r _t ^A Advantage of relative distance r _t ^R Energy dominance r _t ^E Composition, wherein eta _A 、η _R 、η _E Respectively represent r _t ^A 、r _t ^R 、r _t ^E Is a weight of (2).

6. The air combat maneuver strategy generation method based on the example strategy constraints of claim 5, wherein: the algorithm adopted by the online reinforcement learning fine adjustment in the stage three is an Actor-Critic framework, wherein an Actor network is a strategy network pi _θ (s _t ) According to the current state s _t Output action a _t θ represents a control network parameter; critic network is value network, and outputs value functionAccording to the current state s _t Outputting the estimated value V, & gt>Representing the value network parameters.

7. The air combat maneuver strategy generation method based on the example strategy constraints of claim 6, wherein: calculating a merit function in stage threeWherein ( ₊ ＝max(·,0)，/>T is the end time of the round, i.e. only +.>Gradient computation is performed on the dominant state-motion sample data.

8. The air combat maneuver strategy generation method based on the example strategy constraints of claim 7, wherein: in the third stage, a strategy of example report high is selected to carry out constraint, and a filter is designed

Obtaining an algorithm loss function, as: