CN116661493A

CN116661493A - Deep reinforcement learning-based aerial tanker control strategy method

Info

Publication number: CN116661493A
Application number: CN202310604940.0A
Authority: CN
Inventors: 马先龙; 呼卫军; 孟中杰; 霍星宇; 全家乐
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-08-29

Abstract

The invention relates to the field of aerial refueling, in particular to an aerial refueling machine control strategy method based on deep reinforcement learning, which is characterized in that an initial state of an aircraft is defined as a current state, the state information and the initial control strategy of the aircraft are input into an aircraft model, next state information is calculated, a reward value is calculated by utilizing the next state information, if the reward is larger than a set reward threshold value, training is completed, if the reward is not satisfied, the method goes to a step four, a condition distribution model is rapidly defined through a Bayesian neural network model, environmental uncertainty is introduced into a control decision sequence, the control performance of the decision model in an uncertain state is improved, a plurality of intelligent agents are duplicated to form a plurality of acquisition workers, various data are acquired simultaneously in different initial states, then the action strategy of the aircraft is comprehensively updated by utilizing a plurality of groups of data, and the set reward and punishment mechanism is more in line with the thinking of solving the problem of human groping.

Description

Deep reinforcement learning-based aerial tanker control strategy method

Technical Field

The invention relates to the technical field of aerial refueling, in particular to an aerial tanker control strategy method based on deep reinforcement learning.

Background

Modern aerial oiling machines and aerial oiling technologies become important measures for enhancing the capability and striking capability of aviation soldiers, and the technology is an enhancement supplement in a novel combat mode, can obviously improve the endurance capability of fighters, and has extremely important supporting function in strategic or tactical aviation soldiers. The unmanned aerial vehicle air refueling technology is becoming an important component in intelligent unmanned warfare by virtue of a low-cost and high-cost performance combat mode, and the combat mode can promote combat missions of combat units in long voyages under complex scenes, and the unmanned aerial refueling technology and system development projects of advanced research institutions such as American Deleyden Flight Research Center (DFRC), american air force technical institute, thiens company, switzerland American aviation and the like are provided.

The unmanned aerial vehicle in China starts late in air refueling project, but series unmanned aerial vehicles such as a Xiaolong, a rainbow and the like which are independently researched and developed in China are subjected to attack research and development of related technologies, but low-temperature tracking on the height and speed of the oiling machine in air refueling cannot be realized when the unmanned aerial vehicle in China is used.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an air oiling machine control strategy method based on deep reinforcement learning.

The technical scheme adopted for solving the technical problems is as follows: a data fusion processing system based on a data bus comprises the following steps: defining an initial state of the aircraft as a current state;

step two: inputting the aircraft state information and an initial control strategy into an aircraft model, and calculating the next state information;

step three: calculating a reward value by using the next state information, if the reward is larger than a set reward threshold, finishing training, and if the reward is not satisfied, turning to a step four;

step four: collecting state information, strategy information and rewarding information and storing the information into an experience pool;

step five: operating a distributed near-end optimization strategy to output a strategy gradient;

step six: calculating gradient descent of the strategy;

step seven: updating the neural network using gradient descent;

step eight: and outputting the action strategy by the neural network, and turning to the step two.

Specifically, the specific flow of the second step is as follows: after larger deviation occurs between the environment model and the real environment, the training data generated by interaction between the decision model and the environment model is error, and then the decision model is error, the environment uncertainty is introduced into the control decision sequence through the Bayesian neural network, so that the control performance of the decision model under the uncertain state is improved.

Specifically, the specific flow of the third step is as follows: designing a constant-height cruising reward function; and the specific flow is as follows: position reward and punishment functions; pose reward and punishment functions.

Specifically, the specific flow of the fourth step is as follows: interacting with the environment; sampling and collecting training samples; calculating the gradient of the model and storing the gradient; the current gradient is transmitted to the global update gradient.

Specifically, the running distributed near-end optimization strategy output strategy gradient only uses a single agent to interact with the environment to generate training data through a classical reinforcement learning algorithm.

Specifically, in the step seven, after all sampling workers complete gradient transmission, global gradient is used for updating model parameters, the process continuously and fully utilizes the currently acquired data for many times, and the model fully understands the data by utilizing generalization of a neural network.

Specifically, in the process of realizing cruise at a specified speed at a specified height, the position reward and punishment function is designed according to the error allowed by an actual task, and when the error of the height and the speed is small, the requirements of constant high and constant speed can be met; when the high constant speed cruising is designed, the gesture reward and punishment function does not design a desired track, but constrains the track through the gesture angle by giving the desired gesture angle. Certain punishment is given to the angle, and the effect of the angular acceleration is considered, and the angular acceleration is kept at zero when the angle is unchanged.

A method for using a data fusion processing system based on a data bus,

the first step: firstly defining an initial state of the aircraft as a current state, inputting aircraft state information and an initial control strategy into an aircraft model, starting to calculate next state information, then calculating a reward value by using the next state information, if the reward is larger than a set reward threshold value, finishing training, if the reward is not satisfied, starting to acquire data in the next step in step four, and then accumulating experience;

and a second step of: at the moment, running a distributed near-end optimization strategy to output a strategy gradient, calculating gradient descent of the strategy, and updating the neural network by using the gradient descent;

and a third step of: and outputting an action strategy through the neural network, turning to the second step, and repeating the mechanism until the reward function reaches the standard, thereby completing training.

The invention has the beneficial effects that: according to the air refueling machine control strategy method based on deep reinforcement learning, when the air refueling machine control strategy method based on deep reinforcement learning is used, a condition distribution model is rapidly defined through a Bayesian neural network model, environmental uncertainty is introduced into a control decision sequence, the control performance of the decision model in an uncertain state is improved, a plurality of intelligent agents are copied to form a plurality of collection workers, various data are collected simultaneously in different initial states, then action strategies of an aircraft are comprehensively updated through the plurality of groups of data, a set reward and punishment mechanism is more in line with the thinking that people are fumbling for solving the problem, guide information can be given in real time in the intelligent agent learning process, and training efficiency can be improved.

Drawings

The invention will be further described with reference to the drawings and examples.

FIG. 1 is a schematic diagram of the overall system operation of the deep reinforcement learning-based air fuel dispenser control strategy method provided by the invention;

FIG. 2 is a schematic diagram of a distributed framework of an air-fuel dispenser control strategy method based on deep reinforcement learning provided by the invention;

FIG. 3 is a schematic diagram of data acquisition for an air tanker control strategy method based on deep reinforcement learning provided by the present invention.

Detailed Description

The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in fig. 1 to 3, the data fusion processing system based on a data bus according to the present invention includes: defining an initial state of the aircraft as a current state;

step six: calculating gradient descent of the strategy;

step seven: updating the neural network using gradient descent;

The specific flow of the second step is as follows: after a larger deviation occurs between the environment model and the real environment, the training data generated by interaction between the decision model and the environment model is error, and then the decision model is error, the environment uncertainty is introduced into a control decision sequence through a Bayesian neural network, so that the control performance of the decision model in an uncertain state is improved, the Bayesian neural network model can be defined as a conditional distribution model P (y|x, w), x is model input data, w is a neural network weight vector, and the neural network can introduce bias items in order to expand a decision plane to better fit the data. In the fitting regression problem, the weight parameter of the bayesian neural network model will no longer be a certain value, but a gaussian distribution output composed of two trainable parameters σ, μ, unlike the conventional deterministic weight network, bayesian estimation finds the posterior distribution user (w|d) of w, not limited to argmax values. The method can convert the data characteristics into a probability model, and the network layer introduces uncertainty for the prediction of the neural network. The invention uses a cross entropy loss function to improve the DNN algorithm convergence speed. For each sample data, it is desirable to minimize the following equation:

wherein a is ^L And y is a vector whose feature dimension is the sample lot size.

Whereas the cross entropy loss function for each sample is of the form:

J(W，b，a,y)＝-y·Ina-(1-y)·ln(1-a)

where, is the vector inner product, this form is generally applicable to logistic regression problems. Network output layer delta when using cross entropy loss function _L The gradient conditions of (2) are:

at this time, the absence of a gradient that activates the function-related term δ in the gradient term avoids that the gradient update takes a long time. In addition, a ^L Y represents the error of the actual value from the predicted value. The larger the error value, the faster the gradient update and the faster the network convergence speed.

The specific flow of the third step is as follows: step 3.1: designing a constant-height cruising reward function; and the specific flow of the step 3.1 is as follows: 1. position reward and punishment functions; in the process of realizing the cruise of the designated height and the designated speed, the expected height error and the speed error tend to be 0, the design is carried out according to the error allowed by the actual task, and when the error of the height and the speed is less than or equal to 1, the requirements of fixed height and fixed speed can be met. Therefore, when designing the bonus function value, the positive bonus is given when the height error is between-1 m and 1m, and the positive bonus is given when the speed error is between-1 m/s and 1m/s, and when the speed error is not within the range, the negative bonus is given, and the above logarithmic function is combined, the design bonus function is as follows:

where h-, v_are the altitude error and the velocity error, respectively.

In the process of realizing constant-speed and constant-speed endurance, the acceleration of the unmanned aerial vehicle and the longitudinal acceleration should be kept at 0 to ensure the constant speed and the constant height, and a reward function for the acceleration and a reward function for the longitudinal speed are designed based on the constant-speed and constant-speed acceleration, as follows:

in the method, in the process of the invention,the derivative of the altitude and the derivative of the speed, respectively.

The corresponding position reward and punishment function is as follows:

r_pos＝r_h+r_v+r_hv+r_vv

2. and (3) the gesture reward and punishment function does not design a desired track when designing the cruise at a fixed speed, but carries out the constraint of the track through the gesture angle by giving the desired gesture angle. A certain punishment is given to the angle, and the effect of the angular acceleration is considered, and the angular acceleration is kept at 0 when the angle is unchanged. Thus, the design reward function is as follows:

the corresponding pose reward and punishment functions are as follows:

r_ang＝r_phi+r_psi+r_P+r_R

setting a total reward function aiming at the constant-speed and constant-speed cruising, fusing the position reward and punishment function and the gesture reward and punishment function, limiting amplitude of the reward function of each item in the fusion process, and controlling the total size of the reward function, wherein the total reward function is as follows:

R＝r_pos+r_ang

the specific flow of the fourth step is as follows: 1. interacting with the environment; 2. sampling and collecting training samples; 3. calculating the gradient of the model and storing the gradient; 4. the current gradient is transmitted to the global updating gradient, which is different from a distributed training neural network, and when the reinforcement learning algorithm is trained by using a distributed technology, different sampling workers acquire different data time, which causes time loss and slows down the whole updating time. Therefore, the invention introduces the preemption threshold parameter P, (Preemption Threshold), namely when P% in all the works are completed by 1-2 steps, the part of sampling workers participate in the iterative updating, and the rest of sampling workers can forcedly complete the 1 st step and wait for the next gradient transfer.

The near-end policy optimization algorithm is one of the mainstream algorithms currently accepted in the industry. The optimal adjustable parameters of the near-end strategy are fewer, and the stability of strategy updating can be ensured by the design of the double-acting network. The general strategic algorithm is an online strategic algorithm, namely the same network is used for the sampling action and the optimization updating of the agent. Once the agent selects worse data in the iterative interaction process, the updated strategy will generate large-amplitude concussion, which is easy to cause network divergence. Aiming at the phenomenon, the near-end strategy optimization algorithm provides an offline strategy method. Two networks of identical form are constructed as policy output networks, one of which is used for sampling actions and the other is used for updating weight parameters. After the training process is finished, the double network performs weight copy update. When the weight is updated, the importance sampling technology is adopted for the optimization of the near-end strategy to limit the updating speed of algorithm updating, and the strategy phase difference in the two updating processes is avoided to be too large. The traditional gradient update formula is:

the near-end policy optimization offline policy update formula is:

since the probability of a state action pair is the product of the probability of the agent selecting the current state and the probability of the action to be performed in the current state, the probability of a state action pair can be expressed as:

the carrying-in can be obtained:

in general, the same state s _t Probability p selected under behavioural policy and target policy _θ′ (s _t ) And p _θ (s _t ) Is equal, so the above formula can be simplified as:

as a result of:

the objective function is then available as:

to limit the too large distribution difference between the learning strategy and the sampling strategy, a regularization term needs to be added to the loss function. The near-end policy optimization algorithm typically has two schemes, a KL divergence scheme and a CLIP clipping scheme. The CLIP shearing method used in the algorithm has good display effect in the learning process.

The KL dispersion scheme is simple and clear: the variance of the KL distribution of the state actions of the two strategies is calculated and weighted, this penalty is added to the loss function, while the CLIP clipping scheme is more complex: and cutting the importance sampling coefficient by using the e-hyper parameter, and limiting over-training under the condition of large distribution difference.

Wherein the method comprises the steps ofAs a dominant function.

In the near-end policy optimization framework, there are two ways for the neural network output parameters: the softMax activation function can discretize the strategy output, and is suitable for the control problem of action discretization; aiming at the problem of continuous parameter space of the unmanned aerial vehicle, the near-end strategy optimally selects and constructs a Gaussian distribution form to output a continuous strategy. A general gaussian strategy can be written as:

π _θ ＝μ _θ +ε，ε-N(0，σ ² )

mu in the middle _θ For the deterministic part, ε is zero-mean Gaussian random noise.

The method comprises the steps of generating a running distributed near-end optimization strategy output strategy gradient by using a classical reinforcement learning algorithm, generating training data by using interaction of a single intelligent agent and the environment, copying a plurality of intelligent agents to form a plurality of acquisition workers, simultaneously acquiring various data in different initial states, and then comprehensively updating the action strategy of the aircraft by using a plurality of groups of data.

And step seven, after all sampling workers finish gradient transmission, updating model parameters by using global gradients, wherein the process continuously and fully utilizes the currently acquired data for a plurality of times, and the model fully understands the data by utilizing generalization of a neural network.

The position punishment and punishment function is designed according to the error allowed by the actual task when the position punishment and punishment function realizes the cruise process of the designated speed at the designated height, the expected height error and the speed error tend to be zero, and the requirement of constant speed at the designated height can be met when the error of the height and the speed is small; when the high constant speed cruising is designed, the gesture reward and punishment function does not design a desired track, but constrains the track through the gesture angle by giving the desired gesture angle. Certain punishment is given to the angle, and the effect of the angular acceleration is considered, and the angular acceleration is kept at zero when the angle is unchanged.

A method for using a data fusion processing system based on a data bus,

Specific examples: solving the guidance control integrated problem by using the reinforcement learning idea requires converting the original problem into the reinforcement learning framework. First set up to{S _t ，A _t ，R _t ，S _t+1 Markov decision process of tuples: environmental model S at time t _t The process state can be characterized as:

S _t ＝{h_err _t ，v_errr _t ，angle _t ，phi _t ，theta _t ，psi _t ，alpha _t ，beta _t ，wx _t ，wy _t ，wz _t }

(3-1)

including altitude error h_err, speed error v_err, azimuth, roll, pitch, yaw, angle of attack, yaw angle, and current three angular velocities.

Introducing aircraft control problems into a Markov decision chain, and defining the behavior A of an intelligent agent as accelerator delta _th Elevator delta _e Pitch rudder delta _a Aileron delta _r ；

At S _t Shi Jiaduo in the state interact with the environment after instruction, and the next time state S _t+1 Output by the environmental model, characterized by:

S _t+1 ＝{h_err _t+1 ，v_err _t+1 ，angle _t+1 ，phi _t+1 ，theta _t+1 ，psi _t+1 ，alpha _t+1 ，beta _t+1 ，wx _t+1 ，wy _t+1 ，wz _t+1 }

instant prize value R _t And constructing according to the state information and the expected information at the moment. Above { S _t ，A _t ，R _t ，S _t+1 The tuple converts the aircraft guidance control integration problem into a markov decision process.

The application of the deep reinforcement learning framework requires a parameter space that specifies a particular problem when it comes to solving that problem. Mainly comprises a state parameter space and an action parameter space. The neural network model outputs action parameters through input state information to realize data interaction with the flight environment. The state parameter space and the action parameter space are input output quantities of the neural network respectively.

The state space design aims at the control problem of the fixed wing unmanned aerial vehicle, and height error, speed error, azimuth angle error, attack angle, sideslip angle, pitch angle, roll angle, three angular speeds and longitudinal component speeds are selected as state parameter afferent neural networks to be fitted. The method and the device select the original parameters as the input values of the neural network, and can not transmit instruction information into the network, so that the method and the device set the expected height and the expected speed, and difference the height and the speed at the current moment from the expected height and the expected speed to obtain the height error and the speed error, and transmit the error into the neural network, thereby improving generalization, not influencing the selection of the parameters of the neural network after changing the expected height and the speed, and being capable of fitting proper actions through the error to select the fixed wing unmanned aerial vehicle rudder. Feedback can be introduced into the neural network using the error amount as a data input, causing the neural network open loop control to change to closed loop control.

Action space design an action parameter space is a collection of possible values for each actuator within a range of values. When the present invention is designed for a fixed wing drone, the motion parameter space is defined as the output of the neural network, including throttle, elevator, aileron and rudder. According to the practical physical significance, amplitude limitation is carried out on three rudders, and each rudder has a minimum deflection angle and a maximum deflection angle;

it should be noted that, there is a pneumatic rudder in the fixed wing unmanned aerial vehicle actuating mechanism, which is calculated by the steering mechanism parameters, to stabilize the supermaneuver. The invention takes the executive rudder as an intrinsic model parameter of the aircraft, and does not participate in the training process.

Parameter settings in the near-end policy optimization framework:

reinforcement learning framework parameter settings

The feasibility of the flight controller designed based on the deep reinforcement learning algorithm and the exploration value of autonomous flight of the aircraft are verified through simulation of the aircraft constant Gao Dingsu cruise flight, the large attack angle cobra maneuver and the Hubert maneuver.

Because the invention introduces data normalization operation and formalized layering rewarding mechanism, the training process of the flight controller is stable, the strategy cognition of the neural network can be quickly improved in the early stage, and the rewarding function value tends to be converged and is stabilized at a higher level after the reinforcement learning framework gradually searches the whole flight parameter space.

The neural network loss function controller can be quickly reduced to be near zero in the early training period, at the moment, the difference between the strategy network and the evaluation network is small, and along with the increase of training iteration times, the double networks are gradually fitted to the optimal strategy output and the state-action value at the moment.

In order to verify the promotion effect of the layered rewarding function designed by the invention on training time and rewarding value, the invention compares rewarding convergence conditions of pure negative rewards formed by all error items. The rewards function versus graph is shown in the following graph.

Because the forward rewards are introduced into the layered rewards, the controller is more sensitive to dominant actions, the selection probability of dominant action values is continuously adjusted upwards, and compared with the traditional rewards function curves which are all negative rewards, the situation that oscillation instability exists still in the later training period, the training process using the layered rewards is more stable in change and higher in convergence speed.

And setting initial parameters and expected parameters to train the cruise control flight. The arrangement mode can enable the controller to fully explore in parameter spaces on two sides of the expected value, and avoid the situation that the aircraft enters the expected value from one data end only and falls into local optimum. In addition, the invention only trains aiming at a single expected height, speed and expected angle through the simple setting mode, and uses different expected values to verify the advantage of the generalization performance of the neural network model on the control capability in an uncertain environment after the training is finished.

Parameter setting

After about 30 minutes, the neural network controller based on the deep reinforcement learning framework can find an approximate globally optimal solution in the high-dimensional input-output parameter space. In the figure, both the altitude speed and the yaw angle converge to the desired values. The angle tends to be stable, the roll angle tends to be zero, the pitch angle is about 2 degrees, and the aerodynamic drag and the gravity of the aircraft can be kept offset to keep the aircraft highly stable.

After the altitude speed remains stable, the angle of the aircraft, the angular speed, is either constant or varies little. In addition, the method does not need to control balancing, and the algorithm automatically finds the balancing angle for keeping the height speed constant. The cruise control method has the advantages that the cruise control at a fixed altitude and a fixed speed is realized, the expected altitude and speed are kept unchanged in the flight process, the actual control effect achieved by the controller is very good, the variation range of the altitude and the speed is very small, the fluctuation error of the altitude is less than +/-0.01 m, and the fluctuation error of the speed is less than +/-0.1 m/s as shown in the above diagram; the offset generated by the lateral direction is smaller than 0.1m, the attack angle generates a certain deviation near the trimming angle in terms of airflow angle, the fluctuation range is smaller than 0.5 degree, the sideslip angle generates an error of about 0.1 degree at the beginning, the later stage is almost stabilized at 0, the pitch angle is kept at 1.85 degrees in terms of attitude angle, small-range fluctuation is provided, the fluctuation is smaller than 0.2 degrees, the rolling angle and the yaw angle are almost unchanged in the whole process, small-range jitter is provided, the jitter is smaller than 0.1 degree,the change of the rest angular velocity can correspond to the change of the angle, and the feedback of the control change of the angle can be realized. The controller achieves the centimeter level in the height control of the longitudinal channel, the speed error is in the decimeter level, and the control effect is met. The three-channel composite control is realized under the six-degree-of-freedom aircraft model which is not subjected to decoupling treatment, and the difficulty brought by the strong coupling of the unmanned fighter aircraft model to the control is greatly weakened. After training is finished, the neural network weight parameters are saved for controller testing. In the test link, in order to verify the robustness of the controller and the generalization performance of the neural network, a random wind field is added in the acceleration term of the basic aircraft modelUnknown disturbances in the real environment are simulated. At fly height, the randomness of the descriptive wind speed can be expressed as

V _m ＝V _max2 Rand(-1，1)sin(2π+ω _n )

Wherein: v (V) _max2 Peak values for random wind speeds; rand (-1, 1) is a random number uniformly distributed between-1 and 1; omega _n And (3) changing random wind interference of three channels by changing the amplitude value and the phase value of the interference speed to pi/6, wherein the wind interference is different in each flight track. The maximum wind speed amplitude is 10m/s, and the phase is pi/8-pi/3.

The aircraft angular velocity solution equation incorporating the disturbance term becomes:

state setting

The initial height is randomly selected in the parameter range to carry out cruising simulation, and the simulation result shows that the controller can still keep high stability under the condition of adding random interference, and the average height error is 0.97 m in 100 times of random simulation tests; the random wind field interference added in the test environment has larger interference on the speed item, the normal speed control precision is poor, and the average speed error is within +/-5 m/s.

In transverse control, the control accuracy of the controller on the azimuth angle is high, and the average error is 0.16 degree. In 100 random flights, the maximum flight error is not more than 0.25 degrees. When the expected azimuth angle is 0, namely, when the machine body only wants to move along the x-axis of the ground coordinate system, the yaw angle is kept near 0, and the yaw angle mainly changes around 0 in terms of change of the yaw angle and has the positive and negative offset tendency, so that the requirement of the azimuth angle of 0 can be realized.

As can be seen from the figure, the change of the attitude angle is obvious, and the change has obvious influence on the pitching angle when the height rises or falls; for the yaw angle, the influence is small, and the error of the yaw angle is stabilized to be within +/-2 deg, so that the index requirement is met; the change in height also indirectly affects the change in roll angle, so that the roll angle changes over a range.

In the training stage, the controller only trains for a single initial state and a single expected value, but the control effect of the controller can be expanded into a parameter range through the data interaction of the controller and the flight environment. Even if a random interference item is added in a test environment, the flight controller based on the deep reinforcement learning framework can still autonomously control the aircraft to quickly converge to a desired value or immediately avoid the risk of the aircraft being out of control under different initial altitude, initial speed and initial azimuth conditions when facing uncertain factors.

In 100 random simulations, the neural network controller has good control effect on the aircraft, and each test index is shown in the following tables 5-3:

control of performance errors

In the process of constant-speed and high-speed cruising, besides the required height is kept constant, the speed also needs to be kept constant, and the allowable error range of the speed is within +/-1 m/s. The initial speed is 181.68m/s given at the initial moment of flight. As can be seen from the graph, along with the increase of the simulation step length, the speed is slightly changed, but the range of the speed error is concentrated to +/-0.5 m/s, the error is within the index allowable range, and the maximum error of a single track is smaller than 1m/s, so that the index requirement is met, and the constant-speed cruising is better realized.

The control simulation is mainly carried out on the cruising flight of the aircraft in this section, and through test verification, the controller can provide high-precision flight control in an interference environment and an environment with uncertain parameters

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the foregoing examples, and that the foregoing description and description are merely illustrative of the principles of this invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A data fusion processing system based on a data bus, comprising the steps of: step one: defining an initial state of the aircraft as a current state;

step six: calculating gradient descent of the strategy;

step seven: updating the neural network using gradient descent;

2. The data fusion processing system based on the data bus according to claim 1, wherein the specific flow of the second step is as follows: after larger deviation occurs between the environment model and the real environment, the training data generated by interaction between the decision model and the environment model is error, and then the decision model is error, the environment uncertainty is introduced into the control decision sequence through the Bayesian neural network, so that the control performance of the decision model under the uncertain state is improved.

3. The data fusion processing system based on a data bus according to claim 1, wherein the specific flow of the third step is as follows: step 3.1: designing a constant-height cruising reward function; and the specific flow of the step 3.1 is as follows: (1) a location punishment function; (2) pose reward and punishment functions.

4. The data fusion processing system based on a data bus according to claim 1, wherein the specific flow of the step four is as follows: (1) interact with the environment; (2) sampling and collecting training samples; (3) calculating the gradient of the model and storing the gradient; (4) transmitting the current gradient to the global update gradient.

5. The data fusion processing system based on the data bus according to claim 1, wherein the running distributed near-end optimization strategy output strategy gradient generates training data by interaction of a single agent with the environment through a classical reinforcement learning algorithm, a plurality of agents are duplicated to form a plurality of acquisition workers, various data are acquired simultaneously in different initial states, and then the action strategy of the aircraft is comprehensively updated by utilizing a plurality of groups of data.

6. The data fusion processing system based on the data bus according to claim 1, wherein the step seven uses global gradients to update model parameters after all sampling workers complete gradient transmission, the process will continue to fully utilize the currently acquired data a plurality of times, and the model fully understands the data by utilizing generalization of the neural network.

7. A data fusion processing system based on a data bus according to claim 3, wherein the position punishment function is designed according to the error allowed by the actual task, and the requirements of fixed height and fixed speed can be met when the error of the height and the speed is small in the process of realizing the cruise of the designated height and the designated speed, wherein the expected height error and the speed error tend to be zero; when the attitude punishment and punishment function is designed to be used for cruising at a fixed speed and a fixed speed, a desired track is not designed, the track is restrained through the attitude angle by giving the desired attitude angle, a certain punishment and punishment are also given to the angle, meanwhile, the effect of the angular acceleration is considered, and the angular acceleration is kept to be zero when the angle is unchanged.

8. A data fusion processing system based on a data bus according to claim 3, characterized in that the specific usage method is as follows: the first step: firstly defining an initial state of the aircraft as a current state, inputting aircraft state information and an initial control strategy into an aircraft model, starting to calculate next state information, then calculating a reward value by using the next state information, if the reward is larger than a set reward threshold value, finishing training, if the reward is not satisfied, starting to acquire data in the next step in step four, and then accumulating experience;