CN114967472A

CN114967472A - Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method

Info

Publication number: CN114967472A
Application number: CN202210687553.3A
Authority: CN
Inventors: 吴吉莹; 杨忠; 廖禄伟; 何乃峰; 王志永; 王灿
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing Ninglinghao Technology Co ltd; Nanjing Taiside Intelligent Technology Co ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-30

Abstract

The invention discloses a depth certainty strategy gradient control method for unmanned aerial vehicle track tracking state compensation, which adopts a depth certainty strategy gradient algorithm, wherein a network structure comprises a control network, an evaluation network and a compensation network, the control quantity output of the control network and the compensation output of the compensation network are combined to be used as output control quantity to interact with the environment, and the evaluation network carries out value evaluation on the output control quantity, so that an unmanned aerial vehicle can quickly track a dynamic target in the most accurate, continuous and smooth mode; random noise is added on the basis of generating the control quantity, so that exploration in a certain range is realized, and the estimation of the control quantity value is more accurate. Aiming at the conditions that the unmanned aerial vehicle trajectory tracking control algorithm based on deep reinforcement learning is low in training efficiency and unstable in convergence in an unknown environment, the invention provides the state compensation depth certainty strategy gradient algorithm combined with the state compensation network, and the training efficiency and the convergence stability of the unmanned aerial vehicle in the trajectory tracking task can be obviously improved.

Description

Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method

Technical Field

The invention relates to a robot trajectory tracking technology, in particular to a depth certainty strategy gradient control method for unmanned aerial vehicle trajectory tracking state compensation.

Background

In the mobile robot, since trajectory following is time-series and reaches an originally set position within a predetermined time by control of a trajectory following system, a controller having a high-performance trajectory following capability is required for the mobile robot. Designing a controller of a mobile robot is challenging, and uncertainty of a dynamic model, strong environmental disturbance, underactuation of the model, incomplete constraint on kinematics are problems which must be handled by a designer when designing a mobile robot trajectory following controller; the complex sensor suite and multi-layer software bring noise and delay to information transmission, and the traditional control theory is not always enough to effectively solve the problems; specialized control methods developed to solve this complex problem often require a lengthy design process and laborious parameter adjustments. The robot control method based on learning is a latest research hotspot in the control field, ignores a dynamic model of the robot, performs control method learning through a large amount of motion data, and has been widely concerned in the automatic control field at present.

Under the condition that the environment dynamics model is not known, the model-free reinforcement learning algorithm can directly evaluate the quality of the strategy, or an optimal value function and an optimal strategy can be found through the actual interaction of an individual and the environment. A great deal of research is carried out on the problem from theory to experiment internationally, and great results are obtained in the aspects of theoretical analysis, numerical calculation, experimental verification and the like. Among them, the depth reinforcement learning algorithms based on value, such as Q-learning, Sarsa and depth Q-learning (DQN) can only realize the control of discrete controlled quantity space, and therefore can only realize the discrete direction control of the robot. However, in the case of a large-scale control quantity space or a continuous control quantity, it is difficult to obtain a good result by value-based reinforcement learning. For trajectory tracking, a value-based control method only uses a discrete controlled quantity space to hardly achieve accurate tracking, in which case, strategy learning can be directly performed.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method for controlling the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient.

The technical scheme is as follows: the invention discloses an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method which comprises the following steps:

s1, establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;

s2, establishing a motion state equation of the unmanned aerial vehicle dynamic tracking;

s3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;

s4, executing each network, obtaining state transition processes, and storing all the state transition processes in an experience playback pool;

with the current state s of the unmanned aerial vehicle _t Respectively inputting the output of the first control network into the first compensation network, simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise into the output of the first control network, and then using the output of the first control network as a first control quantity a to interact with the environment to obtain a reward R _t+1 And make the unmanned plane enter the next state s _t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle _t Evaluating the value of the first control quantity and outputting a first evaluation value; with the next state s of the unmanned plane _t+1 Are respectively input to the secondThe output of the second control network is input into the second compensation network, the output of the second control network and the output of the second compensation network are added to be used as a second control quantity to be input into the second evaluation network, and the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle _t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated with the reward to obtain a target value Q _Target (ii) a The system repeats the above process, and the next state s of the unmanned aerial vehicle is realized _t+1 As the current state s in the next process _t Carrying out subsequent processes, and mixing s in the processes _t ,a,R _t+1 ,s _t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions _t ,a,R _t+1 ,s _t+1 ) Indicating that all states are transferred to the process(s) _t ,a,R _t+1 ,s _t+1 ) Storing in an experience playback pool, s _t+1 Is the current state s _t The next state, R, obtained after execution of the first control quantity _t+1 Is the reward earned in the process;

s5, randomly sampling X transfer processes in batch from the experience playback pool (S) _i ,a _i ,R _i+1 ,s _i+1 ) Calculating time sequence difference errors of each first evaluation value and each target value, updating each network parameter through a target gradient function until the network parameters meet the training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;

s6, when the unmanned aerial vehicle carries out the track tracking task, the target tracking position is set, the current state of the unmanned aerial vehicle is input into the trained first control network model, then the output control quantity of the network model is executed, the state of the unmanned aerial vehicle is updated, and the steps are repeated in this way, so that the target tracking task is completed.

Further, in step S1, the markov decision process tuple < S, a, P, R, γ > dynamically tracked by the drone is specifically:

where S is a finite set of states, S _t Is the current state, is a group of states belonging to the set of states S, p _x 、p _y 、p _z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,

respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t _x 、t _y 、t _z Respectively is the target position coordinate of the unmanned plane in three coordinate axis directions under a Cartesian coordinate system, A is a limited control quantity set, a is a group of control quantity expression in the control quantity set A,

acceleration of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system, P is a state transition probability matrix based on a control quantity,

representing a set of transition probabilities in the state transition probability matrix P,

indicating the current state s _t Passing through the control quantity a, and shifting to the next state s _t+1 The probability of transition in the process is obtained as R _t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (b) is a reward function based on the state and control quantities,

representing a set of prize values in a prize function R, i.e. from the current state s _t The reward obtained after the control quantity a, gamma is a decay factor, S _t Is a finite state set of the current time step t, A _t Is a finite set of control quantities for the current time step t.

Further, the equation of the motion state of the dynamic tracking of the drone in step S2 is:

wherein, pi _θ (s _t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA _t ＝a|S _t ＝s _t ,θ]Is in a given state s _t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) _t ＝s _t ,A _t ＝a|π _θ (s _t A)) is a continuous update of s by a control method _t ,a,s _t+1 The dynamic equation obtained, o _t To achieve a random noise function that is explored in a small range around the control quantity.

Furthermore, the unmanned aerial vehicle adopts a trajectory tracking control strategy pi _θ (s _t A) executing the control quantity

Post-update s _t ,a,s _t+1 ，

Wherein (p' _x ,p′ _y ,p′ _z ) For unmanned aerial vehicle at state s _t+1 The position of the time of day is,

is a state s _t+1 Velocity of time, (t) _x ,t _y ,t _z ) To track the target location, the update formula is:

further, in step S3, the first control network and the second control network have the same structure and each include an input layer, three hidden layers, and an output layer, where the input layer is configured to receive the state data of the unmanned aerial vehicle, output the state data sequentially through the first hidden layer, the second hidden layer, the third hidden layer, and then output the state data through the output layer, and all the layers are fully connected;

the first evaluation network and the second evaluation network have the same structure and respectively comprise two input layers, four hidden layers and an output layer, wherein one input layer receives state data of the unmanned aerial vehicle, then outputs the state data sequentially pass through the first hidden layer and the second hidden layer and then input into the fourth hidden layer, the other input layer receives control quantity, then outputs the state data pass through the third hidden layer and then input into the fourth hidden layer, the output of the fourth hidden layer outputs evaluation value through the output layer, and all the layers adopt a full connection mode;

the first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.

Further, the calculation formula of the timing difference error value Loss in step S5 is as follows:

wherein Q is _Target To a target value, Q _Target ＝R _i+1 +γQ′(s _i ,A′(s _i |θ ^A′ ，ω ^C′ ) γ is the attenuation factor, Q'(s) _i ,A′(s _i |θ ^A′ ，ω ^C′ ) Q(s) is a value of the second evaluation value after sampling _i ,a _i |θ ^Q ,ω ^C ) Is the value of the first evaluation value after sampling, θ ^Q 、θ ^A And omega ^C The cost function Q(s) of the first evaluation network _t ,a|θ ^Q ) Control quantity function A(s) of the first control network _t |θ ^A ) And a function C(s) of the first compensation network _t |ω ^C ) Parameter of (a), theta ^Q′ 、θ ^A′ And ω ^C′ Of a second evaluation network respectivelyCost function Q'(s) _t ,a|θ ^Q′ ) And a control quantity function A' of the second control network(s) _t |θ ^A′ ) And a function C'(s) of a second compensation network _t |ω ^C′ ) The parameter (c) of (c).

Further, the objective gradient function in step S5

The formula is as follows:

wherein the content of the first and second substances,

is the sign of the gradient function with a as the argument,

is at θ ^A Is the sign of the gradient function of the argument,

is ω is ^C Is the sign of the gradient function of the argument;

the updating mode of each network parameter is as follows:

using timing difference error value Loss, target gradient function

Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascent method ^Q And a first control network parameter theta ^A (ii) a Using the first evaluation network parameter θ ^Q And a first control network parameter theta ^A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:

wherein eta belongs to [0,1] as an updating parameter;

updating the compensation network parameters by adopting a gradient descent method, wherein the parameter updating formula is as follows:

wherein the parameter updating amount is delta omega, and epsilon 0 and 1 is the learning rate of the compensation network.

The invention relates to an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system, which comprises:

the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;

the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;

the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;

the execution control module is used for executing each network and obtaining a plurality of state transition processes;

and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.

An apparatus of the present invention includes a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run.

A storage medium of the present invention, having stored thereon a computer program that, when executed by at least one processor, performs the steps of a method for drone trajectory tracking state compensated depth deterministic policy gradient control as described above.

Has the advantages that: compared with the prior art, the invention has the following technical effects:

(1) according to the method, the compensation network is added, the action output of the control network is combined with the compensation output of the compensation network to serve as the output action interacting with the environment, so that the unmanned aerial vehicle can quickly track the dynamic target in the most accurate continuous and smooth mode; in addition, random noise is added on the basis of generating the controlled variable, so that a certain range of exploration is realized, and the controlled variable is estimated more accurately;

(2) the method is verified by using an OpenAI Gym tool, and simulation results show that the method can obviously improve training efficiency and effectively improve precision and convergence stability; under the same computer configuration, the calculation cost of the method is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost;

(3) in the training process of the algorithm designed by the method, a target point is set to randomly change the spatial position, so that any target position appears in a specified spatial range, the unmanned aerial vehicle can quickly and smoothly approach the target point to realize the trajectory tracking training of the unmanned aerial vehicle, and the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning in the training process under the same training time; under the condition of the same tracking precision, the algorithm designed by the invention is obviously higher than other similar reinforcement learning algorithms in learning efficiency and has higher convergence stability, which shows that the algorithm has certain generalization, can be trained according to different environments, and is applied to track tracking tasks in multiple fields such as unmanned vehicles, unmanned planes, mobile robots, multi-legged robots and the like;

(4) aiming at the problem of unmanned aerial vehicle trajectory tracking control of an unknown model system, reinforcement learning needs a series of iterative processes in the learning process, the training period of an initial strategy is long, and a large amount of training time is needed for controller development.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a network structure designed by the present invention;

FIG. 3 is a schematic diagram of the method of the present invention;

FIG. 4 is a schematic flow chart illustrating the execution of step 4 and step 5 in the method of the present invention;

FIG. 5 is a comparison graph of rewards of the method of the present invention and a general strategy learning algorithm during training;

FIG. 6 is a comparison graph of the iteration time of the method of the present invention and the ordinary strategy learning algorithm in the training process;

FIG. 7 is a comparison graph of the operation time of the method of the present invention and the normal strategy learning algorithm in the training process;

FIG. 8 is a comparison graph of tracking trajectories of different algorithms of the unmanned aerial vehicle in the three-dimensional space in the embodiment;

FIG. 9 is a comparison graph of tracking errors generated by different algorithms in the three-dimensional trajectory tracking process of the unmanned aerial vehicle in the embodiment.

Detailed Description

The embodiments of the present invention will be described in detail below, examples of which are illustrated in the accompanying drawings, and the embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present invention and are not to be construed as limiting the scope of the present invention.

The invention provides an unmanned aerial vehicle trajectory tracking control method aiming at the defects of low training efficiency, unstable convergence and the like of an unmanned aerial vehicle trajectory tracking control algorithm based on deep reinforcement learning in an unknown environment and taking the improvement of the learning efficiency without reducing the training precision as a breakthrough point, wherein the method comprises the following steps: by adopting deep reinforcement learning and a state compensation depth certainty strategy gradient algorithm fusing different state space networks, the training efficiency and the convergence stability can be obviously improved.

As shown in fig. 1, the method for controlling the gradient of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy of the present invention includes the following steps:

step 1, establishing a Markov decision process tuple < S, A, P, R, gamma > of unmanned aerial vehicle dynamic tracking:

therein, the tuple<S,A,P,R,γ>S in (a) is a finite state set, S in formula (a) _t Is a set of state representations in a state set S, where p _x 、p _y 、p _z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,

respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t _x 、t _y 、t _z Respectively are the target position coordinates of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system,

describing the current position (p) by the drone in a Cartesian coordinate system _x ,p _y ,p _z ) Speed of the motor

And target position (t) _x ,t _y ,t _z ) Formed unmanned aerial vehicle state, tuple<S,A,P,R,γ>A in the formula (a) is a group of control quantity representations in the control quantity set A, wherein

Respectively represents the acceleration of the unmanned plane in the direction of three coordinate axes under a Cartesian coordinate system,

control components, tuples, which are performed in three cartesian coordinate axis directions, respectively, under system control are described<S,A,P,R,γ>P in (a) is a state transition probability matrix based on the control quantity

indicating the current state s _t Passing through the control quantity a, and shifting to the next state s _t+1 The probability of transition in the process is obtained as R _t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (a) is a reward function based on the state and the control quantity

Representing a set of prize values in a prize function R, i.e. from the current state s _t Awards, tuples, obtained after a control quantity a<S,A,P,R,γ>Gamma in (1) is an attenuation factor, S _t Is a finite state set of the current time step t, A _t Is a finite set of control quantities for the current time step t.

Step 2, establishing a motion state equation of the dynamic tracking of the unmanned aerial vehicle:

wherein, pi in the formula (b) _θ (s _t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA _t ＝a|S _t ＝s _t ,θ]Is in a given state s _t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) _t ＝s _t ,A _t ＝a|π _θ (s _t A)) is a continuous update of s by the control method of the invention _t ,a,s _t+1 The dynamic equation obtained, o _t To achieve a random noise function that is explored in a small range around the control quantity.

Unmanned aerial vehicle adopts trajectory tracking control strategy pi _θ (s _t A) executing the control quantity

Post-update s _t ,a,s _t+1 Said

Wherein (p' _x ,p′ _y ,p′ _z ) For unmanned aerial vehicle at state s _t+1 Position of time and

is a state s _t+1 Velocity of time, (t) _x ,t _y ,t _z ) To track the target location, the update mode is formula (c).

Step 3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;

as shown in fig. 2, the first control network and the second control network have the same structure and respectively include an input layer, three hidden layers and an output layer, the input layer is used for receiving the state data of the unmanned aerial vehicle, the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, the output is performed through the output layer, and all the layers adopt a full connection mode.

First evaluation network and second evaluation network structure are the same, all include two input layers, four hide layers and an output layer, unmanned aerial vehicle state data is received to one of them input layer, then the output is in proper order through its first hidden layer, the hidden layer of second back input fourth hidden layer, another input layer receives the controlled variable, then the output is input to the hidden layer of fourth after hiding the layer through the third, the output of the hidden layer of fourth exports the evaluation value through the output layer, all adopt the full connected mode in above each layer.

The input received by the evaluation network is a state s and a control quantity a detected by the unmanned aerial vehicle, the hidden layers for processing the state and the control quantity are separated firstly, the input state sequentially passes through two hidden layers with 256 and 128 neurons, the input control quantity passes through one hidden layer with 128 neurons, the two hidden layers are completely connected through the last hidden layer with 128 neurons, and then the evaluation value is output together; the input received by the control network is the state s detected by the unmanned aerial vehicle, and a specific value of each control quantity is output; the input of the compensation network is the output of the control network and the state detected by the unmanned aerial vehicle, the output of the compensation network is added with the noise function and the output of the control network to be used as the control quantity a executed by the unmanned aerial vehicle, and the control network and the compensation network are both designed into 3 hidden layers (the number of neurons is respectively 256, 128 and 64); all the layers are fully connected, and the activation function adopted is relu or tanh (shown in figure 2).

Step 4, executing each network to obtain a state transition process, and storing all the state transition processes into an experience playback pool;

as shown in fig. 3, the current state s of the unmanned aerial vehicle is determined _t Respectively input into a first evaluation network, a first control network and a first compensation network, and simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise, and then using the added random noise as a first control quantity a to interact with the environment to obtain a reward R _t+1 And make the unmanned plane enter the next state s _t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle _t Evaluating the value of the first control quantity and outputting a first evaluation value; the next state s of the unmanned plane is realized in the same way _t+1 Respectively inputting the signals into a second evaluation network, a second control network and a second compensation network, inputting the output of the second control network into the second compensation network, adding the output of the second control network and the output of the second compensation network to form a second control quantity, and inputting the second control quantity into the second evaluation network, wherein the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle _t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated based on the reward to obtain a target value Q _Target Then the system repeats the above process to make the next state s of the unmanned plane _t+1 As the current state s in the next process _t Carrying out the subsequent process; s in the above process _t ,a,R _t+1 ,s _t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions _t ,a,R _t+1 ,s _t+1 ) Indicating that all states are transferred to the process(s) _t ,a,R _t+1 ,s _t+1 ) Storing in an experience playback pool, s _t+1 Is state s _t The next state, R, obtained after execution of the control quantity a _t+1 Is the reward earned during this process.

Step 5, randomly sampling X transfer processes(s) in batch from the experience playback pool _i ,a _i ,R _i+1 ,s _i+1 ) Calculating time sequence difference (TD) errors of each first evaluation value and each target value, updating each network parameter through a target gradient function, updating a second evaluation network parameter and a second control network parameter in a small amplitude until the network parameters accord with a training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;

the compensation network for assisting the network exploration training is a neural network with the input of the unmanned aerial vehicle state and the output of the compensation control quantity; let θ ^Q 、θ ^A And ω ^C The cost function Q(s) of the first evaluation network _t ,a|θ ^Q ) Control quantity function A(s) of the first control network _t |θ ^A ) And a function C(s) of the first compensation network _t |ω ^C ) Parameter of (a), theta ^Q′ 、θ ^A′ And ω ^C′ Is the cost function Q'(s) of the second evaluation network _t ,a|θ ^Q′ ) And a control quantity function A'(s) of the second control network _t |θ ^A′ ) And a function C'(s) of the second compensation network _t |ω ^C′ ) The value of the first evaluation value after sampling is set to Q(s) _i ,a _i |θ ^Q ,ω ^C ) The second evaluation value has a value Q'(s) _i ,A′(s _i |θ ^A′ ，ω ^C′ ) Time difference error value Loss is:

wherein the target value Q in the formula (d) _Target ＝R _i+1 +γQ′(s _i ,A′(s _i |θ ^A′ ，ω ^C′ ) γ is the attenuation factor. Objective gradient function

Is formula (e):

wherein the content of the first and second substances,

is the sign of the gradient function with a as the argument,

is at θ ^A Is the sign of the gradient function of the argument,

is ω is ^C Is the sign of the gradient function of the argument.

The updating mode of each network parameter is as follows:

using timing difference error value Loss, target gradient function

Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascending method ^Q And a first control network parameter theta ^A ；

Using the first evaluation network parameter θ ^Q And a first control network parameter theta ^A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:

wherein eta belongs to [0,1] as an updating parameter;

updating the first compensation network parameter by adopting a gradient descent method of a formula (g):

wherein the parameter updating amount is delta omega,

for the above objective gradient function, ε ∈ [0,1]]To compensate for the learning rate of the network.

The execution flow diagram of step 4 and step 5 is shown in fig. 4.

(1) Initializing; the method comprises the following steps:

initializing a first control network, a first evaluation network, a first compensation network, a second control network, a second evaluation network and a second compensation network;

initializing an experience playback pool;

random initialization noise function and unmanned aerial vehicle state s _t ；

(2) Network training, namely training a network model according to the step 4 and the step 5;

firstly, the unmanned aerial vehicle executes a control quantity a;

obtaining next state s of the unmanned aerial vehicle _t+1 And a prize R _t+1 ；

Storing unmanned aerial vehicle State transition procedures(s) _t ,a,R _t+1 ,s _t+1 )；

Calculating a target value and updating parameters of a control network, an evaluation network and a compensation network through a time sequence difference error value and a target gradient function;

and (3) judging whether the training target is met, namely judging whether the set training times or the set accumulated reward value is reached, if so, saving the training model and carrying out the simulation experiment, and if not, returning to the step (2) again.

And 6, when the unmanned aerial vehicle carries out a track tracking task, setting a target tracking position, inputting the current state of the unmanned aerial vehicle into the trained first control network model, then executing the output control quantity of the network model, updating the state of the unmanned aerial vehicle, and repeating the steps to complete the target tracking task.

the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > dynamically tracked by the unmanned aerial vehicle;

a memory for storing a computer program capable of running on the processor;

and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run, and can achieve the technical effect consistent with the method.

The storage medium of the present invention stores thereon a computer program, which when executed by at least one processor implements the steps of the above method for controlling a gradient of a trajectory tracking state compensation depth certainty strategy of an unmanned aerial vehicle, and achieves technical effects consistent with the above method.

Example (b): in order to verify that the method disclosed by the invention adopts a state compensation depth certainty strategy gradient method fusing different state space networks, the training efficiency and the convergence stability can be obviously improved, and the method disclosed by the invention is simulated through a simulation experiment. The experimental operating platform is configured as follows: intel (R) core (TM) i5-7300HQ, simulation toolkit: OpenAI Gym. Two sets of comparative simulation tests were performed in this example: 1) the method adopts a depth deterministic strategy gradient (DDPG) algorithm as a comparison reference, and is a state-compensated depth deterministic strategy gradient control method, so the method is abbreviated as CDDPG algorithm; 2) since the DDPG is developed by an Actor-critic (QAC) algorithm based on value, the QAC algorithm is also used as a comparison reference in the present embodiment, and is improved according to the algorithm of the present invention to become a state compensated QAC algorithm, which is referred to as CQAC for short. The four algorithms use the absolute root mean square error of the absolute track error as a quantitative evaluation index.

The flight range of the unmanned aerial vehicle is set to be a 10 multiplied by 10(m) three-dimensional space in the simulation environment, the tracking training precision r is set to be 0.3m, the unmanned aerial vehicle tracks to be within 0.3m from a target point in the training process and is regarded as successful follow, each successful follow is regarded as a complete experience, and in the process, the target point randomly changes the spatial position to realize the trajectory tracking training of the unmanned aerial vehicle. And tracking the training of the target points by adopting four algorithms of QAC, DDPG, CQAC and CDDPG. A complete experience comprises a plurality of iterative steps, the reward after each iterative step is the negative value of the linear distance between the tracking point and the target point, and each reward is defined as the total reward of all iterations in the complete experience, so the reward is 0 at most. Learning weight γ is set to 0.999 and η ═ ε is set to 0.001, and a simulation is performed.

After the training is finished for 1000 times of follow-up training, all training models are stored to prepare for subsequent effect verification. Fig. 5 shows the prizes for 1000 experiences. It can be seen that, starting from about the 11 th experience, the reward after is close to zero, but from the previous reward, the CQAC algorithm and the CDDPG algorithm in the training process are significantly improved compared to the QAC and the DDPG. As can be seen from the total number of iterations through which each algorithm passes in FIG. 6, in the early stage of training, the QAC and DDPG algorithms will pass tens of thousands of iterations after each successful completion, whereas the required iteration steps for CQAC and CDDPG are only 30% of those of the original algorithms. As can be seen in fig. 7, the calculation costs of the four algorithms are substantially the same (about 0.998 ms) under the same computer configuration, which fully illustrates that the training time is improved by about 70% under the same training accuracy.

In order to verify the dynamic tracking effect of the training model, a simulation experiment is designed in the embodiment, and the target point is tracked as accurately as possible under a cartesian coordinate system to complete the spiral track motion. In the simulation, the target point completes a predetermined trajectory tracking task in a manner of uniform motion. Because the angular speed of the spiral track on the X-Y plane is pi/15 rad/s, the rising speed on the z axis is pi/30 m/s, two circles of spiral motion with the radius of 3m and the pitch of pi m are completed in space, and the time for completing the track tracking is 60 s. By using four algorithm models, fig. 8 shows the trajectory tracking effect. The CQAC and CDDPG algorithms are visually closer to the reference trajectory than the QAC and DDPG algorithms. In order to further clarify the superiority of the proposed algorithm, fig. 9 shows the tracking errors of the four algorithms, and obviously, due to the addition of the compensation network, the tracking accuracy and the convergence stability of the method of the present invention are effectively improved, and the stable tracking error is reduced by about 50% compared with the original algorithm. The result shows that the compensation network has stronger active control capability in the flight process of the unmanned aerial vehicle, reduces the position error and improves the tracking effect.

The embodiment shows that the state compensation depth certainty strategy gradient control method for unmanned aerial vehicle trajectory tracking can remarkably improve training efficiency and convergence stability; under the same computer configuration, the calculation cost is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost; 3) in the training process, under the condition that the tracking accuracy is the same, the learning efficiency is higher; under the same training time, the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning.

Claims

1. An unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method is characterized by comprising the following steps:

with the current state s of the unmanned aerial vehicle _t Respectively inputting the output of the first control network into the first compensation network, simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise into the output of the first control network, and then using the output of the first control network as a first control quantity a to interact with the environment to obtain a reward R _t+1 And make the unmanned plane enter the next state s _t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle _t Evaluating the value of the first control quantity and outputting a first evaluation value; unmanned aerial vehicleNext state s _t+1 Respectively inputting the signals into a second evaluation network, a second control network and a second compensation network, inputting the output of the second control network into the second compensation network, adding the output of the second control network and the output of the second compensation network to form a second control quantity, and inputting the second control quantity into the second evaluation network, wherein the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle _t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated with the reward to obtain a target value Q _Target (ii) a The system repeats the above process, and the next state s of the unmanned aerial vehicle is realized _t+1 As the current state s in the next process _t Carrying out subsequent processes, and mixing s in the processes _t ,a,R _t+1 ,s _t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions _t ,a,R _t+1 ,s _t+1 ) Indicating that all states are transferred to the process(s) _t ,a,R _t+1 ,s _t+1 ) Storing the experience into an experience playback pool;

s5, randomly sampling X transfer processes in batch from the experience playback pool (S) _i ,a _i ,R _i+1 ,s _i+1 ) When the i is 1,2 and …, X is a sampling sequence, calculating the time sequence difference error of each first evaluation value and each target value, updating each network parameter through a target gradient function until the network parameters accord with a training target, and storing each network model;

2. The method for controlling the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient according to claim 1, wherein the markov decision process tuple < S, a, P, R, γ > dynamically tracked by the unmanned aerial vehicle in step S1 is specifically:

wherein s is _t Is the current state, is a group of states belonging to the set of states S, p _x 、p _y 、p _z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,

respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t _x 、t _y 、t _z Respectively is the target position coordinate of the unmanned plane in three coordinate axis directions under a Cartesian coordinate system, a is a group of control quantity representation in the control quantity set A,

respectively the acceleration of the unmanned plane in the direction of three coordinate axes under a Cartesian coordinate system,

indicating the current state s _t Passing through the control quantity a, and shifting to the next state s _t+1 The probability of transition in the process is obtained as R _t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, respectively,

representing a set of prize values in a prize function R, i.e. from the current state s _t Awards obtained after a controlled amount a, S _t Is a finite state set of the current time step t, A _t Is a finite set of control quantities for the current time step t.

3. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 1, wherein the motion state equation of the unmanned aerial vehicle dynamic tracking in step S2 is as follows:

wherein, pi _θ (s _t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA _t ＝a|S _t ＝s _t ,θ]Is in a given state s _t E is S and the probability distribution matrix of the control quantity a is executed under the strategy parameter theta, f (S) _t ＝s _t ,A _t ＝a|π _θ (s _t A)) is a continuous update of s by a control method _t ,a,s _t+1 The dynamic equation obtained, o _t To achieve a random noise function that is explored in a small range around the control quantity.

4. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 3, characterized in that the unmanned aerial vehicle adopts a trajectory tracking control strategy pi _θ (s _t A) executing the control quantity

Post-update s _t ,a,s _t+1 ，

and

5. the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 1, characterized in that in step S3, the first control network and the second control network have the same structure and each include an input layer, three hidden layers and an output layer, the input layer is used for receiving unmanned aerial vehicle state data, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is performed through the output layer, and all the above layers are in a full connection mode;

6. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method as claimed in claim 1, wherein a calculation formula of a timing difference error value Loss in step S5 is as follows:

wherein Q _Target To a target value, Q _Target ＝R _i+1 +γQ′(s _i ,A′(s _i |θ ^A′ ，ω ^C′ ) γ is the attenuation factor, Q'(s) _i ,A′(s _i |θ ^A′ ，ω ^C′ ) Q(s) is a value of the second evaluation value after sampling _i ,a _i |θ ^Q ,ω ^C ) Is the value of the first evaluation value after sampling, θ ^Q 、θ ^A And ω ^C The cost function Q(s) of the first evaluation network _t ,a|θ ^Q ) Control quantity function A(s) of the first control network _t |θ ^A ) And a function C(s) of the first compensation network _t |ω ^C ) Parameter of (a), theta ^Q′ 、θ ^A′ And ω ^C′ The cost functions Q'(s) of the second evaluation network _t ,a|θ ^Q′ ) And a control quantity function A'(s) of the second control network _t |θ ^A′ ) And a function C'(s) of a second compensation network _t |ω ^C′ ) The parameter (c) of (c).

7. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method of claim 1, wherein in step S5, the objective gradient function

The formula is as follows:

wherein the content of the first and second substances,

is the sign of the gradient function with a as the argument,

is θ ^A Is the sign of the gradient function of the argument,

is ω is ^C Is the sign of the gradient function of the argument;

the updating mode of each network parameter is as follows:

using timing difference error value Loss, target gradient function

wherein eta belongs to [0,1] as an updating parameter;

8. An unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system is characterized by comprising:

9. An apparatus, comprising a memory and a processor, wherein:

a memory for storing a computer program capable of running on the processor;

a processor for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to any one of claims 1 to 7 when running the computer program.

10. A storage medium having stored thereon a computer program which, when executed by at least one processor, performs the steps of a method for unmanned aerial vehicle trajectory tracking state compensated depth deterministic policy gradient control of any of claims 1-7.