Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method for controlling the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient.
The technical scheme is as follows: the invention discloses an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method which comprises the following steps:
s1, establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
s2, establishing a motion state equation of the unmanned aerial vehicle dynamic tracking;
s3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;
s4, executing each network, obtaining state transition processes, and storing all the state transition processes in an experience playback pool;
with the current state s of the unmanned aerial vehicle t Respectively inputting the output of the first control network into the first compensation network, simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise into the output of the first control network, and then using the output of the first control network as a first control quantity a to interact with the environment to obtain a reward R t+1 And make the unmanned plane enter the next state s t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle t Evaluating the value of the first control quantity and outputting a first evaluation value; with the next state s of the unmanned plane t+1 Are respectively input to the secondThe output of the second control network is input into the second compensation network, the output of the second control network and the output of the second compensation network are added to be used as a second control quantity to be input into the second evaluation network, and the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated with the reward to obtain a target value Q Target (ii) a The system repeats the above process, and the next state s of the unmanned aerial vehicle is realized t+1 As the current state s in the next process t Carrying out subsequent processes, and mixing s in the processes t ,a,R t+1 ,s t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions t ,a,R t+1 ,s t+1 ) Indicating that all states are transferred to the process(s) t ,a,R t+1 ,s t+1 ) Storing in an experience playback pool, s t+1 Is the current state s t The next state, R, obtained after execution of the first control quantity t+1 Is the reward earned in the process;
s5, randomly sampling X transfer processes in batch from the experience playback pool (S) i ,a i ,R i+1 ,s i+1 ) Calculating time sequence difference errors of each first evaluation value and each target value, updating each network parameter through a target gradient function until the network parameters meet the training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;
s6, when the unmanned aerial vehicle carries out the track tracking task, the target tracking position is set, the current state of the unmanned aerial vehicle is input into the trained first control network model, then the output control quantity of the network model is executed, the state of the unmanned aerial vehicle is updated, and the steps are repeated in this way, so that the target tracking task is completed.
Further, in step S1, the markov decision process tuple < S, a, P, R, γ > dynamically tracked by the drone is specifically:
where S is a finite set of states, S
t Is the current state, is a group of states belonging to the set of states S, p
x 、p
y 、p
z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,
respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t
x 、t
y 、t
z Respectively is the target position coordinate of the unmanned plane in three coordinate axis directions under a Cartesian coordinate system, A is a limited control quantity set, a is a group of control quantity expression in the control quantity set A,
acceleration of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system, P is a state transition probability matrix based on a control quantity,
representing a set of transition probabilities in the state transition probability matrix P,
indicating the current state s
t Passing through the control quantity a, and shifting to the next state s
t+1 The probability of transition in the process is obtained as R
t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (b) is a reward function based on the state and control quantities,
representing a set of prize values in a prize function R, i.e. from the current state s
t The reward obtained after the control quantity a, gamma is a decay factor, S
t Is a finite state set of the current time step t, A
t Is a finite set of control quantities for the current time step t.
Further, the equation of the motion state of the dynamic tracking of the drone in step S2 is:
wherein, pi θ (s t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA t =a|S t =s t ,θ]Is in a given state s t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) t =s t ,A t =a|π θ (s t A)) is a continuous update of s by a control method t ,a,s t+1 The dynamic equation obtained, o t To achieve a random noise function that is explored in a small range around the control quantity.
Furthermore, the unmanned aerial vehicle adopts a trajectory tracking control strategy pi
θ (s
t A) executing the control quantity
Post-update s
t ,a,s
t+1 ,
Wherein (p'
x ,p′
y ,p′
z ) For unmanned aerial vehicle at state s
t+1 The position of the time of day is,
is a state s
t+1 Velocity of time, (t)
x ,t
y ,t
z ) To track the target location, the update formula is:
further, in step S3, the first control network and the second control network have the same structure and each include an input layer, three hidden layers, and an output layer, where the input layer is configured to receive the state data of the unmanned aerial vehicle, output the state data sequentially through the first hidden layer, the second hidden layer, the third hidden layer, and then output the state data through the output layer, and all the layers are fully connected;
the first evaluation network and the second evaluation network have the same structure and respectively comprise two input layers, four hidden layers and an output layer, wherein one input layer receives state data of the unmanned aerial vehicle, then outputs the state data sequentially pass through the first hidden layer and the second hidden layer and then input into the fourth hidden layer, the other input layer receives control quantity, then outputs the state data pass through the third hidden layer and then input into the fourth hidden layer, the output of the fourth hidden layer outputs evaluation value through the output layer, and all the layers adopt a full connection mode;
the first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.
Further, the calculation formula of the timing difference error value Loss in step S5 is as follows:
wherein Q is Target To a target value, Q Target =R i+1 +γQ′(s i ,A′(s i |θ A′ ,ω C′ ) γ is the attenuation factor, Q'(s) i ,A′(s i |θ A′ ,ω C′ ) Q(s) is a value of the second evaluation value after sampling i ,a i |θ Q ,ω C ) Is the value of the first evaluation value after sampling, θ Q 、θ A And omega C The cost function Q(s) of the first evaluation network t ,a|θ Q ) Control quantity function A(s) of the first control network t |θ A ) And a function C(s) of the first compensation network t |ω C ) Parameter of (a), theta Q′ 、θ A′ And ω C′ Of a second evaluation network respectivelyCost function Q'(s) t ,a|θ Q′ ) And a control quantity function A' of the second control network(s) t |θ A′ ) And a function C'(s) of a second compensation network t |ω C′ ) The parameter (c) of (c).
Further, the objective gradient function in step S5
The formula is as follows:
wherein the content of the first and second substances,
is the sign of the gradient function with a as the argument,
is at θ
A Is the sign of the gradient function of the argument,
is ω is
C Is the sign of the gradient function of the argument;
the updating mode of each network parameter is as follows:
using timing difference error value Loss, target gradient function
Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascent method
Q And a first control network parameter theta
A (ii) a Using the first evaluation network parameter θ
Q And a first control network parameter theta
A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:
wherein eta belongs to [0,1] as an updating parameter;
updating the compensation network parameters by adopting a gradient descent method, wherein the parameter updating formula is as follows:
wherein the parameter updating amount is delta omega, and epsilon 0 and 1 is the learning rate of the compensation network.
The invention relates to an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system, which comprises:
the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;
the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;
the execution control module is used for executing each network and obtaining a plurality of state transition processes;
and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.
An apparatus of the present invention includes a memory and a processor, wherein:
a memory for storing a computer program capable of running on the processor;
and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run.
A storage medium of the present invention, having stored thereon a computer program that, when executed by at least one processor, performs the steps of a method for drone trajectory tracking state compensated depth deterministic policy gradient control as described above.
Has the advantages that: compared with the prior art, the invention has the following technical effects:
(1) according to the method, the compensation network is added, the action output of the control network is combined with the compensation output of the compensation network to serve as the output action interacting with the environment, so that the unmanned aerial vehicle can quickly track the dynamic target in the most accurate continuous and smooth mode; in addition, random noise is added on the basis of generating the controlled variable, so that a certain range of exploration is realized, and the controlled variable is estimated more accurately;
(2) the method is verified by using an OpenAI Gym tool, and simulation results show that the method can obviously improve training efficiency and effectively improve precision and convergence stability; under the same computer configuration, the calculation cost of the method is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost;
(3) in the training process of the algorithm designed by the method, a target point is set to randomly change the spatial position, so that any target position appears in a specified spatial range, the unmanned aerial vehicle can quickly and smoothly approach the target point to realize the trajectory tracking training of the unmanned aerial vehicle, and the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning in the training process under the same training time; under the condition of the same tracking precision, the algorithm designed by the invention is obviously higher than other similar reinforcement learning algorithms in learning efficiency and has higher convergence stability, which shows that the algorithm has certain generalization, can be trained according to different environments, and is applied to track tracking tasks in multiple fields such as unmanned vehicles, unmanned planes, mobile robots, multi-legged robots and the like;
(4) aiming at the problem of unmanned aerial vehicle trajectory tracking control of an unknown model system, reinforcement learning needs a series of iterative processes in the learning process, the training period of an initial strategy is long, and a large amount of training time is needed for controller development.
Detailed Description
The embodiments of the present invention will be described in detail below, examples of which are illustrated in the accompanying drawings, and the embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present invention and are not to be construed as limiting the scope of the present invention.
The invention provides an unmanned aerial vehicle trajectory tracking control method aiming at the defects of low training efficiency, unstable convergence and the like of an unmanned aerial vehicle trajectory tracking control algorithm based on deep reinforcement learning in an unknown environment and taking the improvement of the learning efficiency without reducing the training precision as a breakthrough point, wherein the method comprises the following steps: by adopting deep reinforcement learning and a state compensation depth certainty strategy gradient algorithm fusing different state space networks, the training efficiency and the convergence stability can be obviously improved.
As shown in fig. 1, the method for controlling the gradient of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy of the present invention includes the following steps:
step 1, establishing a Markov decision process tuple < S, A, P, R, gamma > of unmanned aerial vehicle dynamic tracking:
therein, the tuple<S,A,P,R,γ>S in (a) is a finite state set, S in formula (a)
t Is a set of state representations in a state set S, where p
x 、p
y 、p
z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,
respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t
x 、t
y 、t
z Respectively are the target position coordinates of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system,
describing the current position (p) by the drone in a Cartesian coordinate system
x ,p
y ,p
z ) Speed of the motor
And target position (t)
x ,t
y ,t
z ) Formed unmanned aerial vehicle state, tuple<S,A,P,R,γ>A in the formula (a) is a group of control quantity representations in the control quantity set A, wherein
Respectively represents the acceleration of the unmanned plane in the direction of three coordinate axes under a Cartesian coordinate system,
control components, tuples, which are performed in three cartesian coordinate axis directions, respectively, under system control are described<S,A,P,R,γ>P in (a) is a state transition probability matrix based on the control quantity
Representing a set of transition probabilities in the state transition probability matrix P,
indicating the current state s
t Passing through the control quantity a, and shifting to the next state s
t+1 The probability of transition in the process is obtained as R
t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (a) is a reward function based on the state and the control quantity
Representing a set of prize values in a prize function R, i.e. from the current state s
t Awards, tuples, obtained after a control quantity a<S,A,P,R,γ>Gamma in (1) is an attenuation factor, S
t Is a finite state set of the current time step t, A
t Is a finite set of control quantities for the current time step t.
Step 2, establishing a motion state equation of the dynamic tracking of the unmanned aerial vehicle:
wherein, pi in the formula (b) θ (s t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA t =a|S t =s t ,θ]Is in a given state s t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) t =s t ,A t =a|π θ (s t A)) is a continuous update of s by the control method of the invention t ,a,s t+1 The dynamic equation obtained, o t To achieve a random noise function that is explored in a small range around the control quantity.
Unmanned aerial vehicle adopts trajectory tracking control strategy pi
θ (s
t A) executing the control quantity
Post-update s
t ,a,s
t+1 Said
Wherein (p'
x ,p′
y ,p′
z ) For unmanned aerial vehicle at state s
t+1 Position of time and
is a state s
t+1 Velocity of time, (t)
x ,t
y ,t
z ) To track the target location, the update mode is formula (c).
Step 3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;
as shown in fig. 2, the first control network and the second control network have the same structure and respectively include an input layer, three hidden layers and an output layer, the input layer is used for receiving the state data of the unmanned aerial vehicle, the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, the output is performed through the output layer, and all the layers adopt a full connection mode.
First evaluation network and second evaluation network structure are the same, all include two input layers, four hide layers and an output layer, unmanned aerial vehicle state data is received to one of them input layer, then the output is in proper order through its first hidden layer, the hidden layer of second back input fourth hidden layer, another input layer receives the controlled variable, then the output is input to the hidden layer of fourth after hiding the layer through the third, the output of the hidden layer of fourth exports the evaluation value through the output layer, all adopt the full connected mode in above each layer.
The first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.
The input received by the evaluation network is a state s and a control quantity a detected by the unmanned aerial vehicle, the hidden layers for processing the state and the control quantity are separated firstly, the input state sequentially passes through two hidden layers with 256 and 128 neurons, the input control quantity passes through one hidden layer with 128 neurons, the two hidden layers are completely connected through the last hidden layer with 128 neurons, and then the evaluation value is output together; the input received by the control network is the state s detected by the unmanned aerial vehicle, and a specific value of each control quantity is output; the input of the compensation network is the output of the control network and the state detected by the unmanned aerial vehicle, the output of the compensation network is added with the noise function and the output of the control network to be used as the control quantity a executed by the unmanned aerial vehicle, and the control network and the compensation network are both designed into 3 hidden layers (the number of neurons is respectively 256, 128 and 64); all the layers are fully connected, and the activation function adopted is relu or tanh (shown in figure 2).
Step 4, executing each network to obtain a state transition process, and storing all the state transition processes into an experience playback pool;
as shown in fig. 3, the current state s of the unmanned aerial vehicle is determined t Respectively input into a first evaluation network, a first control network and a first compensation network, and simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise, and then using the added random noise as a first control quantity a to interact with the environment to obtain a reward R t+1 And make the unmanned plane enter the next state s t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle t Evaluating the value of the first control quantity and outputting a first evaluation value; the next state s of the unmanned plane is realized in the same way t+1 Respectively inputting the signals into a second evaluation network, a second control network and a second compensation network, inputting the output of the second control network into the second compensation network, adding the output of the second control network and the output of the second compensation network to form a second control quantity, and inputting the second control quantity into the second evaluation network, wherein the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated based on the reward to obtain a target value Q Target Then the system repeats the above process to make the next state s of the unmanned plane t+1 As the current state s in the next process t Carrying out the subsequent process; s in the above process t ,a,R t+1 ,s t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions t ,a,R t+1 ,s t+1 ) Indicating that all states are transferred to the process(s) t ,a,R t+1 ,s t+1 ) Storing in an experience playback pool, s t+1 Is state s t The next state, R, obtained after execution of the control quantity a t+1 Is the reward earned during this process.
Step 5, randomly sampling X transfer processes(s) in batch from the experience playback pool i ,a i ,R i+1 ,s i+1 ) Calculating time sequence difference (TD) errors of each first evaluation value and each target value, updating each network parameter through a target gradient function, updating a second evaluation network parameter and a second control network parameter in a small amplitude until the network parameters accord with a training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;
the compensation network for assisting the network exploration training is a neural network with the input of the unmanned aerial vehicle state and the output of the compensation control quantity; let θ Q 、θ A And ω C The cost function Q(s) of the first evaluation network t ,a|θ Q ) Control quantity function A(s) of the first control network t |θ A ) And a function C(s) of the first compensation network t |ω C ) Parameter of (a), theta Q′ 、θ A′ And ω C′ Is the cost function Q'(s) of the second evaluation network t ,a|θ Q′ ) And a control quantity function A'(s) of the second control network t |θ A′ ) And a function C'(s) of the second compensation network t |ω C′ ) The value of the first evaluation value after sampling is set to Q(s) i ,a i |θ Q ,ω C ) The second evaluation value has a value Q'(s) i ,A′(s i |θ A′ ,ω C′ ) Time difference error value Loss is:
wherein the target value Q in the formula (d)
Target =R
i+1 +γQ′(s
i ,A′(s
i |θ
A′ ,ω
C′ ) γ is the attenuation factor. Objective gradient function
Is formula (e):
wherein the content of the first and second substances,
is the sign of the gradient function with a as the argument,
is at θ
A Is the sign of the gradient function of the argument,
is ω is
C Is the sign of the gradient function of the argument.
The updating mode of each network parameter is as follows:
using timing difference error value Loss, target gradient function
Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascending method
Q And a first control network parameter theta
A ;
Using the first evaluation network parameter θ Q And a first control network parameter theta A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:
wherein eta belongs to [0,1] as an updating parameter;
updating the first compensation network parameter by adopting a gradient descent method of a formula (g):
wherein the parameter updating amount is delta omega,
for the above objective gradient function, ε ∈ [0,1]]To compensate for the learning rate of the network.
The execution flow diagram of step 4 and step 5 is shown in fig. 4.
(1) Initializing; the method comprises the following steps:
initializing a first control network, a first evaluation network, a first compensation network, a second control network, a second evaluation network and a second compensation network;
initializing an experience playback pool;
random initialization noise function and unmanned aerial vehicle state s t ;
(2) Network training, namely training a network model according to the step 4 and the step 5;
firstly, the unmanned aerial vehicle executes a control quantity a;
obtaining next state s of the unmanned aerial vehicle t+1 And a prize R t+1 ;
Storing unmanned aerial vehicle State transition procedures(s) t ,a,R t+1 ,s t+1 );
Calculating a target value and updating parameters of a control network, an evaluation network and a compensation network through a time sequence difference error value and a target gradient function;
and (3) judging whether the training target is met, namely judging whether the set training times or the set accumulated reward value is reached, if so, saving the training model and carrying out the simulation experiment, and if not, returning to the step (2) again.
And 6, when the unmanned aerial vehicle carries out a track tracking task, setting a target tracking position, inputting the current state of the unmanned aerial vehicle into the trained first control network model, then executing the output control quantity of the network model, updating the state of the unmanned aerial vehicle, and repeating the steps to complete the target tracking task.
The invention relates to an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system, which comprises:
the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > dynamically tracked by the unmanned aerial vehicle;
the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;
the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;
the execution control module is used for executing each network and obtaining a plurality of state transition processes;
and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.
An apparatus of the present invention includes a memory and a processor, wherein:
a memory for storing a computer program capable of running on the processor;
and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run, and can achieve the technical effect consistent with the method.
The storage medium of the present invention stores thereon a computer program, which when executed by at least one processor implements the steps of the above method for controlling a gradient of a trajectory tracking state compensation depth certainty strategy of an unmanned aerial vehicle, and achieves technical effects consistent with the above method.
Example (b): in order to verify that the method disclosed by the invention adopts a state compensation depth certainty strategy gradient method fusing different state space networks, the training efficiency and the convergence stability can be obviously improved, and the method disclosed by the invention is simulated through a simulation experiment. The experimental operating platform is configured as follows: intel (R) core (TM) i5-7300HQ, simulation toolkit: OpenAI Gym. Two sets of comparative simulation tests were performed in this example: 1) the method adopts a depth deterministic strategy gradient (DDPG) algorithm as a comparison reference, and is a state-compensated depth deterministic strategy gradient control method, so the method is abbreviated as CDDPG algorithm; 2) since the DDPG is developed by an Actor-critic (QAC) algorithm based on value, the QAC algorithm is also used as a comparison reference in the present embodiment, and is improved according to the algorithm of the present invention to become a state compensated QAC algorithm, which is referred to as CQAC for short. The four algorithms use the absolute root mean square error of the absolute track error as a quantitative evaluation index.
The flight range of the unmanned aerial vehicle is set to be a 10 multiplied by 10(m) three-dimensional space in the simulation environment, the tracking training precision r is set to be 0.3m, the unmanned aerial vehicle tracks to be within 0.3m from a target point in the training process and is regarded as successful follow, each successful follow is regarded as a complete experience, and in the process, the target point randomly changes the spatial position to realize the trajectory tracking training of the unmanned aerial vehicle. And tracking the training of the target points by adopting four algorithms of QAC, DDPG, CQAC and CDDPG. A complete experience comprises a plurality of iterative steps, the reward after each iterative step is the negative value of the linear distance between the tracking point and the target point, and each reward is defined as the total reward of all iterations in the complete experience, so the reward is 0 at most. Learning weight γ is set to 0.999 and η ═ ε is set to 0.001, and a simulation is performed.
After the training is finished for 1000 times of follow-up training, all training models are stored to prepare for subsequent effect verification. Fig. 5 shows the prizes for 1000 experiences. It can be seen that, starting from about the 11 th experience, the reward after is close to zero, but from the previous reward, the CQAC algorithm and the CDDPG algorithm in the training process are significantly improved compared to the QAC and the DDPG. As can be seen from the total number of iterations through which each algorithm passes in FIG. 6, in the early stage of training, the QAC and DDPG algorithms will pass tens of thousands of iterations after each successful completion, whereas the required iteration steps for CQAC and CDDPG are only 30% of those of the original algorithms. As can be seen in fig. 7, the calculation costs of the four algorithms are substantially the same (about 0.998 ms) under the same computer configuration, which fully illustrates that the training time is improved by about 70% under the same training accuracy.
In order to verify the dynamic tracking effect of the training model, a simulation experiment is designed in the embodiment, and the target point is tracked as accurately as possible under a cartesian coordinate system to complete the spiral track motion. In the simulation, the target point completes a predetermined trajectory tracking task in a manner of uniform motion. Because the angular speed of the spiral track on the X-Y plane is pi/15 rad/s, the rising speed on the z axis is pi/30 m/s, two circles of spiral motion with the radius of 3m and the pitch of pi m are completed in space, and the time for completing the track tracking is 60 s. By using four algorithm models, fig. 8 shows the trajectory tracking effect. The CQAC and CDDPG algorithms are visually closer to the reference trajectory than the QAC and DDPG algorithms. In order to further clarify the superiority of the proposed algorithm, fig. 9 shows the tracking errors of the four algorithms, and obviously, due to the addition of the compensation network, the tracking accuracy and the convergence stability of the method of the present invention are effectively improved, and the stable tracking error is reduced by about 50% compared with the original algorithm. The result shows that the compensation network has stronger active control capability in the flight process of the unmanned aerial vehicle, reduces the position error and improves the tracking effect.
The embodiment shows that the state compensation depth certainty strategy gradient control method for unmanned aerial vehicle trajectory tracking can remarkably improve training efficiency and convergence stability; under the same computer configuration, the calculation cost is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost; 3) in the training process, under the condition that the tracking accuracy is the same, the learning efficiency is higher; under the same training time, the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning.