CN114967472A - Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method - Google Patents

Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method Download PDF

Info

Publication number
CN114967472A
CN114967472A CN202210687553.3A CN202210687553A CN114967472A CN 114967472 A CN114967472 A CN 114967472A CN 202210687553 A CN202210687553 A CN 202210687553A CN 114967472 A CN114967472 A CN 114967472A
Authority
CN
China
Prior art keywords
network
control
unmanned aerial
aerial vehicle
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210687553.3A
Other languages
Chinese (zh)
Inventor
吴吉莹
杨忠
廖禄伟
何乃峰
王志永
王灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Ninglinghao Technology Co ltd
Nanjing Taiside Intelligent Technology Co ltd
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210687553.3A priority Critical patent/CN114967472A/en
Publication of CN114967472A publication Critical patent/CN114967472A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a depth certainty strategy gradient control method for unmanned aerial vehicle track tracking state compensation, which adopts a depth certainty strategy gradient algorithm, wherein a network structure comprises a control network, an evaluation network and a compensation network, the control quantity output of the control network and the compensation output of the compensation network are combined to be used as output control quantity to interact with the environment, and the evaluation network carries out value evaluation on the output control quantity, so that an unmanned aerial vehicle can quickly track a dynamic target in the most accurate, continuous and smooth mode; random noise is added on the basis of generating the control quantity, so that exploration in a certain range is realized, and the estimation of the control quantity value is more accurate. Aiming at the conditions that the unmanned aerial vehicle trajectory tracking control algorithm based on deep reinforcement learning is low in training efficiency and unstable in convergence in an unknown environment, the invention provides the state compensation depth certainty strategy gradient algorithm combined with the state compensation network, and the training efficiency and the convergence stability of the unmanned aerial vehicle in the trajectory tracking task can be obviously improved.

Description

Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method
Technical Field
The invention relates to a robot trajectory tracking technology, in particular to a depth certainty strategy gradient control method for unmanned aerial vehicle trajectory tracking state compensation.
Background
In the mobile robot, since trajectory following is time-series and reaches an originally set position within a predetermined time by control of a trajectory following system, a controller having a high-performance trajectory following capability is required for the mobile robot. Designing a controller of a mobile robot is challenging, and uncertainty of a dynamic model, strong environmental disturbance, underactuation of the model, incomplete constraint on kinematics are problems which must be handled by a designer when designing a mobile robot trajectory following controller; the complex sensor suite and multi-layer software bring noise and delay to information transmission, and the traditional control theory is not always enough to effectively solve the problems; specialized control methods developed to solve this complex problem often require a lengthy design process and laborious parameter adjustments. The robot control method based on learning is a latest research hotspot in the control field, ignores a dynamic model of the robot, performs control method learning through a large amount of motion data, and has been widely concerned in the automatic control field at present.
Under the condition that the environment dynamics model is not known, the model-free reinforcement learning algorithm can directly evaluate the quality of the strategy, or an optimal value function and an optimal strategy can be found through the actual interaction of an individual and the environment. A great deal of research is carried out on the problem from theory to experiment internationally, and great results are obtained in the aspects of theoretical analysis, numerical calculation, experimental verification and the like. Among them, the depth reinforcement learning algorithms based on value, such as Q-learning, Sarsa and depth Q-learning (DQN) can only realize the control of discrete controlled quantity space, and therefore can only realize the discrete direction control of the robot. However, in the case of a large-scale control quantity space or a continuous control quantity, it is difficult to obtain a good result by value-based reinforcement learning. For trajectory tracking, a value-based control method only uses a discrete controlled quantity space to hardly achieve accurate tracking, in which case, strategy learning can be directly performed.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method for controlling the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient.
The technical scheme is as follows: the invention discloses an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method which comprises the following steps:
s1, establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
s2, establishing a motion state equation of the unmanned aerial vehicle dynamic tracking;
s3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;
s4, executing each network, obtaining state transition processes, and storing all the state transition processes in an experience playback pool;
with the current state s of the unmanned aerial vehicle t Respectively inputting the output of the first control network into the first compensation network, simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise into the output of the first control network, and then using the output of the first control network as a first control quantity a to interact with the environment to obtain a reward R t+1 And make the unmanned plane enter the next state s t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle t Evaluating the value of the first control quantity and outputting a first evaluation value; with the next state s of the unmanned plane t+1 Are respectively input to the secondThe output of the second control network is input into the second compensation network, the output of the second control network and the output of the second compensation network are added to be used as a second control quantity to be input into the second evaluation network, and the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated with the reward to obtain a target value Q Target (ii) a The system repeats the above process, and the next state s of the unmanned aerial vehicle is realized t+1 As the current state s in the next process t Carrying out subsequent processes, and mixing s in the processes t ,a,R t+1 ,s t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions t ,a,R t+1 ,s t+1 ) Indicating that all states are transferred to the process(s) t ,a,R t+1 ,s t+1 ) Storing in an experience playback pool, s t+1 Is the current state s t The next state, R, obtained after execution of the first control quantity t+1 Is the reward earned in the process;
s5, randomly sampling X transfer processes in batch from the experience playback pool (S) i ,a i ,R i+1 ,s i+1 ) Calculating time sequence difference errors of each first evaluation value and each target value, updating each network parameter through a target gradient function until the network parameters meet the training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;
s6, when the unmanned aerial vehicle carries out the track tracking task, the target tracking position is set, the current state of the unmanned aerial vehicle is input into the trained first control network model, then the output control quantity of the network model is executed, the state of the unmanned aerial vehicle is updated, and the steps are repeated in this way, so that the target tracking task is completed.
Further, in step S1, the markov decision process tuple < S, a, P, R, γ > dynamically tracked by the drone is specifically:
Figure BDA0003700224420000031
where S is a finite set of states, S t Is the current state, is a group of states belonging to the set of states S, p x 、p y 、p z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,
Figure BDA0003700224420000032
respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t x 、t y 、t z Respectively is the target position coordinate of the unmanned plane in three coordinate axis directions under a Cartesian coordinate system, A is a limited control quantity set, a is a group of control quantity expression in the control quantity set A,
Figure BDA0003700224420000033
acceleration of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system, P is a state transition probability matrix based on a control quantity,
Figure BDA0003700224420000034
representing a set of transition probabilities in the state transition probability matrix P,
Figure BDA0003700224420000035
indicating the current state s t Passing through the control quantity a, and shifting to the next state s t+1 The probability of transition in the process is obtained as R t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (b) is a reward function based on the state and control quantities,
Figure BDA0003700224420000036
representing a set of prize values in a prize function R, i.e. from the current state s t The reward obtained after the control quantity a, gamma is a decay factor, S t Is a finite state set of the current time step t, A t Is a finite set of control quantities for the current time step t.
Further, the equation of the motion state of the dynamic tracking of the drone in step S2 is:
Figure BDA0003700224420000037
wherein, pi θ (s t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA t =a|S t =s t ,θ]Is in a given state s t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) t =s t ,A t =a|π θ (s t A)) is a continuous update of s by a control method t ,a,s t+1 The dynamic equation obtained, o t To achieve a random noise function that is explored in a small range around the control quantity.
Furthermore, the unmanned aerial vehicle adopts a trajectory tracking control strategy pi θ (s t A) executing the control quantity
Figure BDA0003700224420000041
Post-update s t ,a,s t+1
Figure BDA0003700224420000045
Wherein (p' x ,p′ y ,p′ z ) For unmanned aerial vehicle at state s t+1 The position of the time of day is,
Figure BDA0003700224420000042
is a state s t+1 Velocity of time, (t) x ,t y ,t z ) To track the target location, the update formula is:
Figure BDA0003700224420000043
further, in step S3, the first control network and the second control network have the same structure and each include an input layer, three hidden layers, and an output layer, where the input layer is configured to receive the state data of the unmanned aerial vehicle, output the state data sequentially through the first hidden layer, the second hidden layer, the third hidden layer, and then output the state data through the output layer, and all the layers are fully connected;
the first evaluation network and the second evaluation network have the same structure and respectively comprise two input layers, four hidden layers and an output layer, wherein one input layer receives state data of the unmanned aerial vehicle, then outputs the state data sequentially pass through the first hidden layer and the second hidden layer and then input into the fourth hidden layer, the other input layer receives control quantity, then outputs the state data pass through the third hidden layer and then input into the fourth hidden layer, the output of the fourth hidden layer outputs evaluation value through the output layer, and all the layers adopt a full connection mode;
the first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.
Further, the calculation formula of the timing difference error value Loss in step S5 is as follows:
Figure BDA0003700224420000044
wherein Q is Target To a target value, Q Target =R i+1 +γQ′(s i ,A′(s iA′ ,ω C′ ) γ is the attenuation factor, Q'(s) i ,A′(s iA′ ,ω C′ ) Q(s) is a value of the second evaluation value after sampling i ,a iQC ) Is the value of the first evaluation value after sampling, θ Q 、θ A And omega C The cost function Q(s) of the first evaluation network t ,a|θ Q ) Control quantity function A(s) of the first control network tA ) And a function C(s) of the first compensation network tC ) Parameter of (a), theta Q′ 、θ A′ And ω C′ Of a second evaluation network respectivelyCost function Q'(s) t ,a|θ Q′ ) And a control quantity function A' of the second control network(s) tA′ ) And a function C'(s) of a second compensation network tC′ ) The parameter (c) of (c).
Further, the objective gradient function in step S5
Figure BDA0003700224420000058
The formula is as follows:
Figure BDA0003700224420000051
wherein the content of the first and second substances,
Figure BDA0003700224420000052
is the sign of the gradient function with a as the argument,
Figure BDA0003700224420000053
is at θ A Is the sign of the gradient function of the argument,
Figure BDA0003700224420000054
is ω is C Is the sign of the gradient function of the argument;
the updating mode of each network parameter is as follows:
using timing difference error value Loss, target gradient function
Figure BDA0003700224420000055
Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascent method Q And a first control network parameter theta A (ii) a Using the first evaluation network parameter θ Q And a first control network parameter theta A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:
Figure BDA0003700224420000056
wherein eta belongs to [0,1] as an updating parameter;
updating the compensation network parameters by adopting a gradient descent method, wherein the parameter updating formula is as follows:
Figure BDA0003700224420000057
wherein the parameter updating amount is delta omega, and epsilon 0 and 1 is the learning rate of the compensation network.
The invention relates to an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system, which comprises:
the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;
the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;
the execution control module is used for executing each network and obtaining a plurality of state transition processes;
and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.
An apparatus of the present invention includes a memory and a processor, wherein:
a memory for storing a computer program capable of running on the processor;
and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run.
A storage medium of the present invention, having stored thereon a computer program that, when executed by at least one processor, performs the steps of a method for drone trajectory tracking state compensated depth deterministic policy gradient control as described above.
Has the advantages that: compared with the prior art, the invention has the following technical effects:
(1) according to the method, the compensation network is added, the action output of the control network is combined with the compensation output of the compensation network to serve as the output action interacting with the environment, so that the unmanned aerial vehicle can quickly track the dynamic target in the most accurate continuous and smooth mode; in addition, random noise is added on the basis of generating the controlled variable, so that a certain range of exploration is realized, and the controlled variable is estimated more accurately;
(2) the method is verified by using an OpenAI Gym tool, and simulation results show that the method can obviously improve training efficiency and effectively improve precision and convergence stability; under the same computer configuration, the calculation cost of the method is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost;
(3) in the training process of the algorithm designed by the method, a target point is set to randomly change the spatial position, so that any target position appears in a specified spatial range, the unmanned aerial vehicle can quickly and smoothly approach the target point to realize the trajectory tracking training of the unmanned aerial vehicle, and the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning in the training process under the same training time; under the condition of the same tracking precision, the algorithm designed by the invention is obviously higher than other similar reinforcement learning algorithms in learning efficiency and has higher convergence stability, which shows that the algorithm has certain generalization, can be trained according to different environments, and is applied to track tracking tasks in multiple fields such as unmanned vehicles, unmanned planes, mobile robots, multi-legged robots and the like;
(4) aiming at the problem of unmanned aerial vehicle trajectory tracking control of an unknown model system, reinforcement learning needs a series of iterative processes in the learning process, the training period of an initial strategy is long, and a large amount of training time is needed for controller development.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a network structure designed by the present invention;
FIG. 3 is a schematic diagram of the method of the present invention;
FIG. 4 is a schematic flow chart illustrating the execution of step 4 and step 5 in the method of the present invention;
FIG. 5 is a comparison graph of rewards of the method of the present invention and a general strategy learning algorithm during training;
FIG. 6 is a comparison graph of the iteration time of the method of the present invention and the ordinary strategy learning algorithm in the training process;
FIG. 7 is a comparison graph of the operation time of the method of the present invention and the normal strategy learning algorithm in the training process;
FIG. 8 is a comparison graph of tracking trajectories of different algorithms of the unmanned aerial vehicle in the three-dimensional space in the embodiment;
FIG. 9 is a comparison graph of tracking errors generated by different algorithms in the three-dimensional trajectory tracking process of the unmanned aerial vehicle in the embodiment.
Detailed Description
The embodiments of the present invention will be described in detail below, examples of which are illustrated in the accompanying drawings, and the embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present invention and are not to be construed as limiting the scope of the present invention.
The invention provides an unmanned aerial vehicle trajectory tracking control method aiming at the defects of low training efficiency, unstable convergence and the like of an unmanned aerial vehicle trajectory tracking control algorithm based on deep reinforcement learning in an unknown environment and taking the improvement of the learning efficiency without reducing the training precision as a breakthrough point, wherein the method comprises the following steps: by adopting deep reinforcement learning and a state compensation depth certainty strategy gradient algorithm fusing different state space networks, the training efficiency and the convergence stability can be obviously improved.
As shown in fig. 1, the method for controlling the gradient of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy of the present invention includes the following steps:
step 1, establishing a Markov decision process tuple < S, A, P, R, gamma > of unmanned aerial vehicle dynamic tracking:
Figure BDA0003700224420000071
therein, the tuple<S,A,P,R,γ>S in (a) is a finite state set, S in formula (a) t Is a set of state representations in a state set S, where p x 、p y 、p z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,
Figure BDA0003700224420000081
respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t x 、t y 、t z Respectively are the target position coordinates of the unmanned aerial vehicle in three coordinate axis directions under a Cartesian coordinate system,
Figure BDA0003700224420000082
describing the current position (p) by the drone in a Cartesian coordinate system x ,p y ,p z ) Speed of the motor
Figure BDA0003700224420000083
And target position (t) x ,t y ,t z ) Formed unmanned aerial vehicle state, tuple<S,A,P,R,γ>A in the formula (a) is a group of control quantity representations in the control quantity set A, wherein
Figure BDA0003700224420000084
Respectively represents the acceleration of the unmanned plane in the direction of three coordinate axes under a Cartesian coordinate system,
Figure BDA0003700224420000085
control components, tuples, which are performed in three cartesian coordinate axis directions, respectively, under system control are described<S,A,P,R,γ>P in (a) is a state transition probability matrix based on the control quantity
Figure BDA0003700224420000086
Representing a set of transition probabilities in the state transition probability matrix P,
Figure BDA0003700224420000087
indicating the current state s t Passing through the control quantity a, and shifting to the next state s t+1 The probability of transition in the process is obtained as R t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, tuple, respectively<S,A,P,R,γ>R in (a) is a reward function based on the state and the control quantity
Figure BDA0003700224420000088
Representing a set of prize values in a prize function R, i.e. from the current state s t Awards, tuples, obtained after a control quantity a<S,A,P,R,γ>Gamma in (1) is an attenuation factor, S t Is a finite state set of the current time step t, A t Is a finite set of control quantities for the current time step t.
Step 2, establishing a motion state equation of the dynamic tracking of the unmanned aerial vehicle:
Figure BDA0003700224420000089
wherein, pi in the formula (b) θ (s t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA t =a|S t =s t ,θ]Is in a given state s t E.s and the probability distribution matrix of the execution control quantity a under the strategy parameter theta, f (S) t =s t ,A t =a|π θ (s t A)) is a continuous update of s by the control method of the invention t ,a,s t+1 The dynamic equation obtained, o t To achieve a random noise function that is explored in a small range around the control quantity.
Unmanned aerial vehicle adopts trajectory tracking control strategy pi θ (s t A) executing the control quantity
Figure BDA0003700224420000091
Post-update s t ,a,s t+1 Said
Figure BDA0003700224420000092
Wherein (p' x ,p′ y ,p′ z ) For unmanned aerial vehicle at state s t+1 Position of time and
Figure BDA0003700224420000093
is a state s t+1 Velocity of time, (t) x ,t y ,t z ) To track the target location, the update mode is formula (c).
Figure BDA0003700224420000094
Step 3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;
as shown in fig. 2, the first control network and the second control network have the same structure and respectively include an input layer, three hidden layers and an output layer, the input layer is used for receiving the state data of the unmanned aerial vehicle, the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, the output is performed through the output layer, and all the layers adopt a full connection mode.
First evaluation network and second evaluation network structure are the same, all include two input layers, four hide layers and an output layer, unmanned aerial vehicle state data is received to one of them input layer, then the output is in proper order through its first hidden layer, the hidden layer of second back input fourth hidden layer, another input layer receives the controlled variable, then the output is input to the hidden layer of fourth after hiding the layer through the third, the output of the hidden layer of fourth exports the evaluation value through the output layer, all adopt the full connected mode in above each layer.
The first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.
The input received by the evaluation network is a state s and a control quantity a detected by the unmanned aerial vehicle, the hidden layers for processing the state and the control quantity are separated firstly, the input state sequentially passes through two hidden layers with 256 and 128 neurons, the input control quantity passes through one hidden layer with 128 neurons, the two hidden layers are completely connected through the last hidden layer with 128 neurons, and then the evaluation value is output together; the input received by the control network is the state s detected by the unmanned aerial vehicle, and a specific value of each control quantity is output; the input of the compensation network is the output of the control network and the state detected by the unmanned aerial vehicle, the output of the compensation network is added with the noise function and the output of the control network to be used as the control quantity a executed by the unmanned aerial vehicle, and the control network and the compensation network are both designed into 3 hidden layers (the number of neurons is respectively 256, 128 and 64); all the layers are fully connected, and the activation function adopted is relu or tanh (shown in figure 2).
Step 4, executing each network to obtain a state transition process, and storing all the state transition processes into an experience playback pool;
as shown in fig. 3, the current state s of the unmanned aerial vehicle is determined t Respectively input into a first evaluation network, a first control network and a first compensation network, and simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise, and then using the added random noise as a first control quantity a to interact with the environment to obtain a reward R t+1 And make the unmanned plane enter the next state s t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle t Evaluating the value of the first control quantity and outputting a first evaluation value; the next state s of the unmanned plane is realized in the same way t+1 Respectively inputting the signals into a second evaluation network, a second control network and a second compensation network, inputting the output of the second control network into the second compensation network, adding the output of the second control network and the output of the second compensation network to form a second control quantity, and inputting the second control quantity into the second evaluation network, wherein the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated based on the reward to obtain a target value Q Target Then the system repeats the above process to make the next state s of the unmanned plane t+1 As the current state s in the next process t Carrying out the subsequent process; s in the above process t ,a,R t+1 ,s t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions t ,a,R t+1 ,s t+1 ) Indicating that all states are transferred to the process(s) t ,a,R t+1 ,s t+1 ) Storing in an experience playback pool, s t+1 Is state s t The next state, R, obtained after execution of the control quantity a t+1 Is the reward earned during this process.
Step 5, randomly sampling X transfer processes(s) in batch from the experience playback pool i ,a i ,R i+1 ,s i+1 ) Calculating time sequence difference (TD) errors of each first evaluation value and each target value, updating each network parameter through a target gradient function, updating a second evaluation network parameter and a second control network parameter in a small amplitude until the network parameters accord with a training target, and storing each network model, wherein i is 1,2 and …, and X is a sampling sequence;
the compensation network for assisting the network exploration training is a neural network with the input of the unmanned aerial vehicle state and the output of the compensation control quantity; let θ Q 、θ A And ω C The cost function Q(s) of the first evaluation network t ,a|θ Q ) Control quantity function A(s) of the first control network tA ) And a function C(s) of the first compensation network tC ) Parameter of (a), theta Q′ 、θ A′ And ω C′ Is the cost function Q'(s) of the second evaluation network t ,a|θ Q′ ) And a control quantity function A'(s) of the second control network tA′ ) And a function C'(s) of the second compensation network tC′ ) The value of the first evaluation value after sampling is set to Q(s) i ,a iQC ) The second evaluation value has a value Q'(s) i ,A′(s iA′ ,ω C′ ) Time difference error value Loss is:
Figure BDA0003700224420000111
wherein the target value Q in the formula (d) Target =R i+1 +γQ′(s i ,A′(s iA′ ,ω C′ ) γ is the attenuation factor. Objective gradient function
Figure BDA0003700224420000112
Is formula (e):
Figure BDA0003700224420000113
wherein the content of the first and second substances,
Figure BDA0003700224420000114
is the sign of the gradient function with a as the argument,
Figure BDA0003700224420000115
is at θ A Is the sign of the gradient function of the argument,
Figure BDA0003700224420000116
is ω is C Is the sign of the gradient function of the argument.
The updating mode of each network parameter is as follows:
using timing difference error value Loss, target gradient function
Figure BDA00037002244200001110
Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascending method Q And a first control network parameter theta A
Using the first evaluation network parameter θ Q And a first control network parameter theta A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:
Figure BDA0003700224420000117
wherein eta belongs to [0,1] as an updating parameter;
updating the first compensation network parameter by adopting a gradient descent method of a formula (g):
Figure BDA0003700224420000118
wherein the parameter updating amount is delta omega,
Figure BDA0003700224420000119
for the above objective gradient function, ε ∈ [0,1]]To compensate for the learning rate of the network.
The execution flow diagram of step 4 and step 5 is shown in fig. 4.
(1) Initializing; the method comprises the following steps:
initializing a first control network, a first evaluation network, a first compensation network, a second control network, a second evaluation network and a second compensation network;
initializing an experience playback pool;
random initialization noise function and unmanned aerial vehicle state s t
(2) Network training, namely training a network model according to the step 4 and the step 5;
firstly, the unmanned aerial vehicle executes a control quantity a;
obtaining next state s of the unmanned aerial vehicle t+1 And a prize R t+1
Storing unmanned aerial vehicle State transition procedures(s) t ,a,R t+1 ,s t+1 );
Calculating a target value and updating parameters of a control network, an evaluation network and a compensation network through a time sequence difference error value and a target gradient function;
and (3) judging whether the training target is met, namely judging whether the set training times or the set accumulated reward value is reached, if so, saving the training model and carrying out the simulation experiment, and if not, returning to the step (2) again.
And 6, when the unmanned aerial vehicle carries out a track tracking task, setting a target tracking position, inputting the current state of the unmanned aerial vehicle into the trained first control network model, then executing the output control quantity of the network model, updating the state of the unmanned aerial vehicle, and repeating the steps to complete the target tracking task.
The invention relates to an unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system, which comprises:
the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > dynamically tracked by the unmanned aerial vehicle;
the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;
the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;
the execution control module is used for executing each network and obtaining a plurality of state transition processes;
and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.
An apparatus of the present invention includes a memory and a processor, wherein:
a memory for storing a computer program capable of running on the processor;
and the processor is used for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method when the computer program is run, and can achieve the technical effect consistent with the method.
The storage medium of the present invention stores thereon a computer program, which when executed by at least one processor implements the steps of the above method for controlling a gradient of a trajectory tracking state compensation depth certainty strategy of an unmanned aerial vehicle, and achieves technical effects consistent with the above method.
Example (b): in order to verify that the method disclosed by the invention adopts a state compensation depth certainty strategy gradient method fusing different state space networks, the training efficiency and the convergence stability can be obviously improved, and the method disclosed by the invention is simulated through a simulation experiment. The experimental operating platform is configured as follows: intel (R) core (TM) i5-7300HQ, simulation toolkit: OpenAI Gym. Two sets of comparative simulation tests were performed in this example: 1) the method adopts a depth deterministic strategy gradient (DDPG) algorithm as a comparison reference, and is a state-compensated depth deterministic strategy gradient control method, so the method is abbreviated as CDDPG algorithm; 2) since the DDPG is developed by an Actor-critic (QAC) algorithm based on value, the QAC algorithm is also used as a comparison reference in the present embodiment, and is improved according to the algorithm of the present invention to become a state compensated QAC algorithm, which is referred to as CQAC for short. The four algorithms use the absolute root mean square error of the absolute track error as a quantitative evaluation index.
The flight range of the unmanned aerial vehicle is set to be a 10 multiplied by 10(m) three-dimensional space in the simulation environment, the tracking training precision r is set to be 0.3m, the unmanned aerial vehicle tracks to be within 0.3m from a target point in the training process and is regarded as successful follow, each successful follow is regarded as a complete experience, and in the process, the target point randomly changes the spatial position to realize the trajectory tracking training of the unmanned aerial vehicle. And tracking the training of the target points by adopting four algorithms of QAC, DDPG, CQAC and CDDPG. A complete experience comprises a plurality of iterative steps, the reward after each iterative step is the negative value of the linear distance between the tracking point and the target point, and each reward is defined as the total reward of all iterations in the complete experience, so the reward is 0 at most. Learning weight γ is set to 0.999 and η ═ ε is set to 0.001, and a simulation is performed.
After the training is finished for 1000 times of follow-up training, all training models are stored to prepare for subsequent effect verification. Fig. 5 shows the prizes for 1000 experiences. It can be seen that, starting from about the 11 th experience, the reward after is close to zero, but from the previous reward, the CQAC algorithm and the CDDPG algorithm in the training process are significantly improved compared to the QAC and the DDPG. As can be seen from the total number of iterations through which each algorithm passes in FIG. 6, in the early stage of training, the QAC and DDPG algorithms will pass tens of thousands of iterations after each successful completion, whereas the required iteration steps for CQAC and CDDPG are only 30% of those of the original algorithms. As can be seen in fig. 7, the calculation costs of the four algorithms are substantially the same (about 0.998 ms) under the same computer configuration, which fully illustrates that the training time is improved by about 70% under the same training accuracy.
In order to verify the dynamic tracking effect of the training model, a simulation experiment is designed in the embodiment, and the target point is tracked as accurately as possible under a cartesian coordinate system to complete the spiral track motion. In the simulation, the target point completes a predetermined trajectory tracking task in a manner of uniform motion. Because the angular speed of the spiral track on the X-Y plane is pi/15 rad/s, the rising speed on the z axis is pi/30 m/s, two circles of spiral motion with the radius of 3m and the pitch of pi m are completed in space, and the time for completing the track tracking is 60 s. By using four algorithm models, fig. 8 shows the trajectory tracking effect. The CQAC and CDDPG algorithms are visually closer to the reference trajectory than the QAC and DDPG algorithms. In order to further clarify the superiority of the proposed algorithm, fig. 9 shows the tracking errors of the four algorithms, and obviously, due to the addition of the compensation network, the tracking accuracy and the convergence stability of the method of the present invention are effectively improved, and the stable tracking error is reduced by about 50% compared with the original algorithm. The result shows that the compensation network has stronger active control capability in the flight process of the unmanned aerial vehicle, reduces the position error and improves the tracking effect.
The embodiment shows that the state compensation depth certainty strategy gradient control method for unmanned aerial vehicle trajectory tracking can remarkably improve training efficiency and convergence stability; under the same computer configuration, the calculation cost is basically the same as that of other reinforcement learning algorithms, and a better tracking effect is realized on the premise of not increasing the calculation cost; 3) in the training process, under the condition that the tracking accuracy is the same, the learning efficiency is higher; under the same training time, the stabilized tracking error is reduced by half compared with the common reinforcement learning algorithm based on strategy learning.

Claims (10)

1. An unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method is characterized by comprising the following steps:
s1, establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
s2, establishing a motion state equation of the unmanned aerial vehicle dynamic tracking;
s3, respectively constructing a first control network, a first evaluation network, a first compensation network and a corresponding second control network, a second evaluation network and a second compensation network with the same network structure;
s4, executing each network, obtaining state transition processes, and storing all the state transition processes in an experience playback pool;
with the current state s of the unmanned aerial vehicle t Respectively inputting the output of the first control network into the first compensation network, simultaneously adding the output of the first control network and the output of the first compensation network, adding random noise into the output of the first control network, and then using the output of the first control network as a first control quantity a to interact with the environment to obtain a reward R t+1 And make the unmanned plane enter the next state s t+1 And guiding the unmanned aerial vehicle to execute the tracking dynamic target, and simultaneously taking the first control quantity as the input of a first evaluation network, wherein the first evaluation network is used for evaluating the current state s of the unmanned aerial vehicle t Evaluating the value of the first control quantity and outputting a first evaluation value; unmanned aerial vehicleNext state s t+1 Respectively inputting the signals into a second evaluation network, a second control network and a second compensation network, inputting the output of the second control network into the second compensation network, adding the output of the second control network and the output of the second compensation network to form a second control quantity, and inputting the second control quantity into the second evaluation network, wherein the second evaluation network inputs the second control quantity into the second evaluation network according to the next state s of the unmanned aerial vehicle t+1 And a second control amount for evaluating the value and outputting a second evaluation value, the second evaluation value being calculated with the reward to obtain a target value Q Target (ii) a The system repeats the above process, and the next state s of the unmanned aerial vehicle is realized t+1 As the current state s in the next process t Carrying out subsequent processes, and mixing s in the processes t ,a,R t+1 ,s t+1 Describing as a transfer process, generating multiple transfer processes after multiple environment interactions t ,a,R t+1 ,s t+1 ) Indicating that all states are transferred to the process(s) t ,a,R t+1 ,s t+1 ) Storing the experience into an experience playback pool;
s5, randomly sampling X transfer processes in batch from the experience playback pool (S) i ,a i ,R i+1 ,s i+1 ) When the i is 1,2 and …, X is a sampling sequence, calculating the time sequence difference error of each first evaluation value and each target value, updating each network parameter through a target gradient function until the network parameters accord with a training target, and storing each network model;
s6, when the unmanned aerial vehicle carries out the track tracking task, the target tracking position is set, the current state of the unmanned aerial vehicle is input into the trained first control network model, then the output control quantity of the network model is executed, the state of the unmanned aerial vehicle is updated, and the steps are repeated in this way, so that the target tracking task is completed.
2. The method for controlling the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient according to claim 1, wherein the markov decision process tuple < S, a, P, R, γ > dynamically tracked by the unmanned aerial vehicle in step S1 is specifically:
Figure FDA0003700224410000021
wherein s is t Is the current state, is a group of states belonging to the set of states S, p x 、p y 、p z Respectively are the position coordinates of the unmanned aerial vehicle in the directions of three coordinate axes under a Cartesian coordinate system,
Figure FDA0003700224410000027
respectively the speed of the unmanned aerial vehicle in the direction of three coordinate axes under a Cartesian coordinate system, t x 、t y 、t z Respectively is the target position coordinate of the unmanned plane in three coordinate axis directions under a Cartesian coordinate system, a is a group of control quantity representation in the control quantity set A,
Figure FDA0003700224410000022
respectively the acceleration of the unmanned plane in the direction of three coordinate axes under a Cartesian coordinate system,
Figure FDA0003700224410000023
representing a set of transition probabilities in the state transition probability matrix P,
Figure FDA0003700224410000024
indicating the current state s t Passing through the control quantity a, and shifting to the next state s t+1 The probability of transition in the process is obtained as R t+1 The subscripts t and t +1 in the above variables denote the current time step and the next time step, respectively,
Figure FDA0003700224410000025
representing a set of prize values in a prize function R, i.e. from the current state s t Awards obtained after a controlled amount a, S t Is a finite state set of the current time step t, A t Is a finite set of control quantities for the current time step t.
3. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 1, wherein the motion state equation of the unmanned aerial vehicle dynamic tracking in step S2 is as follows:
Figure FDA0003700224410000026
wherein, pi θ (s t A) is an unmanned aerial vehicle trajectory tracking control strategy, theta is a strategy parameter, PA t =a|S t =s t ,θ]Is in a given state s t E is S and the probability distribution matrix of the control quantity a is executed under the strategy parameter theta, f (S) t =s t ,A t =a|π θ (s t A)) is a continuous update of s by a control method t ,a,s t+1 The dynamic equation obtained, o t To achieve a random noise function that is explored in a small range around the control quantity.
4. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 3, characterized in that the unmanned aerial vehicle adopts a trajectory tracking control strategy pi θ (s t A) executing the control quantity
Figure FDA0003700224410000031
Post-update s t ,a,s t+1
Figure FDA0003700224410000032
Wherein (p' x ,p′ y ,p′ z ) For unmanned aerial vehicle at state s t+1 The position of the time of day is,
Figure FDA0003700224410000033
is a state s t+1 Velocity of time, (t) x ,t y ,t z ) To track the target location, the update formula is:
Figure FDA0003700224410000034
and
Figure FDA0003700224410000035
5. the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to claim 1, characterized in that in step S3, the first control network and the second control network have the same structure and each include an input layer, three hidden layers and an output layer, the input layer is used for receiving unmanned aerial vehicle state data, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is performed through the output layer, and all the above layers are in a full connection mode;
the first evaluation network and the second evaluation network have the same structure and respectively comprise two input layers, four hidden layers and an output layer, wherein one input layer receives state data of the unmanned aerial vehicle, then outputs the state data sequentially pass through the first hidden layer and the second hidden layer and then input into the fourth hidden layer, the other input layer receives control quantity, then outputs the state data pass through the third hidden layer and then input into the fourth hidden layer, the output of the fourth hidden layer outputs evaluation value through the output layer, and all the layers adopt a full connection mode;
the first compensation network and the second compensation network are identical in structure and respectively comprise an input layer, three hidden layers and an output layer, wherein the input layer receives state data of the unmanned aerial vehicle on the one hand, receives output of the first control network or the second control network on the other hand, then the output sequentially passes through the first hidden layer, the second hidden layer and the third hidden layer, and then the output is carried out through the output layer, and all the layers adopt a full connection mode.
6. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method as claimed in claim 1, wherein a calculation formula of a timing difference error value Loss in step S5 is as follows:
Figure FDA0003700224410000041
wherein Q Target To a target value, Q Target =R i+1 +γQ′(s i ,A′(s iA′ ,ω C′ ) γ is the attenuation factor, Q'(s) i ,A′(s iA′ ,ω C′ ) Q(s) is a value of the second evaluation value after sampling i ,a iQC ) Is the value of the first evaluation value after sampling, θ Q 、θ A And ω C The cost function Q(s) of the first evaluation network t ,a|θ Q ) Control quantity function A(s) of the first control network tA ) And a function C(s) of the first compensation network tC ) Parameter of (a), theta Q′ 、θ A′ And ω C′ The cost functions Q'(s) of the second evaluation network t ,a|θ Q′ ) And a control quantity function A'(s) of the second control network tA′ ) And a function C'(s) of a second compensation network tC′ ) The parameter (c) of (c).
7. The unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method of claim 1, wherein in step S5, the objective gradient function
Figure FDA0003700224410000042
The formula is as follows:
Figure FDA0003700224410000043
wherein the content of the first and second substances,
Figure FDA0003700224410000044
is the sign of the gradient function with a as the argument,
Figure FDA0003700224410000045
is θ A Is the sign of the gradient function of the argument,
Figure FDA0003700224410000046
is ω is C Is the sign of the gradient function of the argument;
the updating mode of each network parameter is as follows:
using timing difference error value Loss, target gradient function
Figure FDA0003700224410000047
Respectively updating the first evaluation network parameter theta by using an Adam optimizer in a Python machine learning library by adopting a gradient ascent method Q And a first control network parameter theta A (ii) a Using the first evaluation network parameter θ Q And a first control network parameter theta A And updating the second evaluation network parameter and the second control network parameter in a small amplitude, wherein the formula is as follows:
Figure FDA0003700224410000048
wherein eta belongs to [0,1] as an updating parameter;
updating the compensation network parameters by adopting a gradient descent method, wherein the parameter updating formula is as follows:
Figure FDA0003700224410000049
wherein the parameter updating amount is delta omega, and epsilon 0 and 1 is the learning rate of the compensation network.
8. An unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control system is characterized by comprising:
the tuple construction module is used for establishing a Markov decision process tuple < S, A, P, R, gamma > of the unmanned aerial vehicle dynamic tracking, wherein S is a finite state set, A is a finite control quantity set, P is a state transition probability matrix based on control quantity, R is a reward function based on state and control quantity, and gamma is an attenuation factor;
the motion state equation establishing module is used for establishing a motion state equation dynamically tracked by the unmanned aerial vehicle, and the motion state equation comprises the state of the next time step and an unmanned aerial vehicle trajectory tracking control strategy;
the network construction module is used for constructing a first control network, a first evaluation network, a first compensation network, and a second control network, a second evaluation network and a second compensation network which have the same network structure;
the execution control module is used for executing each network and obtaining a plurality of state transition processes;
and the network parameter updating module is used for updating each network parameter through the target gradient function until the network parameters accord with the training target.
9. An apparatus, comprising a memory and a processor, wherein:
a memory for storing a computer program capable of running on the processor;
a processor for executing the steps of the unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method according to any one of claims 1 to 7 when running the computer program.
10. A storage medium having stored thereon a computer program which, when executed by at least one processor, performs the steps of a method for unmanned aerial vehicle trajectory tracking state compensated depth deterministic policy gradient control of any of claims 1-7.
CN202210687553.3A 2022-06-17 2022-06-17 Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method Pending CN114967472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210687553.3A CN114967472A (en) 2022-06-17 2022-06-17 Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210687553.3A CN114967472A (en) 2022-06-17 2022-06-17 Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method

Publications (1)

Publication Number Publication Date
CN114967472A true CN114967472A (en) 2022-08-30

Family

ID=82963132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210687553.3A Pending CN114967472A (en) 2022-06-17 2022-06-17 Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method

Country Status (1)

Country Link
CN (1) CN114967472A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116974204A (en) * 2023-08-23 2023-10-31 南通大学 Unmanned aerial vehicle tracking control model training method, using method and terminal equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116974204A (en) * 2023-08-23 2023-10-31 南通大学 Unmanned aerial vehicle tracking control model training method, using method and terminal equipment
CN116974204B (en) * 2023-08-23 2024-04-09 南通大学 Unmanned aerial vehicle tracking control model training method, using method and terminal equipment

Similar Documents

Publication Publication Date Title
CN114603564B (en) Mechanical arm navigation obstacle avoidance method, system, computer equipment and storage medium
CN112356830B (en) Intelligent parking method based on model reinforcement learning
CN112198870B (en) Unmanned aerial vehicle autonomous guiding maneuver decision method based on DDQN
Li et al. MPC-MPNet: Model-predictive motion planning networks for fast, near-optimal planning under kinodynamic constraints
CN112077839B (en) Motion control method and device for mechanical arm
CN111597702A (en) Rocket landing trajectory planning method and device
CN114169230A (en) Robot dynamics parameter identification method
CN116679719A (en) Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
CN116587275A (en) Mechanical arm intelligent impedance control method and system based on deep reinforcement learning
CN116700327A (en) Unmanned aerial vehicle track planning method based on continuous action dominant function learning
CN115446867A (en) Industrial mechanical arm control method and system based on digital twinning technology
CN114967472A (en) Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method
Vinogradska et al. Numerical quadrature for probabilistic policy search
CN117606490B (en) Collaborative search path planning method for autonomous underwater vehicle
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
CN113503885B (en) Robot path navigation method and system based on sampling optimization DDPG algorithm
CN114779792B (en) Medical robot autonomous obstacle avoidance method and system based on simulation and reinforcement learning
CN116047904A (en) Personnel simulation reality mixed training method for robot operation skill learning
CN115027499A (en) Vehicle automatic driving prediction control method based on dynamic neural network Hammerstein model
Feng et al. Mobile robot obstacle avoidance based on deep reinforcement learning
Gao et al. Path optimization of welding robot based on ant colony and genetic algorithm
Samsani et al. Rapid Autonomous Vehicle Drifting with Deep Reinforcement Learning
CN114489035B (en) Multi-robot collaborative search method based on accumulated trace reinforcement learning
CN112693631B (en) Initial trajectory generation method and system in online sequential convex optimization of aircraft
CN113359471B (en) Self-adaptive dynamic programming optimal control method and system based on collaborative state assistance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231121

Address after: Building 1, 01, No. 109 Zhuangpai Road, Moling Street, Jiangning District, Nanjing City, Jiangsu Province, 211106 (Jiangning Development Zone)

Applicant after: NANJING TAISIDE INTELLIGENT TECHNOLOGY Co.,Ltd.

Address before: Zhong Ling Jie Nanjing Xuanwu District of Jiangsu Province, No. 50 210014

Applicant before: Nanjing Ninglinghao Technology Co.,Ltd.

Effective date of registration: 20231121

Address after: Zhong Ling Jie Nanjing Xuanwu District of Jiangsu Province, No. 50 210014

Applicant after: Nanjing Ninglinghao Technology Co.,Ltd.

Address before: No. 29, Qinhuai District, Qinhuai District, Nanjing, Jiangsu

Applicant before: Nanjing University of Aeronautics and Astronautics

TA01 Transfer of patent application right