CN111667513B - Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning - Google Patents

Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning Download PDF

Info

Publication number
CN111667513B
CN111667513B CN202010486053.4A CN202010486053A CN111667513B CN 111667513 B CN111667513 B CN 111667513B CN 202010486053 A CN202010486053 A CN 202010486053A CN 111667513 B CN111667513 B CN 111667513B
Authority
CN
China
Prior art keywords
aerial vehicle
unmanned aerial
network
target
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010486053.4A
Other languages
Chinese (zh)
Other versions
CN111667513A (en
Inventor
李波
杨志鹏
高晓光
万开方
梁诗阳
马浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010486053.4A priority Critical patent/CN111667513B/en
Publication of CN111667513A publication Critical patent/CN111667513A/en
Application granted granted Critical
Publication of CN111667513B publication Critical patent/CN111667513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/277Analysis of motion involving stochastic approaches, e.g. using Kalman filters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention relates to an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, which trains a neural network by decomposing tasks, initializing environment states, neural network parameters and other hyper-parameters. When the turn is started, the unmanned aerial vehicle executes actions to change the speed and the course angle to obtain a new state, the experience of each turn is stored in an experience pool to be used as a learning sample, and the parameters of the neural network are continuously updated in an iterative mode. And when the training is finished, storing the neural network parameters trained by the subtasks, and transferring the neural network parameters to the unmanned aerial vehicle maneuvering target tracking network under the next task scene until the final task is finished.

Description

Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
Technical Field
The invention relates to an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, and belongs to the field of robot intelligent control.
Background
Along with the continuous development of unmanned aerial vehicle technique, unmanned aerial vehicle has the wide application in the civil field. In many tasks of the unmanned aerial vehicle, the most executed tasks are monitoring and reconnaissance tasks, and if the unmanned aerial vehicle can independently and accurately track other mobile targets, the monitoring range is expanded, and meanwhile, the threat area is effectively avoided, so that the monitoring, reconnaissance and even the attack efficiency can be greatly improved.
Most of the existing researches on the maneuvering target of the unmanned aerial vehicle are carried out on the state estimation and measurement information processing of the maneuvering target, and how to decide the maneuvering behavior of the unmanned aerial vehicle after the state of the maneuvering target is determined is rarely researched, so that the unmanned aerial vehicle can better track the target. The traditional unmanned aerial vehicle maneuvering target tracking algorithm mainly depends on the accuracy of target movement modeling, if a large error exists between an environment model and an actual movement model of target tracking, influence factors which cannot be estimated due to a target state can be caused in the tracking process, and in addition, time is consumed for maneuvering modeling of the target. The environment tracked by drones can be relatively complex, dynamically changing, and even uncertain, and the target tracking task undertaken by drones is becoming increasingly complex. By integrating the factors, higher requirements are put forward on the autonomy of the unmanned aerial vehicle, and the unmanned aerial vehicle is increasingly required to have the autonomous learning capability. Therefore, the research has low dependence degree on an environment model or does not need the model, the tracking method can be self-learned through interaction with the environment, the method is very meaningful, and meanwhile, the method becomes an inevitable trend in the field of unmanned aerial vehicle maneuvering target tracking research.
The patent publication CN108919640B proposes an unmanned aerial vehicle target tracking method based on reinforcement learning, which is simple in tracking environment, small in data amount required for decision making, unable to meet unmanned aerial vehicle target tracking under complex environment conditions, and difficult to apply to an unmanned aerial vehicle control system in a real scene. The patent publication CN110806759A provides an aircraft route tracking method based on deep reinforcement learning, and the method carries out online correction on the physical control of an aircraft based on the deep reinforcement learning so as to realize autonomous perception and decision of an unmanned aerial vehicle. However, this method does not take into account the time cost required for neural network fitting, as well as its migratory capabilities, making the task difficult to train.
The deep deterministic strategy gradient (DDPG) algorithm not only utilizes the excellent performances of an experience pool and a dual neural network structure in the deep Q network algorithm, but also improves the problems of data explosion and the like of the traditional reinforcement learning; the method also has the characteristics of a strategy gradient algorithm, can effectively process continuous domain data, and enables the neural network to be rapidly converged. In addition, as an efficient machine learning method, the migration learning can migrate the network developed in different tasks and be applied to the development process of similar engineering task models again, so that the training time and cost are greatly saved, and the generalization capability of the network and the models is improved. Therefore, the unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is designed, and has important significance for the realization of the unmanned aerial vehicle application in the related fields.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning.
Technical scheme
An unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is characterized by comprising the following steps:
step 1: constructing a Markov model (S, A, O, R, gamma) for tracking the maneuvering target of the unmanned aerial vehicle, wherein S is the input state of the unmanned aerial vehicle, A is the output action of the unmanned aerial vehicle, O is the observation space of a sensor of the unmanned aerial vehicle, R is a reward function, and gamma is a discount coefficient;
step 1-1: defining the state space of the Markov model, namely the input state S:
combining the unmanned aerial vehicle state, the target state and the obstacle state information, setting the model input state as follows:
Figure BDA0002519264530000021
wherein: unmanned aerial vehicle state Suav=[xuav,yuav,vuavuav],xuav,yuavRepresenting the position, v, on a two-dimensional plane of the droneuavIs the speed of the drone, thetauavIs the azimuth of the drone;
target state
Figure BDA0002519264530000031
xtarget,ytargetRepresenting the position on the two-dimensional plane of the object,
Figure BDA0002519264530000032
is the component of velocity, ω, of the target along axis X, YtargetAngle of turn, ω, of the targettargetMore than 0 is anticlockwise turning, omegatargetTurning clockwise if less than 0;
state of obstacle
Figure BDA0002519264530000033
Represents the state of the ith obstacle, where i ═ 1,2, … n; because the actual physical models of all the obstacles are different, the external circle processing is uniformly carried out on the obstacles for convenient construction; setting obstacle state
Figure BDA0002519264530000034
Wherein the content of the first and second substances,
Figure BDA0002519264530000035
indicating the position of the ith obstacle in the two-dimensional plane,
Figure BDA0002519264530000036
is the radius of the circumscribed circle of the ith obstacle;
step 1-2: defining the motion space of the Markov model, namely the output motion A of the unmanned aerial vehicle:
the output action A represents an action set taken by the unmanned aerial vehicle for the self state value after receiving the external feedback value; the output is set as:
Figure BDA0002519264530000037
wherein the content of the first and second substances,
Figure BDA0002519264530000038
acceleration, ω, at time t of the dronetThe angular velocity of the unmanned aerial vehicle at the moment t; acceleration and angular velocity of unmanned aerial vehicle respectively combined with practical applicationAnd (5) degree constraint:
Figure BDA0002519264530000039
ωt∈[ωminmax](ii) a Wherein the content of the first and second substances,
Figure BDA00025192645300000310
respectively representing the minimum acceleration and the maximum acceleration of the unmanned aerial vehicle; omegamin、ωmaxRespectively representing the minimum and maximum angular velocities of the unmanned aerial vehicle;
step 1-3: the observation space defining the markov model, i.e. the observation space O of the sensor:
judging and acquiring the position and speed information of the unmanned aerial vehicle and the target by using a radar sensor; the observation space is set as follows:
Figure BDA00025192645300000311
wherein, relative distance D between unmanned aerial vehicle and the target is:
Figure BDA00025192645300000312
relative azimuth between drone and target
Figure BDA00025192645300000313
Comprises the following steps:
Figure BDA00025192645300000314
wherein the content of the first and second substances,
Figure BDA0002519264530000041
the observation error values are distance and angle, respectively;
step 1-4: defining a reward function R:
acquiring information of the unmanned aerial vehicle and a target position by using a sensor, and comprehensively obtaining a reward function R by performing distance reward punishment and obstacle avoidance reward punishment on the unmanned aerial vehicle, wherein the reward function R represents a feedback value obtained when the unmanned aerial vehicle selects a certain action in the current state;
setting a distance reward function r1Comprises the following steps:
Figure BDA0002519264530000042
wherein λ is1、λ2A weight value for two awards; dt-1Representing the distance between the drone and the target at the last moment, DtIs the distance between the unmanned aerial vehicle and the target at the current time t, DminIs the minimum tracking range; dmaxThe maximum tracking distance is obtained, and L is the observation range of the sensor; if D istIf > L, a penalty award C of negative constant is given2(ii) a If D istIf the value is less than or equal to L, giving positive reward; if D ist< L and Dt<DminA positive constant prize C is awarded1
Setting obstacle avoidance reward function
Figure BDA0002519264530000043
Comprises the following steps:
Figure BDA0002519264530000044
wherein the content of the first and second substances,
Figure BDA0002519264530000045
is the distance between the unmanned aerial vehicle and the obstacle at time t, DsafeIs a constant, representing a safe separation between the drone and the obstacle;
synthesize unmanned aerial vehicle distance reward, keep away barrier reward, obtain reward function R and be:
R=λ3*r14*rt coll
wherein λ is3、λ4Respectively representing distance reward and obstacle avoidance reward weight values;
step 1-5: defining a discount factor γ:
setting a discount factor 0< gamma <1 for calculating a return accumulated value in the whole process; when the gamma value is larger, the longer-term benefit is emphasized;
step 2: constructing a neural network of the DDPG algorithm:
step 2-1: constructing a policy network in the DDPG algorithm, namely an Actor policy network:
policy network muactorThe policy network is composed of an input layer, a hidden layer and an output layer, and for an input state vector s, an output vector u of the policy network is expressed as:
u=μactor(s)
step 2-2: constructing an evaluation network in the DDPG algorithm, namely a criticic evaluation network:
evaluating the output of the network as a state-behavior value Qμ(s, u), expressed as:
Figure BDA0002519264530000051
where k is a summation variable, E [. cndot.)]Represents a mathematical expectation; st+k+1、ut+k+1Respectively representing a state input vector and a motion output vector at the moment of t + k + 1;
step 2-3: constructing a target neural network:
policy network muactorAnd evaluating network QμThe weights of (s, u) are copied into the respective target networks, i.e. θμ→θμ′,θQ→θQ′Wherein thetaμ、θQParameters, theta, representing the current policy network and the evaluation network, respectivelyμ′、θQ′Parameters respectively representing a target strategy network and a target evaluation network;
and step 3: unmanned aerial vehicle and target status update
Step 3-1: establishing a state updating equation of the unmanned aerial vehicle at the time t:
Figure BDA0002519264530000052
wherein x isuav(·)、yuav(. v) coordinate value representing unmanned aerial vehicle at a certain timeuav(·)、ζuav(-) represents the linear and angular velocities of the drone at a time,
Figure BDA0002519264530000055
acceleration of the unmanned aerial vehicle at a certain time; Δ t is the simulation time interval, 9vmin,vmax) The minimum and maximum speeds of the unmanned aerial vehicle;
step 3-2: constructing a state updating equation of the target at the time t:
Figure BDA0002519264530000053
wherein the content of the first and second substances,
Figure BDA0002519264530000054
representing the target state at time t +1, FtBeing a state transition matrix, ΓtAs a noise influence matrix, wtIs white gaussian noise; ftAnd ΓtIs represented as follows:
Figure BDA0002519264530000061
Figure BDA0002519264530000062
and 4, step 4: training maneuvering target tracking of the unmanned aerial vehicle by using a deterministic strategy gradient method under a task scene:
step 4-1: setting the maximum training round as E and the maximum step number of each round as TrangeSetting the size M of an experience pool, setting a soft update proportion coefficient tau of a target neural network, and setting the learning rates of an evaluation network and a strategy network to be alpha respectivelyωAnd alphaθ
Step 4-2: initializing a state space S and initializing network parameters;
step 4-3: at the current state StSelecting the action of the unmanned aerial vehicle:
Figure BDA0002519264530000063
wherein, mud(. cndot.) represents a deterministic policy function,
Figure BDA0002519264530000064
is a random process noise vector;
step 4-4: unmanned aerial vehicle executes action atCalculating the relative distance and the relative pitch angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4tThen, the next state s is obtained from step 3t+1Then sample etranstion=<st,at,rt,st+1>Storing the data into an experience pool;
and 4-5: judging the size N of the experience poolRWhether the requirement is met, if N isRIf the number is less than M, turning to the step 4-3; if the stored sample size is larger than the experience pool capacity, automatically dequeuing the sample data in front of the experience pool queue, and then entering the step 4-6;
and 4-6: randomly extracting a small batch of samples N from an experience pool for learning, wherein the learning process is represented as:
yt=rt+γQ'(st+1,μ'(st+1μ')|θQ')
wherein y istRepresenting the target network, rtFor the prize value at time t, θQ′And thetaμ′Respectively representing target evaluation network and target strategy network parameters, Q' is represented at st+1A state-action value is obtained by adopting a mu' strategy at any moment;
and 4-7: updating the policy network according to the minimum loss function:
Figure BDA0002519264530000071
l represents the Loss of Loss function of Loss, N represents the number of samples used for network update;
and 4-8: updating the strategy gradient:
Figure BDA0002519264530000072
wherein the content of the first and second substances,
Figure BDA0002519264530000073
expressed in the policy network parameter thetaμThe following strategy gradient is set to be,
Figure BDA0002519264530000074
and
Figure BDA0002519264530000075
respectively representing the evaluation network state-behavior value function gradient and the strategy network strategy function gradient, mu(s)t) Is represented in a policy network state stThe action strategy is selected according to the selected action strategy,
Figure BDA0002519264530000076
and
Figure BDA0002519264530000077
respectively represent the state stTake action a ═ μ(s) belowt) Evaluating the state-behavior value of the network and the behavior value of the policy network under the state;
and 4-9: updating the weights of the target evaluation network and the target strategy network according to the following formula:
Figure BDA0002519264530000078
wherein tau is a soft update proportionality coefficient;
step 4-10: executing k to k +1 for the number k of iteration steps and judging if k is less than TrangeThen execute t ═ t + Δ t and returnReturning to the step 4-3, otherwise, entering the step 4-11;
and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, saving the network parameters at the moment, and taking the currently trained strategy network as the network for the first migration;
and 5: carrying out first transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task two scene:
step 5-1: migrating the trained neural network of the first task to a second task to serve as an initialization network of the second task;
step 5-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network is learned, storing the parameters and taking the trained strategy network as a network for the second migration;
step 6: and (3) carrying out second transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task three scene:
step 6-2: migrating the neural network trained by the task two to a task three as an initialization network of the task three;
step 6-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network learns, and storing the parameters; and loading the stored data into an unmanned aerial vehicle system, so that the unmanned aerial vehicle finishes the work of state input, neural network analysis and action output, and the high-efficiency unmanned aerial vehicle maneuvering target tracking based on DDPG transfer learning is realized.
λ12∈(0,1),λ3、λ4∈(0,1)。
Advantageous effects
The invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning. The method is independent of an environment model, a deep neural network is established, sensor information such as positions and speeds of an unmanned aerial vehicle and a target is used as input of the neural network, acceleration and angular speed of the unmanned aerial vehicle are used as output, and complex tasks are decomposed into the following steps: the method comprises the following steps of firstly, tracking a uniform linear motion target; task two, the target tracking of a complex maneuvering mode is realized; completing the tracking of the target under the condition of realizing obstacle avoidance; and then training and migrating the flight tracking strategy network based on the DDPG algorithm and migration learning, thereby completing the unmanned aerial vehicle maneuvering target tracking task in the complex environment. Its advantages are:
(1) the method realizes the online tracking decision of the unmanned aerial vehicle under the condition that an environmental model is unknown. By adopting a depth deterministic strategy gradient (DDPG) method, the optimal evaluation and strategy network reaching the target can be automatically learned through the sampling data tracked by the unmanned aerial vehicle under the strong fitting capacity of the neural network, and the tracking task is completed.
(2) The invention uses transfer learning, greatly improves the convergence rate while ensuring the algorithm precision, and saves the engineering development and model training cost. By migrating the trained model or network to a new engineering task, resetting the state space and the action space and adjusting the hyper-parameters of neural network training, more intelligent decision tasks of the unmanned aerial vehicle system can be expanded and realized.
Drawings
FIG. 1 is a flow chart of training task of tracking maneuvering target of unmanned aerial vehicle based on DDPG transfer learning
FIG. 2 is a schematic diagram of a DDPG-based unmanned aerial vehicle maneuvering target tracking algorithm structure
FIG. 3 is a task exploded view of unmanned aerial vehicle maneuvering target tracking
FIG. 4 is a graph showing the variation of the reward obtained by the UAV in each turn during the training process
FIG. 5 is a track display diagram of unmanned aerial vehicle for completing obstacle avoidance and target tracking
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, and the whole flow is shown in figure 1. The technical solution is further clearly and completely described below with reference to the accompanying drawings and specific embodiments:
step 1: constructing a Markov model (S, A, O, R, gamma) for tracking the maneuvering target of the unmanned aerial vehicle, wherein S is the input state of the unmanned aerial vehicle, A is the output action of the unmanned aerial vehicle, O is the observation space of a sensor of the unmanned aerial vehicle, R is a reward function, and gamma is a discount coefficient;
step 1-1: defining the state space of the Markov model, namely the input state S:
combining the unmanned aerial vehicle state, the target state and the obstacle state information, setting the model input state as follows:
Figure BDA0002519264530000091
wherein: unmanned aerial vehicle state Suav=[xuav,yuav,vuavuav],xuav,yuavRepresenting the position, v, on a two-dimensional plane of the droneuavIs the speed of the drone, thetauavIs the azimuth of the drone;
target state
Figure BDA0002519264530000092
xtarget,ytargetRepresenting the position on the two-dimensional plane of the object,
Figure BDA0002519264530000093
is the component of velocity, ω, of the target along axis X, YtargetAngle of turn, ω, of the targettargetMore than 0 is anticlockwise turning, omegatargetTurning clockwise if less than 0;
state of obstacle
Figure BDA0002519264530000094
This indicates the state of the i-th obstacle (i ═ 1,2, … n). Because the actual physical model of each barrier is different, the circumscribed circle processing is uniformly carried out on the barriers for convenient construction. Setting obstacle state
Figure BDA0002519264530000095
Wherein the content of the first and second substances,
Figure BDA0002519264530000096
denotes the ithThe position of the individual obstacle on the two-dimensional plane,
Figure BDA0002519264530000097
is the radius of the circumscribed circle of the ith obstacle;
step 1-2: defining the motion space of the Markov model, namely the output motion A of the unmanned aerial vehicle:
and outputting an action A to represent an action set taken by the unmanned aerial vehicle aiming at the self state value after receiving the external feedback value. In the present invention, the output is set as:
Figure BDA0002519264530000101
wherein the content of the first and second substances,
Figure BDA0002519264530000102
acceleration, ω, at time t of the dronetIs the angular velocity of the unmanned plane at the moment t. The acceleration and the angular velocity of the unmanned aerial vehicle are respectively restrained by combining practical application:
Figure BDA0002519264530000103
ωt∈[-3,3];
step 1-3: the observation space defining the markov model, i.e. the observation space O of the sensor:
in the invention, the radar sensor is utilized to judge and acquire the position and speed information of the unmanned aerial vehicle and the target. The observation space is set as follows:
Figure BDA0002519264530000104
wherein, relative distance D between unmanned aerial vehicle and the target is:
Figure BDA0002519264530000105
relative azimuth between drone and target
Figure BDA0002519264530000106
Comprises the following steps:
Figure BDA0002519264530000107
wherein the content of the first and second substances,
Figure BDA0002519264530000108
the observation error values are distance and angle, respectively;
step 1-4: defining a reward function R:
acquiring information of the unmanned aerial vehicle and a target position by using a sensor, and comprehensively obtaining a reward function R by performing distance reward punishment and obstacle avoidance reward punishment on the unmanned aerial vehicle, wherein the reward function R represents a feedback value obtained when the unmanned aerial vehicle selects a certain action in the current state;
in this embodiment, the minimum tracking range D is setminNot [0-15 m ]]Maximum tracking distance DmaxSetting a distance reward function r as 100 meters, wherein the observation range value of the sensor is L as 100 meters1Comprises the following steps:
Figure BDA0002519264530000109
wherein D ist-1Representing the distance between the drone and the target at the last moment, DtThe distance between the unmanned aerial vehicle and the target at the current time t; if the current distance DtIf the measurement range is larger than the measurement range of the sensor, giving a penalty of-1 to the unmanned aerial vehicle; if D istIf the value is less than or equal to L, giving positive reward; if D ist< L and Dt<DminThen an additional constant prize of 1 is given;
in this embodiment, set up safe interval D between unmanned aerial vehicle and the barrier safe10 meters. Setting obstacle avoidance reward function
Figure BDA0002519264530000111
Comprises the following steps:
Figure BDA0002519264530000112
wherein the content of the first and second substances,
Figure BDA0002519264530000113
is the distance between the drone and the obstacle at time t;
synthesize unmanned aerial vehicle apart from reward, keep away each weighted value of barrier reward, set for reward function R and be:
R=0.7*r1-0.3*rt coll
step 1-5: defining a discount factor γ:
a discount factor is set for calculating the accumulated return value in the whole process. In this embodiment, γ is set to 0.95.
Step 2: constructing a neural network of the DDPG algorithm, wherein the schematic structural diagram of the algorithm is shown in FIG. 2:
step 2-1: constructing a policy network in the DDPG algorithm, namely an Actor policy network:
policy network muactorThe policy network is composed of an input layer, a hidden layer and an output layer, and for an input state vector s, an output vector u of the policy network is expressed as:
u=μactor(s)
step 2-2: constructing an evaluation network in the DDPG algorithm, namely a criticic evaluation network:
evaluating the output of the network as a state-behavior value Qμ(s, u), expressed as:
Figure BDA0002519264530000115
where k is a summation variable, E [. cndot.)]Representing a mathematical expectation. st+k+1、ut+k+1Respectively representing a state input vector and a motion output vector at the moment of t + k + 1;
step 2-3: constructing a target neural network:
policy network muactorAnd evaluating network QμThe weights of (s, u) are copied to the respective destinationsIn a target network, i.e. thetaμ→θμ′,θQ→θQ′Wherein thetaμ、θQParameters, theta, representing the current policy network and the evaluation network, respectivelyμ′、θQ′Parameters respectively representing a target strategy network and a target evaluation network;
it should be noted that, in the embodiment, the policy network and the evaluation network have three layers of neural networks respectively for the target neural network, the number of neurons in the hidden layer is 100, the ReLu activation function is adopted, and the tanh function is adopted in the output layer;
and step 3: unmanned aerial vehicle and target status update
Step 3-1: establishing a state updating equation of the unmanned aerial vehicle at the time t:
Figure BDA0002519264530000121
wherein x isuav(·)、yuav(. v) coordinate value representing unmanned aerial vehicle at a certain timeuav(·)、ζuav(-) represents the linear and angular velocities of the drone at a time,
Figure BDA0002519264530000126
the acceleration of the drone at a certain time. In this embodiment, the simulation time interval Δ t is set to 1 second, and the minimum and maximum speeds of the unmanned aerial vehicle are set to vmin0 m/s, vmax100 m/s;
step 3-2: constructing a state updating equation of the target at the time t:
Figure BDA0002519264530000122
wherein the content of the first and second substances,
Figure BDA0002519264530000123
representing the target state at time t +1, FtBeing a state transition matrix, ΓtAs a noise influence matrix, wtIs gaussian white noise. FtAnd ΓtIs represented as follows:
Figure BDA0002519264530000124
Figure BDA0002519264530000125
step 3-3: in the invention, the position of each obstacle is kept unchanged, so that the position state of the obstacle does not need to be updated;
and 4, step 4: the invention decomposes the tracking task of the maneuvering target of the unmanned aerial vehicle, which respectively comprises the following steps: the method comprises the following steps of firstly, tracking a uniform linear motion target; task two, the target tracking of a complex maneuvering mode is realized; task three, completing the tracking of the target under the condition of realizing obstacle avoidance, as shown in fig. 3 specifically;
training maneuvering target tracking of the unmanned aerial vehicle by using a deterministic strategy gradient method under a task scene:
step 4-1: in the embodiment of the present invention, the maximum training round E is set to 800, and the maximum number of steps in each round is T range400, 8000 for the experience pool size M, 0.9 for the soft update proportionality coefficient τ of the target neural network, and α for the learning rates of the evaluation network and the policy network, respectivelyω0.001 and αθ=0.001;
Step 4-2: initializing a state space S and initializing network parameters;
setting an initial state of the drone
Figure BDA0002519264530000131
Initial state of the target
Figure BDA0002519264530000132
ωtargetThree stages in the round are each ωtarget6.18 degree/sec,. omegatarget8.33 deg/s,. omegatarget-2.21 degrees/sec; the three obstacle initialization states are respectively: rectangular obstacleState S1400 m, 75 m, 42 m, 16 m]Square obstacle state S2═ 200 m, 115 m, 40 m]Circular obstacle state S3528 m, 280 m, 12 m]The rectangular and square obstacle models are subjected to circumscribed circle processing in the obstacle avoidance process of the unmanned aerial vehicle;
initializing the weight of the neural network;
step 4-3: at the current state StSelecting the action of the unmanned aerial vehicle:
Figure BDA0002519264530000133
wherein, mud(. cndot.) represents a deterministic policy function,
Figure BDA0002519264530000134
is a random process noise vector;
step 4-4: unmanned aerial vehicle executes action atCalculating the relative distance and the relative pitch angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4tThen, the next state s is obtained from step 3t+1Then sample etranstion=<st,at,rt,st+1>Storing the experience into an experience pool queue;
and 4-5: judging the size N of the experience poolRWhether the requirement is met, if N isRIf the number is less than M, turning to the step 4-3; if the stored sample size is larger than the experience pool capacity, automatically dequeuing the sample data in front of the experience pool queue, and then entering the step 4-6;
and 4-6: randomly extracting a small batch of samples N from an experience pool for learning, wherein the learning process is represented as:
yt=rt+γQ'(st+1,μ'(st+1μ')|θQ')
wherein y istRepresenting the target network, rtFor the prize value at time t, θQ′And thetaμ′Respectively representing target evaluation network and target strategy network parameters, Q' tableIs shown at st+1A state-action value is obtained by adopting a mu' strategy at any moment;
and 4-7: updating the policy network according to the minimum loss function:
Figure BDA0002519264530000141
l represents the Loss of Loss function of Loss, N represents the number of samples used for network update;
and 4-8: updating the strategy gradient:
Figure BDA0002519264530000142
wherein the content of the first and second substances,
Figure BDA0002519264530000143
expressed in the policy network parameter thetaμThe following strategy gradient is set to be,
Figure BDA0002519264530000144
and
Figure BDA0002519264530000145
respectively representing the evaluation network state-behavior value function gradient and the strategy network strategy function gradient, mu(s)t) Is represented in a policy network state stThe action strategy is selected according to the selected action strategy,
Figure BDA0002519264530000146
and
Figure BDA0002519264530000147
respectively represent the state stTake action a ═ μ(s) belowt) Evaluating the state-behavior value of the network and the behavior value of the policy network under the state;
and 4-9: updating the weights of the target evaluation network and the target strategy network according to the following formula:
Figure BDA0002519264530000148
in this embodiment, the soft update rate coefficient τ is set to 0.9.
Step 4-10: executing k to k +1 for the number k of iteration steps and judging if k is less than TrangeIf yes, executing t-t + delta t and returning to the step 4-3, otherwise, entering the step 4-11;
and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, the network parameters at the moment are saved, and the current strategy network is used as the final strategy network. Replacing the state space in the step 1-1 as the final input of the network, thereby realizing effective unmanned aerial vehicle maneuvering target tracking;
and 5: carrying out first transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task two scene:
step 5-1: the method comprises the following steps that (1) maneuvering target tracking training of the unmanned aerial vehicle under a task two scene is completed on the basis of a task one, and firstly, a neural network trained by the task one is transferred to a task two to serve as an initialization network of the task two;
step 5-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network learns a little, storing the parameters and taking the trained strategy network as a network for the second migration;
step 6: and (3) carrying out second transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task three scene:
step 6-1: the maneuvering target tracking training of the unmanned aerial vehicle under the scene of task three is completed on the basis of task two, namely, the trained neural network of task two is transferred to task three to be used as an initialization network of task three;
step 6-2: and 4-3 to 4-11, and completing the task after the network learns a little. And replacing the state space in the step 1-1 as the final input of the network, thereby realizing the high-efficiency unmanned aerial vehicle maneuvering target tracking based on DDPG transfer learning.
The unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning provided by the invention trains the neural network by decomposing tasks, initializing environment states, neural network parameters and other super parameters. When the turn is started, the unmanned aerial vehicle executes actions to change the speed and the course angle to obtain a new state, the experience of each turn is stored in an experience pool to be used as a learning sample, and the parameters of the neural network are continuously updated in an iterative mode. And when the training is finished, storing the neural network parameters trained by the subtasks, and transferring the neural network parameters to the unmanned aerial vehicle maneuvering target tracking network under the next task scene until the final task is finished.
The reward change curve graph obtained by the unmanned aerial vehicle in each round in the training process is shown in fig. 4, after about 300 rounds of training, the unmanned aerial vehicle can obtain high and stable rewards in each round, and the progressive strategy provided by the method and the DDPG algorithm which is designed by adopting transfer learning in a targeted manner can improve the convergence rate of the original DDPG algorithm and the robustness of a network, so that the efficiency and the stability of the autonomous intelligent decision process of the unmanned aerial vehicle are improved. The simulation result is shown in fig. 5, and it can be seen that the unmanned aerial vehicle trained based on the DDPG migration learning algorithm can effectively avoid obstacles and complete a maneuvering target tracking task.

Claims (2)

1. An unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is characterized by comprising the following steps:
step 1: constructing a Markov model (S, A, O, R, gamma) for tracking the maneuvering target of the unmanned aerial vehicle, wherein S is the input state of the unmanned aerial vehicle, A is the output action of the unmanned aerial vehicle, O is the observation space of a sensor of the unmanned aerial vehicle, R is a reward function, and gamma is a discount coefficient;
step 1-1: defining the state space of the Markov model, namely the input state S:
combining the unmanned aerial vehicle state, the target state and the obstacle state information, setting the model input state as follows:
Figure FDA0003382303610000011
wherein: is free ofHuman machine state Suav=[xuav,yuav,vuavuav],xuav,yuavRepresenting the position, v, on a two-dimensional plane of the droneuavIs the speed of the drone, thetauavIs the azimuth of the drone;
target state
Figure FDA0003382303610000012
xtarget,ytargetRepresenting the position on the two-dimensional plane of the object,
Figure FDA0003382303610000013
is the component of velocity, ω, of the target along axis X, YtargetAngle of turn, ω, of the targettargetMore than 0 is anticlockwise turning, omegatarget<0 is clockwise turning;
state of obstacle
Figure FDA0003382303610000014
Represents the state of the ith obstacle, where i ═ 1,2, … n; because the actual physical models of all the obstacles are different, the external circle processing is uniformly carried out on the obstacles for convenient construction; setting obstacle state
Figure FDA0003382303610000015
Wherein the content of the first and second substances,
Figure FDA0003382303610000016
indicating the position of the ith obstacle in the two-dimensional plane,
Figure FDA0003382303610000017
is the radius of the circumscribed circle of the ith obstacle;
step 1-2: defining the motion space of the Markov model, namely the output motion A of the unmanned aerial vehicle:
the output action A represents an action set taken by the unmanned aerial vehicle for the self state value after receiving the external feedback value; the output is set as:
Figure FDA0003382303610000018
wherein the content of the first and second substances,
Figure FDA0003382303610000019
acceleration, ω, at time t of the dronetThe angular velocity of the unmanned aerial vehicle at the moment t; the acceleration and the angular velocity of the unmanned aerial vehicle are respectively restrained by combining practical application:
Figure FDA00033823036100000110
ωt∈[ωminmax](ii) a Wherein the content of the first and second substances,
Figure FDA00033823036100000111
respectively representing the minimum acceleration and the maximum acceleration of the unmanned aerial vehicle; omegamin、ωmaxRespectively representing the minimum and maximum angular velocities of the unmanned aerial vehicle;
step 1-3: the observation space defining the markov model, i.e. the observation space O of the sensor:
judging and acquiring the position and speed information of the unmanned aerial vehicle and the target by using a radar sensor; the observation space is set as follows:
Figure FDA0003382303610000021
wherein, relative distance D between unmanned aerial vehicle and the target is:
Figure FDA0003382303610000022
relative azimuth between drone and target
Figure FDA0003382303610000023
Comprises the following steps:
Figure FDA0003382303610000024
wherein the content of the first and second substances,
Figure FDA0003382303610000025
the observation error values are distance and angle, respectively;
step 1-4: defining a reward function R:
acquiring information of the unmanned aerial vehicle and a target position by using a sensor, and comprehensively obtaining a reward function R by performing distance reward punishment and obstacle avoidance reward punishment on the unmanned aerial vehicle, wherein the reward function R represents a feedback value obtained when the unmanned aerial vehicle selects a certain action in the current state;
setting a distance reward function r1Comprises the following steps:
Figure FDA0003382303610000026
wherein λ is1、λ2A weight value for two awards; dt-1Representing the distance between the drone and the target at the last moment, DtIs the distance between the unmanned aerial vehicle and the target at the current time t, DminIs the minimum tracking range; dmaxThe maximum tracking distance is obtained, and L is the observation range of the sensor; if D istIf > L, a penalty award C of negative constant is given2(ii) a If D istIf the value is less than or equal to L, giving positive reward; if D ist< L and Dt<DminA positive constant prize C is awarded1
Setting obstacle avoidance reward function
Figure FDA0003382303610000027
Comprises the following steps:
Figure FDA0003382303610000028
wherein the content of the first and second substances,
Figure FDA0003382303610000031
is the distance between the unmanned aerial vehicle and the obstacle at time t, DsafeIs a constant, representing a safe separation between the drone and the obstacle;
synthesize unmanned aerial vehicle distance reward, keep away barrier reward, obtain reward function R and be:
Figure FDA0003382303610000032
wherein λ is3、λ4Respectively representing distance reward and obstacle avoidance reward weight values;
step 1-5: defining a discount factor γ:
setting a discount factor 0< gamma <1 for calculating a return accumulated value in the whole process; when the gamma value is larger, the longer-term benefit is emphasized;
step 2: constructing a neural network of the DDPG algorithm:
step 2-1: constructing a policy network in the DDPG algorithm, namely an Actor policy network:
policy network muactorThe policy network is composed of an input layer, a hidden layer and an output layer, and for an input state vector s, an output vector u of the policy network is expressed as:
u=μactor(s)
step 2-2: constructing an evaluation network in the DDPG algorithm, namely a criticic evaluation network:
evaluating the output of the network as a state-behavior value Qμ(s, u), expressed as:
Figure FDA0003382303610000033
where k is a summation variable, E [. cndot.)]Represents a mathematical expectation; st+k+1、ut+k+1Respectively representing a state input vector and a motion output vector at the moment of t + k + 1;
step 2-3: constructing a target neural network:
policy network muactorAnd evaluating network QμThe weights of (s, u) are copied into the respective target networks, i.e. θμ→θμ′,θQ→θQ′Wherein thetaμ、θQParameters, theta, representing the current policy network and the evaluation network, respectivelyμ′、θQ′Parameters respectively representing a target strategy network and a target evaluation network;
and step 3: unmanned aerial vehicle and target status update
Step 3-1: establishing a state updating equation of the unmanned aerial vehicle at the time t:
Figure FDA0003382303610000041
wherein x isuav(·)、yuav(. v) coordinate value representing unmanned aerial vehicle at a certain timeuav(·)、ζuav(-) represents the linear and angular velocities of the drone at a time,
Figure FDA0003382303610000046
acceleration of the unmanned aerial vehicle at a certain time; Δ t is the simulation time interval, (v)min,vmax) The minimum and maximum speeds of the unmanned aerial vehicle;
step 3-2: constructing a state updating equation of the target at the time t:
Figure FDA0003382303610000042
wherein the content of the first and second substances,
Figure FDA0003382303610000043
representing the target state at time t +1, FtBeing a state transition matrix, ΓtAs a noise influence matrix, wtIs white gaussian noise; ftAnd ΓtIs represented as follows:
Figure FDA0003382303610000044
Figure FDA0003382303610000045
and 4, step 4: training maneuvering target tracking of the unmanned aerial vehicle by using a deterministic strategy gradient method under a task scene:
step 4-1: setting the maximum training round as E and the maximum step number of each round as TrangeSetting the size M of an experience pool, setting a soft update proportion coefficient tau of a target neural network, and setting the learning rates of an evaluation network and a strategy network to be alpha respectivelyωAnd alphaθ
Step 4-2: initializing a state space S and initializing network parameters;
step 4-3: at the current state StSelecting the action of the unmanned aerial vehicle:
Figure FDA0003382303610000051
wherein, mud(. cndot.) represents a deterministic policy function,
Figure FDA0003382303610000052
is a random process noise vector;
step 4-4: unmanned aerial vehicle executes action atCalculating the relative distance and the relative azimuth angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4tThen, the next state s is obtained from step 3t+1Then sample etranstion=<st,at,rt,st+1>Storing the data into an experience pool;
and 4-5: judging the size N of the experience poolRWhether the requirement is met, if N isRIf the number is less than M, turning to the step 4-3; if the stored sample size is larger than the experience pool capacity, the experience pool is queued upAutomatically listing the data of the square sample, and entering a step 4-6 at the moment;
and 4-6: randomly extracting a small batch of samples N from an experience pool for learning, wherein the learning process is represented as:
yt=rt+γQ'(st+1,μ'(st+1μ')|θQ')
wherein y istRepresenting the target network, rtFor the prize value at time t, θQ′And thetaμ′Respectively representing target evaluation network and target strategy network parameters, and Q 'represents a state-action value obtained by adopting a mu' strategy at the moment of t + 1;
and 4-7: updating the policy network according to the minimum loss function:
Figure FDA0003382303610000053
l represents the Loss of Loss function of Loss, N represents the number of samples used for network update;
and 4-8: updating the strategy gradient:
Figure FDA0003382303610000054
wherein the content of the first and second substances,
Figure FDA0003382303610000055
expressed in the policy network parameter thetaμThe following strategy gradient is set to be,
Figure FDA0003382303610000056
and
Figure FDA0003382303610000057
respectively representing the evaluation network state-behavior value function gradient and the strategy network strategy function gradient, mu(s)t) Is represented in a policy network state stThe action strategy is selected according to the selected action strategy,
Figure FDA0003382303610000058
and
Figure FDA0003382303610000059
respectively represent the state stTake action a ═ μ(s) belowt) Evaluating the state-behavior value of the network and the behavior value of the policy network under the state;
and 4-9: updating the weights of the target evaluation network and the target strategy network according to the following formula:
Figure FDA0003382303610000061
wherein tau is a soft update proportionality coefficient;
step 4-10: executing k to k +1 for the number k of iteration steps and judging if k is less than TrangeIf yes, executing t-t + delta t and returning to the step 4-3, otherwise, entering the step 4-11;
and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, saving the network parameters at the current moment, and taking the currently trained strategy network as the network for the first migration;
and 5: carrying out first transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task two scene:
step 5-1: migrating the trained neural network of the first task to a second task to serve as an initialization network of the second task;
step 5-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network is learned, storing the parameters and taking the trained strategy network as a network for the second migration;
step 6: and (3) carrying out second transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task three scene:
step 6-2: migrating the neural network trained by the task two to a task three as an initialization network of the task three;
step 6-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network learns, and storing the parameters; and loading the stored data into an unmanned aerial vehicle system, so that the unmanned aerial vehicle finishes the work of state input, neural network analysis and action output, and the high-efficiency unmanned aerial vehicle maneuvering target tracking based on DDPG transfer learning is realized.
2. The unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning as described in claim 1, characterized in that λ1、λ2∈(0,1),λ3、λ4∈(0,1)。
CN202010486053.4A 2020-06-01 2020-06-01 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning Active CN111667513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010486053.4A CN111667513B (en) 2020-06-01 2020-06-01 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010486053.4A CN111667513B (en) 2020-06-01 2020-06-01 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Publications (2)

Publication Number Publication Date
CN111667513A CN111667513A (en) 2020-09-15
CN111667513B true CN111667513B (en) 2022-02-18

Family

ID=72385471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010486053.4A Active CN111667513B (en) 2020-06-01 2020-06-01 Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Country Status (1)

Country Link
CN (1) CN111667513B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112051863A (en) * 2020-09-25 2020-12-08 南京大学 Unmanned aerial vehicle autonomous anti-reconnaissance and enemy attack avoidance method
CN112488320B (en) * 2020-09-25 2023-05-02 中国人民解放军军事科学院国防科技创新研究院 Training method and system for multiple agents under complex conditions
CN112596515B (en) * 2020-11-25 2023-10-24 北京物资学院 Multi-logistics robot movement control method and device
CN112435275A (en) * 2020-12-07 2021-03-02 中国电子科技集团公司第二十研究所 Unmanned aerial vehicle maneuvering target tracking method integrating Kalman filtering and DDQN algorithm
CN112698572B (en) * 2020-12-22 2022-08-16 西安交通大学 Structural vibration control method, medium and equipment based on reinforcement learning
CN112783199B (en) * 2020-12-25 2022-05-13 北京航空航天大学 Unmanned aerial vehicle autonomous navigation method based on transfer learning
CN112286218B (en) * 2020-12-29 2021-03-26 南京理工大学 Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN112799429B (en) * 2021-01-05 2022-03-29 北京航空航天大学 Multi-missile cooperative attack guidance law design method based on reinforcement learning
CN112904890B (en) * 2021-01-15 2023-06-30 北京国网富达科技发展有限责任公司 Unmanned aerial vehicle automatic inspection system and method for power line
CN112965488B (en) * 2021-02-05 2022-06-03 重庆大学 Baby monitoring mobile machine trolley based on transfer learning neural network
CN113158608A (en) * 2021-02-26 2021-07-23 北京大学 Processing method, device and equipment for determining parameters of analog circuit and storage medium
CN113095463A (en) * 2021-03-31 2021-07-09 南开大学 Robot confrontation method based on evolution reinforcement learning
CN113093803B (en) * 2021-04-03 2022-10-14 西北工业大学 Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm
CN113189983B (en) * 2021-04-13 2022-05-31 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113325704B (en) * 2021-04-25 2023-11-10 北京控制工程研究所 Spacecraft backlighting approaching intelligent orbit control method, device and storage medium
CN113311851B (en) * 2021-04-25 2023-06-16 北京控制工程研究所 Spacecraft chase-escaping intelligent orbit control method, device and storage medium
CN113031642B (en) * 2021-05-24 2021-08-10 北京航空航天大学 Hypersonic aircraft trajectory planning method and system with dynamic no-fly zone constraint
CN113050433B (en) * 2021-05-31 2021-09-14 中国科学院自动化研究所 Robot control strategy migration method, device and system
CN115494831B (en) * 2021-06-17 2024-04-16 中国科学院沈阳自动化研究所 Tracking method for autonomous intelligent collaboration of human and machine
CN113467248A (en) * 2021-07-22 2021-10-01 南京大学 Fault-tolerant control method for unmanned aerial vehicle sensor during fault based on reinforcement learning
CN113721645A (en) * 2021-08-07 2021-11-30 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning
CN113625569B (en) * 2021-08-12 2022-02-08 中国人民解放军32802部队 Small unmanned aerial vehicle prevention and control decision method and system based on hybrid decision model
CN113433953A (en) * 2021-08-25 2021-09-24 北京航空航天大学 Multi-robot cooperative obstacle avoidance method and device and intelligent robot
CN113822409B (en) * 2021-09-18 2022-12-06 中国电子科技集团公司第五十四研究所 Multi-unmanned aerial vehicle cooperative penetration method based on heterogeneous multi-agent reinforcement learning
CN113900445A (en) * 2021-10-13 2022-01-07 厦门渊亭信息科技有限公司 Unmanned aerial vehicle cooperative control training method and system based on multi-agent reinforcement learning
CN114089776B (en) * 2021-11-09 2023-10-24 南京航空航天大学 Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN115097853B (en) * 2022-05-18 2023-07-07 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle maneuvering flight control method based on fine granularity repetition strategy
CN117707207B (en) * 2024-02-06 2024-04-19 中国民用航空飞行学院 Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930625A (en) * 2016-06-13 2016-09-07 天津工业大学 Design method of Q-learning and neural network combined smart driving behavior decision making system
CN106845016A (en) * 2017-02-24 2017-06-13 西北工业大学 One kind is based on event driven measurement dispatching method
CN107193009A (en) * 2017-05-23 2017-09-22 西北工业大学 A kind of many UUV cooperative systems underwater target tracking algorithms of many interaction models of fuzzy self-adaption
CN107402381A (en) * 2017-07-11 2017-11-28 西北工业大学 A kind of multiple maneuver target tracking methods of iteration self-adapting
CN107450555A (en) * 2017-08-30 2017-12-08 唐开强 A kind of Hexapod Robot real-time gait planing method based on deeply study
CN108599737A (en) * 2018-04-10 2018-09-28 西北工业大学 A kind of design method of the non-linear Kalman filtering device of variation Bayes
CN108919640A (en) * 2018-04-20 2018-11-30 西北工业大学 The implementation method of the adaptive multiple target tracking of unmanned plane
CN109933086A (en) * 2019-03-14 2019-06-25 天津大学 Unmanned plane environment sensing and automatic obstacle avoiding method based on depth Q study
CN110196605A (en) * 2019-04-26 2019-09-03 大连海事大学 A kind of more dynamic object methods of the unmanned aerial vehicle group of intensified learning collaboratively searching in unknown sea area
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning
CN110673620A (en) * 2019-10-22 2020-01-10 西北工业大学 Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
CN110703766A (en) * 2019-11-07 2020-01-17 南京航空航天大学 Unmanned aerial vehicle path planning method based on transfer learning strategy deep Q network
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11775850B2 (en) * 2016-01-27 2023-10-03 Microsoft Technology Licensing, Llc Artificial intelligence engine having various algorithms to build different concepts contained within a same AI model
CN109032168B (en) * 2018-05-07 2021-06-08 西安电子科技大学 DQN-based multi-unmanned aerial vehicle collaborative area monitoring airway planning method
CN110806759B (en) * 2019-11-12 2020-09-08 清华大学 Aircraft route tracking method based on deep reinforcement learning

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930625A (en) * 2016-06-13 2016-09-07 天津工业大学 Design method of Q-learning and neural network combined smart driving behavior decision making system
CN106845016A (en) * 2017-02-24 2017-06-13 西北工业大学 One kind is based on event driven measurement dispatching method
CN107193009A (en) * 2017-05-23 2017-09-22 西北工业大学 A kind of many UUV cooperative systems underwater target tracking algorithms of many interaction models of fuzzy self-adaption
CN107402381A (en) * 2017-07-11 2017-11-28 西北工业大学 A kind of multiple maneuver target tracking methods of iteration self-adapting
CN107450555A (en) * 2017-08-30 2017-12-08 唐开强 A kind of Hexapod Robot real-time gait planing method based on deeply study
CN108599737A (en) * 2018-04-10 2018-09-28 西北工业大学 A kind of design method of the non-linear Kalman filtering device of variation Bayes
CN108919640A (en) * 2018-04-20 2018-11-30 西北工业大学 The implementation method of the adaptive multiple target tracking of unmanned plane
CN109933086A (en) * 2019-03-14 2019-06-25 天津大学 Unmanned plane environment sensing and automatic obstacle avoiding method based on depth Q study
CN110196605A (en) * 2019-04-26 2019-09-03 大连海事大学 A kind of more dynamic object methods of the unmanned aerial vehicle group of intensified learning collaboratively searching in unknown sea area
CN110322017A (en) * 2019-08-13 2019-10-11 吉林大学 Automatic Pilot intelligent vehicle Trajectory Tracking Control strategy based on deeply study
CN110333739A (en) * 2019-08-21 2019-10-15 哈尔滨工程大学 A kind of AUV conduct programming and method of controlling operation based on intensified learning
CN110673620A (en) * 2019-10-22 2020-01-10 西北工业大学 Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
CN110703766A (en) * 2019-11-07 2020-01-17 南京航空航天大学 Unmanned aerial vehicle path planning method based on transfer learning strategy deep Q network
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Generic Spatiotemporal Scheduling for Autonomous UAVs: A Reinforcement Learning-Based Approach;OMAR BOUHAMED et al;《Vehicular Technology》;20200430;第1卷;第93-106页 *
Path Planning for UAV Ground Target Tracking via Deep Reinforcement Learning;BOHAO LI et al;《IEEE Access》;20200217;第8卷;第29064-29074页 *
基于马尔科夫网络的无人机机动决策方法研究;罗元强等;《***仿真学报》;20171231;第29卷;第106-112页 *
多无人机协同的飞行航迹规划问题研究;丁强;《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》;20190115;第2019年卷(第1期);C031-351 *

Also Published As

Publication number Publication date
CN111667513A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN111667513B (en) Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
CN110673620B (en) Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
CN109655066B (en) Unmanned aerial vehicle path planning method based on Q (lambda) algorithm
CN112256056B (en) Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN111123963B (en) Unknown environment autonomous navigation system and method based on reinforcement learning
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
CN112947562B (en) Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
CN112435275A (en) Unmanned aerial vehicle maneuvering target tracking method integrating Kalman filtering and DDQN algorithm
Ma et al. Deep reinforcement learning of UAV tracking control under wind disturbances environments
Ma et al. Multi-robot target encirclement control with collision avoidance via deep reinforcement learning
CN112034711B (en) Unmanned ship sea wave interference resistance control method based on deep reinforcement learning
CN114625151B (en) Underwater robot obstacle avoidance path planning method based on reinforcement learning
CN110442129B (en) Control method and system for multi-agent formation
CN112462792B (en) Actor-Critic algorithm-based underwater robot motion control method
CN112947505B (en) Multi-AUV formation distributed control method based on reinforcement learning algorithm and unknown disturbance observer
CN113268074B (en) Unmanned aerial vehicle flight path planning method based on joint optimization
CN113848974B (en) Aircraft trajectory planning method and system based on deep reinforcement learning
CN111783994A (en) Training method and device for reinforcement learning
CN113110546B (en) Unmanned aerial vehicle autonomous flight control method based on offline reinforcement learning
CN115033022A (en) DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
CN114967721B (en) Unmanned aerial vehicle self-service path planning and obstacle avoidance strategy method based on DQ-CapsNet
CN114003059B (en) UAV path planning method based on deep reinforcement learning under kinematic constraint condition
CN117707207B (en) Unmanned aerial vehicle ground target tracking and obstacle avoidance planning method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant