CN111667513B

CN111667513B - Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Info

Publication number: CN111667513B
Application number: CN202010486053.4A
Authority: CN
Inventors: 李波; 杨志鹏; 高晓光; 万开方; 梁诗阳; 马浩
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2022-02-18
Anticipated expiration: 2040-06-01
Also published as: CN111667513A

Abstract

The invention relates to an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, which trains a neural network by decomposing tasks, initializing environment states, neural network parameters and other hyper-parameters. When the turn is started, the unmanned aerial vehicle executes actions to change the speed and the course angle to obtain a new state, the experience of each turn is stored in an experience pool to be used as a learning sample, and the parameters of the neural network are continuously updated in an iterative mode. And when the training is finished, storing the neural network parameters trained by the subtasks, and transferring the neural network parameters to the unmanned aerial vehicle maneuvering target tracking network under the next task scene until the final task is finished.

Description

Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning

Technical Field

The invention relates to an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, and belongs to the field of robot intelligent control.

Background

Along with the continuous development of unmanned aerial vehicle technique, unmanned aerial vehicle has the wide application in the civil field. In many tasks of the unmanned aerial vehicle, the most executed tasks are monitoring and reconnaissance tasks, and if the unmanned aerial vehicle can independently and accurately track other mobile targets, the monitoring range is expanded, and meanwhile, the threat area is effectively avoided, so that the monitoring, reconnaissance and even the attack efficiency can be greatly improved.

Most of the existing researches on the maneuvering target of the unmanned aerial vehicle are carried out on the state estimation and measurement information processing of the maneuvering target, and how to decide the maneuvering behavior of the unmanned aerial vehicle after the state of the maneuvering target is determined is rarely researched, so that the unmanned aerial vehicle can better track the target. The traditional unmanned aerial vehicle maneuvering target tracking algorithm mainly depends on the accuracy of target movement modeling, if a large error exists between an environment model and an actual movement model of target tracking, influence factors which cannot be estimated due to a target state can be caused in the tracking process, and in addition, time is consumed for maneuvering modeling of the target. The environment tracked by drones can be relatively complex, dynamically changing, and even uncertain, and the target tracking task undertaken by drones is becoming increasingly complex. By integrating the factors, higher requirements are put forward on the autonomy of the unmanned aerial vehicle, and the unmanned aerial vehicle is increasingly required to have the autonomous learning capability. Therefore, the research has low dependence degree on an environment model or does not need the model, the tracking method can be self-learned through interaction with the environment, the method is very meaningful, and meanwhile, the method becomes an inevitable trend in the field of unmanned aerial vehicle maneuvering target tracking research.

The patent publication CN108919640B proposes an unmanned aerial vehicle target tracking method based on reinforcement learning, which is simple in tracking environment, small in data amount required for decision making, unable to meet unmanned aerial vehicle target tracking under complex environment conditions, and difficult to apply to an unmanned aerial vehicle control system in a real scene. The patent publication CN110806759A provides an aircraft route tracking method based on deep reinforcement learning, and the method carries out online correction on the physical control of an aircraft based on the deep reinforcement learning so as to realize autonomous perception and decision of an unmanned aerial vehicle. However, this method does not take into account the time cost required for neural network fitting, as well as its migratory capabilities, making the task difficult to train.

The deep deterministic strategy gradient (DDPG) algorithm not only utilizes the excellent performances of an experience pool and a dual neural network structure in the deep Q network algorithm, but also improves the problems of data explosion and the like of the traditional reinforcement learning; the method also has the characteristics of a strategy gradient algorithm, can effectively process continuous domain data, and enables the neural network to be rapidly converged. In addition, as an efficient machine learning method, the migration learning can migrate the network developed in different tasks and be applied to the development process of similar engineering task models again, so that the training time and cost are greatly saved, and the generalization capability of the network and the models is improved. Therefore, the unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is designed, and has important significance for the realization of the unmanned aerial vehicle application in the related fields.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning.

Technical scheme

An unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is characterized by comprising the following steps:

step 1: constructing a Markov model (S, A, O, R, gamma) for tracking the maneuvering target of the unmanned aerial vehicle, wherein S is the input state of the unmanned aerial vehicle, A is the output action of the unmanned aerial vehicle, O is the observation space of a sensor of the unmanned aerial vehicle, R is a reward function, and gamma is a discount coefficient;

step 1-1: defining the state space of the Markov model, namely the input state S:

combining the unmanned aerial vehicle state, the target state and the obstacle state information, setting the model input state as follows:

wherein: unmanned aerial vehicle state S_uav＝[x^uav,y^uav,v^uav,θ^uav]，x^uav,y^uavRepresenting the position, v, on a two-dimensional plane of the drone^uavIs the speed of the drone, theta^uavIs the azimuth of the drone;

target state

x^target,y^targetRepresenting the position on the two-dimensional plane of the object,

is the component of velocity, ω, of the target along axis X, Y^targetAngle of turn, ω, of the target^targetMore than 0 is anticlockwise turning, omega^targetTurning clockwise if less than 0;

state of obstacle

Represents the state of the ith obstacle, where i ═ 1,2, … n; because the actual physical models of all the obstacles are different, the external circle processing is uniformly carried out on the obstacles for convenient construction; setting obstacle state

Wherein the content of the first and second substances,

indicating the position of the ith obstacle in the two-dimensional plane,

is the radius of the circumscribed circle of the ith obstacle;

step 1-2: defining the motion space of the Markov model, namely the output motion A of the unmanned aerial vehicle:

the output action A represents an action set taken by the unmanned aerial vehicle for the self state value after receiving the external feedback value; the output is set as:

wherein the content of the first and second substances,

acceleration, ω, at time t of the drone_tThe angular velocity of the unmanned aerial vehicle at the moment t; acceleration and angular velocity of unmanned aerial vehicle respectively combined with practical applicationAnd (5) degree constraint:

ω_t∈[ω_min,ω_max](ii) a Wherein the content of the first and second substances,

respectively representing the minimum acceleration and the maximum acceleration of the unmanned aerial vehicle; omega_min、ω_maxRespectively representing the minimum and maximum angular velocities of the unmanned aerial vehicle;

step 1-3: the observation space defining the markov model, i.e. the observation space O of the sensor:

judging and acquiring the position and speed information of the unmanned aerial vehicle and the target by using a radar sensor; the observation space is set as follows:

wherein, relative distance D between unmanned aerial vehicle and the target is:

relative azimuth between drone and target

Comprises the following steps:

wherein the content of the first and second substances,

the observation error values are distance and angle, respectively;

step 1-4: defining a reward function R:

acquiring information of the unmanned aerial vehicle and a target position by using a sensor, and comprehensively obtaining a reward function R by performing distance reward punishment and obstacle avoidance reward punishment on the unmanned aerial vehicle, wherein the reward function R represents a feedback value obtained when the unmanned aerial vehicle selects a certain action in the current state;

setting a distance reward function r₁Comprises the following steps:

wherein λ is₁、λ₂A weight value for two awards; d_t-1Representing the distance between the drone and the target at the last moment, D_tIs the distance between the unmanned aerial vehicle and the target at the current time t, D_minIs the minimum tracking range; d_maxThe maximum tracking distance is obtained, and L is the observation range of the sensor; if D is_tIf > L, a penalty award C of negative constant is given₂(ii) a If D is_tIf the value is less than or equal to L, giving positive reward; if D is_t< L and D_t＜D_minA positive constant prize C is awarded₁；

Setting obstacle avoidance reward function

Comprises the following steps:

wherein the content of the first and second substances,

is the distance between the unmanned aerial vehicle and the obstacle at time t, D_safeIs a constant, representing a safe separation between the drone and the obstacle;

synthesize unmanned aerial vehicle distance reward, keep away barrier reward, obtain reward function R and be:

R＝λ₃*r₁-λ₄*r_t ^coll

wherein λ is₃、λ₄Respectively representing distance reward and obstacle avoidance reward weight values;

step 1-5: defining a discount factor γ:

setting a discount factor 0< gamma <1 for calculating a return accumulated value in the whole process; when the gamma value is larger, the longer-term benefit is emphasized;

step 2: constructing a neural network of the DDPG algorithm:

step 2-1: constructing a policy network in the DDPG algorithm, namely an Actor policy network:

policy network mu_actorThe policy network is composed of an input layer, a hidden layer and an output layer, and for an input state vector s, an output vector u of the policy network is expressed as:

u＝μ_actor(s)

step 2-2: constructing an evaluation network in the DDPG algorithm, namely a criticic evaluation network:

evaluating the output of the network as a state-behavior value Q^μ(s, u), expressed as:

where k is a summation variable, E [. cndot.)]Represents a mathematical expectation; s_t+k+1、u_t+k+1Respectively representing a state input vector and a motion output vector at the moment of t + k + 1;

step 2-3: constructing a target neural network:

policy network mu_actorAnd evaluating network Q^μThe weights of (s, u) are copied into the respective target networks, i.e. θ^μ→θ^μ′，θ^Q→θ^Q′Wherein theta^μ、θ^QParameters, theta, representing the current policy network and the evaluation network, respectively^μ′、θ^Q′Parameters respectively representing a target strategy network and a target evaluation network;

and step 3: unmanned aerial vehicle and target status update

Step 3-1: establishing a state updating equation of the unmanned aerial vehicle at the time t:

wherein x is^uav(·)、y^uav(. v) coordinate value representing unmanned aerial vehicle at a certain time^uav(·)、ζ^uav(-) represents the linear and angular velocities of the drone at a time,

acceleration of the unmanned aerial vehicle at a certain time; Δ t is the simulation time interval, 9v_min,v_max) The minimum and maximum speeds of the unmanned aerial vehicle;

step 3-2: constructing a state updating equation of the target at the time t:

wherein the content of the first and second substances,

representing the target state at time t +1, F_tBeing a state transition matrix, Γ_tAs a noise influence matrix, w_tIs white gaussian noise; f_tAnd Γ_tIs represented as follows:

and 4, step 4: training maneuvering target tracking of the unmanned aerial vehicle by using a deterministic strategy gradient method under a task scene:

step 4-1: setting the maximum training round as E and the maximum step number of each round as T_rangeSetting the size M of an experience pool, setting a soft update proportion coefficient tau of a target neural network, and setting the learning rates of an evaluation network and a strategy network to be alpha respectively_ωAnd alpha_θ；

Step 4-2: initializing a state space S and initializing network parameters;

step 4-3: at the current state S_tSelecting the action of the unmanned aerial vehicle:

wherein, mu_d(. cndot.) represents a deterministic policy function,

is a random process noise vector;

step 4-4: unmanned aerial vehicle executes action a_tCalculating the relative distance and the relative pitch angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4_tThen, the next state s is obtained from step 3_t+1Then sample e_transtion＝<s_t,a_t,r_t,s_t+1>Storing the data into an experience pool;

and 4-5: judging the size N of the experience pool_RWhether the requirement is met, if N is_RIf the number is less than M, turning to the step 4-3; if the stored sample size is larger than the experience pool capacity, automatically dequeuing the sample data in front of the experience pool queue, and then entering the step 4-6;

and 4-6: randomly extracting a small batch of samples N from an experience pool for learning, wherein the learning process is represented as:

y_t＝r_t+γQ'(s_t+1,μ'(s_t+1|θ^μ')|θ^Q')

wherein y is_tRepresenting the target network, r_tFor the prize value at time t, θ^Q′And theta^μ′Respectively representing target evaluation network and target strategy network parameters, Q' is represented at s_t+1A state-action value is obtained by adopting a mu' strategy at any moment;

and 4-7: updating the policy network according to the minimum loss function:

l represents the Loss of Loss function of Loss, N represents the number of samples used for network update;

and 4-8: updating the strategy gradient:

wherein the content of the first and second substances,

expressed in the policy network parameter theta^μThe following strategy gradient is set to be,

and

respectively representing the evaluation network state-behavior value function gradient and the strategy network strategy function gradient, mu(s)_t) Is represented in a policy network state s_tThe action strategy is selected according to the selected action strategy,

and

respectively represent the state s_tTake action a ═ μ(s) below_t) Evaluating the state-behavior value of the network and the behavior value of the policy network under the state;

and 4-9: updating the weights of the target evaluation network and the target strategy network according to the following formula:

wherein tau is a soft update proportionality coefficient;

step 4-10: executing k to k +1 for the number k of iteration steps and judging if k is less than T_rangeThen execute t ═ t + Δ t and returnReturning to the step 4-3, otherwise, entering the step 4-11;

and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, saving the network parameters at the moment, and taking the currently trained strategy network as the network for the first migration;

and 5: carrying out first transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task two scene:

step 5-1: migrating the trained neural network of the first task to a second task to serve as an initialization network of the second task;

step 5-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network is learned, storing the parameters and taking the trained strategy network as a network for the second migration;

step 6: and (3) carrying out second transfer learning, namely training the unmanned aerial vehicle to track the maneuvering target in a task three scene:

step 6-2: migrating the neural network trained by the task two to a task three as an initialization network of the task three;

step 6-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network learns, and storing the parameters; and loading the stored data into an unmanned aerial vehicle system, so that the unmanned aerial vehicle finishes the work of state input, neural network analysis and action output, and the high-efficiency unmanned aerial vehicle maneuvering target tracking based on DDPG transfer learning is realized.

λ₁,λ₂∈(0,1)，λ₃、λ₄∈(0,1)。

Advantageous effects

The invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning. The method is independent of an environment model, a deep neural network is established, sensor information such as positions and speeds of an unmanned aerial vehicle and a target is used as input of the neural network, acceleration and angular speed of the unmanned aerial vehicle are used as output, and complex tasks are decomposed into the following steps: the method comprises the following steps of firstly, tracking a uniform linear motion target; task two, the target tracking of a complex maneuvering mode is realized; completing the tracking of the target under the condition of realizing obstacle avoidance; and then training and migrating the flight tracking strategy network based on the DDPG algorithm and migration learning, thereby completing the unmanned aerial vehicle maneuvering target tracking task in the complex environment. Its advantages are:

(1) the method realizes the online tracking decision of the unmanned aerial vehicle under the condition that an environmental model is unknown. By adopting a depth deterministic strategy gradient (DDPG) method, the optimal evaluation and strategy network reaching the target can be automatically learned through the sampling data tracked by the unmanned aerial vehicle under the strong fitting capacity of the neural network, and the tracking task is completed.

(2) The invention uses transfer learning, greatly improves the convergence rate while ensuring the algorithm precision, and saves the engineering development and model training cost. By migrating the trained model or network to a new engineering task, resetting the state space and the action space and adjusting the hyper-parameters of neural network training, more intelligent decision tasks of the unmanned aerial vehicle system can be expanded and realized.

Drawings

FIG. 1 is a flow chart of training task of tracking maneuvering target of unmanned aerial vehicle based on DDPG transfer learning

FIG. 2 is a schematic diagram of a DDPG-based unmanned aerial vehicle maneuvering target tracking algorithm structure

FIG. 3 is a task exploded view of unmanned aerial vehicle maneuvering target tracking

FIG. 4 is a graph showing the variation of the reward obtained by the UAV in each turn during the training process

FIG. 5 is a track display diagram of unmanned aerial vehicle for completing obstacle avoidance and target tracking

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention provides an unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning, and the whole flow is shown in figure 1. The technical solution is further clearly and completely described below with reference to the accompanying drawings and specific embodiments:

target state

state of obstacle

This indicates the state of the i-th obstacle (i ═ 1,2, … n). Because the actual physical model of each barrier is different, the circumscribed circle processing is uniformly carried out on the barriers for convenient construction. Setting obstacle state

Wherein the content of the first and second substances,

denotes the ithThe position of the individual obstacle on the two-dimensional plane,

is the radius of the circumscribed circle of the ith obstacle;

and outputting an action A to represent an action set taken by the unmanned aerial vehicle aiming at the self state value after receiving the external feedback value. In the present invention, the output is set as:

wherein the content of the first and second substances,

acceleration, ω, at time t of the drone_tIs the angular velocity of the unmanned plane at the moment t. The acceleration and the angular velocity of the unmanned aerial vehicle are respectively restrained by combining practical application:

ω_t∈[-3,3]；

in the invention, the radar sensor is utilized to judge and acquire the position and speed information of the unmanned aerial vehicle and the target. The observation space is set as follows:

wherein, relative distance D between unmanned aerial vehicle and the target is:

relative azimuth between drone and target

Comprises the following steps:

wherein the content of the first and second substances,

the observation error values are distance and angle, respectively;

step 1-4: defining a reward function R:

in this embodiment, the minimum tracking range D is set_minNot [0-15 m ]]Maximum tracking distance D_maxSetting a distance reward function r as 100 meters, wherein the observation range value of the sensor is L as 100 meters₁Comprises the following steps:

wherein D is_t-1Representing the distance between the drone and the target at the last moment, D_tThe distance between the unmanned aerial vehicle and the target at the current time t; if the current distance D_tIf the measurement range is larger than the measurement range of the sensor, giving a penalty of-1 to the unmanned aerial vehicle; if D is_tIf the value is less than or equal to L, giving positive reward; if D is_t< L and D_t＜D_minThen an additional constant prize of 1 is given;

in this embodiment, set up safe interval D between unmanned aerial vehicle and the barrier _safe10 meters. Setting obstacle avoidance reward function

Comprises the following steps:

wherein the content of the first and second substances,

is the distance between the drone and the obstacle at time t;

synthesize unmanned aerial vehicle apart from reward, keep away each weighted value of barrier reward, set for reward function R and be:

R＝0.7*r₁-0.3*r_t ^coll

step 1-5: defining a discount factor γ:

a discount factor is set for calculating the accumulated return value in the whole process. In this embodiment, γ is set to 0.95.

Step 2: constructing a neural network of the DDPG algorithm, wherein the schematic structural diagram of the algorithm is shown in FIG. 2:

u＝μ_actor(s)

where k is a summation variable, E [. cndot.)]Representing a mathematical expectation. s_t+k+1、u_t+k+1Respectively representing a state input vector and a motion output vector at the moment of t + k + 1;

step 2-3: constructing a target neural network:

policy network mu_actorAnd evaluating network Q^μThe weights of (s, u) are copied to the respective destinationsIn a target network, i.e. theta^μ→θ^μ′，θ^Q→θ^Q′Wherein theta^μ、θ^QParameters, theta, representing the current policy network and the evaluation network, respectively^μ′、θ^Q′Parameters respectively representing a target strategy network and a target evaluation network;

it should be noted that, in the embodiment, the policy network and the evaluation network have three layers of neural networks respectively for the target neural network, the number of neurons in the hidden layer is 100, the ReLu activation function is adopted, and the tanh function is adopted in the output layer;

and step 3: unmanned aerial vehicle and target status update

the acceleration of the drone at a certain time. In this embodiment, the simulation time interval Δ t is set to 1 second, and the minimum and maximum speeds of the unmanned aerial vehicle are set to v_min0 m/s, v_max100 m/s;

step 3-2: constructing a state updating equation of the target at the time t:

wherein the content of the first and second substances,

representing the target state at time t +1, F_tBeing a state transition matrix, Γ_tAs a noise influence matrix, w_tIs gaussian white noise. F_tAnd Γ_tIs represented as follows:

step 3-3: in the invention, the position of each obstacle is kept unchanged, so that the position state of the obstacle does not need to be updated;

and 4, step 4: the invention decomposes the tracking task of the maneuvering target of the unmanned aerial vehicle, which respectively comprises the following steps: the method comprises the following steps of firstly, tracking a uniform linear motion target; task two, the target tracking of a complex maneuvering mode is realized; task three, completing the tracking of the target under the condition of realizing obstacle avoidance, as shown in fig. 3 specifically;

training maneuvering target tracking of the unmanned aerial vehicle by using a deterministic strategy gradient method under a task scene:

step 4-1: in the embodiment of the present invention, the maximum training round E is set to 800, and the maximum number of steps in each round is T _range400, 8000 for the experience pool size M, 0.9 for the soft update proportionality coefficient τ of the target neural network, and α for the learning rates of the evaluation network and the policy network, respectively_ω0.001 and α_θ＝0.001；

Step 4-2: initializing a state space S and initializing network parameters;

setting an initial state of the drone

Initial state of the target

ω_targetThree stages in the round are each ω_target6.18 degree/sec,. omega_target8.33 deg/s,. omega_target-2.21 degrees/sec; the three obstacle initialization states are respectively: rectangular obstacleState S¹400 m, 75 m, 42 m, 16 m]Square obstacle state S²═ 200 m, 115 m, 40 m]Circular obstacle state S³528 m, 280 m, 12 m]The rectangular and square obstacle models are subjected to circumscribed circle processing in the obstacle avoidance process of the unmanned aerial vehicle;

initializing the weight of the neural network;

wherein, mu_d(. cndot.) represents a deterministic policy function,

is a random process noise vector;

step 4-4: unmanned aerial vehicle executes action a_tCalculating the relative distance and the relative pitch angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4_tThen, the next state s is obtained from step 3_t+1Then sample e_transtion＝<s_t,a_t,r_t,s_t+1>Storing the experience into an experience pool queue;

y_t＝r_t+γQ'(s_t+1,μ'(s_t+1|θ^μ')|θ^Q')

wherein y is_tRepresenting the target network, r_tFor the prize value at time t, θ^Q′And theta^μ′Respectively representing target evaluation network and target strategy network parameters, Q' tableIs shown at s_t+1A state-action value is obtained by adopting a mu' strategy at any moment;

and 4-7: updating the policy network according to the minimum loss function:

and 4-8: updating the strategy gradient:

wherein the content of the first and second substances,

and

and

in this embodiment, the soft update rate coefficient τ is set to 0.9.

Step 4-10: executing k to k +1 for the number k of iteration steps and judging if k is less than T_rangeIf yes, executing t-t + delta t and returning to the step 4-3, otherwise, entering the step 4-11;

and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, the network parameters at the moment are saved, and the current strategy network is used as the final strategy network. Replacing the state space in the step 1-1 as the final input of the network, thereby realizing effective unmanned aerial vehicle maneuvering target tracking;

step 5-1: the method comprises the following steps that (1) maneuvering target tracking training of the unmanned aerial vehicle under a task two scene is completed on the basis of a task one, and firstly, a neural network trained by the task one is transferred to a task two to serve as an initialization network of the task two;

step 5-2: executing the operations from the step 4-3 to the step 4-11, completing the task after the network learns a little, storing the parameters and taking the trained strategy network as a network for the second migration;

step 6-1: the maneuvering target tracking training of the unmanned aerial vehicle under the scene of task three is completed on the basis of task two, namely, the trained neural network of task two is transferred to task three to be used as an initialization network of task three;

step 6-2: and 4-3 to 4-11, and completing the task after the network learns a little. And replacing the state space in the step 1-1 as the final input of the network, thereby realizing the high-efficiency unmanned aerial vehicle maneuvering target tracking based on DDPG transfer learning.

The unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning provided by the invention trains the neural network by decomposing tasks, initializing environment states, neural network parameters and other super parameters. When the turn is started, the unmanned aerial vehicle executes actions to change the speed and the course angle to obtain a new state, the experience of each turn is stored in an experience pool to be used as a learning sample, and the parameters of the neural network are continuously updated in an iterative mode. And when the training is finished, storing the neural network parameters trained by the subtasks, and transferring the neural network parameters to the unmanned aerial vehicle maneuvering target tracking network under the next task scene until the final task is finished.

The reward change curve graph obtained by the unmanned aerial vehicle in each round in the training process is shown in fig. 4, after about 300 rounds of training, the unmanned aerial vehicle can obtain high and stable rewards in each round, and the progressive strategy provided by the method and the DDPG algorithm which is designed by adopting transfer learning in a targeted manner can improve the convergence rate of the original DDPG algorithm and the robustness of a network, so that the efficiency and the stability of the autonomous intelligent decision process of the unmanned aerial vehicle are improved. The simulation result is shown in fig. 5, and it can be seen that the unmanned aerial vehicle trained based on the DDPG migration learning algorithm can effectively avoid obstacles and complete a maneuvering target tracking task.

Claims

1. An unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning is characterized by comprising the following steps:

wherein: is free ofHuman machine state S_uav＝[x^uav,y^uav,v^uav,θ^uav]，x^uav,y^uavRepresenting the position, v, on a two-dimensional plane of the drone^uavIs the speed of the drone, theta^uavIs the azimuth of the drone;

target state

is the component of velocity, ω, of the target along axis X, Y^targetAngle of turn, ω, of the target^targetMore than 0 is anticlockwise turning, omega^target<0 is clockwise turning;

state of obstacle

Wherein the content of the first and second substances,

indicating the position of the ith obstacle in the two-dimensional plane,

is the radius of the circumscribed circle of the ith obstacle;

wherein the content of the first and second substances,

acceleration, ω, at time t of the drone_tThe angular velocity of the unmanned aerial vehicle at the moment t; the acceleration and the angular velocity of the unmanned aerial vehicle are respectively restrained by combining practical application:

wherein, relative distance D between unmanned aerial vehicle and the target is:

relative azimuth between drone and target

Comprises the following steps:

wherein the content of the first and second substances,

the observation error values are distance and angle, respectively;

step 1-4: defining a reward function R:

setting a distance reward function r₁Comprises the following steps:

Setting obstacle avoidance reward function

Comprises the following steps:

wherein the content of the first and second substances,

step 1-5: defining a discount factor γ:

step 2: constructing a neural network of the DDPG algorithm:

u＝μ_actor(s)

step 2-3: constructing a target neural network:

and step 3: unmanned aerial vehicle and target status update

acceleration of the unmanned aerial vehicle at a certain time; Δ t is the simulation time interval, (v)_min,v_max) The minimum and maximum speeds of the unmanned aerial vehicle;

step 3-2: constructing a state updating equation of the target at the time t:

wherein the content of the first and second substances,

Step 4-2: initializing a state space S and initializing network parameters;

wherein, mu_d(. cndot.) represents a deterministic policy function,

is a random process noise vector;

step 4-4: unmanned aerial vehicle executes action a_tCalculating the relative distance and the relative azimuth angle between the unmanned aerial vehicle and the target according to the steps 1-3, and obtaining the reward value r at the moment t by the reward function in the steps 1-4_tThen, the next state s is obtained from step 3_t+1Then sample e_transtion＝<s_t,a_t,r_t,s_t+1>Storing the data into an experience pool;

and 4-5: judging the size N of the experience pool_RWhether the requirement is met, if N is_RIf the number is less than M, turning to the step 4-3; if the stored sample size is larger than the experience pool capacity, the experience pool is queued upAutomatically listing the data of the square sample, and entering a step 4-6 at the moment;

y_t＝r_t+γQ'(s_t+1,μ'(s_t+1|θ^μ')|θ^Q')

wherein y is_tRepresenting the target network, r_tFor the prize value at time t, θ^Q′And theta^μ′Respectively representing target evaluation network and target strategy network parameters, and Q 'represents a state-action value obtained by adopting a mu' strategy at the moment of t + 1;

and 4-7: updating the policy network according to the minimum loss function:

and 4-8: updating the strategy gradient:

wherein the content of the first and second substances,

and

and

wherein tau is a soft update proportionality coefficient;

and 4-11: judging the number E of rounds, and if E is less than E, returning to the step 4-2; if E is larger than or equal to E, saving the network parameters at the current moment, and taking the currently trained strategy network as the network for the first migration;

2. The unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning as described in claim 1, characterized in that λ₁、λ₂∈(0,1)，λ₃、λ₄∈(0,1)。