CN114859910A

CN114859910A - Unmanned ship path following system and method based on deep reinforcement learning

Info

Publication number: CN114859910A
Application number: CN202210470023.3A
Authority: CN
Inventors: 杨杰; 韦港文; 刘今栋; 尚午晟; 梁奇
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-05

Abstract

The invention discloses an unmanned ship path following system based on deep reinforcement learning, wherein a simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform; the Markov decision modeling module is used for carrying out Markov decision process modeling by utilizing an unmanned ship motion control task; the neural network construction module is used for designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process; the strategy model construction module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model; and the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control. The invention separates the ship motion model from the control algorithm, simplifies the design process of the control strategy, and obviously reduces or eliminates the dependence on the professional knowledge in the field of ship motion control.

Description

Unmanned ship path following system and method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned ship motion control, in particular to an unmanned ship path following system and method based on deep reinforcement learning.

Background

Highly autonomous, intelligent unmanned ships are a necessary trend for the development of shipbuilding and shipping industries. The unmanned ship aims at realizing unmanned and intelligent ship operation, can effectively improve the safety of equipment and ship operation, optimizes navigation strategies and reduces operation cost. Therefore, unmanned ships are becoming the main direction for the development of various shipbuilding countries and offshore strong countries. The path following is one of basic tasks of unmanned ship motion control, and is a key for realizing autonomous intelligent navigation of the unmanned ship.

The traditional unmanned ship path following method is established on the basis of mathematical estimation analysis, parameters of the controller are determined through mathematical analysis and derivation, and the design and parameter setting process of the controller depend on strong professional knowledge. At present, the effective performance of the path following method based on mathematical estimation and analysis has been proved, but the method has great limitations, such as high computational complexity, poor portability and great influence by environmental interference. Particularly, the method is easily influenced by factors such as navigation condition change, environmental interference and the like in the navigation process, and shows strong uncertainty, nonlinearity and time-varying property, so that an accurate mathematical model is difficult to establish to express the change state of the ship, and great challenge is brought to the unmanned ship path following task. The unmanned ship needs to adapt to the change of navigation conditions in the real-time navigation process and adjust the control strategy in time, but the traditional control method cannot well solve the problem of uncertainty, so that the path following effect is poor.

Disclosure of Invention

The invention aims to provide a system and a method for unmanned ship path following based on deep reinforcement learning.

In order to realize the aim, the invention designs an unmanned ship path following system based on deep reinforcement learning, which comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;

the simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of the unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to the requirement followed by the unmanned ship path;

the Markov decision modeling module is used for modeling a Markov decision process by utilizing an unmanned ship motion control task, the Markov decision process is used for describing an interaction process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;

the neural network construction module is used for designing a deep neural network based on a DDPG (deep deterministic policy gradient) algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;

the strategy model building module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model;

the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.

The deep reinforcement learning has the characteristics of autonomous learning and strong self-adaptive capacity, and is widely applied to the fields of navigation, robot control, parameter optimization and the like. Meanwhile, the end-to-end learning mode in the deep reinforcement learning obviously reduces or eliminates the dependence on professional field knowledge and has good performance exceeding human professionals in multiple fields, thereby providing a new idea for the research in the field of decision control. The path following method based on deep reinforcement learning needs to construct an environment for interactive learning, which can be an actual environment or a simulation environment. The problems of large risk, long period, high cost and the like exist in the actual environment, and the motion process of the ship cannot be intuitively reflected by a numerical simulation mode.

The invention has the beneficial effects that:

the unmanned ship motion interactive simulation platform constructed by the invention provides a training and testing environment similar to the actual environment, the ship motion control and control algorithm are separated, the ship motion control underlying principle can be not known, the unmanned ship control algorithm is directly designed, the ship motion process is visualized by Unity3D, the control effect of a control strategy is reflected intuitively, and a safe, efficient and low-cost way is provided for the decision-making and control strategy research and verification of the unmanned ship;

the invention provides an unmanned ship path following method based on deep reinforcement learning, which takes a neural network obtained by DDPG algorithm training as an unmanned ship path following controller, does not need complex mathematical derivation and simplifies the design process of the controller; in the training process, interactive data consisting of control input and corresponding dynamic response of the unmanned ship is used as a training sample of the neural network, the modeling uncertainty problem and the influence of various environmental factors are implicitly considered, and the uncertainty problem is well processed; the trained control strategy model can be applied to a real-time path following task through fine adjustment, the transportability is high, the control strategy model takes the real-time state of the unmanned ship as input, the optimal control strategy of the ship is directly output, and the self-adaptive capacity and the real-time performance are strong.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a flowchart of a method for tracking a unmanned ship path based on deep reinforcement learning according to the present invention;

FIG. 3 is a schematic diagram of an unmanned ship motion interaction simulation platform system according to the present invention;

FIG. 4 is a schematic diagram of the unmanned ship path following controller based on DDPG algorithm in the present invention;

FIG. 5 is a schematic view of a line of sight method;

FIG. 6 is a training frame diagram of unmanned ship path following based on DDPG algorithm in the present invention;

FIG. 7 is a flow chart of the implementation of the control strategy model of the present invention;

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

the unmanned ship path following system based on deep reinforcement learning shown in fig. 1 is characterized in that: the system comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;

the simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform, as shown in fig. 3, initializing a target path and a navigation environment of the unmanned ship in the unmanned ship motion interactive simulation platform, wherein the navigation environment comprises the target path, a ship initial position, a course and obstacle information, and defining an unmanned ship motion control task according to a requirement followed by the unmanned ship path;

the Markov decision modeling module is used for modeling a Markov decision process by utilizing an unmanned ship motion control task, the Markov decision process is used for describing an interaction process of unmanned ship path following control, the state space of the Markov decision process is determined according to information required by a ship to complete the path following task, the action space of the Markov decision process is determined according to a ship control instruction, the reward function of the Markov decision process is determined according to a target task controlled by the ship, the ship control instruction is an action executable by the ship, and the ship control task is what task the ship completes; the target task of ship control is a target followed by a ship path, specifically, the difference between the course angle of the ship and the target course angle is minimized, the position error followed by the path is minimized, and the target path is smoothly followed; the Markov decision process is used for describing an interactive process of unmanned ship path following control, the ship path following control problem is converted into a discrete sequential decision problem, and a state space, an action space and a reward function are all components of the Markov decision process;

the Markov decision process is described by (S, A, P, R, gamma), and the state space, the action space and the reward function are all the components; the state space is a collection for describing the state of the ship and describes the state information of the ship; wherein the expected course angle in the state space is obtained by a line-of-sight method; the action space is a set of ship control actions, and defines which actions can be taken by the ship; the reward function is a feedback signal of the environment after the ship takes a certain action and is used for evaluating the quality of the taken action, and the control strategy is optimized through the feedback signal.

The neural network construction module is used for designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;

the strategy model building module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model, the training process is that the unmanned ship acquires experience data according to a Markov decision process, stores the experience data into an experience pool, and then trains the neural network by randomly sampling from the experience pool;

the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control, and is shown in figure 4.

In the technical scheme, the simulation platform construction module constructs a virtual 3D navigation environment through a Unity3D engine, constructs a control model of the ship course and speed in Pycharm software, controls the ship navigation speed and steering through an interface in Pycharm, provides a longitudinal driving force and a rotating torque for a ship, and realizes data interaction between the virtual 3D navigation environment and the control model of the ship course and speed by using a communication tool package of Airsim software to form an unmanned ship motion interaction simulation platform.

In the technical scheme, an unmanned ship motion interactive simulation platform is constructed and used for training and evaluating the unmanned ship motion control strategy; the simulation platform comprises a virtual environment layer, an application program interface layer and an external environment layer, wherein the virtual environment layer constructs a virtual environment based on a ship motion model, an unmanned ship navigation scene, an environment model and a sensor model, and is visualized by Unity3D, so that ship motion simulation and sensing information content perception can be realized; the external environment layer can be used for designing a decision and control strategy of the unmanned ship; the application program interface layer connects ship motion simulation and control strategies through a data transmission protocol, and transmission of environment perception information and unmanned ship control signals between the virtual environment layer and the external environment layer is achieved.

Optionally, the external environment layer is isolated from the unmanned ship motion model in the virtual environment, and the core of the external environment layer is a ship motion control strategy, which only relates to the design and implementation of the ship motion control strategy, so that the ship control method of the external environment layer can be the method provided by the invention, and can also be the technology provided by other mechanisms.

In the above technical solution, the specific method for modeling the markov decision process includes:

the Markov decision process is described by tuples (S, A, P, R, gamma), where S is the unmanned ship ' S state space, A is the unmanned ship ' S action space, P is the state transition probability, R is the reward function, gamma is the discount factor for weighing the relationship between the instant reward and the future long-term reward, and at time t, the unmanned ship ' S state information S _t E to S, and selecting action a from the action space according to the corresponding action strategy _t Execution, followed by a state transition of the unmanned ship to a new state s _t+1 While obtaining a feedback prize value r _t The task goal of the unmanned ship is to maximize the jackpot value during the completion of the interaction;

constructing an unmanned ship state space; the state space S is represented as:

wherein psi _d Representing the desired heading angle,. chi. _y A lateral position error representing the position of the vessel from a desired course, the speed of the vessel,

derivatives of the desired heading angle, heading error, position lateral error; the expected course is obtained by a line-of-sight method; the line-of-sight method obtains the expected course of the ship according to the real-time position information and the target path information of the ship; the real-time position information is acquired through a ship sensing system; the target path information is known information.

The line-of-sight method provides the unmanned ship with the desired heading, and the schematic diagram is shown in FIG. 5, assuming P _k+1 (x _k+1 ,y _k+1 ) Is the current target waypoint, P _k (x _k ,y _k ) Is the previous waypoint, P _k+2 (x _k+2 ,y _k+2 ) Is the next target waypoint, and the current position of the ship is P (x, y), P _los (x _los ,y _los ) Is LOS point, the equation for solving LOS point is:

wherein alpha is _k Is the path angle, R is the turning radius;

thus, the desired heading is:

ψ _d ＝atan2(y _los -y,x _los -x)

where the atan2 function is an arctan function.

Constructing an unmanned ship action space; the motion space a is represented as: a ═ { δ }, wherein δ is a ship rudder angle;

constructing a reward function; the reward function includes relative position and relative heading to the target path, expressed as: r ═ w _e r _e +w _χ r _χ Wherein r is _e Awarded a position error, r _χ Is a heading error reward, w _e Awarding weight for position error, w _χ Rewarding weight for the course error; the position error reward function and the course error reward are as follows:

wherein, χ (k) is the course error of the current time, χ (k-1) represents the course error of the last time, k ₁ Is a first weight coefficient with a value of 0.1 and k ₂ Is a second weight coefficient with a value of 0.01, k ₃ And the third weight coefficient is 0.1, e is a natural constant, and rad represents radian.

In the above technical solution, an unmanned ship path following learning framework based on the DDPG algorithm is shown in fig. 6, and the deep neural network includes: a current policy network, a current evaluation network, a target policy network, and a target evaluation network; the current strategy network uses the current state information s of the unmanned ship _t For inputting, corresponding to ship state information at the time t in the Markov decision process description, namely a vector formed by specific values corresponding to each variable in a state space, and outputting a current strategy action mu(s) _t )，μ(s _t ) Outputting action value for the current time strategy network, and comparing action mu(s) _t ) The addition of OU (Ornstein-Uhlenback) exploration noise results in an action a to be performed _t . The output of the strategy network corresponds to a certain value in the action space, specifically, at the moment t, the strategy network inputs the ship state s _t The output action value is the rudder angle of the ship; the current evaluation network is paired with a state and an action(s) _t ,a _t ) For input, status and action pairs, from status s _t And action a _t Composition a of _t Is the current policy network output action value mu(s) _t ) Adding OU exploration noise

The action to be executed is obtained

Output to calculate the current Q value (Q value is proper noun, representing state s) _t Execute action a at once _t The value of (a) of (b),the current Q value is the output value of the current evaluation network); the target strategy network has the same structure as the current strategy network, and the state s of the target strategy network at the next moment _t+1 For input, the output is the next time optimal action mu'(s) _t+1 θ ^μ' ) The target evaluation network is in the state s at the next moment _t+1 And the next moment optimal action mu'(s) _t+1 θ ^μ′ ) For input, output estimated Q value Q'(s) _t+1 ,μ′(s _t+1 θ ^μ′ )|θ ^Q′ ) (ii) a The target evaluation network has the same structure as the current evaluation network.

In the technical scheme, the dimension of the state space defines the input of the current policy network; the dimension of the action space defines the output of the current policy network; the state space dimension and the action space dimension define the input of the current evaluation network; the number of input layer nodes and the number of output layer nodes of the neural network are determined through the state space and the action space.

In the above technical solution, the network parameter of the current policy network is θ ^μ With the current state s of the unmanned ship _t Output the current policy action mu(s) as input _t ) Adding the OU exploration noise results in the action a to be performed _t (ii) a The network parameter of the current evaluation network is theta ^Q In the form of a state and action pair(s) _t ,a _t ) Calculating for input, output the current Q value Q(s) _t ,a _t |θ ^Q ) (ii) a The network parameter of the target policy network is theta ^μ′ The following time state s _t+1 For input, the output is the next time optimal action mu'(s) _t+1 θ ^μ' ) (ii) a The network parameter of the target evaluation network is theta ^Q′ The following time state s _t+1 And next-time optimal action the next-time optimal action mu'(s) _t+1 θ ^μ′ ) For input, output estimated Q value Q'(s) _t+1 ,μ′(s _t+1 θ ^μ′ )|θ ^Q′ )。

In the above technical solution, the specific method for obtaining the unmanned ship path following control strategy model is as follows:

step 4.1, initializing the training round times M, the maximum step number T of each round, the soft update rate tau, the learning rates of the current strategy network and the current evaluation network and the attenuation factor gamma in the deep neural network;

step 4.2, initializing the current strategy network parameter theta ^μ Current estimated network parameter θ ^Q Target policy network parameter θ ^μ′ Target evaluation network parameter θ ^Q′ And an experience playback pool D;

step 4.3, for each training round, randomly initializing the ship state s ₀ ；

Step 4.4: for each time step, obtaining the current state information s of the unmanned ship from the virtual environment layer of the unmanned ship motion interactive simulation platform _t Then inputting the current policy network to obtain the policy action mu(s) _t ) Synthesizing unmanned ship motion a by adding noise search signal _t The unmanned ship performs an unmanned ship action a in a virtual environment _t Obtaining the unmanned ship state s according to the reward function _t (the current time state of the unmanned ship corresponds to the ship state information at the time t in the Markov decision process description, namely a vector formed by specific values corresponding to each variable in a state space) to execute the action a _t Is given a prize value r _t While obtaining the new state s of the unmanned ship _t+1 ；

Step 4.5, interactive experience data(s) _t ,a _t ,r _t ,s _t+1 ) The interactive experience data comprises a reward function to obtain the state s of the unmanned ship _t Unmanned ship action a _t The bonus value r _t New state s of unmanned ship _t+1 After the quantity of the interactive experience data exceeds the capacity of the experience pool, replacing the earliest experience data with the latest experience data;

step 4.6, randomly sampling batch N tuple data(s) from the experience playback pool D _i ,a _i ,r _i ,s _i+1 ) As training data, s _i Is the current time state in the ith sample, a _i For the current time-of-day state in the ith sample, r _i For the current in the ith sampleTime of day award, s _i+1 For the next time state in the ith sample, a target function is calculated:

y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

y _i is a target Q value, r _i Denotes the prize value, gamma denotes the decay factor, Q'(s) _i+1 ,μ′(s _i+1 θ ^μ′ )|θ ^Q′ ) Evaluating the Q value, s, of the network output for the purpose of targeting _i+1 Represents the new state of the unmanned ship, mu'(s) _i+1 |θ ^μ' ) For the output action of the target policy network, θ ^μ' Representing a target policy network parameter, θ ^Q' Representing target evaluation network parameters;

and 4.7, updating the current evaluation network through a minimum loss function:

L(θ ^Q ) A loss function, Q(s), representing the current evaluation network _i ,a _i |θ ^Q ) Representing the output Q value, s, of the currently evaluated network _i Indicating the unmanned ship state, a _i Indicating unmanned ship motion, theta ^Q A parameter indicative of a currently evaluated network;

and 4.8, updating the current strategy network by using the strategy gradient:

wherein,

representing the gradient of the objective function of the current policy network, E representing the mean function,

represents the gradient of the Q value of the current evaluation network output to action a, a ═ μ(s) _i ) Indicates that the unmanned ship state is s _i When the output action value of the current policy network,

action value pair theta representing current policy network output ^μ A gradient of (a);

and 4.9, updating the target network parameters in a soft updating mode:

θ ^Q′ ←τθ ^Q +(1-τ)θμ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

where τ represents the soft update rate, θ ^Q Representing the currently evaluated network parameter, theta ^Q' Representing target evaluation network parameters, theta ^μ Representing a current policy network parameter, θ ^μ' Representing a target policy network parameter;

step 4.10, repeating the step 4.4 to the step 4.9 until the maximum step number T is reached;

step 4.11, repeating the steps 4.3-4.10 until the maximum number of rounds M is reached, and obtaining the unmanned ship path following control strategy model;

and obtaining an optimal control strategy model through the training process, wherein the control strategy model is the trained current strategy network parameters.

In the technical scheme, the path following control module combines the unmanned ship path following control strategy model with the sight guidance algorithm to realize unmanned ship path following control, and the sight guidance algorithm is used for providing an expected course for the control strategy model and guiding the unmanned ship to sail; the implementation of the path following process specifically includes, as shown in fig. 7:

step 5.1, initializing a target path of ship navigation;

step 5.2, acquiring self initial state information and environment information of the unmanned ship, and calculating values of all variables in a state space to obtain the current time state of the ship according to the acquired ship position, course, speed information and target path information;

step 5.3, inputting the current state information of the unmanned ship into the unmanned ship path following control strategy model to obtain a current action strategy of the unmanned ship;

step 5.4, the action strategy is converted into a rudder angle control instruction, the rudder angle control instruction is executed, and the unmanned ship is converted into a new state;

step 5.5, acquiring real-time position information of the ship and calculating the current ship yaw distance;

step 5.6, if the yaw distance exceeds the threshold value, representing that the task execution fails, and returning the ship to the initial position; otherwise, carrying out the next step;

step 5.7, if the unmanned ship reaches the destination, the path following control process is finished, otherwise, the next step is carried out;

and 5.8, if the unmanned ship finishes the following of the current target path, taking the next section of target path as the current target path, and then returning to the step 5.2, otherwise, directly returning to the step 5.2.

An unmanned ship path following method, as shown in fig. 2, comprises the following steps:

step 1, constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of an unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to a requirement followed by the unmanned ship path;

the architecture diagram of the unmanned ship motion interactive simulation platform system is shown in fig. 3, and the whole simulation platform comprises a virtual environment layer, an application program interface layer and an external environment layer, wherein the virtual environment layer constructs a virtual environment based on a ship motion model, an unmanned ship navigation scene, an environment model and a sensor model, and is visualized by Unity3D, so that the course and the speed of a ship can be controlled, and the position, the speed, course information and navigation environment information of the ship can be sensed; the external environment layer can be used for designing a decision and control strategy of the unmanned ship; the application program interface layer connects ship motion simulation and control strategies through a data transmission protocol to realize the transmission of environment perception information and unmanned ship control signals between the virtual environment layer and the external environment layer;

step 2, modeling a Markov decision process by utilizing an unmanned ship motion control task, wherein the Markov decision process is used for describing an interactive process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;

step 3, designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;

step 4, training the deep neural network by using a DDPG algorithm on the simulation platform to obtain an unmanned ship path following control strategy model;

and 5, combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.

Compared with the prior art, the unmanned ship motion interaction simulation platform constructed by the invention separates a ship motion model and a control algorithm, shields the bottom layer realization principle of a ship motion controller, greatly simplifies the design of a control strategy, also provides a training and testing environment similar to the actual environment, and provides a safe, efficient and low-cost way for the decision-making and control strategy research and verification of the unmanned ship; meanwhile, the unmanned ship path following method based on deep reinforcement learning provided by the invention takes the neural network obtained by DDPG algorithm training as the unmanned ship path following controller, does not need complex mathematical derivation, and simplifies the design process of the controller; in the training process, interactive data consisting of control input and corresponding dynamic response of the unmanned ship is used as a training sample of the neural network, the modeling uncertainty problem and the influence of various environmental factors are implicitly considered, and the uncertainty problem is well processed; the trained control strategy model can be applied to a real-time path following task through fine adjustment, the transportability is high, the control strategy model takes the real-time state of the unmanned ship as input, the optimal control strategy of the ship is directly output, and the self-adaptive capacity and the real-time performance are strong.

Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims

1. The utility model provides an unmanned ship route following system based on degree of depth reinforcement study which characterized in that: the system comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;

2. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the simulation platform construction module constructs a virtual 3D navigation environment through a Unity3D engine, constructs a ship course and speed control model in Pycharm software, and realizes data interaction between the virtual 3D navigation environment and the ship course and speed control model by using a communication tool package of Airsim software to form an unmanned ship motion interaction simulation platform.

3. The unmanned ship path following system based on deep reinforcement learning of claim 1, wherein: the specific method for modeling the Markov decision process comprises the following steps:

the Markov decision process is described by tuples (S, A, P, R, gamma), where S is the unmanned ship ' S state space, A is the unmanned ship ' S action space, P is the state transition probability, R is the reward function, gamma is the discount factor for weighing the relationship between the instant reward and the future long-term reward, and at time t, the unmanned ship ' S state information S _t E to S, and selecting action a from the action space according to the corresponding action strategy _t Execution, following which the state of the unmanned ship is transferred to a new state s _t+1 While obtaining a feedback prize value r _t The task goal of the unmanned ship is to maximize the jackpot value during the completion of the interaction;

constructing an unmanned ship state space; the state space S is represented as:

derivatives of the desired heading angle, heading error, position lateral error;

wherein, χ (k) is the course error of the current time, χ (k-1) represents the course error of the last time, k ₁ Is a first weight coefficient, k ₂ Is the second weight coefficient, k ₃ Is the third weight coefficient, e is the natural constant, and rad represents radians.

4. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the deep neural network includes: a current policy network, a current evaluation network, a target policy network, and a target evaluation network; the current strategy network uses the current state information s of the unmanned ship _t Output the current policy action mu(s) as input _t ) (ii) a The current evaluation network is paired with a state and an action(s) _t ,a _t ) Outputting and calculating the current Q value for input; the target strategy network has the same structure as the current strategy network, and the state s of the target strategy network at the next moment _t+1 For input, the output is the next time optimal action mu'(s) _t+1 |θ ^μ' ) The target evaluation network is in the state s at the next moment _t+1 And the next moment optimal action mu'(s) _t+1 |θ ^μ′ ) For input, output estimated Q value Q'(s) _t+1 ,μ′(s _t+1 |θ ^μ′ )|θ ^Q′ ) (ii) a The target evaluation network has the same structure as the current evaluation network.

5. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the dimension of the state space defines the input of the current policy network; the dimension of the action space defines the output of the current policy network; the state space dimension and the action space dimension define the input of the current evaluation network; the number of input layer nodes and the number of output layer nodes of the neural network are determined through the state space and the action space.

6. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the network parameter of the current policy network is theta ^μ With the current state s of the unmanned ship _t Output the current policy action mu(s) as input _t ) Adding the OU exploration noise results in the action a to be performed _t (ii) a The network parameter of the current evaluation network is theta ^Q In the form of a state and action pair(s) _t ,a _t ) Calculating for input, output the current Q value Q(s) _t ,a _t |θ ^Q ) (ii) a The network parameter of the target policy network is theta ^μ′ The following time state s _t+1 For input, the output is the optimal action mu' at the next moment(s) _t+1 |θ ^μ′ ) (ii) a The network parameter of the target evaluation network is theta ^Q′ The following time state s _t+1 And next-time optimal action the next-time optimal action mu'(s) _t+1 |θ ^μ′ ) For input, output estimated Q value Q'(s) _t+1 ,μ′(s _t+1 |θ ^μ′ )|θ ^Q′ )。

7. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the specific method for obtaining the unmanned ship path following control strategy model comprises the following steps:

step 4.2, initializing the current strategy network parameter theta ^μ Current estimated network parameter θ ^Q Target policy networkParameter of the complex theta ^μ′ Target evaluation network parameter θ ^Q′ And an experience playback pool D;

Step 4.4: for each time step, acquiring current state information s of the unmanned ship from a virtual environment layer of the unmanned ship motion interactive simulation platform _t Then inputting the current policy network to obtain the policy action mu(s) _t ) Synthesizing the unmanned ship motion a by adding a noise search signal _t The unmanned ship performs an unmanned ship action a in a virtual environment _t Obtaining the unmanned ship state s according to the reward function _t Lower execution action a _t Is given a prize value r _t While obtaining the new state s of the unmanned ship _t+1 ；

Step 4.5, storing the interactive experience data into an experience playback pool D, wherein the interactive experience data comprise a reward function to obtain the unmanned ship state s _t Unmanned ship action a _t The bonus value r _t New state s of unmanned ship _t+1 After the quantity of the interactive experience data exceeds the capacity of the experience pool, replacing the earliest experience data with the latest experience data;

step 4.6, randomly sampling batch N tuple data(s) from the experience playback pool D _i ,a _i ,r _i ,s _i+1 ) As training data, s _i Is the current time state in the ith sample, a _i For the current time-of-day state in the ith sample, r _i Awarding, s, for the current time in the ith sample _i+1 For the next time state in the ith sample, a target function is calculated:

y _i ＝r _i +γQ'(s _i+1 ,μ'(s _i+1 |θ ^μ' )|θ ^Q’ )

y _i is a target Q value, r _i Denotes the prize value, gamma denotes the decay factor, Q'(s) _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) Evaluating the Q value, s, of the network output for the purpose of targeting _i+1 Represents the new state of the unmanned ship, mu'(s) _i+1 |θ ^μ′ ) Is composed ofOutput action of the target policy network, θ ^μ' Representing a target policy network parameter, θ ^Q' Representing target evaluation network parameters;

and 4.8, updating the current strategy network by using the strategy gradient:

wherein,

and 4.9, updating the target network parameters in a soft updating mode:

θ ^Q’ ←τθ ^Q +(1-τ)θ ^Q’

θ ^μ’ ←τθ ^μ +(1-τ)θ ^μ’

and 4.11, repeating the steps 4.3-4.10 until the maximum number of rounds M is reached, and obtaining the unmanned ship path following control strategy model.

8. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the path following control module combines the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control, and the sight guidance algorithm is used for providing an expected course for the control strategy model and guiding the unmanned ship to sail; the implementation of the path following process specifically includes:

step 5.1, initializing a target path of ship navigation;

and 5.8, if the unmanned ship finishes the following of the current target path, taking the next section of target path as the current target path, and returning to the step 5.2, otherwise, directly returning to the step 5.2.

9. The deep reinforcement learning-based unmanned ship path following system according to claim 3, wherein: the expected course is obtained by a line-of-sight method; the line-of-sight method obtains the expected course of the ship according to the real-time position information and the target path information of the ship; the real-time position information is acquired through a ship sensing system; the target path information is known information.

10. An unmanned ship path following method of the system of claim 1, comprising the steps of: