CN114859910A - Unmanned ship path following system and method based on deep reinforcement learning - Google Patents

Unmanned ship path following system and method based on deep reinforcement learning Download PDF

Info

Publication number
CN114859910A
CN114859910A CN202210470023.3A CN202210470023A CN114859910A CN 114859910 A CN114859910 A CN 114859910A CN 202210470023 A CN202210470023 A CN 202210470023A CN 114859910 A CN114859910 A CN 114859910A
Authority
CN
China
Prior art keywords
unmanned ship
ship
network
current
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210470023.3A
Other languages
Chinese (zh)
Inventor
杨杰
韦港文
刘今栋
尚午晟
梁奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202210470023.3A priority Critical patent/CN114859910A/en
Publication of CN114859910A publication Critical patent/CN114859910A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/0206Control of position or course in two dimensions specially adapted to water vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Geometry (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an unmanned ship path following system based on deep reinforcement learning, wherein a simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform; the Markov decision modeling module is used for carrying out Markov decision process modeling by utilizing an unmanned ship motion control task; the neural network construction module is used for designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process; the strategy model construction module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model; and the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control. The invention separates the ship motion model from the control algorithm, simplifies the design process of the control strategy, and obviously reduces or eliminates the dependence on the professional knowledge in the field of ship motion control.

Description

Unmanned ship path following system and method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of unmanned ship motion control, in particular to an unmanned ship path following system and method based on deep reinforcement learning.
Background
Highly autonomous, intelligent unmanned ships are a necessary trend for the development of shipbuilding and shipping industries. The unmanned ship aims at realizing unmanned and intelligent ship operation, can effectively improve the safety of equipment and ship operation, optimizes navigation strategies and reduces operation cost. Therefore, unmanned ships are becoming the main direction for the development of various shipbuilding countries and offshore strong countries. The path following is one of basic tasks of unmanned ship motion control, and is a key for realizing autonomous intelligent navigation of the unmanned ship.
The traditional unmanned ship path following method is established on the basis of mathematical estimation analysis, parameters of the controller are determined through mathematical analysis and derivation, and the design and parameter setting process of the controller depend on strong professional knowledge. At present, the effective performance of the path following method based on mathematical estimation and analysis has been proved, but the method has great limitations, such as high computational complexity, poor portability and great influence by environmental interference. Particularly, the method is easily influenced by factors such as navigation condition change, environmental interference and the like in the navigation process, and shows strong uncertainty, nonlinearity and time-varying property, so that an accurate mathematical model is difficult to establish to express the change state of the ship, and great challenge is brought to the unmanned ship path following task. The unmanned ship needs to adapt to the change of navigation conditions in the real-time navigation process and adjust the control strategy in time, but the traditional control method cannot well solve the problem of uncertainty, so that the path following effect is poor.
Disclosure of Invention
The invention aims to provide a system and a method for unmanned ship path following based on deep reinforcement learning.
In order to realize the aim, the invention designs an unmanned ship path following system based on deep reinforcement learning, which comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;
the simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of the unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to the requirement followed by the unmanned ship path;
the Markov decision modeling module is used for modeling a Markov decision process by utilizing an unmanned ship motion control task, the Markov decision process is used for describing an interaction process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;
the neural network construction module is used for designing a deep neural network based on a DDPG (deep deterministic policy gradient) algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;
the strategy model building module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model;
the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.
The deep reinforcement learning has the characteristics of autonomous learning and strong self-adaptive capacity, and is widely applied to the fields of navigation, robot control, parameter optimization and the like. Meanwhile, the end-to-end learning mode in the deep reinforcement learning obviously reduces or eliminates the dependence on professional field knowledge and has good performance exceeding human professionals in multiple fields, thereby providing a new idea for the research in the field of decision control. The path following method based on deep reinforcement learning needs to construct an environment for interactive learning, which can be an actual environment or a simulation environment. The problems of large risk, long period, high cost and the like exist in the actual environment, and the motion process of the ship cannot be intuitively reflected by a numerical simulation mode.
The invention has the beneficial effects that:
the unmanned ship motion interactive simulation platform constructed by the invention provides a training and testing environment similar to the actual environment, the ship motion control and control algorithm are separated, the ship motion control underlying principle can be not known, the unmanned ship control algorithm is directly designed, the ship motion process is visualized by Unity3D, the control effect of a control strategy is reflected intuitively, and a safe, efficient and low-cost way is provided for the decision-making and control strategy research and verification of the unmanned ship;
the invention provides an unmanned ship path following method based on deep reinforcement learning, which takes a neural network obtained by DDPG algorithm training as an unmanned ship path following controller, does not need complex mathematical derivation and simplifies the design process of the controller; in the training process, interactive data consisting of control input and corresponding dynamic response of the unmanned ship is used as a training sample of the neural network, the modeling uncertainty problem and the influence of various environmental factors are implicitly considered, and the uncertainty problem is well processed; the trained control strategy model can be applied to a real-time path following task through fine adjustment, the transportability is high, the control strategy model takes the real-time state of the unmanned ship as input, the optimal control strategy of the ship is directly output, and the self-adaptive capacity and the real-time performance are strong.
Drawings
FIG. 1 is a schematic structural view of the present invention;
FIG. 2 is a flowchart of a method for tracking a unmanned ship path based on deep reinforcement learning according to the present invention;
FIG. 3 is a schematic diagram of an unmanned ship motion interaction simulation platform system according to the present invention;
FIG. 4 is a schematic diagram of the unmanned ship path following controller based on DDPG algorithm in the present invention;
FIG. 5 is a schematic view of a line of sight method;
FIG. 6 is a training frame diagram of unmanned ship path following based on DDPG algorithm in the present invention;
FIG. 7 is a flow chart of the implementation of the control strategy model of the present invention;
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
the unmanned ship path following system based on deep reinforcement learning shown in fig. 1 is characterized in that: the system comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;
the simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform, as shown in fig. 3, initializing a target path and a navigation environment of the unmanned ship in the unmanned ship motion interactive simulation platform, wherein the navigation environment comprises the target path, a ship initial position, a course and obstacle information, and defining an unmanned ship motion control task according to a requirement followed by the unmanned ship path;
the Markov decision modeling module is used for modeling a Markov decision process by utilizing an unmanned ship motion control task, the Markov decision process is used for describing an interaction process of unmanned ship path following control, the state space of the Markov decision process is determined according to information required by a ship to complete the path following task, the action space of the Markov decision process is determined according to a ship control instruction, the reward function of the Markov decision process is determined according to a target task controlled by the ship, the ship control instruction is an action executable by the ship, and the ship control task is what task the ship completes; the target task of ship control is a target followed by a ship path, specifically, the difference between the course angle of the ship and the target course angle is minimized, the position error followed by the path is minimized, and the target path is smoothly followed; the Markov decision process is used for describing an interactive process of unmanned ship path following control, the ship path following control problem is converted into a discrete sequential decision problem, and a state space, an action space and a reward function are all components of the Markov decision process;
the Markov decision process is described by (S, A, P, R, gamma), and the state space, the action space and the reward function are all the components; the state space is a collection for describing the state of the ship and describes the state information of the ship; wherein the expected course angle in the state space is obtained by a line-of-sight method; the action space is a set of ship control actions, and defines which actions can be taken by the ship; the reward function is a feedback signal of the environment after the ship takes a certain action and is used for evaluating the quality of the taken action, and the control strategy is optimized through the feedback signal.
The neural network construction module is used for designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;
the strategy model building module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model, the training process is that the unmanned ship acquires experience data according to a Markov decision process, stores the experience data into an experience pool, and then trains the neural network by randomly sampling from the experience pool;
the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control, and is shown in figure 4.
In the technical scheme, the simulation platform construction module constructs a virtual 3D navigation environment through a Unity3D engine, constructs a control model of the ship course and speed in Pycharm software, controls the ship navigation speed and steering through an interface in Pycharm, provides a longitudinal driving force and a rotating torque for a ship, and realizes data interaction between the virtual 3D navigation environment and the control model of the ship course and speed by using a communication tool package of Airsim software to form an unmanned ship motion interaction simulation platform.
In the technical scheme, an unmanned ship motion interactive simulation platform is constructed and used for training and evaluating the unmanned ship motion control strategy; the simulation platform comprises a virtual environment layer, an application program interface layer and an external environment layer, wherein the virtual environment layer constructs a virtual environment based on a ship motion model, an unmanned ship navigation scene, an environment model and a sensor model, and is visualized by Unity3D, so that ship motion simulation and sensing information content perception can be realized; the external environment layer can be used for designing a decision and control strategy of the unmanned ship; the application program interface layer connects ship motion simulation and control strategies through a data transmission protocol, and transmission of environment perception information and unmanned ship control signals between the virtual environment layer and the external environment layer is achieved.
Optionally, the external environment layer is isolated from the unmanned ship motion model in the virtual environment, and the core of the external environment layer is a ship motion control strategy, which only relates to the design and implementation of the ship motion control strategy, so that the ship control method of the external environment layer can be the method provided by the invention, and can also be the technology provided by other mechanisms.
In the above technical solution, the specific method for modeling the markov decision process includes:
the Markov decision process is described by tuples (S, A, P, R, gamma), where S is the unmanned ship ' S state space, A is the unmanned ship ' S action space, P is the state transition probability, R is the reward function, gamma is the discount factor for weighing the relationship between the instant reward and the future long-term reward, and at time t, the unmanned ship ' S state information S t E to S, and selecting action a from the action space according to the corresponding action strategy t Execution, followed by a state transition of the unmanned ship to a new state s t+1 While obtaining a feedback prize value r t The task goal of the unmanned ship is to maximize the jackpot value during the completion of the interaction;
constructing an unmanned ship state space; the state space S is represented as:
Figure BDA0003622026110000061
wherein psi d Representing the desired heading angle,. chi. y A lateral position error representing the position of the vessel from a desired course, the speed of the vessel,
Figure BDA0003622026110000062
derivatives of the desired heading angle, heading error, position lateral error; the expected course is obtained by a line-of-sight method; the line-of-sight method obtains the expected course of the ship according to the real-time position information and the target path information of the ship; the real-time position information is acquired through a ship sensing system; the target path information is known information.
The line-of-sight method provides the unmanned ship with the desired heading, and the schematic diagram is shown in FIG. 5, assuming P k+1 (x k+1 ,y k+1 ) Is the current target waypoint, P k (x k ,y k ) Is the previous waypoint, P k+2 (x k+2 ,y k+2 ) Is the next target waypoint, and the current position of the ship is P (x, y), P los (x los ,y los ) Is LOS point, the equation for solving LOS point is:
Figure BDA0003622026110000063
wherein alpha is k Is the path angle, R is the turning radius;
thus, the desired heading is:
ψ d =atan2(y los -y,x los -x)
where the atan2 function is an arctan function.
Constructing an unmanned ship action space; the motion space a is represented as: a ═ { δ }, wherein δ is a ship rudder angle;
constructing a reward function; the reward function includes relative position and relative heading to the target path, expressed as: r ═ w e r e +w χ r χ Wherein r is e Awarded a position error, r χ Is a heading error reward, w e Awarding weight for position error, w χ Rewarding weight for the course error; the position error reward function and the course error reward are as follows:
Figure BDA0003622026110000071
Figure BDA0003622026110000072
wherein, χ (k) is the course error of the current time, χ (k-1) represents the course error of the last time, k 1 Is a first weight coefficient with a value of 0.1 and k 2 Is a second weight coefficient with a value of 0.01, k 3 And the third weight coefficient is 0.1, e is a natural constant, and rad represents radian.
In the above technical solution, an unmanned ship path following learning framework based on the DDPG algorithm is shown in fig. 6, and the deep neural network includes: a current policy network, a current evaluation network, a target policy network, and a target evaluation network; the current strategy network uses the current state information s of the unmanned ship t For inputting, corresponding to ship state information at the time t in the Markov decision process description, namely a vector formed by specific values corresponding to each variable in a state space, and outputting a current strategy action mu(s) t ),μ(s t ) Outputting action value for the current time strategy network, and comparing action mu(s) t ) The addition of OU (Ornstein-Uhlenback) exploration noise results in an action a to be performed t . The output of the strategy network corresponds to a certain value in the action space, specifically, at the moment t, the strategy network inputs the ship state s t The output action value is the rudder angle of the ship; the current evaluation network is paired with a state and an action(s) t ,a t ) For input, status and action pairs, from status s t And action a t Composition a of t Is the current policy network output action value mu(s) t ) Adding OU exploration noise
Figure BDA0003622026110000073
The action to be executed is obtained
Figure BDA0003622026110000074
Output to calculate the current Q value (Q value is proper noun, representing state s) t Execute action a at once t The value of (a) of (b),the current Q value is the output value of the current evaluation network); the target strategy network has the same structure as the current strategy network, and the state s of the target strategy network at the next moment t+1 For input, the output is the next time optimal action mu'(s) t+1 θ μ' ) The target evaluation network is in the state s at the next moment t+1 And the next moment optimal action mu'(s) t+1 θ μ′ ) For input, output estimated Q value Q'(s) t+1 ,μ′(s t+1 θ μ′ )|θ Q′ ) (ii) a The target evaluation network has the same structure as the current evaluation network.
In the technical scheme, the dimension of the state space defines the input of the current policy network; the dimension of the action space defines the output of the current policy network; the state space dimension and the action space dimension define the input of the current evaluation network; the number of input layer nodes and the number of output layer nodes of the neural network are determined through the state space and the action space.
In the above technical solution, the network parameter of the current policy network is θ μ With the current state s of the unmanned ship t Output the current policy action mu(s) as input t ) Adding the OU exploration noise results in the action a to be performed t (ii) a The network parameter of the current evaluation network is theta Q In the form of a state and action pair(s) t ,a t ) Calculating for input, output the current Q value Q(s) t ,a tQ ) (ii) a The network parameter of the target policy network is theta μ′ The following time state s t+1 For input, the output is the next time optimal action mu'(s) t+1 θ μ' ) (ii) a The network parameter of the target evaluation network is theta Q′ The following time state s t+1 And next-time optimal action the next-time optimal action mu'(s) t+1 θ μ′ ) For input, output estimated Q value Q'(s) t+1 ,μ′(s t+1 θ μ′ )|θ Q′ )。
In the above technical solution, the specific method for obtaining the unmanned ship path following control strategy model is as follows:
step 4.1, initializing the training round times M, the maximum step number T of each round, the soft update rate tau, the learning rates of the current strategy network and the current evaluation network and the attenuation factor gamma in the deep neural network;
step 4.2, initializing the current strategy network parameter theta μ Current estimated network parameter θ Q Target policy network parameter θ μ′ Target evaluation network parameter θ Q′ And an experience playback pool D;
step 4.3, for each training round, randomly initializing the ship state s 0
Step 4.4: for each time step, obtaining the current state information s of the unmanned ship from the virtual environment layer of the unmanned ship motion interactive simulation platform t Then inputting the current policy network to obtain the policy action mu(s) t ) Synthesizing unmanned ship motion a by adding noise search signal t The unmanned ship performs an unmanned ship action a in a virtual environment t Obtaining the unmanned ship state s according to the reward function t (the current time state of the unmanned ship corresponds to the ship state information at the time t in the Markov decision process description, namely a vector formed by specific values corresponding to each variable in a state space) to execute the action a t Is given a prize value r t While obtaining the new state s of the unmanned ship t+1
Step 4.5, interactive experience data(s) t ,a t ,r t ,s t+1 ) The interactive experience data comprises a reward function to obtain the state s of the unmanned ship t Unmanned ship action a t The bonus value r t New state s of unmanned ship t+1 After the quantity of the interactive experience data exceeds the capacity of the experience pool, replacing the earliest experience data with the latest experience data;
step 4.6, randomly sampling batch N tuple data(s) from the experience playback pool D i ,a i ,r i ,s i+1 ) As training data, s i Is the current time state in the ith sample, a i For the current time-of-day state in the ith sample, r i For the current in the ith sampleTime of day award, s i+1 For the next time state in the ith sample, a target function is calculated:
y i =r i +γQ′(s i+1 ,μ′(s i+1μ′ )|θ Q′ )
y i is a target Q value, r i Denotes the prize value, gamma denotes the decay factor, Q'(s) i+1 ,μ′(s i+1 θ μ′ )|θ Q′ ) Evaluating the Q value, s, of the network output for the purpose of targeting i+1 Represents the new state of the unmanned ship, mu'(s) i+1μ' ) For the output action of the target policy network, θ μ' Representing a target policy network parameter, θ Q' Representing target evaluation network parameters;
and 4.7, updating the current evaluation network through a minimum loss function:
Figure BDA0003622026110000091
L(θ Q ) A loss function, Q(s), representing the current evaluation network i ,a iQ ) Representing the output Q value, s, of the currently evaluated network i Indicating the unmanned ship state, a i Indicating unmanned ship motion, theta Q A parameter indicative of a currently evaluated network;
and 4.8, updating the current strategy network by using the strategy gradient:
Figure BDA0003622026110000092
wherein,
Figure BDA0003622026110000093
representing the gradient of the objective function of the current policy network, E representing the mean function,
Figure BDA0003622026110000095
represents the gradient of the Q value of the current evaluation network output to action a, a ═ μ(s) i ) Indicates that the unmanned ship state is s i When the output action value of the current policy network,
Figure BDA0003622026110000094
action value pair theta representing current policy network output μ A gradient of (a);
and 4.9, updating the target network parameters in a soft updating mode:
θ Q′ ←τθ Q +(1-τ)θμ Q′
θ μ′ ←τθ μ +(1-τ)θ μ′
where τ represents the soft update rate, θ Q Representing the currently evaluated network parameter, theta Q' Representing target evaluation network parameters, theta μ Representing a current policy network parameter, θ μ' Representing a target policy network parameter;
step 4.10, repeating the step 4.4 to the step 4.9 until the maximum step number T is reached;
step 4.11, repeating the steps 4.3-4.10 until the maximum number of rounds M is reached, and obtaining the unmanned ship path following control strategy model;
and obtaining an optimal control strategy model through the training process, wherein the control strategy model is the trained current strategy network parameters.
In the technical scheme, the path following control module combines the unmanned ship path following control strategy model with the sight guidance algorithm to realize unmanned ship path following control, and the sight guidance algorithm is used for providing an expected course for the control strategy model and guiding the unmanned ship to sail; the implementation of the path following process specifically includes, as shown in fig. 7:
step 5.1, initializing a target path of ship navigation;
step 5.2, acquiring self initial state information and environment information of the unmanned ship, and calculating values of all variables in a state space to obtain the current time state of the ship according to the acquired ship position, course, speed information and target path information;
step 5.3, inputting the current state information of the unmanned ship into the unmanned ship path following control strategy model to obtain a current action strategy of the unmanned ship;
step 5.4, the action strategy is converted into a rudder angle control instruction, the rudder angle control instruction is executed, and the unmanned ship is converted into a new state;
step 5.5, acquiring real-time position information of the ship and calculating the current ship yaw distance;
step 5.6, if the yaw distance exceeds the threshold value, representing that the task execution fails, and returning the ship to the initial position; otherwise, carrying out the next step;
step 5.7, if the unmanned ship reaches the destination, the path following control process is finished, otherwise, the next step is carried out;
and 5.8, if the unmanned ship finishes the following of the current target path, taking the next section of target path as the current target path, and then returning to the step 5.2, otherwise, directly returning to the step 5.2.
An unmanned ship path following method, as shown in fig. 2, comprises the following steps:
step 1, constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of an unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to a requirement followed by the unmanned ship path;
the architecture diagram of the unmanned ship motion interactive simulation platform system is shown in fig. 3, and the whole simulation platform comprises a virtual environment layer, an application program interface layer and an external environment layer, wherein the virtual environment layer constructs a virtual environment based on a ship motion model, an unmanned ship navigation scene, an environment model and a sensor model, and is visualized by Unity3D, so that the course and the speed of a ship can be controlled, and the position, the speed, course information and navigation environment information of the ship can be sensed; the external environment layer can be used for designing a decision and control strategy of the unmanned ship; the application program interface layer connects ship motion simulation and control strategies through a data transmission protocol to realize the transmission of environment perception information and unmanned ship control signals between the virtual environment layer and the external environment layer;
step 2, modeling a Markov decision process by utilizing an unmanned ship motion control task, wherein the Markov decision process is used for describing an interactive process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;
step 3, designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;
step 4, training the deep neural network by using a DDPG algorithm on the simulation platform to obtain an unmanned ship path following control strategy model;
and 5, combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.
Compared with the prior art, the unmanned ship motion interaction simulation platform constructed by the invention separates a ship motion model and a control algorithm, shields the bottom layer realization principle of a ship motion controller, greatly simplifies the design of a control strategy, also provides a training and testing environment similar to the actual environment, and provides a safe, efficient and low-cost way for the decision-making and control strategy research and verification of the unmanned ship; meanwhile, the unmanned ship path following method based on deep reinforcement learning provided by the invention takes the neural network obtained by DDPG algorithm training as the unmanned ship path following controller, does not need complex mathematical derivation, and simplifies the design process of the controller; in the training process, interactive data consisting of control input and corresponding dynamic response of the unmanned ship is used as a training sample of the neural network, the modeling uncertainty problem and the influence of various environmental factors are implicitly considered, and the uncertainty problem is well processed; the trained control strategy model can be applied to a real-time path following task through fine adjustment, the transportability is high, the control strategy model takes the real-time state of the unmanned ship as input, the optimal control strategy of the ship is directly output, and the self-adaptive capacity and the real-time performance are strong.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (10)

1. The utility model provides an unmanned ship route following system based on degree of depth reinforcement study which characterized in that: the system comprises a simulation platform construction module, a Markov decision modeling module, a neural network construction module, a strategy model construction module and a path following control module;
the simulation platform construction module is used for constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of the unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to the requirement followed by the unmanned ship path;
the Markov decision modeling module is used for modeling a Markov decision process by utilizing an unmanned ship motion control task, the Markov decision process is used for describing an interaction process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;
the neural network construction module is used for designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;
the strategy model building module trains a deep neural network on a simulation platform by using a DDPG algorithm to obtain an unmanned ship path following control strategy model;
the path following control module is used for combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.
2. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the simulation platform construction module constructs a virtual 3D navigation environment through a Unity3D engine, constructs a ship course and speed control model in Pycharm software, and realizes data interaction between the virtual 3D navigation environment and the ship course and speed control model by using a communication tool package of Airsim software to form an unmanned ship motion interaction simulation platform.
3. The unmanned ship path following system based on deep reinforcement learning of claim 1, wherein: the specific method for modeling the Markov decision process comprises the following steps:
the Markov decision process is described by tuples (S, A, P, R, gamma), where S is the unmanned ship ' S state space, A is the unmanned ship ' S action space, P is the state transition probability, R is the reward function, gamma is the discount factor for weighing the relationship between the instant reward and the future long-term reward, and at time t, the unmanned ship ' S state information S t E to S, and selecting action a from the action space according to the corresponding action strategy t Execution, following which the state of the unmanned ship is transferred to a new state s t+1 While obtaining a feedback prize value r t The task goal of the unmanned ship is to maximize the jackpot value during the completion of the interaction;
constructing an unmanned ship state space; the state space S is represented as:
Figure FDA0003622026100000021
wherein psi d Representing the desired heading angle,. chi. y A lateral position error representing the position of the vessel from a desired course, the speed of the vessel,
Figure FDA0003622026100000022
derivatives of the desired heading angle, heading error, position lateral error;
constructing an unmanned ship action space; the motion space a is represented as: a ═ { δ }, wherein δ is a ship rudder angle;
constructing a reward function; the reward function includes relative position and relative heading to the target path, expressed as: r ═ w e r e +w χ r χ Wherein r is e Awarded a position error, r χ Is a heading error reward, w e Awarding weight for position error, w χ Rewarding weight for the course error; the position error reward function and the course error reward are as follows:
Figure FDA0003622026100000023
Figure FDA0003622026100000024
wherein, χ (k) is the course error of the current time, χ (k-1) represents the course error of the last time, k 1 Is a first weight coefficient, k 2 Is the second weight coefficient, k 3 Is the third weight coefficient, e is the natural constant, and rad represents radians.
4. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the deep neural network includes: a current policy network, a current evaluation network, a target policy network, and a target evaluation network; the current strategy network uses the current state information s of the unmanned ship t Output the current policy action mu(s) as input t ) (ii) a The current evaluation network is paired with a state and an action(s) t ,a t ) Outputting and calculating the current Q value for input; the target strategy network has the same structure as the current strategy network, and the state s of the target strategy network at the next moment t+1 For input, the output is the next time optimal action mu'(s) t+1μ' ) The target evaluation network is in the state s at the next moment t+1 And the next moment optimal action mu'(s) t+1μ′ ) For input, output estimated Q value Q'(s) t+1 ,μ′(s t+1μ′ )|θ Q′ ) (ii) a The target evaluation network has the same structure as the current evaluation network.
5. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the dimension of the state space defines the input of the current policy network; the dimension of the action space defines the output of the current policy network; the state space dimension and the action space dimension define the input of the current evaluation network; the number of input layer nodes and the number of output layer nodes of the neural network are determined through the state space and the action space.
6. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the network parameter of the current policy network is theta μ With the current state s of the unmanned ship t Output the current policy action mu(s) as input t ) Adding the OU exploration noise results in the action a to be performed t (ii) a The network parameter of the current evaluation network is theta Q In the form of a state and action pair(s) t ,a t ) Calculating for input, output the current Q value Q(s) t ,a tQ ) (ii) a The network parameter of the target policy network is theta μ′ The following time state s t+1 For input, the output is the optimal action mu' at the next moment(s) t+1μ′ ) (ii) a The network parameter of the target evaluation network is theta Q′ The following time state s t+1 And next-time optimal action the next-time optimal action mu'(s) t+1μ′ ) For input, output estimated Q value Q'(s) t+1 ,μ′(s t+1μ′ )|θ Q′ )。
7. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the specific method for obtaining the unmanned ship path following control strategy model comprises the following steps:
step 4.1, initializing the training round times M, the maximum step number T of each round, the soft update rate tau, the learning rates of the current strategy network and the current evaluation network and the attenuation factor gamma in the deep neural network;
step 4.2, initializing the current strategy network parameter theta μ Current estimated network parameter θ Q Target policy networkParameter of the complex theta μ′ Target evaluation network parameter θ Q′ And an experience playback pool D;
step 4.3, for each training round, randomly initializing the ship state s 0
Step 4.4: for each time step, acquiring current state information s of the unmanned ship from a virtual environment layer of the unmanned ship motion interactive simulation platform t Then inputting the current policy network to obtain the policy action mu(s) t ) Synthesizing the unmanned ship motion a by adding a noise search signal t The unmanned ship performs an unmanned ship action a in a virtual environment t Obtaining the unmanned ship state s according to the reward function t Lower execution action a t Is given a prize value r t While obtaining the new state s of the unmanned ship t+1
Step 4.5, storing the interactive experience data into an experience playback pool D, wherein the interactive experience data comprise a reward function to obtain the unmanned ship state s t Unmanned ship action a t The bonus value r t New state s of unmanned ship t+1 After the quantity of the interactive experience data exceeds the capacity of the experience pool, replacing the earliest experience data with the latest experience data;
step 4.6, randomly sampling batch N tuple data(s) from the experience playback pool D i ,a i ,r i ,s i+1 ) As training data, s i Is the current time state in the ith sample, a i For the current time-of-day state in the ith sample, r i Awarding, s, for the current time in the ith sample i+1 For the next time state in the ith sample, a target function is calculated:
y i =r i +γQ'(s i+1 ,μ'(s i+1μ' )|θ Q’ )
y i is a target Q value, r i Denotes the prize value, gamma denotes the decay factor, Q'(s) i+1 ,μ′(s i+1μ′ )|θ Q′ ) Evaluating the Q value, s, of the network output for the purpose of targeting i+1 Represents the new state of the unmanned ship, mu'(s) i+1μ′ ) Is composed ofOutput action of the target policy network, θ μ' Representing a target policy network parameter, θ Q' Representing target evaluation network parameters;
and 4.7, updating the current evaluation network through a minimum loss function:
Figure FDA0003622026100000041
L(θ Q ) A loss function, Q(s), representing the current evaluation network i ,a iQ ) Representing the output Q value, s, of the currently evaluated network i Indicating the unmanned ship state, a i Indicating unmanned ship motion, theta Q A parameter indicative of a currently evaluated network;
and 4.8, updating the current strategy network by using the strategy gradient:
Figure FDA0003622026100000051
wherein,
Figure FDA0003622026100000052
representing the gradient of the objective function of the current policy network, E representing the mean function,
Figure FDA0003622026100000053
represents the gradient of the Q value of the current evaluation network output to action a, a ═ μ(s) i ) Indicates that the unmanned ship state is s i When the output action value of the current policy network,
Figure FDA0003622026100000054
action value pair theta representing current policy network output μ A gradient of (a);
and 4.9, updating the target network parameters in a soft updating mode:
θ Q’ ←τθ Q +(1-τ)θ Q’
θ μ’ ←τθ μ +(1-τ)θ μ’
where τ represents the soft update rate, θ Q Representing the currently evaluated network parameter, theta Q' Representing target evaluation network parameters, theta μ Representing a current policy network parameter, θ μ' Representing a target policy network parameter;
step 4.10, repeating the step 4.4 to the step 4.9 until the maximum step number T is reached;
and 4.11, repeating the steps 4.3-4.10 until the maximum number of rounds M is reached, and obtaining the unmanned ship path following control strategy model.
8. The deep reinforcement learning-based unmanned ship path following system according to claim 1, wherein: the path following control module combines the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control, and the sight guidance algorithm is used for providing an expected course for the control strategy model and guiding the unmanned ship to sail; the implementation of the path following process specifically includes:
step 5.1, initializing a target path of ship navigation;
step 5.2, acquiring self initial state information and environment information of the unmanned ship, and calculating values of all variables in a state space to obtain the current time state of the ship according to the acquired ship position, course, speed information and target path information;
step 5.3, inputting the current state information of the unmanned ship into the unmanned ship path following control strategy model to obtain a current action strategy of the unmanned ship;
step 5.4, the action strategy is converted into a rudder angle control instruction, the rudder angle control instruction is executed, and the unmanned ship is converted into a new state;
step 5.5, acquiring real-time position information of the ship and calculating the current ship yaw distance;
step 5.6, if the yaw distance exceeds the threshold value, representing that the task execution fails, and returning the ship to the initial position; otherwise, carrying out the next step;
step 5.7, if the unmanned ship reaches the destination, the path following control process is finished, otherwise, the next step is carried out;
and 5.8, if the unmanned ship finishes the following of the current target path, taking the next section of target path as the current target path, and returning to the step 5.2, otherwise, directly returning to the step 5.2.
9. The deep reinforcement learning-based unmanned ship path following system according to claim 3, wherein: the expected course is obtained by a line-of-sight method; the line-of-sight method obtains the expected course of the ship according to the real-time position information and the target path information of the ship; the real-time position information is acquired through a ship sensing system; the target path information is known information.
10. An unmanned ship path following method of the system of claim 1, comprising the steps of:
step 1, constructing an unmanned ship motion interactive simulation platform, initializing a target path and a navigation environment of an unmanned ship in the unmanned ship motion interactive simulation platform, and defining an unmanned ship motion control task according to a requirement followed by the unmanned ship path;
step 2, modeling a Markov decision process by utilizing an unmanned ship motion control task, wherein the Markov decision process is used for describing an interactive process of unmanned ship path following control, determining a state space of the Markov decision process according to information required by a ship to complete the path following task, determining an action space of the Markov decision process according to a ship control instruction, and determining a reward function of the Markov decision process according to a target task controlled by the ship;
step 3, designing a deep neural network based on a DDPG algorithm architecture according to a state space, an action space and a reward function in a Markov decision process;
step 4, training the deep neural network by using a DDPG algorithm on the simulation platform to obtain an unmanned ship path following control strategy model;
and 5, combining the unmanned ship path following control strategy model with a sight guidance algorithm to realize unmanned ship path following control.
CN202210470023.3A 2022-04-28 2022-04-28 Unmanned ship path following system and method based on deep reinforcement learning Pending CN114859910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210470023.3A CN114859910A (en) 2022-04-28 2022-04-28 Unmanned ship path following system and method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210470023.3A CN114859910A (en) 2022-04-28 2022-04-28 Unmanned ship path following system and method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN114859910A true CN114859910A (en) 2022-08-05

Family

ID=82635089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210470023.3A Pending CN114859910A (en) 2022-04-28 2022-04-28 Unmanned ship path following system and method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114859910A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421483A (en) * 2022-08-22 2022-12-02 河海大学 Unmanned ship control motion forecasting method
CN115657683A (en) * 2022-11-14 2023-01-31 中国电子科技集团公司第十研究所 Unmanned and cableless submersible real-time obstacle avoidance method capable of being used for inspection task
CN116280140A (en) * 2023-04-13 2023-06-23 广东海洋大学 Ship hybrid power energy management method, equipment and medium based on deep learning
CN116520281A (en) * 2023-05-11 2023-08-01 兰州理工大学 DDPG-based extended target tracking optimization method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421483A (en) * 2022-08-22 2022-12-02 河海大学 Unmanned ship control motion forecasting method
CN115421483B (en) * 2022-08-22 2024-06-14 河海大学 Unmanned ship maneuvering motion forecasting method
CN115657683A (en) * 2022-11-14 2023-01-31 中国电子科技集团公司第十研究所 Unmanned and cableless submersible real-time obstacle avoidance method capable of being used for inspection task
CN116280140A (en) * 2023-04-13 2023-06-23 广东海洋大学 Ship hybrid power energy management method, equipment and medium based on deep learning
CN116280140B (en) * 2023-04-13 2023-10-10 广东海洋大学 Ship hybrid power energy management method, equipment and medium based on deep learning
CN116520281A (en) * 2023-05-11 2023-08-01 兰州理工大学 DDPG-based extended target tracking optimization method and device
CN116520281B (en) * 2023-05-11 2023-10-24 兰州理工大学 DDPG-based extended target tracking optimization method and device

Similar Documents

Publication Publication Date Title
CN114859910A (en) Unmanned ship path following system and method based on deep reinforcement learning
Sun et al. Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning
CN111667513B (en) Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
CN108820157B (en) Intelligent ship collision avoidance method based on reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
Xu et al. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs
CN107102644B (en) Underwater robot track control method and control system based on deep reinforcement learning
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
CN110333739A (en) A kind of AUV conduct programming and method of controlling operation based on intensified learning
Hadi et al. Deep reinforcement learning for adaptive path planning and control of an autonomous underwater vehicle
EP2280241A2 (en) Vehicle control
CN105785999A (en) Unmanned surface vehicle course motion control method
CN115016496A (en) Water surface unmanned ship path tracking method based on deep reinforcement learning
CN112286218B (en) Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN115494879B (en) Rotor unmanned aerial vehicle obstacle avoidance method, device and equipment based on reinforcement learning SAC
CN114199248A (en) AUV (autonomous underwater vehicle) cooperative positioning method for optimizing ANFIS (artificial neural field of view) based on mixed element heuristic algorithm
Fang et al. Autonomous underwater vehicle formation control and obstacle avoidance using multi-agent generative adversarial imitation learning
CN115016534A (en) Unmanned aerial vehicle autonomous obstacle avoidance navigation method based on memory reinforcement learning
CN113268074A (en) Unmanned aerial vehicle flight path planning method based on joint optimization
CN115033022A (en) DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
Amendola et al. Navigation in restricted channels under environmental conditions: Fast-time simulation by asynchronous deep reinforcement learning
Song et al. Autonomous mobile robot navigation using machine learning
Hua et al. A novel learning-based trajectory generation strategy for a quadrotor
CN114609925B (en) Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish
CN115755603A (en) Intelligent ash box identification method for ship motion model parameters and ship motion control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination