CN111829527B

CN111829527B - Unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements

Info

Publication number: CN111829527B
Application number: CN202010717418.XA
Authority: CN
Inventors: 曾喆; 杜沛; 刘善伟; 万剑华
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2021-07-20
Anticipated expiration: 2040-07-23
Also published as: CN111829527A

Abstract

The invention discloses an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements, which comprises the following basic steps: s1 interpolating wind, wave and flow data of the target sea area, and adding barrier information, starting point information and end point information; s2, evaluating the maximum value of the unmanned ship subjected to the storm flow by using the Bayesian network; s3 reorganizes the target sea area AIS data into a training network to obtain an optimized experience pool and preliminary network parameters; s4, respectively inputting the unmanned ship state feature vectors to a deep reinforcement learning module for algorithm iteration, updating network parameters and outputting actions; s5, the unmanned ship is operated for 15S each time of iteration, and the data is updated when the time is accumulated to 1 h; and S6, when the unmanned ship reaches the target point, ending the iteration and outputting a path. The invention fully considers the influence of marine environmental factors on the navigation of the unmanned ship, better accords with the actual long-distance navigation condition of the unmanned ship, and can simultaneously consider the environmental factors and the barrier information under the severe sea condition of the unmanned ship to obtain high-quality safety.

Description

Unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements

Technical Field

The patent relates to the field of unmanned ship path planning, in particular to an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements.

Background

The unmanned ship breaks through in many technical fields depending on the development of an artificial intelligence control technology, gradually enters the visual field of people, starts to take on tasks such as ocean exploration and data acquisition, and gradually develops to the marine operation industry.

The presently published patents: CN109657863A, CN109726866A and CN107289939A all provide better path planning methods in the field, but only consider the influence of obstacles on unmanned ships in general. Each unmanned ship has a limit value for bearing wind waves according to factors such as own materials, structure, draught and the like, and when the unmanned ship is influenced by strong wind and strong waves in a real sea area, the unmanned ship has the danger of side turning or overturning, so that the unmanned ship sails in the sea area to avoid dangerous marine environment elements and areas with obstacles, which is extremely important for sailing safety and is particularly prominent for marine transport type unmanned ships.

The influence of marine environment elements on the navigation of the unmanned ship is considered, the marine environment elements and obstacle information around the unmanned ship are used as characteristic input vectors of deep reinforcement learning, the elements which have the largest influence on algorithm output at each moment are highlighted by using an attention moment array, and compared with a collision avoidance reinforcement learning method, the method has the advantages that the reward value is not a fixed value, and the influence degree of the marine environment elements and the obstacles on the unmanned ship is comprehensively changed. The method is more suitable for the actual situation of unmanned ship navigation, and can obtain a high-quality safe path by considering environmental factors and barrier information during navigation of the unmanned ship.

Disclosure of Invention

Objects of the invention

Aiming at the problem that marine environment elements are not considered in many unmanned ship path planning methods proposed at present, the invention provides the unmanned ship path planning method based on deep reinforcement learning and considering the marine environment elements, which fully considers the real marine environment elements and marine obstacles and combines the deep reinforcement learning method to plan a safe and efficient driving path for the unmanned ship.

(II) technical scheme

In order to achieve the purpose, the technical scheme of the invention is as follows: an unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements comprises the following specific steps:

(1) interpolating the data of wind speed, flow speed and wave height at t moment of the target sea area into a grid of 200m multiplied by 200m by s_tTo describe the characteristic state vector of the unmanned ship at the moment t, namely:

wherein

Respectively representing the wind speed and wave height of the unmanned ship at the t momentAnd the flow rate of the liquid, and,

the distance between the time t and the obstacle of the unmanned ship

Indicating that no obstacle has been detected by the drone;

(2) the capability of the unmanned ship for resisting wind, wave and flow is evaluated by utilizing a Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is

Three parameters respectively representing the maximum values of wind speed, wave height and flow speed borne by the unmanned ship and used for calculating a reward function

(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (as target Q network and actual Q network, respectively), a reward function model, a model experience pool, and an action output set.

(4) Three attributes of coordinates, course and speed of real AIS data of a target sea area are reserved, three marine environment element values and obstacle information are superposed into the AIS data according to time and point positions, new AIS data are used as training samples and put into a deep reinforcement learning model for training, and an optimized experience pool and preliminary network parameters are obtained;

(5) setting a starting point coordinate and an end point coordinate of the voyage of the unmanned ship, and obtaining a state feature vector s of the unmanned ship at the time t_tRespectively inputting the data into an actual Q network and a reward function model;

wherein: actual Q network calculation yields Q_{Practice of}And find Q according to an e-greedy strategy_{Practice of}Corresponding actions are output; the reward function model calculates the reward value R of the current iteration_t(ii) a Randomly extracting n records in an experience pool by a target Q network, and combining the n records with R_tCalculating Q_Target，Q_{Practice of}And Q_TargetCalculating loss functions together, updating network parameters of the actual Q network by using a gradient descent method, and when the iteration times reach a certain threshold value alpha, updating the network parameters of the actual Q networkCopying all parameters to a target Q network;

(6) the motion time of the unmanned ship is 15 seconds each time, and when the accumulated motion time reaches 1h, the information of the wind speed, the ocean current, the wave height and the obstacles in the sea area is updated to the current time;

(7) and when the unmanned ship reaches the target point, finishing the iteration output of the safety path.

Specifically, the bayesian network construction method in the step (2) includes the following steps:

(2.1) the nodes of the unmanned ship evaluating bayesian network include: the material, the water displacement, the length, the width, the height, the wind resistance level, the wave resistance level and the flow resistance level are taken as bottom-layer nodes, the wind resistance level, the wave resistance level and the flow resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;

(2.2) training a Bayesian network by taking the unmanned ship structure data as a sample to obtain a conditional probability table of each node;

(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height, calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value;

(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network corresponding to the sea state grade into specific numerical values as

The value of (c).

Specifically, the reward function model described in step (3), wherein the reward value R_tThe calculation formula is as follows:

R_t＝softmax(|(θ_safe-s_t)·w₁|)·((θ_safe-s_t)·w₂)^T

wherein s is_tIs a characteristic state vector theta of the unmanned ship at the time t_safeIncluding four parameters for a safety threshold vector

Wherein

The maximum value of the wind speed, wave height and flow velocity borne by the unmanned ship obtained in the step 2,

sensing obstacle Range, w, for unmanned vessels₁The attention matrix of the reward function is a 4 multiplied by 4 upper triangular constant square matrix, and the diagonal element W of the matrix_ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W_ijRepresenting the correlation between the element i and the element j, wherein the matrix has the function of processing the element values of the marine environment with different orders of magnitude to the same order of magnitude for comparison, and can highlight key elements; w is a₂Is a 4 × 4 diagonal array, requires the sum (theta)_safe-s_t) Partially combined to give the final prize value R_tEndowing positive and negative, and simultaneously enlarging the reward value to facilitate decision making;

softmax(|(θ_safe-s_t)·w₁|) partial calculation of the coefficient of the reward function, responsible for giving weight to each element value, the weight highlights the more important elements to the decision in each iteration, and the reward value can be rapidly reduced when encountering the elements with suddenly increased numerical values and the time of detecting the obstacle; when the unmanned ship does not sense the obstacle, the reward function can guide the model to avoid a high-storm area, and can make collision avoidance actions in the first time when the obstacle is sensed.

(III) advantageous effects

The advantages of the invention are embodied in that:

1. the wind speed, the wave height, the ocean current flow velocity and the obstacle information are jointly used as main reference objects for planning the unmanned ship path, the planned path is more feasible, and in the calculation process of the method, the data are updated according to the running time of the unmanned ship, so that the reliability of the path planning result is ensured.

2. The designed reward function can highlight the more important elements for decision in each iteration, simultaneously considers the detection capability and the capability of bearing storm impact of the ship, gives rewards in a safe area, gives proper penalty in a dangerous area, and makes an avoidance decision at the first time when an obstacle is detected, so that the path planning efficiency of the method is improved, and the path planning result is optimized.

3. The method for evaluating the wind and wave resistance of the unmanned ship by using the Bayesian network is provided, replaces the conventional evaluation mode of giving the wind and wave resistance grade by using expert experience, and is more scientific and efficient.

Drawings

FIG. 1 is a flow chart of a method for planning a unmanned ship route based on deep reinforcement learning and considering marine environment elements

FIG. 2 is a schematic diagram of a Bayesian network for evaluating the capability of an unmanned ship in resisting wind, wave and current

FIG. 3 is a flow chart of a deep reinforcement learning algorithm used by the model

FIG. 4 is a schematic diagram of path planning under the influence of elements and obstacles in marine environment

Detailed Description

The invention will now be described more fully and clearly with reference to the accompanying drawings and examples:

FIG. 1 is a flow chart of a method for planning a path of an unmanned ship based on deep reinforcement learning and considering elements of a marine environment, wherein the method gives a reasonable solution for safely completing a navigation task of the unmanned ship by fully considering the material and structure of the unmanned ship and possible strong wind, strong waves, ocean currents and obstacles in a sea area; the method mainly comprises two modules, wherein the first module is a Bayesian network evaluation module used for evaluating the wind wave resistance of the unmanned ship, and the second module is a deep reinforcement learning route planning module considering marine environment elements; the method utilizes a reward function of deep reinforcement learning to couple the two modules, so that the unmanned ship can make a proper risk avoiding decision according to the self material and structure. The unmanned ship planning method is suitable for planning the path of the unmanned ship for executing the long-range mission.

Specifically, the method comprises the following steps:

(1) the wind speed, the flow speed and the wave height data of the target sea area at the time t and the forecast wind speed and the forecast after the time t are obtainedJointly using a kriging interpolation method for predicting ocean current and wave height data, and interpolating into a grid of 200m multiplied by 200 m; the grid size of 200m × 200m is a value for which the unmanned ship acquires a new element after at most three operations. Storing data by using a three-dimensional array, wherein three dimensions of the array are longitude, latitude and time respectively, the time interval of the data is 1h, and s is used at the time of t_tTo describe the characteristic state vector of the unmanned ship at the moment t, namely:

wherein

Respectively represents the wind speed, wave height and flow velocity of the unmanned ship at the time t,

the distance between the time t and the obstacle of the unmanned ship

Indicating that no obstacle has been detected by the drone;

(2) the capability of resisting wind, wave and flow of the unmanned ship is evaluated by using the Bayesian network, the material, the displacement, the length, the width and the height of the unmanned ship are input, and the output is

And three parameters respectively represent the maximum values of wind speed, wave height and flow speed which can be borne by the unmanned ship and are used for calculating the reward function. The specific steps for constructing the Bayesian network are as follows:

(2.1) as shown in fig. 2, a bayesian network schematic diagram for evaluating the capability of the unmanned ship for resisting wind, wave and current is provided, and nodes comprise: the material, the water displacement, the length, the width, the height, the wind speed resistance grade, the wave height resistance grade and the flow speed resistance grade are taken as bottom layer nodes; the wind resistance level, the wave resistance level and the flow speed resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;

(2.2) training a Bayesian network by taking the unmanned ship structure information as a sample, wherein data need to be subjected to discretization processing, and the unmanned ship structure table is shown as the following table:

TABLE 1

And putting the data into a Bayesian network for training to obtain a conditional probability table of each node.

(2.3) inputting unmanned ship information to be evaluated, including: material, water displacement, length, width and height; and calculating the probability of each grade of the three high-grade nodes according to the conditional probability table, and outputting the maximum probability grade as a final value.

(2.4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network to specific numerical values corresponding to the sea condition grade table to serve as

The value of (c).

(3) Initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks (respectively serving as a target Q network and an actual Q network), a reward function model, a model experience pool and an action output set;

in particular, a reward function model is described, wherein a reward value R is_tThe calculation formula is as follows:

R_t＝softmax(|(θ_safe-s_t)·w₁|)·((θ_safe-s_t)·w₂)^T

Wherein

For unmanned ship bearing obtained by Bayesian network evaluationThe maximum value of wind speed, wave height and flow speed,

the collision avoidance range is sensed for the unmanned ship, and the negative sign is added in the front for convenient calculation; weight matrix w₁Is a 4 × 4 symmetric constant square matrix with diagonal element W_ii(i ═ 1,2,3,4) corresponding to the wind speed, wave height, ocean currents and the extent of the effect of obstacles on the path plan, respectively, the off-diagonal element W_ijRepresenting the correlation between the element i and the element j, wherein the matrix is used for processing the marine environment element values with different orders of magnitude to the same order of magnitude for comparison, and can highlight the key element, particularly the w₁The values are given empirically:

w₂is a 4 x 4 diagonal matrix, needs and

partially combined to give the final prize value R_tEndowing positive and negative, simultaneously enlarging the reward value and accelerating the decision-making speed; w is a₂The method specifically comprises the following steps:

softmax(|(θ_safe-s_t)·w₁|) calculating the coefficients of the reward function, in charge of giving weight to each characteristic state element, the weight highlights the element which is more important to the decision in each iteration, and the reward value is rapidly reduced when encountering the element with suddenly increased numerical value and the moment of detecting the obstacle, (theta)_safe-s_t)·w₂And part, attaching positive and negative to the calculation result to indicate that reward or punishment is made. The calculation mode of the reward function is exemplified and divided into two cases of not meeting obstacles and meeting obstacles:

when no obstacle is encountered:

suppose a certain time t_nCharacteristic state vector of

Comprises the following steps:

unmanned ship safety threshold vector theta_safe＝[3,1.5,0.2,500]NAN represents not participating in the calculation, then softmax (| (θ)_safe-s_t)·w₁I) the calculation result is [0.867,0.117,0.016,0 |)]It means that in this calculation, the marine factor "wind speed" needs attention; (theta)_safe-s_t)·w₂The part and the weight value are subjected to multiplication, and positive and negative are attached to a calculation result to indicate that reward or punishment is given; the final calculation result of-19.95 represents that punishment is made.

When an obstacle is encountered:

suppose a certain time t_nCharacteristic state vector of

Comprises the following steps:

indicating that the obstacle is detected 50m away from the unmanned ship, the unmanned ship just senses the obstacle, and the safety threshold vector of the unmanned ship

Softmax (| (θ)_safe-s_t)·w₁I) the calculation result is [0,0,0,1 |)]It means that in this calculation, avoiding obstacles is the most important; (theta)_safe-s_t)·w₂And performing dot multiplication on the part and the weight value, attaching positive and negative to a calculation result, and giving a penalty when the final calculation result is-200.

Through the calculation, the algorithm drives the unmanned ship through the reward function, focuses on marine environment elements when no obstacle is detected, and reacts at the first time when the obstacle is detected;

(4) the method comprises the following steps of reserving three attributes of coordinates and course of real AIS data of a target sea area, superposing three marine environment element values and obstacle information into the AIS data according to time and point positions, wherein a new AIS data sample is shown in the following table:

TABLE 2

Putting the newly-sorted AIS data serving as training samples into a deep reinforcement learning model for training to obtain an optimized experience pool and preliminary network parameters;

(5) and (5) selecting the discretization course angle of the unmanned ship as the action output of the deep reinforcement learning when the fixed unmanned ship running speed v is 10 m/s. Considering the steering capacity of the ship, the heading change range is limited to be between 35 degrees and minus 35 degrees and discretized at equal intervals, namely an action set output by the model:

A＝{35,25,15,5,-5,-15,-25,-35}

(6) referring to fig. 3, which is a flowchart of a deep reinforcement learning algorithm, two identical LSTM networks are used as an actual Q network and a target Q value network in a deep reinforcement learning framework, respectively; obtaining state characteristic vector s of unmanned ship at time t_tRespectively inputting the data into an actual Q network and a reward function model; the LSTM input layer of the actual Q network at time t is the feature state vector s_tAnd the output Q(s) of the actual Q network at the last moment_t-1)_{Practice of}The output layer is Q(s)_t)_{Practice of}Value of Q(s)_t)_{Practice of}Then, selecting action a corresponding to the Q value by utilizing an epsilon-greedy strategy_t(a_t∈A)；

(7) Calculating the reward value R at time t_tThe characteristic state vector s at the time t_tAnd action a_tExecution of a_tThe latter feature state vector s_t' and the Boolean value isend, which determines whether the iteration has terminated, together as a record rec_t＝{s_t,a_t,R_t,s_t', is _ end } is stored in experience pool D;

(8) randomly extracting n records from experience pool D s_i,a_i,R_i,s_i′,is_end_iN calculating a target Q value Q, 1,2, …_Target：

Wherein R is_iThe prize value recorded for the ith entry, γ is the discount factor, in this example γ is 0.9, ω is a parameter of the actual Q network, ω' is a parameter of the target Q network, a^max(s_i', ω) is the action chosen to record the re-projection of i into the actual Q network:

wherein s is_i′、a_iAnd omega respectively recording the state characteristic vector, the action and the network parameter of the i;

(9) calculating the accumulated loss of the i records, and updating the parameter omega of the actual Q network by utilizing gradient descent, wherein the used loss function is as follows:

(10) when the iteration number of the actual Q network reaches the threshold value alpha, the parameter omega of the actual Q network is wholly copied to the target Q network.

(11) The motion time of the unmanned ship is 10 seconds each time, and when the accumulated motion time reaches 1h, the wind speed, ocean current, wave height and obstacle information data of the sea area are updated to the current time;

(12) and finishing the iteration when the unmanned ship reaches the termination point, and outputting a safe path.

Fig. 4 is a schematic diagram of path planning under the influence of marine environmental elements and obstacles, and the method can avoid high marine environmental risk areas and obstacles when planning a path.

The above is an example of the present invention, and all changes made according to the technical scheme of the present invention, which produce the functional effects, do not exceed the technical scheme of the present invention, and all belong to the protection scope of the present invention.

Claims

1. An unmanned ship path planning method based on deep reinforcement learning and considering marine environment elements is characterized by comprising the following steps:

(1) interpolating the wind speed, flow speed and wave height data of the target sea area at the time t into a grid of 500m multiplied by 500m by s_tTo describe the characteristic state vector of the unmanned ship at the moment t, namely:

wherein

the distance between the time t and the obstacle of the unmanned ship

Indicating that no obstacle has been detected by the drone;

Three parameters respectively represent the maximum values of wind speed, wave height and flow speed which can be borne by the unmanned ship;

1) constructing a Bayesian network node, comprising: the material, the water displacement, the length, the width, the height, the wind resistance level, the wave resistance level and the flow resistance level are taken as bottom-layer nodes, the wind resistance level, the wave resistance level and the flow resistance level are high-grade nodes, and the bottom-layer nodes are fully connected with the high-grade nodes;

2) training a Bayesian network by taking the unmanned ship structure data as a sample to obtain a conditional probability table of each node;

3) inputting unmanned ship information to be evaluated, comprising: material, water displacement, length, width and height, calculating the probability of each level of the three high-level nodes according to the conditional probability table, and outputting the maximum probability level as a final value;

4) mapping the grades of the wind speed, wave height and flow velocity of the unmanned ship obtained through the Bayesian network corresponding to the sea condition grade into specific numerical values as

A value of (d);

(3) initializing a deep reinforcement learning model, specifically comprising: two identical LSTM networks as a target Q network and an actual Q network, a reward function model, a model experience pool, and an action output set, wherein the reward value R_tThe calculation formula is as follows:

in the formula s_tIs a characteristic state vector theta of the unmanned ship at the time t_safeIncluding four parameters for a safety threshold vector

Wherein

the collision avoidance range of the unmanned ship; weight matrix w_4×4The attention matrix is a 4 x 4 square matrix of upper triangular constants, the diagonal element W of which is the reward function_iiCorresponding to wind speed, wave height, ocean current anddegree of influence of obstacle on path planning, i ═ 1,2,3,4, off-diagonal element W_ijRepresenting the correlation between element i and element j;

calculating coefficients of the reward function, giving weight to each characteristic state element, highlighting the more important elements for decision in each iteration, and enabling the reward value to rapidly decrease when encountering the elements with suddenly increased numerical values and the time of detecting the obstacle; when the unmanned ship does not sense the obstacle, the reward function can guide the model to avoid a high-storm area, and can perform collision avoidance action in the first time when the obstacle is sensed;

wherein: actual Q network calculation yields Q_{Practice of}And find Q according to an e-greedy strategy_{Practice of}Corresponding actions are output; the reward function model calculates the reward value R of the current iteration_t(ii) a Randomly extracting n records in an experience pool by a target Q network, and combining the n records with R_tCalculating Q_Target，Q_{Practice of}And Q_TargetCalculating loss functions together, updating network parameters of the actual Q network by using a gradient descent method, and copying all the parameters of the actual Q network to the target Q network when the iteration times reach a threshold value alpha;