CN111880549B

CN111880549B - Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Info

Publication number: CN111880549B
Application number: CN202010962958.4A
Authority: CN
Inventors: 曹志英; 杜意权; 张秀国; 郭嗣彧; 郑易松
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2024-06-04
Anticipated expiration: 2040-09-14
Also published as: CN111880549A

Abstract

The invention provides a deep reinforcement learning rewarding function optimization method for unmanned ship path planning, which comprises the following steps: s1, acquiring environment information; s2, obtaining the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and the target point; s3, giving corresponding rewarding values according to the times of arrival of the ship at the target point; s4, judging whether the ship is in the rewarding domain, and giving corresponding rewards according to a rewarding domain rewarding principle; s5, judging whether the unmanned ship collides with the obstacle or not, and giving a corresponding punishment value; and S6, judging whether the ship is in the dangerous area, giving corresponding punishment according to a dangerous area punishment principle, and giving rewards according to a general situation rewarding principle otherwise. According to the invention, the obtained rewards or punishments are increased or reduced by adding the rewards domain near the target point of ship navigation and adding the dangerous domain near the obstacle and introducing the counting principle, so that the convergence speed of the deep reinforcement learning algorithm is increased, and the ship is guided to avoid the obstacle to reach the target point more quickly.

Description

Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Technical Field

The invention relates to the technical field of path planning, in particular to a deep reinforcement learning reward function optimization method for unmanned ship path planning.

Background

At present, the economic relations of countries around the world are more compact, the trade is more frequent, and ships play an important role as an important transportation means. However, as the navigation density of ships becomes higher and the navigation environment becomes more complex, marine safety accidents frequently occur. In recent years, data show that the main reason for marine perils is that a ship collides with an obstacle which is not detected by a crew in time during the sailing process. Meanwhile, the ship is not suitable for a manned ship to go to a working place to execute tasks under some conditions, and the ship is required to independently navigate to cope with the complex and changeable severe environment at sea, so that the ship is required to have an independent obstacle avoidance function. In order to ensure the running safety of the ship and avoid the occurrence of marine safety accidents, scholars at home and abroad propose a plurality of methods related to ship obstacle avoidance.

The path planning is used as the most basic and important link of ship obstacle avoidance, and the aim is to establish an optimal path from the starting point to the target point, and can avoid all obstacles on the premise of meeting the navigation performance requirement of the ship.

The conventional unmanned ship path planning algorithm comprises an artificial potential field method (APF), an A-algorithm, an evolutionary neural network, an evolutionary algorithm and the like. APF guides unmanned ship to target waypoint through constructing virtual gravitational field, changes navigation limit area into virtual barrier area to construct virtual repulsive force field and avoid static barrier, but when there is equal repulsive force and attractive force or when the repulsive force of target point is very big, the ship will be stopped before, falls into local optimum. The algorithm a is a modified version of Dijkstra's algorithm by searching in the state space and evaluating each searched location to get the best location from which to search until the target is reached. However, the planning path function of the method depends on a grid map, and the size and the number of the grids can directly influence the calculation speed and the accuracy of the algorithm. The evolutionary neural network mainly learns sample data of historical ship collision avoidance operation, so that unmanned ships can plan paths and avoid obstacles in static and dynamic obstacle environments. The accuracy of this approach is highly dependent on training sample data of the evolving neural network, requiring training of corresponding neural network models for different voyage areas. The evolutionary algorithm includes ant colony algorithm and simulated degradation algorithm, for example, vettor, R et al utilize optimization genetic algorithm to calculate environmental information as an initial population, and finally obtain a navigation path meeting the requirements and realize unmanned ship autonomous path planning in open water. The algorithm is simple to realize, but has high iteration times, is easy to fall into a local optimal solution, and can not be used in a continuous topological environment because of excessive corners of the found paths.

In recent years, with the development of deep reinforcement learning, a ship path planning method based on the deep reinforcement learning is also beginning to be studied by many students. According to the method, a deep reinforcement learning algorithm is combined with unmanned ship control actions, so that the ship is continuously tried in a simulation environment, accurate navigation actions are output in real time, how to avoid obstacles to reach target points is learned, and finally an effective ship path planning model is obtained. When the model is used, the model is not required to be trained again, and the path planning can be realized by only deploying the obtained model on a ship needing path planning and then transmitting environmental information into the model.

Unmanned ship path planning based on deep reinforcement learning requires random initialization of deep reinforcement learning algorithm network parameters, and abstracts obstacles in a real environment to establish an experimental environment. When the path planning model is established, firstly, data such as the position, heading, navigational speed, distance between the ship and the target point, distance between the ship and the obstacle and the like of the ship are initialized as input, and the initial state of the ship is determined. And then the network predicts the action required to be executed by the ship according to the input ship state information, the ship executes the action to obtain the ship state at the next moment, and the rewarding function judges whether the action is good or bad to obtain a rewarding or punishment. And finally, saving the current ship state, the executed action, the obtained rewards and the next state after the executed action into a buffer pool for updating the network parameters of the deep reinforcement learning algorithm. And repeating the process until an effective ship path planning model is obtained. And the evaluation of the bonus function as evaluating the performance of the action in the process has great influence on the deep reinforcement learning algorithm.

At present, most of reward functions adopted in unmanned ship path planning based on deep reinforcement learning take the distance between a ship and a target point as reward, and the adoption of the reward functions can lead to slow convergence speed of a deep reinforcement learning algorithm, long training period and possibility of not learning an optimal path planning strategy.

Disclosure of Invention

In order to solve the problems of low convergence speed and long training period of the traditional reward function, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning. The obtained rewards or penalties are increased or decreased by adding the rewards domain near the target point and the dangerous domain near the obstacle and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.

The invention adopts the following technical means:

the deep reinforcement learning reward function optimization method for unmanned ship path planning comprises the following steps:

S1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;

S2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and the target point, judging whether the unmanned ship reaches the target point according to the distance between the unmanned ship and the target point, executing S3 if the unmanned ship reaches the target point, otherwise executing S4;

s3, giving a reward value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point;

S4, judging whether the ship is in the rewarding domain, if so, giving corresponding rewards according to the rewarding domain rewarding principle, executing S2, otherwise, executing S5;

S5, judging whether the unmanned ship collides with the obstacle, giving a punishment value corresponding to the collision times of the ship and the obstacle according to the collision times of the ship and the obstacle, executing the step S2, otherwise executing the step S6;

S6, judging whether the ship is in a dangerous area, if so, giving corresponding punishment according to a dangerous area punishment principle, and executing S2; otherwise, awarding the rewards according to the general rewards principle, and executing S2.

Further, in S1, preprocessing the environmental information includes:

Reading obstacle position data between the current position of the ship and the next turning point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;

Converting the coordinate information of the obstacle into a first conversion coordinate under a rectangular coordinate system through a mercator projection;

Performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;

And normalizing the second converted coordinates to enable the coordinate values to be in the [0,1] interval.

Further, in S3, according to the number of times the ship reaches the target point, a bonus value corresponding to the number of times the ship reaches the target point is given, including: and if the number of times the ship reaches the target point is greater than a threshold value, giving the maximum reward, otherwise giving a weighted reward with a standard reward value, wherein the weighted coefficient is positively correlated with the number of times the ship reaches the target point.

Further, in S5, according to the number of times the ship collides with the obstacle, a penalty value corresponding to the number of times the ship collides with the obstacle is given, including: and if the number of times the ship collides with the obstacle is greater than a threshold value, giving the maximum punishment, otherwise, giving the weighted punishment of a standard punishment value, wherein the weighted coefficient is positively correlated with the number of times the obstacle collides.

Further, in S4, corresponding rewards are given according to the rewards domain rewards principle, including: and giving corresponding rewarding values according to the distance between the ship and the target point.

Further, in S6, a corresponding penalty is given according to the risk domain penalty principle, including: and giving corresponding punishment values according to the distance between the ship and the nearest obstacle.

Further, awarding the prize according to a general case prize principle in S6 includes: and giving corresponding rewarding values according to the distance between the ship and the target point.

Compared with the prior art, the invention has the following advantages:

According to the invention, the rewarding domain is added near the target point, the dangerous domain is added near the obstacle, and the counting principle is introduced to increase or decrease the obtained rewards or punishments, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.

Based on the advantages, the invention can be widely popularized in the field of ship obstacle avoidance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a method for optimizing a bonus function of the invention.

FIG. 2 is a schematic diagram of an example environment.

FIG. 3 is a schematic diagram of an exemplary environment.

FIG. 4 is a schematic diagram of bonus points in an example environment one.

FIG. 5 is a schematic diagram of bonus points in an example environment II.

FIG. 6 is a graph of how many converges are compared in an example environment I.

Fig. 7 is a chart showing how many times convergence is compared in the second example environment.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning, which comprises the following steps:

Step 1: and converting the environment information into an experimental environment by data preprocessing. The data preprocessing method comprises the following steps:

The environment information mainly comprises information such as position information, speed, angular speed and position of an obstacle of the ship, obstacle data between the current position of the ship and the next turning point are read before the algorithm is operated, the data are analyzed, information such as coordinates of the obstacle are obtained, the coordinate conversion is carried out on the data information of the obstacle by using a black card bracket projection, and the coordinate is converted into a coordinate under a rectangular coordinate system. After coordinate conversion, the data is converted into final coordinates in kilometers by taking meters as a unit, and the final coordinates are converted into final coordinates in kilometers as shown in a formula (1). Where a is the long axis of the earth, b is the short axis of the earth, a point on the earth has a longitude of θ ε (-pi, +pi), a latitude of α ε (-pi 2, +pi 2), The first eccentricity of the earth's ellipsoid, (x, y) is the last transformed coordinate. And finally, carrying out normalization operation to ensure that the data of the input algorithm are all between 0 and 1.

Step 2: the variables required for the algorithm to run are initialized. Variables required for the process of the present invention include:

goal _counter is the target counter;

the obs_counter is a collision counter;

Reward rewards or penalties for each step;

The reorder_standard_ goal is a standard reward to reach the target point;

max_rewind_ goal is the maximum prize to reach the target point;

the reorder_standard_obs is the standard penalty for collision barriers;

max_forward_obs is the maximum penalty for collision obstacle.

Step 3: the distance between the vessel and the obstacle and target point is calculated. The distance between the unmanned ship and the obstacle is d_obs＝[d₁,d₂,d₃,...,d_n],d₁,d₂,d₃,...,d_n as the distance between each obstacle; the distance to the target point is noted as d_ goal, and in order to prevent calculation errors during calculation of the distance, the boundary distance between the ship and the target point or obstacle is calculated instead of the center distance.

Step 4: and judging whether the ship reaches the target point, if so, performing the step 5, otherwise, performing the step 6.

Step 5: the target counter is incremented by one and the collision counter is zeroed. Judging whether the target counter is larger than a threshold value, if so, directly giving the maximum reward, wherein a reward formula is shown in a formula (2), otherwise, calculating a reward value according to a formula (3), wherein ceil () is a rounding function. The present round is ended and the next round is started.

reward＝max_reward_goal(2)

And 6, judging whether the ship is in the rewarding domain, if so, giving corresponding rewards according to the rewarding domain rewarding principle, returning to the step 3, and otherwise, carrying out the step 7.

The rewards domain rewards principle is as follows:

The rewarding domain refers to that the ship is near the target point and does not reach the target point yet, different rewards are given for different distances from the target point in the rewarding domain in order to help the ship reach the target point quickly, the convergence speed of the model is accelerated, meanwhile, rewards in the rewarding domain cannot be too dense in order to prevent sinking into local optimum, and rewards reaching the target point have a gap, and specific rewards are given as shown in a formula (4). Wherein the method comprises the steps of L and D are two thresholds for the range of the bonus domain, at/>Different rewards are respectively given in the range of l and D.

Step 7: judging whether the ship collides with an obstacle, if so, adding one to the collision counter, resetting the target counter to zero, judging whether the collision counter is larger than a threshold value, and if so, directly giving the maximum punishment, wherein the punishment is calculated by a formula (5); if the threshold is not reached, a penalty value is calculated according to equation (6) and the current round is ended and the next round is started. If there is no collision, step 8 is performed.

reward＝max_reward_obs(5)

Step 8: judging whether the ship is in a dangerous area, if so, giving corresponding punishment according to a dangerous area punishment principle, and returning to the step 3; if the rewards are not in the dangerous domain, rewards are given according to the general rewards principle, and the step 3 is returned.

The risk domain rewards principle is as follows: the dangerous area refers to that the ship is near a certain obstacle, but does not collide with the obstacle, so that the ship can quickly leave the vicinity of the obstacle and walk to a target point, punishment is needed to be increased near the obstacle, and the punishment value is inversely proportional to the distance between the ship and the obstacle. Similarly, in order to avoid trapping in local optimum, the penalties in the dangerous domain should not be too dense, and there is a certain gap between the penalties at the obstacle, and a specific penalty is given by the calculation of formula (7), wherein min (d_ ^obs) is the minimum value of the ship from the obstacle, α, β and δ are thresholds of the dangerous domain range, and different penalties are given in the α, β and δ ranges respectively.

General case rewards principle: it is generally the case that the vessel is neither near the target point nor near the obstacle, in which case the reward arrangement should be as sparse as possible to avoid the algorithm falling into a local optimum. So in general, the opposite number of the distance between the ship and the target point is adopted as the reward of each step, and a specific calculation formula is shown in formula (8), wherein index () is an inverse function.

reward＝inverse(d_goal)(8)

The following comparison analysis is performed on both the prize score and the convergence number from different environments to illustrate that the prize function using the present patent converges faster than the prize function using the conventional prize function.

(1) Reward score contrast analysis in different environments

In the navigation process of a ship, the whole path is divided into a plurality of waypoints, and path planning is required to plan a path between one waypoint and the next waypoint. In the experimental simulation environment process, the actual environment size is 60 marix 60 maris, and the number of the obstacles is set to be less than 7 from one waypoint to the next waypoint, so that the experimental environment size is set to be 600 pixels x 600 pixels in the experimental process, and 1-7 obstacles are respectively selected to simulate the actual environment. Because the more the number of the obstacles is, the slower the algorithm converges, and therefore, the environment is divided into a class I environment and a class II environment according to the number of the obstacles in the environment, and the convergence frequency statistics is carried out. The number of the obstacles in the I type environment is 1-4, and the number of the obstacles in the II type environment is 5-7. The convergence effect of the present method will be described below by taking two of the environments (one example environment in the class i environment in fig. 2, one example environment in the class ii environment in fig. 3) as an example.

As shown in fig. 2 and 3, there are 3 obstacles in the first example environment, 5 obstacles in the second example environment, the five-pointed star point is the target point, the ship point is the starting point, and the obstacles are represented by mountain. In both environments, experiments were performed using a conventional reward function and DoubleDQN algorithm employed by the method of this patent. In the deep reinforcement learning, the reward scores obtained by training all rounds are used for judging whether an algorithm converges, and if the obtained reward finally tends to be stable, the algorithm is converged. The prize scoring graphs obtained in two environments using the prize function of this patent and the conventional prize function are shown in fig. 4 and 5.

As shown in fig. 4 and 5, the abscissa represents the number of training rounds, the ordinate represents the bonus score obtained for each training round, new represents the bonus function using the present patent, and Traditional represents the use of a conventional bonus function. It can be seen from fig. 4 that the bonus score has been substantially stabilized by the time the bonus function of the present invention was trained to 50 rounds, the algorithm has converged substantially, but the bonus score gradually begins to stabilize after 100 rounds using the conventional bonus function, the algorithm begins to converge, and further concussion occurs in the subsequent process, indicating that the algorithm has not yet converged completely. It can be seen from figure 5 that the bonus score continues to oscillate before the first 100 rounds using the bonus function of the invention, whereas the algorithm begins to converge after 200 rounds, while slowly settling in the 100-150 rounds, although an oscillation occurs. Whereas with the conventional bonus function the bonus score oscillates significantly before the 150 round, the amplitude of the score starts to decrease at the 150-250 round, the algorithm starts to converge and after the 250 round the algorithm converges substantially.

From the above experimental comparison, it can be seen that the bonus function of this patent converges faster than the conventional bonus function in the same algorithm. With the same round number, higher bonus points are achieved using the bonus function of the present invention than with conventional bonus functions.

(2) Convergence number analysis

In the experimental process, different environments are respectively designed by changing the positions, the numbers and the sizes of the barriers, and each environment is respectively subjected to experiments by using the rewarding function and the traditional rewarding function. Algorithms used in the experiments are DQN, doubleDQN and dueling DQN algorithms, each of which is respectively subjected to 10 experiments in a simple environment and a complex environment, and the counted convergence times are respectively shown in fig. 6 and 7.

In fig. 6 and 7, the abscissa indicates the number of rounds of algorithm convergence, the ordinate indicates the number of times of convergence, new indicates the use of the bonus function of this patent, and tradition indicates the use of the conventional bonus function.

FIG. 6 depicts convergence in a class I environment, from which it can be seen that in 30 experiments were performed, 18 times within 100 rounds, 8 times between 100-150 rounds, and only 4 times between 150-250 rounds of convergence using the bonus function of the present patent; using the conventional bonus function, there were 6 times within 100 rounds of convergence, 15 times with 100-150 rounds of convergence, and 9 times with 150-250 rounds of convergence. It can be seen that the most of the bonus functions using this patent converge within 100 rounds, while the most of the bonus functions using the traditional bonus functions converge at 150 rounds.

FIG. 7 illustrates the convergence in a class II environment, where it can be seen that using the reward function of this patent, there are 14 times for the inner convergence for the 150 rounds, 9 times for the 150-200 rounds and the outer convergence for the 200-250 rounds, 5 times for the outer convergence for the 300 rounds, and 2 times for the outer convergence for the 300 rounds; using the conventional bonus function, there were 7 times convergence within 150 rounds, 10 times convergence of 150-200 rounds, 8 times convergence of 200-250 rounds, and 5 times convergence outside 300 rounds. To sum up, most of the bonus functions of this patent converge within 200 rounds, while most converge within 250 rounds using conventional bonus functions.

As can be seen from the comparison, in the environments with different numbers of barriers, the reward functions using the method are faster in convergence than the reward functions using the traditional reward functions by using different deep reinforcement learning algorithms.

Conventional rewarding functions are all rewarding by reversing the distance between the ship and the target point. However, the convergence speed of the reward function is slow, the training period is long, and the training can be stopped or the scheme is abandoned due to time problems.

The invention designs a reward function by adopting a counting principle method, doubles rewards if the ship continuously reaches a target point, doubles punishs if the ship continuously collides with an obstacle, sets a reward domain at the target point and sets a dangerous domain at the obstacle. By means of the design, the help algorithm achieves the convergence effect faster, so that the ship can avoid the obstacle to reach the target point faster.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The deep reinforcement learning reward function optimization method for unmanned ship path planning is characterized by comprising the following steps of:

s3, giving a reward value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point, wherein the reward value comprises the following components: if the ship reaches the target point, the target counter is increased by one, the collision counter is reset, whether the target counter is larger than a threshold value is judged, if so, the maximum rewards are directly given, otherwise, the rewards are calculated according to the following formula:

Where ceil () is a rounding function, goal _counter is a target counter, and rewind_standard_ goal is a standard prize for reaching the target point;

S4, judging whether the ship is in a rewarding domain, wherein the rewarding domain refers to that the ship is near a target point and does not reach the target point yet, if the ship is in the rewarding domain, giving corresponding rewards according to a rewarding domain rewarding principle, executing S2, otherwise, executing S5, and calculating the rewarding domain rewards according to the following formula:

Wherein, L and D are bonus domain range thresholds, d_ goal is the distance of the target point;

s5, judging whether the unmanned ship collides with the obstacle, giving a punishment value corresponding to the number of times of collision of the ship with the obstacle according to the number of times of collision of the ship with the obstacle, executing the step S2, otherwise executing the step S6, including:

judging whether the ship collides with the obstacle, if so, adding one to the collision counter, resetting the target counter to zero, judging whether the collision counter is larger than a threshold value, if so, directly giving the maximum punishment, and if not, calculating the punishment value according to the following formula:

wherein, the ob_counter is a collision counter, and the reorder_standard_ob is a standard penalty for collision barriers;

S6, judging whether the ship is in a dangerous area, wherein the dangerous area refers to the ship is near a certain obstacle but does not collide with the obstacle, if so, giving corresponding punishment according to a dangerous area punishment principle, and executing S2; otherwise awarding rewards according to a general situation rewards principle, and executing S2, wherein the risk threshold penalty is calculated according to the following formula:

where min (d_obs) is the minimum of the vessel distance obstacle and α, β and δ are thresholds of the hazardous domain range.

2. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method according to claim 1, wherein the preprocessing of the environmental information in S1 comprises:

3. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step S3 of giving a reward value corresponding to the number of times a ship arrives at a target point according to the number of times the ship arrives at the target point comprises: and if the number of times the ship reaches the target point is greater than a threshold value, giving the maximum reward, otherwise giving a weighted reward with a standard reward value, wherein the weighted coefficient is positively correlated with the number of times the ship reaches the target point.

4. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step of S5 of giving a penalty value corresponding to the number of collisions of the ship with the obstacle according to the number of collisions of the ship with the obstacle comprises: and if the number of times the ship collides with the obstacle is greater than a threshold value, giving the maximum punishment, otherwise, giving the weighted punishment of a standard punishment value, wherein the weighted coefficient is positively correlated with the number of times the obstacle collides.

5. The method for optimizing a deep reinforcement learning rewards function for unmanned ship-oriented path planning of claim 1, wherein the step of S4 awarding the corresponding rewards according to the rewards domain rewards principle comprises: and giving corresponding rewarding values according to the distance between the ship and the target point.

6. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step of giving a corresponding penalty according to a risk domain penalty principle in S6 comprises: and giving corresponding punishment values according to the distance between the ship and the nearest obstacle.

7. The method for optimizing a deep reinforcement learning rewards function for unmanned ship-oriented path planning of claim 1, wherein awarding rewards in S6 according to a general case rewards principle comprises: and giving corresponding rewarding values according to the distance between the ship and the target point.