CN111880549A

CN111880549A - Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method

Info

Publication number: CN111880549A
Application number: CN202010962958.4A
Authority: CN
Inventors: 曹志英; 杜意权; 张秀国; 郭嗣彧; 郑易松
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2020-11-03
Anticipated expiration: 2040-09-14
Also published as: CN111880549B

Abstract

The invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, which comprises the following steps: s1, acquiring environmental information; s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point; s3, giving a corresponding reward value according to the times of the ship reaching the target point; s4, judging whether the ship is in the bonus area, and giving corresponding bonus according to the bonus area bonus principle; s5, judging whether the unmanned ship collides with the barrier or not, and giving a corresponding penalty value; and S6, judging whether the ship is in the danger zone, giving corresponding punishment according to the punishment principle of the danger zone, and giving reward according to the reward principle of the general condition if not. According to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point of ship navigation, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the deep reinforcement learning algorithm is increased, and the ship is guided to avoid the barrier to reach the target point more quickly.

Description

Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method

Technical Field

The invention relates to the technical field of path planning, in particular to a deep reinforcement learning reward function optimization method for unmanned ship path planning.

Background

At present, economic connection of countries around the world is tighter and more frequent in trade, and ships play an important role as an important vehicle. However, as the density of ships sailing increases and the sailing environment becomes more complex, marine safety accidents frequently occur. Recent data show that the main cause of marine distress is that the ship is collided by an obstacle which is not timely detected by a crew in the process of sailing. Meanwhile, in some cases, the unmanned ship is not suitable for going to a working place to execute tasks, and needs to autonomously navigate to deal with complicated and changeable severe environments on the sea, so that the ship needs to have an autonomous obstacle avoidance function. In order to ensure the running safety of ships and avoid safety accidents on the sea, scholars at home and abroad put forward a plurality of methods related to ship obstacle avoidance.

The path planning is used as the most basic and important link for avoiding obstacles of the ship, the aim is to establish an optimal path from a starting point to a target point, and all obstacles can be avoided on the premise of meeting the requirement of ship navigation performance.

At present, the traditional unmanned ship path planning algorithm includes an Artificial Potential Field (APF) algorithm, an a-x algorithm, an evolutionary neural network, an evolutionary algorithm and the like. The APF guides the unmanned ship to drive to a target route point by constructing a virtual gravitational field, and converts a navigation limiting area into a virtual obstacle area, so that a virtual repulsive field is constructed to avoid a static obstacle, but when an equal repulsive force and attractive force exist or the repulsive force of a target point is large, the ship can be stopped and is trapped in local optimum. The a-algorithm is an improved version of Dijkstra's algorithm, and the best position is obtained by searching in the state space and evaluating the position of each search, and then searching from this position until the target is reached. However, the function of planning a path of the method depends on a grid map, and the size and the number of the intervals of the grids directly influence the calculation speed and the accuracy of the algorithm. The evolutionary neural network mainly realizes path planning and obstacle avoidance of the unmanned ship in static and dynamic obstacle environments by learning sample data of historical ship collision avoidance operations. However, the accuracy of the method highly depends on sample data for training the evolutionary neural network, and corresponding neural network models need to be trained for different navigation areas. The evolutionary algorithm comprises an ant colony algorithm, a simulated degradation algorithm and the like, for example, Vettor, R et al, the environment information is used as an initial population to be calculated by utilizing optimization genetic algorithm, and finally a navigation path meeting the requirement is obtained and the unmanned ship autonomous path planning is realized in an open water area. The algorithm is simple to implement, but has high iteration times, is easy to fall into a local optimal solution, and can not be used in a continuous topological environment due to excessive found path corners.

In recent years, with the development of deep reinforcement learning, a ship path planning method based on deep reinforcement learning is also studied by many scholars. The method combines a deep reinforcement learning algorithm with unmanned ship control actions, enables a ship to continuously try in a simulation environment, outputs accurate navigation actions in real time, learns how to avoid obstacles to reach a target point, and finally obtains an effective ship path planning model. When the model is used, retraining is not needed, the obtained model is deployed on a ship needing path planning, and then environment information is transmitted into the model, so that path planning can be achieved.

The unmanned ship path planning based on deep reinforcement learning needs to randomly initialize network parameters of a deep reinforcement learning algorithm, abstract obstacles in a real environment and establish an experimental environment. When a path planning model is established, firstly, data such as the position, the course, the navigational speed, the distance between a ship and a target point, the distance between the ship and an obstacle and the like are initialized as input, and the initial state of the ship is determined. And then the network predicts the action to be executed by the ship according to the input ship state information, the ship executes the action to obtain the ship state at the next moment, and a reward or punishment is obtained by judging the action by a reward function. And finally, the state of the ship, the executed action, the obtained reward and the next state after the action is executed are stored in a buffer pool to be used for updating the network parameters of the deep reinforcement learning algorithm. And repeating the process until an effective ship path planning model is obtained. In the process, the rewarding function has great influence on the deep reinforcement learning algorithm as an evaluator for evaluating the performance of the action.

At present, most of reward functions adopted in unmanned ship path planning based on deep reinforcement learning are obtained by taking the inverse of the distance between a ship and a target point as rewards, and the adoption of the reward functions can cause the convergence rate of a deep reinforcement learning algorithm to be very low, the training period to be very long and the optimal path planning strategy possibly not to be learned.

Disclosure of Invention

In order to solve the problems of low convergence speed and long training period of the traditional reward function, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning. The acquired reward or punishment is increased or reduced by increasing the reward domain near the target point, increasing the danger domain near the obstacle and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.

The technical means adopted by the invention are as follows:

the depth reinforcement learning reward function optimization method for unmanned ship path planning comprises the following steps:

s1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;

s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point, judging whether the unmanned ship reaches the target point or not according to the distance between the unmanned ship and the target point, if so, executing S3, otherwise, executing S4;

s3, according to the times of the ship reaching the target point, giving a reward value corresponding to the times of the ship reaching the target point;

s4, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, and executing S2, otherwise, executing S5;

s5, judging whether the unmanned ship collides with the obstacle, giving a penalty value corresponding to the collision frequency of the ship and the obstacle according to the collision frequency of the ship and the obstacle, and executing a step S2, otherwise executing a step S6;

s6, judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a danger domain punishment principle, and executing S2; otherwise, awarding the bonus according to the general bonus principle and executing S2.

Further, the preprocessing the environment information in S1 includes:

reading obstacle position data between the current position of the ship and the next steering point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;

converting the coordinate information of the barrier into a first conversion coordinate under a rectangular coordinate system through mercator projection;

performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;

and carrying out normalization operation on the second conversion coordinate to enable the coordinate value to be in the range of [0,1 ].

Further, in S3, the step of giving a bonus value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point includes: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.

Further, in S5, the giving a penalty value corresponding to the number of collisions between the ship and the obstacle according to the number of collisions between the ship and the obstacle includes: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.

Further, in S4, the awarding of the corresponding bonus according to the bonus domain bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.

Further, in S6, giving a corresponding penalty according to the risk domain penalty principle includes: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.

Further, in S6, the awarding of the bonus according to the general bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.

Compared with the prior art, the invention has the following advantages:

according to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can quickly avoid the barrier to reach the target point.

Based on the advantages, the method can be widely popularized in the field of ship obstacle avoidance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a reward function optimization method of the present invention.

FIG. 2 is a schematic diagram of an example environment.

FIG. 3 is a schematic diagram of an example environment two.

FIG. 4 is a graphical illustration of reward scores in an example environment one.

FIG. 5 is a graphical illustration of reward scores in an example environment two.

FIG. 6 is a graph comparing convergence times in an example environment I.

FIG. 7 is a graph comparing convergence times in example two environment.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the present invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, including:

step 1: and carrying out data preprocessing conversion on the environment information to obtain an experimental environment. The data preprocessing method comprises the following steps:

the environment information mainly comprises position information, speed size, angular speed, position size of an obstacle and other information of the ship, obstacle data between the ship and a next turning point is read according to the current position of the ship before the algorithm runs, data are analyzed to obtain coordinates and other information of the obstacle, the data information of the obstacle is subjected to coordinate conversion by using ink card support projection, and the coordinates are converted into coordinates under a rectangular coordinate system. After the coordinates are converted, the data are converted into final coordinates in units of meters and kilometers, as shown in formula (1). Wherein a is the earth long axis, b is the earth short axis, the longitude of a certain point on the earth is theta ∈ (-pi, + pi), and the latitude is

The first eccentricity of the earth ellipsoid and (x, y) the final transformed coordinates. Finally, normalization operation is carried out to ensure that the data transmitted into the algorithm are all [0,1]]In the meantime.

Step 2: variables needed for the algorithm to run are initialized. The variables required for the process of the invention include:

the coarse _ counter is a target counter;

obs _ counter is a collision counter;

reward awards or penalties for each step;

the reward _ standard _ good is a standard reward for reaching a target point;

max _ rewarded _ good is the maximum reward for reaching the target point;

the reward _ standard _ obs is the standard penalty of colliding with the obstacle;

max _ rewarded _ obs is the maximum penalty for colliding with the obstacle.

And step 3: the distance between the vessel and the obstacle and target point is calculated. The distance between the unmanned ship and the obstacle is recorded as d _ obs ═ d₁,d₂,d₃,...,d_n]，d₁,d₂,d₃,...,d_nIs the distance from each obstacle; the distance from the target point is recorded as d _ good, and in order to prevent calculation errors in the process of calculating the distance, the boundary distance between the ship and the target point or the obstacle is calculated instead of the center distance.

And 4, step 4: and (5) judging whether the ship reaches a target point, if so, performing the step 5, and otherwise, performing the step 6.

And 5: the target counter is incremented by one and the collision counter is zeroed. And judging whether the target counter is greater than a threshold value, if so, directly giving the maximum reward, wherein the reward formula is shown as a formula (2), otherwise, calculating the reward value according to a formula (3), wherein ceil () is a rounding function. Ending the current round and starting the next round.

reward＝max_reward_goal (2)

And 6, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, returning to the step 3, and otherwise, performing the step 7.

The bonus domain bonus principle is as follows:

the reward field means that the ship is near the target point and does not reach the target point, in order to help the ship to quickly reach the target point, different rewards are given according to different distances from the target point in the reward field, the convergence speed of the model is accelerated, meanwhile, in order to prevent the ship from falling into local optimum, the rewards in the reward field cannot be too dense, and the rewards reaching the target point have a difference, and the specific reward is given as shown in a formula (4). Wherein

L and

two thresholds for the bonus field range, in

l and

different prizes are respectively given in the range.

And 7: judging whether the ship collides with the barrier, if so, adding one to a collision counter, returning the target counter to zero, judging whether the collision counter is greater than a threshold value, and if so, directly giving a maximum penalty, wherein the penalty is calculated by a formula (5); if the threshold is not reached, the penalty value is calculated according to equation (6) and the present round is ended and the next round is started. If there is no collision, step 8 is performed.

reward＝max_reward_obs (5)

And 8: judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a punishment principle of the danger domain, and returning to the step 3; if not, the reward is given according to the general condition reward principle and returns to the step 3.

The dangerous area reward principle is as follows: the danger zone refers to that the ship is near a certain obstacle but does not collide with the obstacle, and in order to help the ship to rapidly leave the vicinity of the obstacle and walk to a target point, a penalty is required to be increased near the obstacle, and the penalty value is inversely proportional to the distance between the ship and the obstacle. Similarly, in order to avoid falling into local optimum, punishment in the danger zone should not be too dense, and punishment at the barrier also has a certain difference, and the specific punishment is calculated by formula (7), wherein min (d _ obs) is the minimum value of the distance between the ship and the barrier, alpha, beta and threshold values of the range of the danger zone are respectively given different punishment in alpha, beta and range.

General case reward principle: the general case refers to a vessel that is neither near the target point nor near an obstacle, in which case the reward setting should be as sparse as possible to avoid the algorithm falling into local optimality. Therefore, in general, the inverse number of the distance between the ship and the target point is used as the reward of each step, and a specific calculation formula is shown as formula (8), wherein inverse () is an inverse function.

reward＝inverse(d_goal) (8)

The comparative analysis of the reward points and the convergence times in different environments is used for explaining that the reward function using the patent converges faster than the traditional reward function.

(1) Comparative analysis of reward scores in different environments

In the navigation process of a ship, the whole path is divided into a plurality of waypoints, and path planning needs to plan the path between one waypoint and the next waypoint. In the experimental simulation environment process, the actual environment size is 60 nautical miles multiplied by 60 nautical miles, and the number of obstacles is set to be within 7 from one waypoint to the next waypoint, so that the experimental environment size is set to be 600 pixels multiplied by 600 pixels in the experimental process, and 1-7 obstacles are respectively selected to simulate the actual environment. Because the more the number of the obstacles is, the slower the algorithm is converged, the convergence times are counted by dividing the environment into a class I environment and a class II environment according to the number of the obstacles in the environment. The number of obstacles in the type I environment is 1-4, and the number of obstacles in the type II environment is 5-7. The convergence effect of the method is described below by taking two environments (fig. 2 is an exemplary environment in a class i environment, and fig. 3 is an exemplary environment in a class ii environment) as an example.

As shown in fig. 2 and 3, there are 3 obstacles in the first example environment, 5 obstacles in the second example environment, the five-pointed star is the target point, the ship is the starting point, and the obstacles are represented by mountain bodies. In both of these environments, experiments were conducted using the traditional reward function and the doubldqn algorithm employed by the method of this patent. In the deep reinforcement learning, reward scores obtained by training all rounds are used for judging whether an algorithm converges, and if the obtained reward finally tends to be smooth, the algorithm is converged. The prize scores obtained in both environments using the proprietary prize function and the conventional prize function are shown in figures 4 and 5.

As shown in fig. 4 and 5, the abscissa indicates the number of training rounds, the ordinate indicates the reward score obtained for each training round, New indicates the reward function used in this patent, and Traditional indicates the conventional reward function. It can be seen from fig. 4 that the reward score has been substantially stable and the algorithm has substantially converged when training is completed to 50 rounds using the reward function of the present invention, but the reward score gradually starts to stabilize after 100 rounds using the conventional reward function, the algorithm starts to converge, and in the subsequent process, there is also a shock occurring, indicating that the algorithm has not fully converged. From fig. 5, it can be seen that the reward score of the present invention is continuously oscillating before the first 100 rounds, and slowly stabilizes in 100-150 rounds although the oscillation occurs, the algorithm starts to converge, and after 200 rounds, the algorithm has substantially converged. While the reward score oscillates greatly before 150 rounds using the conventional reward function, the magnitude of the score begins to decrease at 150 rounds and 250 rounds, the algorithm begins to tend to converge, and the algorithm substantially converges after 250 rounds.

Through the comparison of the experiments, it can be seen that in the same algorithm, the reward function of the patent is converged faster than the traditional reward function. Using the reward function of the present invention, a higher reward score is achieved than with a conventional reward function, for the same number of rounds.

(2) Convergence number analysis

In the experiment process, different environments are respectively designed by changing the positions, the number and the sizes of the obstacles, and each environment is respectively used for carrying out experiments by using the reward function of the invention and the traditional reward function. The algorithms used in the experiment are DQN, DoubleDQN and dulling DQN algorithms, each algorithm performs 10 experiments in a simple environment and a complex environment, respectively, and the convergence times obtained by statistics are shown in fig. 6 and fig. 7, respectively.

In fig. 6 and 7, the abscissa represents the number of convergence rounds of the algorithm, the ordinate represents the number of convergence times, New represents the reward function used in the present patent, and Traditional represents the conventional reward function.

FIG. 6 illustrates the convergence in class I environment, and it can be seen that in the 30 experiments, 18 convergence times within 100 rounds, 8 convergence times between 100 and 150 rounds, and 4 convergence times between 150 and 250 rounds are obtained using the reward function of this patent; using the conventional reward function, there are 6 convergence within 100 rounds, 15 convergence within 100-150 rounds, and 9 convergence within 150-250 rounds. It can be seen that most of the reward functions using this patent converge within 100 rounds, while most of the reward functions using the traditional patent converge at 150 rounds.

FIG. 7 illustrates convergence in class II environments, and it can be seen that using the reward function of this patent, there are 14 convergence within 150 rounds, 9 convergence beyond 150 and 200 rounds, 5 convergence beyond 200 and 250 rounds, and 2 convergence beyond 300 rounds; using the conventional reward function, there are 7 convergence within 150, 10 convergence within 150 and 200, 8 convergence within 200 and 250, and 5 convergence outside 300. In summary, most of the convergence within 200 rounds is achieved using the reward function of this patent, while most of the convergence within 250 rounds is achieved using the conventional reward function.

It can be seen from the comparison that in the environment with different barrier numbers, different deep reinforcement learning algorithms are used, and the convergence speed of the reward function using the method is higher than that of the traditional reward function.

The traditional reward function is to take the distance between the ship and the target point as the reward. However, such reward functions have slow convergence and long training periods, which may result in terminating the training or discarding the scheme due to time problems.

The invention designs the reward function by adopting a counting principle method, if the ship continuously reaches a target point, the reward is doubled, if the ship continuously collides with an obstacle, the penalty is doubled, an reward domain is arranged at the target point, and a danger domain is arranged at the obstacle. By means of the design, the algorithm is helped to achieve the convergence effect faster, and therefore the ship can avoid the obstacles to reach the target point faster.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A deep reinforcement learning reward function optimization method for unmanned ship path planning is characterized by comprising the following steps:

2. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein the preprocessing of the environment information in S1 includes:

3. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S3, according to the number of times that a ship reaches a target point, a reward value corresponding to the number of times that the ship reaches the target point is given, and the method comprises: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.

4. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S5, a penalty value corresponding to the number of collisions between a ship and an obstacle is given according to the number of collisions between the ship and the obstacle, and the method comprises: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.

5. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step S4 of awarding corresponding rewards according to the reward domain reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.

6. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the giving of the corresponding penalty in S6 according to the risk domain penalty principle comprises: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.

7. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step of giving the reward in S6 according to the general condition reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.