CN111880549A - Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method - Google Patents
Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method Download PDFInfo
- Publication number
- CN111880549A CN111880549A CN202010962958.4A CN202010962958A CN111880549A CN 111880549 A CN111880549 A CN 111880549A CN 202010962958 A CN202010962958 A CN 202010962958A CN 111880549 A CN111880549 A CN 111880549A
- Authority
- CN
- China
- Prior art keywords
- ship
- reward
- giving
- target point
- obstacle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000006870 function Effects 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000002787 reinforcement Effects 0.000 title claims abstract description 26
- 238000005457 optimization Methods 0.000 title claims description 14
- 230000004888 barrier function Effects 0.000 claims abstract description 11
- 230000000875 corresponding effect Effects 0.000 claims description 26
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000002596 correlated effect Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 40
- 230000007613 environmental effect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 12
- 238000012549 training Methods 0.000 description 9
- 230000009471 action Effects 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000010835 comparative analysis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009429 distress Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000035939 shock Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/0088—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/0206—Control of position or course in two dimensions specially adapted to water vehicles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Automation & Control Theory (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Game Theory and Decision Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, which comprises the following steps: s1, acquiring environmental information; s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point; s3, giving a corresponding reward value according to the times of the ship reaching the target point; s4, judging whether the ship is in the bonus area, and giving corresponding bonus according to the bonus area bonus principle; s5, judging whether the unmanned ship collides with the barrier or not, and giving a corresponding penalty value; and S6, judging whether the ship is in the danger zone, giving corresponding punishment according to the punishment principle of the danger zone, and giving reward according to the reward principle of the general condition if not. According to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point of ship navigation, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the deep reinforcement learning algorithm is increased, and the ship is guided to avoid the barrier to reach the target point more quickly.
Description
Technical Field
The invention relates to the technical field of path planning, in particular to a deep reinforcement learning reward function optimization method for unmanned ship path planning.
Background
At present, economic connection of countries around the world is tighter and more frequent in trade, and ships play an important role as an important vehicle. However, as the density of ships sailing increases and the sailing environment becomes more complex, marine safety accidents frequently occur. Recent data show that the main cause of marine distress is that the ship is collided by an obstacle which is not timely detected by a crew in the process of sailing. Meanwhile, in some cases, the unmanned ship is not suitable for going to a working place to execute tasks, and needs to autonomously navigate to deal with complicated and changeable severe environments on the sea, so that the ship needs to have an autonomous obstacle avoidance function. In order to ensure the running safety of ships and avoid safety accidents on the sea, scholars at home and abroad put forward a plurality of methods related to ship obstacle avoidance.
The path planning is used as the most basic and important link for avoiding obstacles of the ship, the aim is to establish an optimal path from a starting point to a target point, and all obstacles can be avoided on the premise of meeting the requirement of ship navigation performance.
At present, the traditional unmanned ship path planning algorithm includes an Artificial Potential Field (APF) algorithm, an a-x algorithm, an evolutionary neural network, an evolutionary algorithm and the like. The APF guides the unmanned ship to drive to a target route point by constructing a virtual gravitational field, and converts a navigation limiting area into a virtual obstacle area, so that a virtual repulsive field is constructed to avoid a static obstacle, but when an equal repulsive force and attractive force exist or the repulsive force of a target point is large, the ship can be stopped and is trapped in local optimum. The a-algorithm is an improved version of Dijkstra's algorithm, and the best position is obtained by searching in the state space and evaluating the position of each search, and then searching from this position until the target is reached. However, the function of planning a path of the method depends on a grid map, and the size and the number of the intervals of the grids directly influence the calculation speed and the accuracy of the algorithm. The evolutionary neural network mainly realizes path planning and obstacle avoidance of the unmanned ship in static and dynamic obstacle environments by learning sample data of historical ship collision avoidance operations. However, the accuracy of the method highly depends on sample data for training the evolutionary neural network, and corresponding neural network models need to be trained for different navigation areas. The evolutionary algorithm comprises an ant colony algorithm, a simulated degradation algorithm and the like, for example, Vettor, R et al, the environment information is used as an initial population to be calculated by utilizing optimization genetic algorithm, and finally a navigation path meeting the requirement is obtained and the unmanned ship autonomous path planning is realized in an open water area. The algorithm is simple to implement, but has high iteration times, is easy to fall into a local optimal solution, and can not be used in a continuous topological environment due to excessive found path corners.
In recent years, with the development of deep reinforcement learning, a ship path planning method based on deep reinforcement learning is also studied by many scholars. The method combines a deep reinforcement learning algorithm with unmanned ship control actions, enables a ship to continuously try in a simulation environment, outputs accurate navigation actions in real time, learns how to avoid obstacles to reach a target point, and finally obtains an effective ship path planning model. When the model is used, retraining is not needed, the obtained model is deployed on a ship needing path planning, and then environment information is transmitted into the model, so that path planning can be achieved.
The unmanned ship path planning based on deep reinforcement learning needs to randomly initialize network parameters of a deep reinforcement learning algorithm, abstract obstacles in a real environment and establish an experimental environment. When a path planning model is established, firstly, data such as the position, the course, the navigational speed, the distance between a ship and a target point, the distance between the ship and an obstacle and the like are initialized as input, and the initial state of the ship is determined. And then the network predicts the action to be executed by the ship according to the input ship state information, the ship executes the action to obtain the ship state at the next moment, and a reward or punishment is obtained by judging the action by a reward function. And finally, the state of the ship, the executed action, the obtained reward and the next state after the action is executed are stored in a buffer pool to be used for updating the network parameters of the deep reinforcement learning algorithm. And repeating the process until an effective ship path planning model is obtained. In the process, the rewarding function has great influence on the deep reinforcement learning algorithm as an evaluator for evaluating the performance of the action.
At present, most of reward functions adopted in unmanned ship path planning based on deep reinforcement learning are obtained by taking the inverse of the distance between a ship and a target point as rewards, and the adoption of the reward functions can cause the convergence rate of a deep reinforcement learning algorithm to be very low, the training period to be very long and the optimal path planning strategy possibly not to be learned.
Disclosure of Invention
In order to solve the problems of low convergence speed and long training period of the traditional reward function, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning. The acquired reward or punishment is increased or reduced by increasing the reward domain near the target point, increasing the danger domain near the obstacle and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.
The technical means adopted by the invention are as follows:
the depth reinforcement learning reward function optimization method for unmanned ship path planning comprises the following steps:
s1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point, judging whether the unmanned ship reaches the target point or not according to the distance between the unmanned ship and the target point, if so, executing S3, otherwise, executing S4;
s3, according to the times of the ship reaching the target point, giving a reward value corresponding to the times of the ship reaching the target point;
s4, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, and executing S2, otherwise, executing S5;
s5, judging whether the unmanned ship collides with the obstacle, giving a penalty value corresponding to the collision frequency of the ship and the obstacle according to the collision frequency of the ship and the obstacle, and executing a step S2, otherwise executing a step S6;
s6, judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a danger domain punishment principle, and executing S2; otherwise, awarding the bonus according to the general bonus principle and executing S2.
Further, the preprocessing the environment information in S1 includes:
reading obstacle position data between the current position of the ship and the next steering point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
converting the coordinate information of the barrier into a first conversion coordinate under a rectangular coordinate system through mercator projection;
performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
and carrying out normalization operation on the second conversion coordinate to enable the coordinate value to be in the range of [0,1 ].
Further, in S3, the step of giving a bonus value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point includes: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.
Further, in S5, the giving a penalty value corresponding to the number of collisions between the ship and the obstacle according to the number of collisions between the ship and the obstacle includes: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.
Further, in S4, the awarding of the corresponding bonus according to the bonus domain bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.
Further, in S6, giving a corresponding penalty according to the risk domain penalty principle includes: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.
Further, in S6, the awarding of the bonus according to the general bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.
Compared with the prior art, the invention has the following advantages:
according to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can quickly avoid the barrier to reach the target point.
Based on the advantages, the method can be widely popularized in the field of ship obstacle avoidance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a reward function optimization method of the present invention.
FIG. 2 is a schematic diagram of an example environment.
FIG. 3 is a schematic diagram of an example environment two.
FIG. 4 is a graphical illustration of reward scores in an example environment one.
FIG. 5 is a graphical illustration of reward scores in an example environment two.
FIG. 6 is a graph comparing convergence times in an example environment I.
FIG. 7 is a graph comparing convergence times in example two environment.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, the present invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, including:
step 1: and carrying out data preprocessing conversion on the environment information to obtain an experimental environment. The data preprocessing method comprises the following steps:
the environment information mainly comprises position information, speed size, angular speed, position size of an obstacle and other information of the ship, obstacle data between the ship and a next turning point is read according to the current position of the ship before the algorithm runs, data are analyzed to obtain coordinates and other information of the obstacle, the data information of the obstacle is subjected to coordinate conversion by using ink card support projection, and the coordinates are converted into coordinates under a rectangular coordinate system. After the coordinates are converted, the data are converted into final coordinates in units of meters and kilometers, as shown in formula (1). Wherein a is the earth long axis, b is the earth short axis, the longitude of a certain point on the earth is theta ∈ (-pi, + pi), and the latitude isThe first eccentricity of the earth ellipsoid and (x, y) the final transformed coordinates. Finally, normalization operation is carried out to ensure that the data transmitted into the algorithm are all [0,1]]In the meantime.
Step 2: variables needed for the algorithm to run are initialized. The variables required for the process of the invention include:
the coarse _ counter is a target counter;
obs _ counter is a collision counter;
reward awards or penalties for each step;
the reward _ standard _ good is a standard reward for reaching a target point;
max _ rewarded _ good is the maximum reward for reaching the target point;
the reward _ standard _ obs is the standard penalty of colliding with the obstacle;
max _ rewarded _ obs is the maximum penalty for colliding with the obstacle.
And step 3: the distance between the vessel and the obstacle and target point is calculated. The distance between the unmanned ship and the obstacle is recorded as d _ obs ═ d1,d2,d3,...,dn],d1,d2,d3,...,dnIs the distance from each obstacle; the distance from the target point is recorded as d _ good, and in order to prevent calculation errors in the process of calculating the distance, the boundary distance between the ship and the target point or the obstacle is calculated instead of the center distance.
And 4, step 4: and (5) judging whether the ship reaches a target point, if so, performing the step 5, and otherwise, performing the step 6.
And 5: the target counter is incremented by one and the collision counter is zeroed. And judging whether the target counter is greater than a threshold value, if so, directly giving the maximum reward, wherein the reward formula is shown as a formula (2), otherwise, calculating the reward value according to a formula (3), wherein ceil () is a rounding function. Ending the current round and starting the next round.
reward=max_reward_goal (2)
And 6, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, returning to the step 3, and otherwise, performing the step 7.
The bonus domain bonus principle is as follows:
the reward field means that the ship is near the target point and does not reach the target point, in order to help the ship to quickly reach the target point, different rewards are given according to different distances from the target point in the reward field, the convergence speed of the model is accelerated, meanwhile, in order to prevent the ship from falling into local optimum, the rewards in the reward field cannot be too dense, and the rewards reaching the target point have a difference, and the specific reward is given as shown in a formula (4). WhereinL andtwo thresholds for the bonus field range, inl anddifferent prizes are respectively given in the range.
And 7: judging whether the ship collides with the barrier, if so, adding one to a collision counter, returning the target counter to zero, judging whether the collision counter is greater than a threshold value, and if so, directly giving a maximum penalty, wherein the penalty is calculated by a formula (5); if the threshold is not reached, the penalty value is calculated according to equation (6) and the present round is ended and the next round is started. If there is no collision, step 8 is performed.
reward=max_reward_obs (5)
And 8: judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a punishment principle of the danger domain, and returning to the step 3; if not, the reward is given according to the general condition reward principle and returns to the step 3.
The dangerous area reward principle is as follows: the danger zone refers to that the ship is near a certain obstacle but does not collide with the obstacle, and in order to help the ship to rapidly leave the vicinity of the obstacle and walk to a target point, a penalty is required to be increased near the obstacle, and the penalty value is inversely proportional to the distance between the ship and the obstacle. Similarly, in order to avoid falling into local optimum, punishment in the danger zone should not be too dense, and punishment at the barrier also has a certain difference, and the specific punishment is calculated by formula (7), wherein min (d _ obs) is the minimum value of the distance between the ship and the barrier, alpha, beta and threshold values of the range of the danger zone are respectively given different punishment in alpha, beta and range.
General case reward principle: the general case refers to a vessel that is neither near the target point nor near an obstacle, in which case the reward setting should be as sparse as possible to avoid the algorithm falling into local optimality. Therefore, in general, the inverse number of the distance between the ship and the target point is used as the reward of each step, and a specific calculation formula is shown as formula (8), wherein inverse () is an inverse function.
reward=inverse(d_goal) (8)
The comparative analysis of the reward points and the convergence times in different environments is used for explaining that the reward function using the patent converges faster than the traditional reward function.
(1) Comparative analysis of reward scores in different environments
In the navigation process of a ship, the whole path is divided into a plurality of waypoints, and path planning needs to plan the path between one waypoint and the next waypoint. In the experimental simulation environment process, the actual environment size is 60 nautical miles multiplied by 60 nautical miles, and the number of obstacles is set to be within 7 from one waypoint to the next waypoint, so that the experimental environment size is set to be 600 pixels multiplied by 600 pixels in the experimental process, and 1-7 obstacles are respectively selected to simulate the actual environment. Because the more the number of the obstacles is, the slower the algorithm is converged, the convergence times are counted by dividing the environment into a class I environment and a class II environment according to the number of the obstacles in the environment. The number of obstacles in the type I environment is 1-4, and the number of obstacles in the type II environment is 5-7. The convergence effect of the method is described below by taking two environments (fig. 2 is an exemplary environment in a class i environment, and fig. 3 is an exemplary environment in a class ii environment) as an example.
As shown in fig. 2 and 3, there are 3 obstacles in the first example environment, 5 obstacles in the second example environment, the five-pointed star is the target point, the ship is the starting point, and the obstacles are represented by mountain bodies. In both of these environments, experiments were conducted using the traditional reward function and the doubldqn algorithm employed by the method of this patent. In the deep reinforcement learning, reward scores obtained by training all rounds are used for judging whether an algorithm converges, and if the obtained reward finally tends to be smooth, the algorithm is converged. The prize scores obtained in both environments using the proprietary prize function and the conventional prize function are shown in figures 4 and 5.
As shown in fig. 4 and 5, the abscissa indicates the number of training rounds, the ordinate indicates the reward score obtained for each training round, New indicates the reward function used in this patent, and Traditional indicates the conventional reward function. It can be seen from fig. 4 that the reward score has been substantially stable and the algorithm has substantially converged when training is completed to 50 rounds using the reward function of the present invention, but the reward score gradually starts to stabilize after 100 rounds using the conventional reward function, the algorithm starts to converge, and in the subsequent process, there is also a shock occurring, indicating that the algorithm has not fully converged. From fig. 5, it can be seen that the reward score of the present invention is continuously oscillating before the first 100 rounds, and slowly stabilizes in 100-150 rounds although the oscillation occurs, the algorithm starts to converge, and after 200 rounds, the algorithm has substantially converged. While the reward score oscillates greatly before 150 rounds using the conventional reward function, the magnitude of the score begins to decrease at 150 rounds and 250 rounds, the algorithm begins to tend to converge, and the algorithm substantially converges after 250 rounds.
Through the comparison of the experiments, it can be seen that in the same algorithm, the reward function of the patent is converged faster than the traditional reward function. Using the reward function of the present invention, a higher reward score is achieved than with a conventional reward function, for the same number of rounds.
(2) Convergence number analysis
In the experiment process, different environments are respectively designed by changing the positions, the number and the sizes of the obstacles, and each environment is respectively used for carrying out experiments by using the reward function of the invention and the traditional reward function. The algorithms used in the experiment are DQN, DoubleDQN and dulling DQN algorithms, each algorithm performs 10 experiments in a simple environment and a complex environment, respectively, and the convergence times obtained by statistics are shown in fig. 6 and fig. 7, respectively.
In fig. 6 and 7, the abscissa represents the number of convergence rounds of the algorithm, the ordinate represents the number of convergence times, New represents the reward function used in the present patent, and Traditional represents the conventional reward function.
FIG. 6 illustrates the convergence in class I environment, and it can be seen that in the 30 experiments, 18 convergence times within 100 rounds, 8 convergence times between 100 and 150 rounds, and 4 convergence times between 150 and 250 rounds are obtained using the reward function of this patent; using the conventional reward function, there are 6 convergence within 100 rounds, 15 convergence within 100-150 rounds, and 9 convergence within 150-250 rounds. It can be seen that most of the reward functions using this patent converge within 100 rounds, while most of the reward functions using the traditional patent converge at 150 rounds.
FIG. 7 illustrates convergence in class II environments, and it can be seen that using the reward function of this patent, there are 14 convergence within 150 rounds, 9 convergence beyond 150 and 200 rounds, 5 convergence beyond 200 and 250 rounds, and 2 convergence beyond 300 rounds; using the conventional reward function, there are 7 convergence within 150, 10 convergence within 150 and 200, 8 convergence within 200 and 250, and 5 convergence outside 300. In summary, most of the convergence within 200 rounds is achieved using the reward function of this patent, while most of the convergence within 250 rounds is achieved using the conventional reward function.
It can be seen from the comparison that in the environment with different barrier numbers, different deep reinforcement learning algorithms are used, and the convergence speed of the reward function using the method is higher than that of the traditional reward function.
The traditional reward function is to take the distance between the ship and the target point as the reward. However, such reward functions have slow convergence and long training periods, which may result in terminating the training or discarding the scheme due to time problems.
The invention designs the reward function by adopting a counting principle method, if the ship continuously reaches a target point, the reward is doubled, if the ship continuously collides with an obstacle, the penalty is doubled, an reward domain is arranged at the target point, and a danger domain is arranged at the obstacle. By means of the design, the algorithm is helped to achieve the convergence effect faster, and therefore the ship can avoid the obstacles to reach the target point faster.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (7)
1. A deep reinforcement learning reward function optimization method for unmanned ship path planning is characterized by comprising the following steps:
s1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point, judging whether the unmanned ship reaches the target point or not according to the distance between the unmanned ship and the target point, if so, executing S3, otherwise, executing S4;
s3, according to the times of the ship reaching the target point, giving a reward value corresponding to the times of the ship reaching the target point;
s4, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, and executing S2, otherwise, executing S5;
s5, judging whether the unmanned ship collides with the obstacle, giving a penalty value corresponding to the collision frequency of the ship and the obstacle according to the collision frequency of the ship and the obstacle, and executing a step S2, otherwise executing a step S6;
s6, judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a danger domain punishment principle, and executing S2; otherwise, awarding the bonus according to the general bonus principle and executing S2.
2. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein the preprocessing of the environment information in S1 includes:
reading obstacle position data between the current position of the ship and the next steering point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
converting the coordinate information of the barrier into a first conversion coordinate under a rectangular coordinate system through mercator projection;
performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
and carrying out normalization operation on the second conversion coordinate to enable the coordinate value to be in the range of [0,1 ].
3. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S3, according to the number of times that a ship reaches a target point, a reward value corresponding to the number of times that the ship reaches the target point is given, and the method comprises: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.
4. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S5, a penalty value corresponding to the number of collisions between a ship and an obstacle is given according to the number of collisions between the ship and the obstacle, and the method comprises: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.
5. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step S4 of awarding corresponding rewards according to the reward domain reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.
6. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the giving of the corresponding penalty in S6 according to the risk domain penalty principle comprises: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.
7. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step of giving the reward in S6 according to the general condition reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010962958.4A CN111880549B (en) | 2020-09-14 | 2020-09-14 | Deep reinforcement learning rewarding function optimization method for unmanned ship path planning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010962958.4A CN111880549B (en) | 2020-09-14 | 2020-09-14 | Deep reinforcement learning rewarding function optimization method for unmanned ship path planning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111880549A true CN111880549A (en) | 2020-11-03 |
CN111880549B CN111880549B (en) | 2024-06-04 |
Family
ID=73199216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010962958.4A Active CN111880549B (en) | 2020-09-14 | 2020-09-14 | Deep reinforcement learning rewarding function optimization method for unmanned ship path planning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111880549B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112462786A (en) * | 2020-12-03 | 2021-03-09 | 大连海事大学 | Unmanned ship collision avoidance method based on fuzzy control strategy double-window algorithm |
CN112507520A (en) * | 2020-11-12 | 2021-03-16 | 深圳慧拓无限科技有限公司 | Path planning method and device based on reinforcement learning |
CN112800545A (en) * | 2021-01-28 | 2021-05-14 | 中国地质大学(武汉) | Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN |
CN112882469A (en) * | 2021-01-14 | 2021-06-01 | 浙江大学 | Deep reinforcement learning obstacle avoidance navigation method integrating global training |
CN113079517A (en) * | 2021-03-23 | 2021-07-06 | 中国人民解放军国防科技大学 | Mobile command console and control method thereof |
CN113985876A (en) * | 2021-10-27 | 2022-01-28 | 广州大学 | Marine garbage recycling path planning method and system based on foraging of bionic fishes |
CN114531647A (en) * | 2022-01-13 | 2022-05-24 | 江苏大学 | Multi-layer self-adaptive reward target tracing method based on reinforcement learning under diffusion molecular communication environment |
CN114851184A (en) * | 2021-01-20 | 2022-08-05 | 广东技术师范大学 | Industrial robot-oriented reinforcement learning reward value calculation method |
CN115790608A (en) * | 2023-01-31 | 2023-03-14 | 天津大学 | AUV path planning algorithm and device based on reinforcement learning |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011204036A (en) * | 2010-03-25 | 2011-10-13 | Institute Of National Colleges Of Technology Japan | Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program |
CN102967310A (en) * | 2012-12-12 | 2013-03-13 | 中国船舶重工集团公司第七0七研究所 | Seaway monitoring method based on electronic chart |
CN107133735A (en) * | 2017-04-28 | 2017-09-05 | 大鹏高科(武汉)智能装备有限公司 | A kind of unmanned boat resource share method and device |
CN110632931A (en) * | 2019-10-09 | 2019-12-31 | 哈尔滨工程大学 | Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment |
CN110716575A (en) * | 2019-09-29 | 2020-01-21 | 哈尔滨工程大学 | UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning |
CN110750096A (en) * | 2019-10-09 | 2020-02-04 | 哈尔滨工程大学 | Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment |
CN110989576A (en) * | 2019-11-14 | 2020-04-10 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
CN111045445A (en) * | 2019-10-23 | 2020-04-21 | 浩亚信息科技有限公司 | Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning |
CN111123963A (en) * | 2019-12-19 | 2020-05-08 | 南京航空航天大学 | Unknown environment autonomous navigation system and method based on reinforcement learning |
CN111157982A (en) * | 2019-11-20 | 2020-05-15 | 智慧航海(青岛)科技有限公司 | Intelligent ship and shore cooperative target tracking system and method based on shore-based radar |
CN111487977A (en) * | 2020-05-11 | 2020-08-04 | 大连海事大学 | Ship dynamic obstacle avoidance optimization method and system under uncertain environment |
CN111507501A (en) * | 2019-01-31 | 2020-08-07 | 斯特拉德视觉公司 | Method and device for executing personalized path planning through reinforcement learning |
-
2020
- 2020-09-14 CN CN202010962958.4A patent/CN111880549B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011204036A (en) * | 2010-03-25 | 2011-10-13 | Institute Of National Colleges Of Technology Japan | Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program |
CN102967310A (en) * | 2012-12-12 | 2013-03-13 | 中国船舶重工集团公司第七0七研究所 | Seaway monitoring method based on electronic chart |
CN107133735A (en) * | 2017-04-28 | 2017-09-05 | 大鹏高科(武汉)智能装备有限公司 | A kind of unmanned boat resource share method and device |
CN111507501A (en) * | 2019-01-31 | 2020-08-07 | 斯特拉德视觉公司 | Method and device for executing personalized path planning through reinforcement learning |
CN110716575A (en) * | 2019-09-29 | 2020-01-21 | 哈尔滨工程大学 | UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning |
CN110632931A (en) * | 2019-10-09 | 2019-12-31 | 哈尔滨工程大学 | Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment |
CN110750096A (en) * | 2019-10-09 | 2020-02-04 | 哈尔滨工程大学 | Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment |
CN111045445A (en) * | 2019-10-23 | 2020-04-21 | 浩亚信息科技有限公司 | Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning |
CN110989576A (en) * | 2019-11-14 | 2020-04-10 | 北京理工大学 | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle |
CN111157982A (en) * | 2019-11-20 | 2020-05-15 | 智慧航海(青岛)科技有限公司 | Intelligent ship and shore cooperative target tracking system and method based on shore-based radar |
CN111123963A (en) * | 2019-12-19 | 2020-05-08 | 南京航空航天大学 | Unknown environment autonomous navigation system and method based on reinforcement learning |
CN111487977A (en) * | 2020-05-11 | 2020-08-04 | 大连海事大学 | Ship dynamic obstacle avoidance optimization method and system under uncertain environment |
Non-Patent Citations (2)
Title |
---|
SIYU GUO: "An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning", SENSORS, vol. 20, 11 January 2020 (2020-01-11), pages 1 - 35 * |
刘畅: "使用增强学习训练多焦点聚焦模型", 自动化学报, vol. 43, no. 9, 30 September 2017 (2017-09-30), pages 1563 - 1570 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507520A (en) * | 2020-11-12 | 2021-03-16 | 深圳慧拓无限科技有限公司 | Path planning method and device based on reinforcement learning |
CN112462786A (en) * | 2020-12-03 | 2021-03-09 | 大连海事大学 | Unmanned ship collision avoidance method based on fuzzy control strategy double-window algorithm |
CN112462786B (en) * | 2020-12-03 | 2024-01-19 | 大连海事大学 | Unmanned ship collision prevention method based on fuzzy control strategy double-window algorithm |
CN112882469A (en) * | 2021-01-14 | 2021-06-01 | 浙江大学 | Deep reinforcement learning obstacle avoidance navigation method integrating global training |
CN112882469B (en) * | 2021-01-14 | 2022-04-08 | 浙江大学 | Deep reinforcement learning obstacle avoidance navigation method integrating global training |
CN114851184B (en) * | 2021-01-20 | 2023-05-09 | 广东技术师范大学 | Reinforced learning rewarding value calculating method for industrial robot |
CN114851184A (en) * | 2021-01-20 | 2022-08-05 | 广东技术师范大学 | Industrial robot-oriented reinforcement learning reward value calculation method |
CN112800545B (en) * | 2021-01-28 | 2022-06-24 | 中国地质大学(武汉) | Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN |
CN112800545A (en) * | 2021-01-28 | 2021-05-14 | 中国地质大学(武汉) | Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN |
CN113079517B (en) * | 2021-03-23 | 2022-07-05 | 中国人民解放军国防科技大学 | Mobile command console and control method thereof |
CN113079517A (en) * | 2021-03-23 | 2021-07-06 | 中国人民解放军国防科技大学 | Mobile command console and control method thereof |
CN113985876A (en) * | 2021-10-27 | 2022-01-28 | 广州大学 | Marine garbage recycling path planning method and system based on foraging of bionic fishes |
CN113985876B (en) * | 2021-10-27 | 2023-09-26 | 广州大学 | Ocean garbage recycling path planning method and system based on bionic fish foraging |
CN114531647A (en) * | 2022-01-13 | 2022-05-24 | 江苏大学 | Multi-layer self-adaptive reward target tracing method based on reinforcement learning under diffusion molecular communication environment |
CN114531647B (en) * | 2022-01-13 | 2024-03-22 | 江苏大学 | Multi-layer self-adaptive rewarding target tracing method in diffusion molecular communication environment |
CN115790608A (en) * | 2023-01-31 | 2023-03-14 | 天津大学 | AUV path planning algorithm and device based on reinforcement learning |
CN115790608B (en) * | 2023-01-31 | 2023-05-30 | 天津大学 | AUV path planning algorithm and device based on reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN111880549B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111880549B (en) | Deep reinforcement learning rewarding function optimization method for unmanned ship path planning | |
CN112179367B (en) | Intelligent autonomous navigation method based on deep reinforcement learning | |
CN108564202B (en) | Unmanned ship route optimization method based on environment forecast information | |
Sun et al. | Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning | |
CN111273670B (en) | Unmanned ship collision prevention method for fast moving obstacle | |
Ouahouah et al. | Deep-reinforcement-learning-based collision avoidance in uav environment | |
Xiaofei et al. | Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle | |
Wang et al. | Cooperative collision avoidance for unmanned surface vehicles based on improved genetic algorithm | |
CN114089762B (en) | Water-air amphibious unmanned aircraft path planning method based on reinforcement learning | |
CN112015174A (en) | Multi-AGV motion planning method, device and system | |
CN110906935B (en) | Unmanned ship path planning method | |
Deraj et al. | Deep reinforcement learning based controller for ship navigation | |
CN112859864A (en) | Unmanned ship-oriented geometric path planning method | |
CN109765914A (en) | A kind of unmanned surface vehicle collision prevention method based on sliding window population | |
CN117590867B (en) | Underwater autonomous vehicle connection control method and system based on deep reinforcement learning | |
Gao et al. | An optimized path planning method for container ships in Bohai bay based on improved deep Q-learning | |
Masmitja et al. | A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles | |
CN115107948B (en) | Efficient reinforcement learning autonomous ship collision prevention method | |
CN114609925B (en) | Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish | |
Jose et al. | Navigating the Ocean with DRL: Path following for marine vessels | |
CN114779764B (en) | Vehicle reinforcement learning movement planning method based on driving risk analysis | |
CN116466701A (en) | Unmanned surface vessel energy consumption minimum track planning method | |
Cheng et al. | Trajectory optimization for ship navigation safety using genetic annealing algorithm | |
Meyer | On course towards model-free guidance: A self-learning approach to dynamic collision avoidance for autonomous surface vehicles | |
CN114167856A (en) | Service robot local path planning method based on artificial emotion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |