CN111880549B - Deep reinforcement learning rewarding function optimization method for unmanned ship path planning - Google Patents

Deep reinforcement learning rewarding function optimization method for unmanned ship path planning Download PDF

Info

Publication number
CN111880549B
CN111880549B CN202010962958.4A CN202010962958A CN111880549B CN 111880549 B CN111880549 B CN 111880549B CN 202010962958 A CN202010962958 A CN 202010962958A CN 111880549 B CN111880549 B CN 111880549B
Authority
CN
China
Prior art keywords
ship
obstacle
target point
giving
rewards
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010962958.4A
Other languages
Chinese (zh)
Other versions
CN111880549A (en
Inventor
曹志英
杜意权
张秀国
郭嗣彧
郑易松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202010962958.4A priority Critical patent/CN111880549B/en
Publication of CN111880549A publication Critical patent/CN111880549A/en
Application granted granted Critical
Publication of CN111880549B publication Critical patent/CN111880549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/0088Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/0206Control of position or course in two dimensions specially adapted to water vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a deep reinforcement learning rewarding function optimization method for unmanned ship path planning, which comprises the following steps: s1, acquiring environment information; s2, obtaining the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and the target point; s3, giving corresponding rewarding values according to the times of arrival of the ship at the target point; s4, judging whether the ship is in the rewarding domain, and giving corresponding rewards according to a rewarding domain rewarding principle; s5, judging whether the unmanned ship collides with the obstacle or not, and giving a corresponding punishment value; and S6, judging whether the ship is in the dangerous area, giving corresponding punishment according to a dangerous area punishment principle, and giving rewards according to a general situation rewarding principle otherwise. According to the invention, the obtained rewards or punishments are increased or reduced by adding the rewards domain near the target point of ship navigation and adding the dangerous domain near the obstacle and introducing the counting principle, so that the convergence speed of the deep reinforcement learning algorithm is increased, and the ship is guided to avoid the obstacle to reach the target point more quickly.

Description

Deep reinforcement learning rewarding function optimization method for unmanned ship path planning
Technical Field
The invention relates to the technical field of path planning, in particular to a deep reinforcement learning reward function optimization method for unmanned ship path planning.
Background
At present, the economic relations of countries around the world are more compact, the trade is more frequent, and ships play an important role as an important transportation means. However, as the navigation density of ships becomes higher and the navigation environment becomes more complex, marine safety accidents frequently occur. In recent years, data show that the main reason for marine perils is that a ship collides with an obstacle which is not detected by a crew in time during the sailing process. Meanwhile, the ship is not suitable for a manned ship to go to a working place to execute tasks under some conditions, and the ship is required to independently navigate to cope with the complex and changeable severe environment at sea, so that the ship is required to have an independent obstacle avoidance function. In order to ensure the running safety of the ship and avoid the occurrence of marine safety accidents, scholars at home and abroad propose a plurality of methods related to ship obstacle avoidance.
The path planning is used as the most basic and important link of ship obstacle avoidance, and the aim is to establish an optimal path from the starting point to the target point, and can avoid all obstacles on the premise of meeting the navigation performance requirement of the ship.
The conventional unmanned ship path planning algorithm comprises an artificial potential field method (APF), an A-algorithm, an evolutionary neural network, an evolutionary algorithm and the like. APF guides unmanned ship to target waypoint through constructing virtual gravitational field, changes navigation limit area into virtual barrier area to construct virtual repulsive force field and avoid static barrier, but when there is equal repulsive force and attractive force or when the repulsive force of target point is very big, the ship will be stopped before, falls into local optimum. The algorithm a is a modified version of Dijkstra's algorithm by searching in the state space and evaluating each searched location to get the best location from which to search until the target is reached. However, the planning path function of the method depends on a grid map, and the size and the number of the grids can directly influence the calculation speed and the accuracy of the algorithm. The evolutionary neural network mainly learns sample data of historical ship collision avoidance operation, so that unmanned ships can plan paths and avoid obstacles in static and dynamic obstacle environments. The accuracy of this approach is highly dependent on training sample data of the evolving neural network, requiring training of corresponding neural network models for different voyage areas. The evolutionary algorithm includes ant colony algorithm and simulated degradation algorithm, for example, vettor, R et al utilize optimization genetic algorithm to calculate environmental information as an initial population, and finally obtain a navigation path meeting the requirements and realize unmanned ship autonomous path planning in open water. The algorithm is simple to realize, but has high iteration times, is easy to fall into a local optimal solution, and can not be used in a continuous topological environment because of excessive corners of the found paths.
In recent years, with the development of deep reinforcement learning, a ship path planning method based on the deep reinforcement learning is also beginning to be studied by many students. According to the method, a deep reinforcement learning algorithm is combined with unmanned ship control actions, so that the ship is continuously tried in a simulation environment, accurate navigation actions are output in real time, how to avoid obstacles to reach target points is learned, and finally an effective ship path planning model is obtained. When the model is used, the model is not required to be trained again, and the path planning can be realized by only deploying the obtained model on a ship needing path planning and then transmitting environmental information into the model.
Unmanned ship path planning based on deep reinforcement learning requires random initialization of deep reinforcement learning algorithm network parameters, and abstracts obstacles in a real environment to establish an experimental environment. When the path planning model is established, firstly, data such as the position, heading, navigational speed, distance between the ship and the target point, distance between the ship and the obstacle and the like of the ship are initialized as input, and the initial state of the ship is determined. And then the network predicts the action required to be executed by the ship according to the input ship state information, the ship executes the action to obtain the ship state at the next moment, and the rewarding function judges whether the action is good or bad to obtain a rewarding or punishment. And finally, saving the current ship state, the executed action, the obtained rewards and the next state after the executed action into a buffer pool for updating the network parameters of the deep reinforcement learning algorithm. And repeating the process until an effective ship path planning model is obtained. And the evaluation of the bonus function as evaluating the performance of the action in the process has great influence on the deep reinforcement learning algorithm.
At present, most of reward functions adopted in unmanned ship path planning based on deep reinforcement learning take the distance between a ship and a target point as reward, and the adoption of the reward functions can lead to slow convergence speed of a deep reinforcement learning algorithm, long training period and possibility of not learning an optimal path planning strategy.
Disclosure of Invention
In order to solve the problems of low convergence speed and long training period of the traditional reward function, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning. The obtained rewards or penalties are increased or decreased by adding the rewards domain near the target point and the dangerous domain near the obstacle and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.
The invention adopts the following technical means:
the deep reinforcement learning reward function optimization method for unmanned ship path planning comprises the following steps:
S1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
S2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and the target point, judging whether the unmanned ship reaches the target point according to the distance between the unmanned ship and the target point, executing S3 if the unmanned ship reaches the target point, otherwise executing S4;
s3, giving a reward value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point;
S4, judging whether the ship is in the rewarding domain, if so, giving corresponding rewards according to the rewarding domain rewarding principle, executing S2, otherwise, executing S5;
S5, judging whether the unmanned ship collides with the obstacle, giving a punishment value corresponding to the collision times of the ship and the obstacle according to the collision times of the ship and the obstacle, executing the step S2, otherwise executing the step S6;
S6, judging whether the ship is in a dangerous area, if so, giving corresponding punishment according to a dangerous area punishment principle, and executing S2; otherwise, awarding the rewards according to the general rewards principle, and executing S2.
Further, in S1, preprocessing the environmental information includes:
Reading obstacle position data between the current position of the ship and the next turning point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
Converting the coordinate information of the obstacle into a first conversion coordinate under a rectangular coordinate system through a mercator projection;
Performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
And normalizing the second converted coordinates to enable the coordinate values to be in the [0,1] interval.
Further, in S3, according to the number of times the ship reaches the target point, a bonus value corresponding to the number of times the ship reaches the target point is given, including: and if the number of times the ship reaches the target point is greater than a threshold value, giving the maximum reward, otherwise giving a weighted reward with a standard reward value, wherein the weighted coefficient is positively correlated with the number of times the ship reaches the target point.
Further, in S5, according to the number of times the ship collides with the obstacle, a penalty value corresponding to the number of times the ship collides with the obstacle is given, including: and if the number of times the ship collides with the obstacle is greater than a threshold value, giving the maximum punishment, otherwise, giving the weighted punishment of a standard punishment value, wherein the weighted coefficient is positively correlated with the number of times the obstacle collides.
Further, in S4, corresponding rewards are given according to the rewards domain rewards principle, including: and giving corresponding rewarding values according to the distance between the ship and the target point.
Further, in S6, a corresponding penalty is given according to the risk domain penalty principle, including: and giving corresponding punishment values according to the distance between the ship and the nearest obstacle.
Further, awarding the prize according to a general case prize principle in S6 includes: and giving corresponding rewarding values according to the distance between the ship and the target point.
Compared with the prior art, the invention has the following advantages:
According to the invention, the rewarding domain is added near the target point, the dangerous domain is added near the obstacle, and the counting principle is introduced to increase or decrease the obtained rewards or punishments, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.
Based on the advantages, the invention can be widely popularized in the field of ship obstacle avoidance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.
FIG. 1 is a flow chart of a method for optimizing a bonus function of the invention.
FIG. 2 is a schematic diagram of an example environment.
FIG. 3 is a schematic diagram of an exemplary environment.
FIG. 4 is a schematic diagram of bonus points in an example environment one.
FIG. 5 is a schematic diagram of bonus points in an example environment II.
FIG. 6 is a graph of how many converges are compared in an example environment I.
Fig. 7 is a chart showing how many times convergence is compared in the second example environment.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning, which comprises the following steps:
Step 1: and converting the environment information into an experimental environment by data preprocessing. The data preprocessing method comprises the following steps:
The environment information mainly comprises information such as position information, speed, angular speed and position of an obstacle of the ship, obstacle data between the current position of the ship and the next turning point are read before the algorithm is operated, the data are analyzed, information such as coordinates of the obstacle are obtained, the coordinate conversion is carried out on the data information of the obstacle by using a black card bracket projection, and the coordinate is converted into a coordinate under a rectangular coordinate system. After coordinate conversion, the data is converted into final coordinates in kilometers by taking meters as a unit, and the final coordinates are converted into final coordinates in kilometers as shown in a formula (1). Where a is the long axis of the earth, b is the short axis of the earth, a point on the earth has a longitude of θ ε (-pi, +pi), a latitude of α ε (-pi 2, +pi 2), The first eccentricity of the earth's ellipsoid, (x, y) is the last transformed coordinate. And finally, carrying out normalization operation to ensure that the data of the input algorithm are all between 0 and 1.
Step 2: the variables required for the algorithm to run are initialized. Variables required for the process of the present invention include:
goal _counter is the target counter;
the obs_counter is a collision counter;
Reward rewards or penalties for each step;
The reorder_standard_ goal is a standard reward to reach the target point;
max_rewind_ goal is the maximum prize to reach the target point;
the reorder_standard_obs is the standard penalty for collision barriers;
max_forward_obs is the maximum penalty for collision obstacle.
Step 3: the distance between the vessel and the obstacle and target point is calculated. The distance between the unmanned ship and the obstacle is d_obs=[d1,d2,d3,...,dn],d1,d2,d3,...,dn as the distance between each obstacle; the distance to the target point is noted as d_ goal, and in order to prevent calculation errors during calculation of the distance, the boundary distance between the ship and the target point or obstacle is calculated instead of the center distance.
Step 4: and judging whether the ship reaches the target point, if so, performing the step 5, otherwise, performing the step 6.
Step 5: the target counter is incremented by one and the collision counter is zeroed. Judging whether the target counter is larger than a threshold value, if so, directly giving the maximum reward, wherein a reward formula is shown in a formula (2), otherwise, calculating a reward value according to a formula (3), wherein ceil () is a rounding function. The present round is ended and the next round is started.
reward=max_reward_goal(2)
And 6, judging whether the ship is in the rewarding domain, if so, giving corresponding rewards according to the rewarding domain rewarding principle, returning to the step 3, and otherwise, carrying out the step 7.
The rewards domain rewards principle is as follows:
The rewarding domain refers to that the ship is near the target point and does not reach the target point yet, different rewards are given for different distances from the target point in the rewarding domain in order to help the ship reach the target point quickly, the convergence speed of the model is accelerated, meanwhile, rewards in the rewarding domain cannot be too dense in order to prevent sinking into local optimum, and rewards reaching the target point have a gap, and specific rewards are given as shown in a formula (4). Wherein the method comprises the steps of L and D are two thresholds for the range of the bonus domain, at/>Different rewards are respectively given in the range of l and D.
Step 7: judging whether the ship collides with an obstacle, if so, adding one to the collision counter, resetting the target counter to zero, judging whether the collision counter is larger than a threshold value, and if so, directly giving the maximum punishment, wherein the punishment is calculated by a formula (5); if the threshold is not reached, a penalty value is calculated according to equation (6) and the current round is ended and the next round is started. If there is no collision, step 8 is performed.
reward=max_reward_obs(5)
Step 8: judging whether the ship is in a dangerous area, if so, giving corresponding punishment according to a dangerous area punishment principle, and returning to the step 3; if the rewards are not in the dangerous domain, rewards are given according to the general rewards principle, and the step 3 is returned.
The risk domain rewards principle is as follows: the dangerous area refers to that the ship is near a certain obstacle, but does not collide with the obstacle, so that the ship can quickly leave the vicinity of the obstacle and walk to a target point, punishment is needed to be increased near the obstacle, and the punishment value is inversely proportional to the distance between the ship and the obstacle. Similarly, in order to avoid trapping in local optimum, the penalties in the dangerous domain should not be too dense, and there is a certain gap between the penalties at the obstacle, and a specific penalty is given by the calculation of formula (7), wherein min (d_ obs) is the minimum value of the ship from the obstacle, α, β and δ are thresholds of the dangerous domain range, and different penalties are given in the α, β and δ ranges respectively.
General case rewards principle: it is generally the case that the vessel is neither near the target point nor near the obstacle, in which case the reward arrangement should be as sparse as possible to avoid the algorithm falling into a local optimum. So in general, the opposite number of the distance between the ship and the target point is adopted as the reward of each step, and a specific calculation formula is shown in formula (8), wherein index () is an inverse function.
reward=inverse(d_goal)(8)
The following comparison analysis is performed on both the prize score and the convergence number from different environments to illustrate that the prize function using the present patent converges faster than the prize function using the conventional prize function.
(1) Reward score contrast analysis in different environments
In the navigation process of a ship, the whole path is divided into a plurality of waypoints, and path planning is required to plan a path between one waypoint and the next waypoint. In the experimental simulation environment process, the actual environment size is 60 marix 60 maris, and the number of the obstacles is set to be less than 7 from one waypoint to the next waypoint, so that the experimental environment size is set to be 600 pixels x 600 pixels in the experimental process, and 1-7 obstacles are respectively selected to simulate the actual environment. Because the more the number of the obstacles is, the slower the algorithm converges, and therefore, the environment is divided into a class I environment and a class II environment according to the number of the obstacles in the environment, and the convergence frequency statistics is carried out. The number of the obstacles in the I type environment is 1-4, and the number of the obstacles in the II type environment is 5-7. The convergence effect of the present method will be described below by taking two of the environments (one example environment in the class i environment in fig. 2, one example environment in the class ii environment in fig. 3) as an example.
As shown in fig. 2 and 3, there are 3 obstacles in the first example environment, 5 obstacles in the second example environment, the five-pointed star point is the target point, the ship point is the starting point, and the obstacles are represented by mountain. In both environments, experiments were performed using a conventional reward function and DoubleDQN algorithm employed by the method of this patent. In the deep reinforcement learning, the reward scores obtained by training all rounds are used for judging whether an algorithm converges, and if the obtained reward finally tends to be stable, the algorithm is converged. The prize scoring graphs obtained in two environments using the prize function of this patent and the conventional prize function are shown in fig. 4 and 5.
As shown in fig. 4 and 5, the abscissa represents the number of training rounds, the ordinate represents the bonus score obtained for each training round, new represents the bonus function using the present patent, and Traditional represents the use of a conventional bonus function. It can be seen from fig. 4 that the bonus score has been substantially stabilized by the time the bonus function of the present invention was trained to 50 rounds, the algorithm has converged substantially, but the bonus score gradually begins to stabilize after 100 rounds using the conventional bonus function, the algorithm begins to converge, and further concussion occurs in the subsequent process, indicating that the algorithm has not yet converged completely. It can be seen from figure 5 that the bonus score continues to oscillate before the first 100 rounds using the bonus function of the invention, whereas the algorithm begins to converge after 200 rounds, while slowly settling in the 100-150 rounds, although an oscillation occurs. Whereas with the conventional bonus function the bonus score oscillates significantly before the 150 round, the amplitude of the score starts to decrease at the 150-250 round, the algorithm starts to converge and after the 250 round the algorithm converges substantially.
From the above experimental comparison, it can be seen that the bonus function of this patent converges faster than the conventional bonus function in the same algorithm. With the same round number, higher bonus points are achieved using the bonus function of the present invention than with conventional bonus functions.
(2) Convergence number analysis
In the experimental process, different environments are respectively designed by changing the positions, the numbers and the sizes of the barriers, and each environment is respectively subjected to experiments by using the rewarding function and the traditional rewarding function. Algorithms used in the experiments are DQN, doubleDQN and dueling DQN algorithms, each of which is respectively subjected to 10 experiments in a simple environment and a complex environment, and the counted convergence times are respectively shown in fig. 6 and 7.
In fig. 6 and 7, the abscissa indicates the number of rounds of algorithm convergence, the ordinate indicates the number of times of convergence, new indicates the use of the bonus function of this patent, and tradition indicates the use of the conventional bonus function.
FIG. 6 depicts convergence in a class I environment, from which it can be seen that in 30 experiments were performed, 18 times within 100 rounds, 8 times between 100-150 rounds, and only 4 times between 150-250 rounds of convergence using the bonus function of the present patent; using the conventional bonus function, there were 6 times within 100 rounds of convergence, 15 times with 100-150 rounds of convergence, and 9 times with 150-250 rounds of convergence. It can be seen that the most of the bonus functions using this patent converge within 100 rounds, while the most of the bonus functions using the traditional bonus functions converge at 150 rounds.
FIG. 7 illustrates the convergence in a class II environment, where it can be seen that using the reward function of this patent, there are 14 times for the inner convergence for the 150 rounds, 9 times for the 150-200 rounds and the outer convergence for the 200-250 rounds, 5 times for the outer convergence for the 300 rounds, and 2 times for the outer convergence for the 300 rounds; using the conventional bonus function, there were 7 times convergence within 150 rounds, 10 times convergence of 150-200 rounds, 8 times convergence of 200-250 rounds, and 5 times convergence outside 300 rounds. To sum up, most of the bonus functions of this patent converge within 200 rounds, while most converge within 250 rounds using conventional bonus functions.
As can be seen from the comparison, in the environments with different numbers of barriers, the reward functions using the method are faster in convergence than the reward functions using the traditional reward functions by using different deep reinforcement learning algorithms.
Conventional rewarding functions are all rewarding by reversing the distance between the ship and the target point. However, the convergence speed of the reward function is slow, the training period is long, and the training can be stopped or the scheme is abandoned due to time problems.
The invention designs a reward function by adopting a counting principle method, doubles rewards if the ship continuously reaches a target point, doubles punishs if the ship continuously collides with an obstacle, sets a reward domain at the target point and sets a dangerous domain at the obstacle. By means of the design, the help algorithm achieves the convergence effect faster, so that the ship can avoid the obstacle to reach the target point faster.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (7)

1. The deep reinforcement learning reward function optimization method for unmanned ship path planning is characterized by comprising the following steps of:
S1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
S2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and the target point, judging whether the unmanned ship reaches the target point according to the distance between the unmanned ship and the target point, executing S3 if the unmanned ship reaches the target point, otherwise executing S4;
s3, giving a reward value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point, wherein the reward value comprises the following components: if the ship reaches the target point, the target counter is increased by one, the collision counter is reset, whether the target counter is larger than a threshold value is judged, if so, the maximum rewards are directly given, otherwise, the rewards are calculated according to the following formula:
Where ceil () is a rounding function, goal _counter is a target counter, and rewind_standard_ goal is a standard prize for reaching the target point;
S4, judging whether the ship is in a rewarding domain, wherein the rewarding domain refers to that the ship is near a target point and does not reach the target point yet, if the ship is in the rewarding domain, giving corresponding rewards according to a rewarding domain rewarding principle, executing S2, otherwise, executing S5, and calculating the rewarding domain rewards according to the following formula:
Wherein, L and D are bonus domain range thresholds, d_ goal is the distance of the target point;
s5, judging whether the unmanned ship collides with the obstacle, giving a punishment value corresponding to the number of times of collision of the ship with the obstacle according to the number of times of collision of the ship with the obstacle, executing the step S2, otherwise executing the step S6, including:
judging whether the ship collides with the obstacle, if so, adding one to the collision counter, resetting the target counter to zero, judging whether the collision counter is larger than a threshold value, if so, directly giving the maximum punishment, and if not, calculating the punishment value according to the following formula:
wherein, the ob_counter is a collision counter, and the reorder_standard_ob is a standard penalty for collision barriers;
S6, judging whether the ship is in a dangerous area, wherein the dangerous area refers to the ship is near a certain obstacle but does not collide with the obstacle, if so, giving corresponding punishment according to a dangerous area punishment principle, and executing S2; otherwise awarding rewards according to a general situation rewards principle, and executing S2, wherein the risk threshold penalty is calculated according to the following formula:
where min (d_obs) is the minimum of the vessel distance obstacle and α, β and δ are thresholds of the hazardous domain range.
2. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method according to claim 1, wherein the preprocessing of the environmental information in S1 comprises:
Reading obstacle position data between the current position of the ship and the next turning point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
Converting the coordinate information of the obstacle into a first conversion coordinate under a rectangular coordinate system through a mercator projection;
Performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
And normalizing the second converted coordinates to enable the coordinate values to be in the [0,1] interval.
3. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step S3 of giving a reward value corresponding to the number of times a ship arrives at a target point according to the number of times the ship arrives at the target point comprises: and if the number of times the ship reaches the target point is greater than a threshold value, giving the maximum reward, otherwise giving a weighted reward with a standard reward value, wherein the weighted coefficient is positively correlated with the number of times the ship reaches the target point.
4. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step of S5 of giving a penalty value corresponding to the number of collisions of the ship with the obstacle according to the number of collisions of the ship with the obstacle comprises: and if the number of times the ship collides with the obstacle is greater than a threshold value, giving the maximum punishment, otherwise, giving the weighted punishment of a standard punishment value, wherein the weighted coefficient is positively correlated with the number of times the obstacle collides.
5. The method for optimizing a deep reinforcement learning rewards function for unmanned ship-oriented path planning of claim 1, wherein the step of S4 awarding the corresponding rewards according to the rewards domain rewards principle comprises: and giving corresponding rewarding values according to the distance between the ship and the target point.
6. The method for optimizing a deep reinforcement learning reward function for unmanned ship path planning according to claim 1, wherein the step of giving a corresponding penalty according to a risk domain penalty principle in S6 comprises: and giving corresponding punishment values according to the distance between the ship and the nearest obstacle.
7. The method for optimizing a deep reinforcement learning rewards function for unmanned ship-oriented path planning of claim 1, wherein awarding rewards in S6 according to a general case rewards principle comprises: and giving corresponding rewarding values according to the distance between the ship and the target point.
CN202010962958.4A 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning Active CN111880549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010962958.4A CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010962958.4A CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Publications (2)

Publication Number Publication Date
CN111880549A CN111880549A (en) 2020-11-03
CN111880549B true CN111880549B (en) 2024-06-04

Family

ID=73199216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010962958.4A Active CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Country Status (1)

Country Link
CN (1) CN111880549B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507520A (en) * 2020-11-12 2021-03-16 深圳慧拓无限科技有限公司 Path planning method and device based on reinforcement learning
CN112462786B (en) * 2020-12-03 2024-01-19 大连海事大学 Unmanned ship collision prevention method based on fuzzy control strategy double-window algorithm
CN112882469B (en) * 2021-01-14 2022-04-08 浙江大学 Deep reinforcement learning obstacle avoidance navigation method integrating global training
CN114851184B (en) * 2021-01-20 2023-05-09 广东技术师范大学 Reinforced learning rewarding value calculating method for industrial robot
CN112800545B (en) * 2021-01-28 2022-06-24 中国地质大学(武汉) Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN
CN113079517B (en) * 2021-03-23 2022-07-05 中国人民解放军国防科技大学 Mobile command console and control method thereof
CN113985876B (en) * 2021-10-27 2023-09-26 广州大学 Ocean garbage recycling path planning method and system based on bionic fish foraging
CN114531647B (en) * 2022-01-13 2024-03-22 江苏大学 Multi-layer self-adaptive rewarding target tracing method in diffusion molecular communication environment
CN115790608B (en) * 2023-01-31 2023-05-30 天津大学 AUV path planning algorithm and device based on reinforcement learning

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102967310A (en) * 2012-12-12 2013-03-13 中国船舶重工集团公司第七0七研究所 Seaway monitoring method based on electronic chart
CN107133735A (en) * 2017-04-28 2017-09-05 大鹏高科(武汉)智能装备有限公司 A kind of unmanned boat resource share method and device
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110716575A (en) * 2019-09-29 2020-01-21 哈尔滨工程大学 UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111045445A (en) * 2019-10-23 2020-04-21 浩亚信息科技有限公司 Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111157982A (en) * 2019-11-20 2020-05-15 智慧航海(青岛)科技有限公司 Intelligent ship and shore cooperative target tracking system and method based on shore-based radar
CN111487977A (en) * 2020-05-11 2020-08-04 大连海事大学 Ship dynamic obstacle avoidance optimization method and system under uncertain environment
CN111507501A (en) * 2019-01-31 2020-08-07 斯特拉德视觉公司 Method and device for executing personalized path planning through reinforcement learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102967310A (en) * 2012-12-12 2013-03-13 中国船舶重工集团公司第七0七研究所 Seaway monitoring method based on electronic chart
CN107133735A (en) * 2017-04-28 2017-09-05 大鹏高科(武汉)智能装备有限公司 A kind of unmanned boat resource share method and device
CN111507501A (en) * 2019-01-31 2020-08-07 斯特拉德视觉公司 Method and device for executing personalized path planning through reinforcement learning
CN110716575A (en) * 2019-09-29 2020-01-21 哈尔滨工程大学 UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN111045445A (en) * 2019-10-23 2020-04-21 浩亚信息科技有限公司 Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111157982A (en) * 2019-11-20 2020-05-15 智慧航海(青岛)科技有限公司 Intelligent ship and shore cooperative target tracking system and method based on shore-based radar
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111487977A (en) * 2020-05-11 2020-08-04 大连海事大学 Ship dynamic obstacle avoidance optimization method and system under uncertain environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning;Siyu Guo;Sensors;20200111;第20卷;第1-35页 *
使用增强学习训练多焦点聚焦模型;刘畅;自动化学报;20170930;第43卷(第9期);第1563-1570页 *

Also Published As

Publication number Publication date
CN111880549A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN111880549B (en) Deep reinforcement learning rewarding function optimization method for unmanned ship path planning
CN112179367B (en) Intelligent autonomous navigation method based on deep reinforcement learning
CN111780777B (en) Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
CN108564202B (en) Unmanned ship route optimization method based on environment forecast information
CN111273670B (en) Unmanned ship collision prevention method for fast moving obstacle
Ouahouah et al. Deep-reinforcement-learning-based collision avoidance in uav environment
Xiaofei et al. Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle
CN109933067A (en) A kind of unmanned boat collision prevention method based on genetic algorithm and particle swarm algorithm
Wang et al. Cooperative collision avoidance for unmanned surface vehicles based on improved genetic algorithm
CN112180950B (en) Intelligent ship autonomous collision avoidance and path planning method based on reinforcement learning
CN110906935B (en) Unmanned ship path planning method
CN114625151A (en) Underwater robot obstacle avoidance path planning method based on reinforcement learning
Bakdi et al. Fullest COLREGs evaluation using fuzzy logic for collaborative decision-making analysis of autonomous ships in complex situations
Zheng et al. A Decision‐Making Method for Ship Collision Avoidance Based on Improved Cultural Particle Swarm
CN112859864A (en) Unmanned ship-oriented geometric path planning method
CN109765914A (en) A kind of unmanned surface vehicle collision prevention method based on sliding window population
Kim Vessel target prediction method and dead reckoning position based on SVR seaway model
Du et al. An optimized path planning method for coastal ships based on improved DDPG and DP
Gao et al. An optimized path planning method for container ships in Bohai bay based on improved deep Q-learning
Yang et al. Improved reinforcement learning for collision-free local path planning of dynamic obstacle
CN117590867A (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
Masmitja et al. A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles
CN117034102A (en) Multi-scene navigation method based on intelligent scene classification
Jose et al. Navigating the Ocean with DRL: Path following for marine vessels
Keong et al. Reinforcement learning for autonomous aircraft avoidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant