CN111880549A - Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method - Google Patents

Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method Download PDF

Info

Publication number
CN111880549A
CN111880549A CN202010962958.4A CN202010962958A CN111880549A CN 111880549 A CN111880549 A CN 111880549A CN 202010962958 A CN202010962958 A CN 202010962958A CN 111880549 A CN111880549 A CN 111880549A
Authority
CN
China
Prior art keywords
ship
reward
giving
target point
obstacle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010962958.4A
Other languages
Chinese (zh)
Other versions
CN111880549B (en
Inventor
曹志英
杜意权
张秀国
郭嗣彧
郑易松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Maritime University filed Critical Dalian Maritime University
Priority to CN202010962958.4A priority Critical patent/CN111880549B/en
Publication of CN111880549A publication Critical patent/CN111880549A/en
Application granted granted Critical
Publication of CN111880549B publication Critical patent/CN111880549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/0088Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/0206Control of position or course in two dimensions specially adapted to water vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, which comprises the following steps: s1, acquiring environmental information; s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point; s3, giving a corresponding reward value according to the times of the ship reaching the target point; s4, judging whether the ship is in the bonus area, and giving corresponding bonus according to the bonus area bonus principle; s5, judging whether the unmanned ship collides with the barrier or not, and giving a corresponding penalty value; and S6, judging whether the ship is in the danger zone, giving corresponding punishment according to the punishment principle of the danger zone, and giving reward according to the reward principle of the general condition if not. According to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point of ship navigation, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the deep reinforcement learning algorithm is increased, and the ship is guided to avoid the barrier to reach the target point more quickly.

Description

Unmanned ship path planning-oriented deep reinforcement learning reward function optimization method
Technical Field
The invention relates to the technical field of path planning, in particular to a deep reinforcement learning reward function optimization method for unmanned ship path planning.
Background
At present, economic connection of countries around the world is tighter and more frequent in trade, and ships play an important role as an important vehicle. However, as the density of ships sailing increases and the sailing environment becomes more complex, marine safety accidents frequently occur. Recent data show that the main cause of marine distress is that the ship is collided by an obstacle which is not timely detected by a crew in the process of sailing. Meanwhile, in some cases, the unmanned ship is not suitable for going to a working place to execute tasks, and needs to autonomously navigate to deal with complicated and changeable severe environments on the sea, so that the ship needs to have an autonomous obstacle avoidance function. In order to ensure the running safety of ships and avoid safety accidents on the sea, scholars at home and abroad put forward a plurality of methods related to ship obstacle avoidance.
The path planning is used as the most basic and important link for avoiding obstacles of the ship, the aim is to establish an optimal path from a starting point to a target point, and all obstacles can be avoided on the premise of meeting the requirement of ship navigation performance.
At present, the traditional unmanned ship path planning algorithm includes an Artificial Potential Field (APF) algorithm, an a-x algorithm, an evolutionary neural network, an evolutionary algorithm and the like. The APF guides the unmanned ship to drive to a target route point by constructing a virtual gravitational field, and converts a navigation limiting area into a virtual obstacle area, so that a virtual repulsive field is constructed to avoid a static obstacle, but when an equal repulsive force and attractive force exist or the repulsive force of a target point is large, the ship can be stopped and is trapped in local optimum. The a-algorithm is an improved version of Dijkstra's algorithm, and the best position is obtained by searching in the state space and evaluating the position of each search, and then searching from this position until the target is reached. However, the function of planning a path of the method depends on a grid map, and the size and the number of the intervals of the grids directly influence the calculation speed and the accuracy of the algorithm. The evolutionary neural network mainly realizes path planning and obstacle avoidance of the unmanned ship in static and dynamic obstacle environments by learning sample data of historical ship collision avoidance operations. However, the accuracy of the method highly depends on sample data for training the evolutionary neural network, and corresponding neural network models need to be trained for different navigation areas. The evolutionary algorithm comprises an ant colony algorithm, a simulated degradation algorithm and the like, for example, Vettor, R et al, the environment information is used as an initial population to be calculated by utilizing optimization genetic algorithm, and finally a navigation path meeting the requirement is obtained and the unmanned ship autonomous path planning is realized in an open water area. The algorithm is simple to implement, but has high iteration times, is easy to fall into a local optimal solution, and can not be used in a continuous topological environment due to excessive found path corners.
In recent years, with the development of deep reinforcement learning, a ship path planning method based on deep reinforcement learning is also studied by many scholars. The method combines a deep reinforcement learning algorithm with unmanned ship control actions, enables a ship to continuously try in a simulation environment, outputs accurate navigation actions in real time, learns how to avoid obstacles to reach a target point, and finally obtains an effective ship path planning model. When the model is used, retraining is not needed, the obtained model is deployed on a ship needing path planning, and then environment information is transmitted into the model, so that path planning can be achieved.
The unmanned ship path planning based on deep reinforcement learning needs to randomly initialize network parameters of a deep reinforcement learning algorithm, abstract obstacles in a real environment and establish an experimental environment. When a path planning model is established, firstly, data such as the position, the course, the navigational speed, the distance between a ship and a target point, the distance between the ship and an obstacle and the like are initialized as input, and the initial state of the ship is determined. And then the network predicts the action to be executed by the ship according to the input ship state information, the ship executes the action to obtain the ship state at the next moment, and a reward or punishment is obtained by judging the action by a reward function. And finally, the state of the ship, the executed action, the obtained reward and the next state after the action is executed are stored in a buffer pool to be used for updating the network parameters of the deep reinforcement learning algorithm. And repeating the process until an effective ship path planning model is obtained. In the process, the rewarding function has great influence on the deep reinforcement learning algorithm as an evaluator for evaluating the performance of the action.
At present, most of reward functions adopted in unmanned ship path planning based on deep reinforcement learning are obtained by taking the inverse of the distance between a ship and a target point as rewards, and the adoption of the reward functions can cause the convergence rate of a deep reinforcement learning algorithm to be very low, the training period to be very long and the optimal path planning strategy possibly not to be learned.
Disclosure of Invention
In order to solve the problems of low convergence speed and long training period of the traditional reward function, the invention provides a deep reinforcement learning reward function optimization method for unmanned ship path planning. The acquired reward or punishment is increased or reduced by increasing the reward domain near the target point, increasing the danger domain near the obstacle and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can avoid the obstacle to reach the target point more quickly.
The technical means adopted by the invention are as follows:
the depth reinforcement learning reward function optimization method for unmanned ship path planning comprises the following steps:
s1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point, judging whether the unmanned ship reaches the target point or not according to the distance between the unmanned ship and the target point, if so, executing S3, otherwise, executing S4;
s3, according to the times of the ship reaching the target point, giving a reward value corresponding to the times of the ship reaching the target point;
s4, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, and executing S2, otherwise, executing S5;
s5, judging whether the unmanned ship collides with the obstacle, giving a penalty value corresponding to the collision frequency of the ship and the obstacle according to the collision frequency of the ship and the obstacle, and executing a step S2, otherwise executing a step S6;
s6, judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a danger domain punishment principle, and executing S2; otherwise, awarding the bonus according to the general bonus principle and executing S2.
Further, the preprocessing the environment information in S1 includes:
reading obstacle position data between the current position of the ship and the next steering point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
converting the coordinate information of the barrier into a first conversion coordinate under a rectangular coordinate system through mercator projection;
performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
and carrying out normalization operation on the second conversion coordinate to enable the coordinate value to be in the range of [0,1 ].
Further, in S3, the step of giving a bonus value corresponding to the number of times the ship reaches the target point according to the number of times the ship reaches the target point includes: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.
Further, in S5, the giving a penalty value corresponding to the number of collisions between the ship and the obstacle according to the number of collisions between the ship and the obstacle includes: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.
Further, in S4, the awarding of the corresponding bonus according to the bonus domain bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.
Further, in S6, giving a corresponding penalty according to the risk domain penalty principle includes: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.
Further, in S6, the awarding of the bonus according to the general bonus principle includes: and giving a corresponding reward value according to the distance between the ship and the target point.
Compared with the prior art, the invention has the following advantages:
according to the method, the acquired reward or punishment is increased or reduced by increasing the reward area near the target point, increasing the danger area near the barrier and introducing the counting principle, so that the convergence speed of the model is increased, and the ship can quickly avoid the barrier to reach the target point.
Based on the advantages, the method can be widely popularized in the field of ship obstacle avoidance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a reward function optimization method of the present invention.
FIG. 2 is a schematic diagram of an example environment.
FIG. 3 is a schematic diagram of an example environment two.
FIG. 4 is a graphical illustration of reward scores in an example environment one.
FIG. 5 is a graphical illustration of reward scores in an example environment two.
FIG. 6 is a graph comparing convergence times in an example environment I.
FIG. 7 is a graph comparing convergence times in example two environment.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, the present invention provides a method for optimizing a deep reinforcement learning reward function for unmanned ship path planning, including:
step 1: and carrying out data preprocessing conversion on the environment information to obtain an experimental environment. The data preprocessing method comprises the following steps:
the environment information mainly comprises position information, speed size, angular speed, position size of an obstacle and other information of the ship, obstacle data between the ship and a next turning point is read according to the current position of the ship before the algorithm runs, data are analyzed to obtain coordinates and other information of the obstacle, the data information of the obstacle is subjected to coordinate conversion by using ink card support projection, and the coordinates are converted into coordinates under a rectangular coordinate system. After the coordinates are converted, the data are converted into final coordinates in units of meters and kilometers, as shown in formula (1). Wherein a is the earth long axis, b is the earth short axis, the longitude of a certain point on the earth is theta ∈ (-pi, + pi), and the latitude is
Figure BDA0002681182190000061
The first eccentricity of the earth ellipsoid and (x, y) the final transformed coordinates. Finally, normalization operation is carried out to ensure that the data transmitted into the algorithm are all [0,1]]In the meantime.
Figure BDA0002681182190000062
Step 2: variables needed for the algorithm to run are initialized. The variables required for the process of the invention include:
the coarse _ counter is a target counter;
obs _ counter is a collision counter;
reward awards or penalties for each step;
the reward _ standard _ good is a standard reward for reaching a target point;
max _ rewarded _ good is the maximum reward for reaching the target point;
the reward _ standard _ obs is the standard penalty of colliding with the obstacle;
max _ rewarded _ obs is the maximum penalty for colliding with the obstacle.
And step 3: the distance between the vessel and the obstacle and target point is calculated. The distance between the unmanned ship and the obstacle is recorded as d _ obs ═ d1,d2,d3,...,dn],d1,d2,d3,...,dnIs the distance from each obstacle; the distance from the target point is recorded as d _ good, and in order to prevent calculation errors in the process of calculating the distance, the boundary distance between the ship and the target point or the obstacle is calculated instead of the center distance.
And 4, step 4: and (5) judging whether the ship reaches a target point, if so, performing the step 5, and otherwise, performing the step 6.
And 5: the target counter is incremented by one and the collision counter is zeroed. And judging whether the target counter is greater than a threshold value, if so, directly giving the maximum reward, wherein the reward formula is shown as a formula (2), otherwise, calculating the reward value according to a formula (3), wherein ceil () is a rounding function. Ending the current round and starting the next round.
reward=max_reward_goal (2)
Figure BDA0002681182190000071
And 6, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, returning to the step 3, and otherwise, performing the step 7.
The bonus domain bonus principle is as follows:
the reward field means that the ship is near the target point and does not reach the target point, in order to help the ship to quickly reach the target point, different rewards are given according to different distances from the target point in the reward field, the convergence speed of the model is accelerated, meanwhile, in order to prevent the ship from falling into local optimum, the rewards in the reward field cannot be too dense, and the rewards reaching the target point have a difference, and the specific reward is given as shown in a formula (4). Wherein
Figure BDA0002681182190000075
L and
Figure BDA0002681182190000076
two thresholds for the bonus field range, in
Figure BDA0002681182190000072
l and
Figure BDA0002681182190000077
different prizes are respectively given in the range.
Figure BDA0002681182190000073
And 7: judging whether the ship collides with the barrier, if so, adding one to a collision counter, returning the target counter to zero, judging whether the collision counter is greater than a threshold value, and if so, directly giving a maximum penalty, wherein the penalty is calculated by a formula (5); if the threshold is not reached, the penalty value is calculated according to equation (6) and the present round is ended and the next round is started. If there is no collision, step 8 is performed.
reward=max_reward_obs (5)
Figure BDA0002681182190000074
And 8: judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a punishment principle of the danger domain, and returning to the step 3; if not, the reward is given according to the general condition reward principle and returns to the step 3.
The dangerous area reward principle is as follows: the danger zone refers to that the ship is near a certain obstacle but does not collide with the obstacle, and in order to help the ship to rapidly leave the vicinity of the obstacle and walk to a target point, a penalty is required to be increased near the obstacle, and the penalty value is inversely proportional to the distance between the ship and the obstacle. Similarly, in order to avoid falling into local optimum, punishment in the danger zone should not be too dense, and punishment at the barrier also has a certain difference, and the specific punishment is calculated by formula (7), wherein min (d _ obs) is the minimum value of the distance between the ship and the barrier, alpha, beta and threshold values of the range of the danger zone are respectively given different punishment in alpha, beta and range.
Figure BDA0002681182190000081
General case reward principle: the general case refers to a vessel that is neither near the target point nor near an obstacle, in which case the reward setting should be as sparse as possible to avoid the algorithm falling into local optimality. Therefore, in general, the inverse number of the distance between the ship and the target point is used as the reward of each step, and a specific calculation formula is shown as formula (8), wherein inverse () is an inverse function.
reward=inverse(d_goal) (8)
The comparative analysis of the reward points and the convergence times in different environments is used for explaining that the reward function using the patent converges faster than the traditional reward function.
(1) Comparative analysis of reward scores in different environments
In the navigation process of a ship, the whole path is divided into a plurality of waypoints, and path planning needs to plan the path between one waypoint and the next waypoint. In the experimental simulation environment process, the actual environment size is 60 nautical miles multiplied by 60 nautical miles, and the number of obstacles is set to be within 7 from one waypoint to the next waypoint, so that the experimental environment size is set to be 600 pixels multiplied by 600 pixels in the experimental process, and 1-7 obstacles are respectively selected to simulate the actual environment. Because the more the number of the obstacles is, the slower the algorithm is converged, the convergence times are counted by dividing the environment into a class I environment and a class II environment according to the number of the obstacles in the environment. The number of obstacles in the type I environment is 1-4, and the number of obstacles in the type II environment is 5-7. The convergence effect of the method is described below by taking two environments (fig. 2 is an exemplary environment in a class i environment, and fig. 3 is an exemplary environment in a class ii environment) as an example.
As shown in fig. 2 and 3, there are 3 obstacles in the first example environment, 5 obstacles in the second example environment, the five-pointed star is the target point, the ship is the starting point, and the obstacles are represented by mountain bodies. In both of these environments, experiments were conducted using the traditional reward function and the doubldqn algorithm employed by the method of this patent. In the deep reinforcement learning, reward scores obtained by training all rounds are used for judging whether an algorithm converges, and if the obtained reward finally tends to be smooth, the algorithm is converged. The prize scores obtained in both environments using the proprietary prize function and the conventional prize function are shown in figures 4 and 5.
As shown in fig. 4 and 5, the abscissa indicates the number of training rounds, the ordinate indicates the reward score obtained for each training round, New indicates the reward function used in this patent, and Traditional indicates the conventional reward function. It can be seen from fig. 4 that the reward score has been substantially stable and the algorithm has substantially converged when training is completed to 50 rounds using the reward function of the present invention, but the reward score gradually starts to stabilize after 100 rounds using the conventional reward function, the algorithm starts to converge, and in the subsequent process, there is also a shock occurring, indicating that the algorithm has not fully converged. From fig. 5, it can be seen that the reward score of the present invention is continuously oscillating before the first 100 rounds, and slowly stabilizes in 100-150 rounds although the oscillation occurs, the algorithm starts to converge, and after 200 rounds, the algorithm has substantially converged. While the reward score oscillates greatly before 150 rounds using the conventional reward function, the magnitude of the score begins to decrease at 150 rounds and 250 rounds, the algorithm begins to tend to converge, and the algorithm substantially converges after 250 rounds.
Through the comparison of the experiments, it can be seen that in the same algorithm, the reward function of the patent is converged faster than the traditional reward function. Using the reward function of the present invention, a higher reward score is achieved than with a conventional reward function, for the same number of rounds.
(2) Convergence number analysis
In the experiment process, different environments are respectively designed by changing the positions, the number and the sizes of the obstacles, and each environment is respectively used for carrying out experiments by using the reward function of the invention and the traditional reward function. The algorithms used in the experiment are DQN, DoubleDQN and dulling DQN algorithms, each algorithm performs 10 experiments in a simple environment and a complex environment, respectively, and the convergence times obtained by statistics are shown in fig. 6 and fig. 7, respectively.
In fig. 6 and 7, the abscissa represents the number of convergence rounds of the algorithm, the ordinate represents the number of convergence times, New represents the reward function used in the present patent, and Traditional represents the conventional reward function.
FIG. 6 illustrates the convergence in class I environment, and it can be seen that in the 30 experiments, 18 convergence times within 100 rounds, 8 convergence times between 100 and 150 rounds, and 4 convergence times between 150 and 250 rounds are obtained using the reward function of this patent; using the conventional reward function, there are 6 convergence within 100 rounds, 15 convergence within 100-150 rounds, and 9 convergence within 150-250 rounds. It can be seen that most of the reward functions using this patent converge within 100 rounds, while most of the reward functions using the traditional patent converge at 150 rounds.
FIG. 7 illustrates convergence in class II environments, and it can be seen that using the reward function of this patent, there are 14 convergence within 150 rounds, 9 convergence beyond 150 and 200 rounds, 5 convergence beyond 200 and 250 rounds, and 2 convergence beyond 300 rounds; using the conventional reward function, there are 7 convergence within 150, 10 convergence within 150 and 200, 8 convergence within 200 and 250, and 5 convergence outside 300. In summary, most of the convergence within 200 rounds is achieved using the reward function of this patent, while most of the convergence within 250 rounds is achieved using the conventional reward function.
It can be seen from the comparison that in the environment with different barrier numbers, different deep reinforcement learning algorithms are used, and the convergence speed of the reward function using the method is higher than that of the traditional reward function.
The traditional reward function is to take the distance between the ship and the target point as the reward. However, such reward functions have slow convergence and long training periods, which may result in terminating the training or discarding the scheme due to time problems.
The invention designs the reward function by adopting a counting principle method, if the ship continuously reaches a target point, the reward is doubled, if the ship continuously collides with an obstacle, the penalty is doubled, an reward domain is arranged at the target point, and a danger domain is arranged at the obstacle. By means of the design, the algorithm is helped to achieve the convergence effect faster, and therefore the ship can avoid the obstacles to reach the target point faster.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A deep reinforcement learning reward function optimization method for unmanned ship path planning is characterized by comprising the following steps:
s1, acquiring environment information, preprocessing the environment information, and converting the acquired environment information into information in an experimental environment;
s2, acquiring the distance between the unmanned ship and the obstacle and the distance between the unmanned ship and a target point, judging whether the unmanned ship reaches the target point or not according to the distance between the unmanned ship and the target point, if so, executing S3, otherwise, executing S4;
s3, according to the times of the ship reaching the target point, giving a reward value corresponding to the times of the ship reaching the target point;
s4, judging whether the ship is in the bonus area, if so, giving corresponding bonus according to the bonus area bonus principle, and executing S2, otherwise, executing S5;
s5, judging whether the unmanned ship collides with the obstacle, giving a penalty value corresponding to the collision frequency of the ship and the obstacle according to the collision frequency of the ship and the obstacle, and executing a step S2, otherwise executing a step S6;
s6, judging whether the ship is in a danger domain, if so, giving corresponding punishment according to a danger domain punishment principle, and executing S2; otherwise, awarding the bonus according to the general bonus principle and executing S2.
2. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein the preprocessing of the environment information in S1 includes:
reading obstacle position data between the current position of the ship and the next steering point, and analyzing the obstacle position data to obtain coordinate information of an obstacle;
converting the coordinate information of the barrier into a first conversion coordinate under a rectangular coordinate system through mercator projection;
performing unit conversion on the first conversion coordinate to obtain a second conversion coordinate;
and carrying out normalization operation on the second conversion coordinate to enable the coordinate value to be in the range of [0,1 ].
3. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S3, according to the number of times that a ship reaches a target point, a reward value corresponding to the number of times that the ship reaches the target point is given, and the method comprises: if the number of times that the ship reaches the target point is larger than the threshold value, giving the maximum reward, otherwise giving the weighted reward of the standard reward value, wherein the weighting coefficient is positively correlated with the number of times that the ship reaches the target point.
4. The unmanned ship path planning-oriented deep reinforcement learning reward function optimization method of claim 1, wherein in step S5, a penalty value corresponding to the number of collisions between a ship and an obstacle is given according to the number of collisions between the ship and the obstacle, and the method comprises: and if the times of the ship colliding with the obstacle are larger than the threshold value, giving a maximum penalty, otherwise giving a weighted penalty of a standard penalty value, wherein the weighting coefficient is positively correlated with the times of colliding with the obstacle.
5. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step S4 of awarding corresponding rewards according to the reward domain reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.
6. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the giving of the corresponding penalty in S6 according to the risk domain penalty principle comprises: corresponding penalty values are given according to the distance of the ship from the nearest obstacle.
7. The optimization method of the deep reinforcement learning reward function oriented to unmanned ship path planning of claim 1, wherein the step of giving the reward in S6 according to the general condition reward principle comprises: and giving a corresponding reward value according to the distance between the ship and the target point.
CN202010962958.4A 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning Active CN111880549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010962958.4A CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010962958.4A CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Publications (2)

Publication Number Publication Date
CN111880549A true CN111880549A (en) 2020-11-03
CN111880549B CN111880549B (en) 2024-06-04

Family

ID=73199216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010962958.4A Active CN111880549B (en) 2020-09-14 2020-09-14 Deep reinforcement learning rewarding function optimization method for unmanned ship path planning

Country Status (1)

Country Link
CN (1) CN111880549B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112462786A (en) * 2020-12-03 2021-03-09 大连海事大学 Unmanned ship collision avoidance method based on fuzzy control strategy double-window algorithm
CN112507520A (en) * 2020-11-12 2021-03-16 深圳慧拓无限科技有限公司 Path planning method and device based on reinforcement learning
CN112800545A (en) * 2021-01-28 2021-05-14 中国地质大学(武汉) Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN
CN112882469A (en) * 2021-01-14 2021-06-01 浙江大学 Deep reinforcement learning obstacle avoidance navigation method integrating global training
CN113079517A (en) * 2021-03-23 2021-07-06 中国人民解放军国防科技大学 Mobile command console and control method thereof
CN113985876A (en) * 2021-10-27 2022-01-28 广州大学 Marine garbage recycling path planning method and system based on foraging of bionic fishes
CN114531647A (en) * 2022-01-13 2022-05-24 江苏大学 Multi-layer self-adaptive reward target tracing method based on reinforcement learning under diffusion molecular communication environment
CN114851184A (en) * 2021-01-20 2022-08-05 广东技术师范大学 Industrial robot-oriented reinforcement learning reward value calculation method
CN115790608A (en) * 2023-01-31 2023-03-14 天津大学 AUV path planning algorithm and device based on reinforcement learning

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102967310A (en) * 2012-12-12 2013-03-13 中国船舶重工集团公司第七0七研究所 Seaway monitoring method based on electronic chart
CN107133735A (en) * 2017-04-28 2017-09-05 大鹏高科(武汉)智能装备有限公司 A kind of unmanned boat resource share method and device
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110716575A (en) * 2019-09-29 2020-01-21 哈尔滨工程大学 UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111045445A (en) * 2019-10-23 2020-04-21 浩亚信息科技有限公司 Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111157982A (en) * 2019-11-20 2020-05-15 智慧航海(青岛)科技有限公司 Intelligent ship and shore cooperative target tracking system and method based on shore-based radar
CN111487977A (en) * 2020-05-11 2020-08-04 大连海事大学 Ship dynamic obstacle avoidance optimization method and system under uncertain environment
CN111507501A (en) * 2019-01-31 2020-08-07 斯特拉德视觉公司 Method and device for executing personalized path planning through reinforcement learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102967310A (en) * 2012-12-12 2013-03-13 中国船舶重工集团公司第七0七研究所 Seaway monitoring method based on electronic chart
CN107133735A (en) * 2017-04-28 2017-09-05 大鹏高科(武汉)智能装备有限公司 A kind of unmanned boat resource share method and device
CN111507501A (en) * 2019-01-31 2020-08-07 斯特拉德视觉公司 Method and device for executing personalized path planning through reinforcement learning
CN110716575A (en) * 2019-09-29 2020-01-21 哈尔滨工程大学 UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110750096A (en) * 2019-10-09 2020-02-04 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
CN111045445A (en) * 2019-10-23 2020-04-21 浩亚信息科技有限公司 Aircraft intelligent collision avoidance method, equipment and medium based on reinforcement learning
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN111157982A (en) * 2019-11-20 2020-05-15 智慧航海(青岛)科技有限公司 Intelligent ship and shore cooperative target tracking system and method based on shore-based radar
CN111123963A (en) * 2019-12-19 2020-05-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111487977A (en) * 2020-05-11 2020-08-04 大连海事大学 Ship dynamic obstacle avoidance optimization method and system under uncertain environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIYU GUO: "An Autonomous Path Planning Model for Unmanned Ships Based on Deep Reinforcement Learning", SENSORS, vol. 20, 11 January 2020 (2020-01-11), pages 1 - 35 *
刘畅: "使用增强学习训练多焦点聚焦模型", 自动化学报, vol. 43, no. 9, 30 September 2017 (2017-09-30), pages 1563 - 1570 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507520A (en) * 2020-11-12 2021-03-16 深圳慧拓无限科技有限公司 Path planning method and device based on reinforcement learning
CN112462786A (en) * 2020-12-03 2021-03-09 大连海事大学 Unmanned ship collision avoidance method based on fuzzy control strategy double-window algorithm
CN112462786B (en) * 2020-12-03 2024-01-19 大连海事大学 Unmanned ship collision prevention method based on fuzzy control strategy double-window algorithm
CN112882469A (en) * 2021-01-14 2021-06-01 浙江大学 Deep reinforcement learning obstacle avoidance navigation method integrating global training
CN112882469B (en) * 2021-01-14 2022-04-08 浙江大学 Deep reinforcement learning obstacle avoidance navigation method integrating global training
CN114851184B (en) * 2021-01-20 2023-05-09 广东技术师范大学 Reinforced learning rewarding value calculating method for industrial robot
CN114851184A (en) * 2021-01-20 2022-08-05 广东技术师范大学 Industrial robot-oriented reinforcement learning reward value calculation method
CN112800545B (en) * 2021-01-28 2022-06-24 中国地质大学(武汉) Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN
CN112800545A (en) * 2021-01-28 2021-05-14 中国地质大学(武汉) Unmanned ship self-adaptive path planning method, equipment and storage medium based on D3QN
CN113079517B (en) * 2021-03-23 2022-07-05 中国人民解放军国防科技大学 Mobile command console and control method thereof
CN113079517A (en) * 2021-03-23 2021-07-06 中国人民解放军国防科技大学 Mobile command console and control method thereof
CN113985876A (en) * 2021-10-27 2022-01-28 广州大学 Marine garbage recycling path planning method and system based on foraging of bionic fishes
CN113985876B (en) * 2021-10-27 2023-09-26 广州大学 Ocean garbage recycling path planning method and system based on bionic fish foraging
CN114531647A (en) * 2022-01-13 2022-05-24 江苏大学 Multi-layer self-adaptive reward target tracing method based on reinforcement learning under diffusion molecular communication environment
CN114531647B (en) * 2022-01-13 2024-03-22 江苏大学 Multi-layer self-adaptive rewarding target tracing method in diffusion molecular communication environment
CN115790608A (en) * 2023-01-31 2023-03-14 天津大学 AUV path planning algorithm and device based on reinforcement learning
CN115790608B (en) * 2023-01-31 2023-05-30 天津大学 AUV path planning algorithm and device based on reinforcement learning

Also Published As

Publication number Publication date
CN111880549B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
CN111880549B (en) Deep reinforcement learning rewarding function optimization method for unmanned ship path planning
CN112179367B (en) Intelligent autonomous navigation method based on deep reinforcement learning
CN108564202B (en) Unmanned ship route optimization method based on environment forecast information
Sun et al. Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning
CN111273670B (en) Unmanned ship collision prevention method for fast moving obstacle
Ouahouah et al. Deep-reinforcement-learning-based collision avoidance in uav environment
Xiaofei et al. Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle
Wang et al. Cooperative collision avoidance for unmanned surface vehicles based on improved genetic algorithm
CN114089762B (en) Water-air amphibious unmanned aircraft path planning method based on reinforcement learning
CN112015174A (en) Multi-AGV motion planning method, device and system
CN110906935B (en) Unmanned ship path planning method
Deraj et al. Deep reinforcement learning based controller for ship navigation
CN112859864A (en) Unmanned ship-oriented geometric path planning method
CN109765914A (en) A kind of unmanned surface vehicle collision prevention method based on sliding window population
CN117590867B (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
Gao et al. An optimized path planning method for container ships in Bohai bay based on improved deep Q-learning
Masmitja et al. A reinforcement learning path planning approach for range-only underwater target localization with autonomous vehicles
CN115107948B (en) Efficient reinforcement learning autonomous ship collision prevention method
CN114609925B (en) Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish
Jose et al. Navigating the Ocean with DRL: Path following for marine vessels
CN114779764B (en) Vehicle reinforcement learning movement planning method based on driving risk analysis
CN116466701A (en) Unmanned surface vessel energy consumption minimum track planning method
Cheng et al. Trajectory optimization for ship navigation safety using genetic annealing algorithm
Meyer On course towards model-free guidance: A self-learning approach to dynamic collision avoidance for autonomous surface vehicles
CN114167856A (en) Service robot local path planning method based on artificial emotion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant