CN108820157B

CN108820157B - Intelligent ship collision avoidance method based on reinforcement learning

Info

Publication number: CN108820157B
Application number: CN201810378954.4A
Authority: CN
Inventors: 张蕊; 王潇; 刘克中; 吴晓烈; 刘炯炯
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2020-03-10
Anticipated expiration: 2038-04-25
Also published as: CN108820157A

Abstract

The invention discloses an intelligent ship collision avoidance method based on reinforcement learning, which comprises the steps of firstly, acquiring static data and dynamic data of two ships; then, checking the validity of the data, and judging whether a collision avoidance program needs to be started or not; calculating related collision avoidance parameters and judging whether dangerous conditions can be generated or not; if the collision danger cannot be generated, the vehicle can move forward according to the collision prevention rule by keeping the speed and the direction; if collision danger is generated, learning a collision avoidance strategy by using a reinforcement learning method, inputting data as calculated parameters for training, outputting the calculated parameters as a strategy generated after training, and acquiring a rudder angle required to be rotated by the ship; then executing a strategy, dynamically updating the dynamic data of the two ships in the step 1, and returning a reward value; after strategy execution is finished, determining a re-navigation opportunity according to a collision avoidance rule and then re-navigating. The invention realizes the autonomous learning and improvement of ship collision avoidance and avoids the unfavorable situation caused by the fact that seaman and the like depend on experience.

Description

Intelligent ship collision avoidance method based on reinforcement learning

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to an intelligent ship collision prevention method, and particularly relates to an intelligent ship collision prevention method based on reinforcement learning.

Background

In the navigation process, ship collision avoidance is a non-negligible problem, the problem has a plurality of different solutions, an AIS-based intelligent ship collision avoidance decision is utilized, an intelligent algorithm is utilized to carry out ship collision avoidance based on an evolutionary genetic algorithm, a Bayesian network-based ship collision avoidance algorithm and the like are utilized, the algorithms have certain capacity of solving the ship collision avoidance problem, but the algorithms also have limitations, and the algorithms cannot self-learn and improve collision avoidance strategies.

At present, ships avoid problems in open water areas among multiple ships, the existing ship collision avoidance mode in the open water areas is mainly based on international maritime collision avoidance rules, at present, due to the fact that relevant avoidance terms of the collision avoidance rules are mostly qualitative descriptions, and in the actual ship avoidance process, the common method of sea keepers, the actual ship operation experience of drivers and the like can obviously influence specific decision schemes and collision avoidance effects of the ships.

In practical situations, the collision avoidance of the ship is mainly controlled by people and depends on the general practice of sea operators and the actual ship-handling experience of drivers, so that the ship has a lot of instability.

Disclosure of Invention

In order to solve the technical problems, the invention adopts reinforcement learning to realize the optimization of the collision avoidance strategy and algorithm, provides an intelligent ship collision avoidance method based on reinforcement learning, realizes the autonomous learning and improvement of ship collision avoidance, and avoids the unfavorable situation caused by the fact that seaman and the like depend on experience.

The technical scheme adopted by the invention for solving the technical problems is as follows: an intelligent ship collision avoidance method based on reinforcement learning is characterized by comprising the following steps:

step 1: acquiring static data and dynamic data of two ships;

step 2: checking the legality of the data, calculating relevant collision avoidance parameters, judging whether dangerous conditions can be generated or not, and starting a collision avoidance program;

and 3, step 3: if the collision danger cannot be generated, the vehicle can move forward according to the collision prevention rule by keeping the speed and the direction; if collision danger is generated, learning a collision avoidance strategy by using a reinforcement learning method, inputting data as calculated parameters for training, outputting the calculated parameters as a strategy generated after training, and acquiring a rudder angle required to be rotated by the ship;

and 4, step 4: executing the strategy generated in the step 3, then dynamically updating the dynamic data of the two ships in the step 1, and returning a reward value; the reward value is used for evaluating the quality of the collision avoidance strategy;

and 5, step 5: after strategy execution is finished, determining a re-navigation opportunity according to a collision avoidance rule and then re-navigating.

The method has the advantages that the strategy optimization is carried out by adopting reinforcement learning, the error operation caused by intuition and experience is effectively reduced by auxiliary operators, the collision prevention efficiency of the ship is effectively improved, and the method of machine learning is used. After the strategy is optimized, the optimal strategy learned by the machine can be conveniently provided for operators to refer, and high-quality decisions are made to avoid more urgent situations.

Drawings

FIG. 1 is a schematic diagram of an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

In the field of machine learning, reinforcement learning is used as an artificial intelligence method, a research team represented by a Deep Mind team firstly provides a DQN (Deep Q-Network) based Deep reinforcement learning method, and an Atari2600 part game is used as a test object, so that the result can exceed that of a human player, and the effect is obvious. In 2012, Lange further started to make applications, proposing Deep fixed Q learning for vehicle control. Experiments show that the method is suitable for the fields of intelligent control, robots, analysis, prediction and the like, and provides new ideas and opportunities for ship collision avoidance optimization operation. The invention can well fit the actions of human sailors, and the intelligent ship collision avoidance decision has the characteristics of autonomous learning and improvement.

Referring to fig. 1, the intelligent ship collision avoidance method based on reinforcement learning provided by the invention comprises the following steps:

step 1: acquiring static data and dynamic data of two ships;

the static data and the dynamic data of the two ships comprise ship information and target ship information; the ship information comprises ship state, ship gyration index, ship tracking index, track direction, ship heading, ground speed, water speed, longitude, latitude, rudder angle and draught; the target vessel information includes the name of the vessel, the MMSI, the call sign, the type of vessel, the length of the vessel, the width of the vessel, the track direction, the heading direction of the vessel, the speed of the ground, the speed of the water, the longitude, the latitude, the distance, the true azimuth, and the relative azimuth.

the relevant collision avoidance parameters include Time To Close Point of Arrival (TCPA), Distance of Closest arrival (DCPA), safe Distance of safe arrival (SDA), urgent Distance of urgent situation (CQS), urgent Distance of Danger (IMD), Relative motion speed (VR) and Relative motion direction (AR);

a determination is made as to whether a hazardous condition is present, and when TCPA >0 and DCPA < SDA, a collision hazard is present.

the method for learning the collision avoidance strategy by applying the reinforcement learning method comprises the following specific steps:

step 4.1: inputting static parameters and dynamic parameters of a ship for training;

step 4.2: inputting various parameters into a Deep Q-learning Network (DQN) to train data; continuously updating the Q value function until the Q function is converged to obtain the best model;

step 4.3: inputting the static parameters and the dynamic parameters of the ship for testing into the trained model;

step 4.4: outputting a rudder angle required to be rotated by the ship;

in the embodiment, static data, dynamic data and marine environment data of two ships are obtained through various sensors and other devices. At the moment, a Markov decision process four-tuple E is generated<S,A,P,R>S is a state set describing the course and the navigation speed of the ship, A is an action set describing the rudder angle which the ship should turn,

state transition probabilities are specified for the transition functions;

for the reward function, a reward is specified. Existing algorithms typically employ DQN (Deep Q-learning Network) to train the data. First, Q-Table is initialized, the rows and columns are S and A, respectively, and the value of Q-Table is used to measure the quality of the action a taken by the current state S. This embodiment uses the Bellman equation to update the Q-Table during the training process:

Q(s,a)＝r+γ(max(Q(s′,a′))

q (s, a) is expressed as r, immediately after the current s takes a, plus the maximum rewardmax (Q (s ', a')) after a discount γ.

In the embodiment, Q-Table is realized through a neural network in DQN, and Q values of different actions a are output by inputting a state x. The corresponding algorithm is as follows

1. A deep neural network is used as a network of Q values, and the parameter is omega;

Q(s,a,ω)≈Qπ(s,a)

2. defining objective function, namely loss function, in the Q value by using mean-square error;

L(ω)＝E[(r+γ·maxa,Q(s,,a,,ω)-Q(s,a,ω)²)]

the above formula is s ', a', the next state and action, which is expressed by David Silver, and appears to be clear. It can be seen that the Q value to be updated by Q-Learning is used as the target value. With the target value and the current value, the deviation can be calculated by means of the mean square error.

3. Calculating the gradient of the parameter omega relative to the loss function;

4. using the SGD to realize the optimization target of the End-to-End;

the above gradient is calculated, and

calculations are performed from the deep neural network, so that parameters can be updated using SGD stochastic gradient descent to obtain an optimal Q value.

5. Randomly selecting action a with probability epsilon_tOr selecting the action a with the maximum Q value according to the Q value output by the network_tThen get execution a_tRear prize r_tAnd the input of the next network, the network calculates the output of the network at the next moment according to the current value, and the process is circulated.

After a plurality of iterations and training, when the Q value is converged to the maximum value, a good model is trained. The trained model is applied to collision avoidance of two ships, and can predict an optimal collision avoidance strategy, namely a turned rudder angle, under the current emergency condition, assist an operator to control the ships and change the ship state until collision avoidance is finished.

the reward value comprises minimum flight path offset, shortest avoidance time, shortest avoidance path, shortest avoidance amplitude and minimum avoidance amplitude; the quality of the strategy depends on accumulated reward obtained after the strategy is executed for a long time, and the strategy can be continuously optimized when the Q value representing the reward is converged to the maximum value after a plurality of iterations and training are carried out in the training process.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An intelligent ship collision avoidance method based on reinforcement learning is characterized by comprising the following steps:

step 1: acquiring static data and dynamic data of two ships;

in the step 3, a reinforcement learning method is used for learning the collision avoidance strategy, and the specific implementation comprises the following substeps:

step 3.1: inputting static parameters and dynamic parameters of a ship for training;

step 3.2: inputting various parameters into the reinforcement learning DQN to train data; continuously updating the Q value function until the Q function is converged to obtain the best model;

firstly, static data, dynamic data and marine environment data of two ships are obtained, and a Markov decision process four-tuple E is generated<S，A，P，R>Wherein S is a state set describing the course speed of the ship, A is an action set describing the rudder angle the ship should turn, and P:

state transition probabilities are specified for the transition functions; r:

for the reward function, specifyRewarding;

training data using DQN; firstly, initializing Q-Table, wherein the rows and the columns are respectively S and A, and the value of the Q-Table is used for measuring the quality of an action a taken by the current state S; the Q-Table is updated during the training process using the Bellman equation:

Q(s,a)＝r+γ(max(Q(s′,a′))

wherein Q (s, a) is represented as the instant reward value r of the current action after the current s takes a, plus the maximum rewardmax (Q (s ', a')) after the discount gamma;

Q-Table is realized in DQN through a neural network, and Q values of different actions a are output by inputting a state x; the specific implementation process is as follows:

(1) a deep neural network is adopted as a network of a Q value, and the parameter is omega;

Q(s,a,ω)≈Qπ(s,a)；

(2) defining objective function, namely loss function, in the Q value by using mean-square error;

L(ω)＝E[(r+γ·max a′Q(s′,a′,ω)-Q(s，a，ω)²)]

wherein s ', a' is the next state and action, represented by David Silver, using the Q value to be updated by Q-Learning as the target value;

(3) calculating the gradient of the parameter omega relative to the loss function;

(4) using the SGD to realize the optimization target of the End-to-End;

the above gradient is calculated, and

calculating from a deep neural network, and updating parameters by using SGD random gradient descent to obtain an optimal Q value;

(5) randomly selecting action a with probability epsilon_tOr selecting the action a with the maximum Q value according to the Q value output by the network_tThen get execution a_tLater awardExciter_tAnd the input of the next network, the network calculates the output of the network at the next moment according to the current value, and the process is circulated;

step 3.3: inputting the static parameters and the dynamic parameters of the ship for testing into the trained model;

step 3.4: outputting a rudder angle required to be rotated by the ship;

the reward value comprises a minimum track offset, a shortest avoidance time, a shortest avoidance path and a minimum avoidance amplitude; the quality of the strategy depends on accumulated reward obtained after the strategy is executed for a long time, and the strategy can be continuously optimized when the Q value representing the reward is converged to the maximum value after a plurality of iterations and trainings are carried out in the training process;

2. The intelligent ship collision avoidance method based on reinforcement learning of claim 1, wherein: in step 1, the static data and the dynamic data of the two ships comprise ship information and target ship information; the ship information comprises ship state, ship gyration index, ship tracking index, track direction, ship heading, ground speed, water speed, longitude, latitude, rudder angle and draught; the target ship information comprises a ship name, MMSI, a call sign, a ship type, a ship length, a ship width, a track direction, a ship heading direction, a ground speed, a water speed, a longitude, a latitude, a distance, a true azimuth and a relative azimuth.

3. The intelligent ship collision avoidance method based on reinforcement learning of claim 1, wherein: in step 2, the relevant collision avoidance parameters comprise a latest meeting time TCPA, a latest meeting distance DCPA, a safe meeting distance SDA, an urgent situation distance CQS, an urgent danger distance IMD, a relative movement speed VR and a relative movement direction AR;

the determination of whether a dangerous condition is created, when TCPA >0 and DCPA < SDA, a collision danger is created.