CN111880535A

CN111880535A - Unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning

Info

Publication number: CN111880535A
Application number: CN202010715076.8A
Authority: CN
Inventors: 张卫东; 王雪纯; 徐鑫莉; 蔡云泽
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-03
Anticipated expiration: 2040-07-23
Also published as: CN111880535B

Abstract

The invention relates to an unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning, wherein the method comprises the following steps: 1) building a marine environment; 2) setting an action space according to the situation of the propeller of the unmanned ship, and learning according to global planning information provided by the static chart and obstacle information in the detection radius range of the radar system to obtain a reinforcement learning state code; 3) setting reward target weight to obtain a comprehensive reward function; 4) establishing and training an evaluation network and a strategy network; 5) and respectively inputting the reinforcement learning state codes into the evaluation network and the strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the mean value of the learned strategy network. Compared with the prior art, the invention has high self-learning ability, can adapt to different large-scale complex environments through simple deployment training, and further realizes autonomous perception, autonomous navigation and autonomous obstacle avoidance.

Description

Unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning

Technical Field

The invention relates to an unmanned ship autonomous obstacle avoidance method and system, in particular to an unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning.

Background

The unmanned ship is an unmanned water vehicle capable of realizing autonomous navigation, autonomous obstacle avoidance and autonomous water surface operation, and has the advantages of small volume, high speed, good stealth, no casualty risk and the like. The unmanned ship is very suitable for executing water surface operation tasks in dangerous sea areas with greater risks to casualties of people or simple water surface operation tasks with low requirements on personnel participation degree, has good cost-effectiveness ratio, and is widely and effectively applied to the fields of ocean monitoring, ocean investigation, maritime search and rescue, unmanned freight transportation and the like.

At present, the mainstream thought for realizing autonomous navigation of the unmanned ship is to deploy and apply autonomous sensing, autonomous navigation and autonomous obstacle avoidance algorithms respectively, and each algorithm is matched with each other to complement and complete navigation and operation tasks. For example, algorithms such as pattern recognition, target detection and the like are involved in vision system perception, main ideas for realizing global planning autonomous navigation include a grid graph method, an A-algorithm, a genetic algorithm and the like, and methods such as an artificial potential field method, optimal interaction collision avoidance and the like are mainly applied to local dynamic collision avoidance. Although the methods have good performance in respective application backgrounds, different functional modules need to be elaborately designed, and parameters need to be integrally configured and adjusted for a comprehensive algorithm, so that the unmanned ship intelligent algorithm is complex and tedious to realize. Furthermore, because these methods lack the ability of autonomous learning, it is difficult to adapt to large-scale complex environments, and different algorithm modules need to be redesigned and recombined to cooperate with each other according to different environments.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a reinforcement learning-based unmanned ship hybrid perception autonomous obstacle avoidance method and system with autonomous learning and environmental characteristic adaptation capabilities.

The purpose of the invention can be realized by the following technical scheme:

an unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning comprises the following steps:

1) building a marine environment: establishing an interaction rule between the unmanned ship and the marine environment, generating random obstacles, and randomly generating an initial point and a final point of the unmanned ship;

2) setting an action space and a state space: setting an action space according to the situation of the propeller of the unmanned ship, and learning according to global planning information provided by the static chart and obstacle information in the detection radius range of the radar system to obtain a reinforcement learning state code;

3) determining a reward function: setting reward target weight to obtain a comprehensive reward function;

4) establishing and training an evaluation network and a strategy network: the evaluation network and the strategy network are respectively formed by connecting a state coding network and a perceptron, and network parameters are initialized and trained;

5) and the intelligent agent decision controller outputs: and respectively inputting the reinforcement learning state codes into the evaluation network and the strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the mean value of the learned strategy network.

Preferably, the interaction rule between the unmanned ship and the marine environment in the step 1) follows the own kinetic equation of the unmanned ship.

Preferably, the random obstacles generated in step 1) include 4 kinds: random static obstacles that can be delineated by a chart, random dynamic obstacles that cannot be delineated by a chart, random dynamic obstacles with autonomous control capability, and random dynamic obstacles without autonomous control capability.

Preferably, the motion space in step 2) includes discretized yaw force, pitch force and yaw.

Preferably, the strong learning state code in step 2) is obtained through deep network learning, and specifically includes:

and learning the characteristics of the static sea chart by combining the convolution neural network and full connection to obtain a static programming state code, taking the static programming state code and the dynamic obstacle avoidance state code fed back by the radar system processing as the key characteristics of the reinforcement learning state code, and redistributing the importance by learning the whole weight matrix to obtain the final reinforcement learning state code.

Preferably, the dynamic obstacle avoidance state code is:

wherein，σ_tFor detecting the obstacle mark in the detection radius range,

The distance between the unmanned boat and the target in the world coordinate system,

the angle of the unmanned ship in the world coordinate system from the target is psi, the yaw angle and u of the unmanned ship in the world coordinate system are u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, r, of the coordinate system of the unmanned ship_tThe yaw speed of the coordinate system of the unmanned ship,

is the nearest obstacle distance in the world coordinate system,

the subscript t denotes the time t, which is the nearest obstacle angle in the world coordinate system.

Preferably, the comprehensive reward function in step 3) is a product of a reward target weight matrix and a reward target, and the reward target includes: a distance reward objective, an obstacle avoidance reward objective, a speed reward objective, and an energy consumption reward objective.

Preferably, the reward objectives are obtained by:

in the task of navigating the unmanned ship to the target point, if

Then the distance to the reward target R_distanceNot all right 1, otherwise R_distance＝0，

The distance between the unmanned ship and the target in the world coordinate system is shown, subscript t represents the time t, and subscript t +1 represents the time t + 1;

when the radar detects an obstacle and is within the range threatened by the obstacle, if

Obstacle avoidance reward target R_obstanceNot all right 1, otherwise R_obstance＝0，

The subscript t represents the time t, and the subscript t +1 represents the time t + 1;

if it is not

Then the speed reward target R_speedNot all right 1, otherwise R_speed＝0，u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, v, of the coordinate system of the unmanned ship_thSetting a speed threshold;

if it is not

Then the target R is rewarded for energy consumption_consumptionNot all right 1, otherwise R_consumption＝0，τ_uIs the surging force, tau, of the unmanned boat_rIs the bow shaking force, tau, of the unmanned boat_thA threshold is set for energy consumption.

Preferably, step 4) is done based on the A3C algorithm.

An unmanned ship hybrid perception autonomous obstacle avoidance system based on reinforcement learning comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the autonomous obstacle avoidance method when running the computer program.

Compared with the prior art, the invention has the following advantages:

the algorithm has high self-learning capacity, and can adapt to different large-scale complex environments through simple deployment training, so that autonomous perception, autonomous navigation and autonomous obstacle avoidance are realized;

the algorithm integrates the functions of environmental perception and navigation obstacle avoidance, and gets rid of the heavy burden of respective configuration and overall parameter adjustment caused by modular algorithm design;

the algorithm has the static planning and dynamic collision avoidance capabilities, on one hand, the track planning can be realized by learning a static sea chart, on the other hand, the algorithm can deal with sea surface real-time threats, and has reliable and stable threat avoidance capabilities.

Drawings

Fig. 1 is a schematic overall structure diagram of the unmanned ship hybrid sensing autonomous obstacle avoidance method based on reinforcement learning.

Fig. 2 is a schematic diagram of state coding of the unmanned ship hybrid perception reinforcement learning algorithm.

Fig. 3 is a parameter explanatory diagram of dynamic obstacle avoidance coding.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.

Examples

As shown in fig. 1, an unmanned surface vehicle hybrid perception autonomous obstacle avoidance method based on reinforcement learning includes the following steps:

the interaction rule of the unmanned ship and the marine environment follows the self kinetic equation of the unmanned ship:

wherein eta is [ x, y, psi ═ x, y, psi]^TContaining unmanned boat position and yaw angle information, v ═ u, upsilon, r]^TIncluding yaw, pitch, yaw rate information, [ tau ═_u,0,τ_t]^TThe longitudinal and heading forces of the unmanned ship, M is the mass of the unmanned ship, and R (psi) is the yaw angleψ, C (v), g (v) are each a function of v;

the random obstacles generated include 4 kinds: random static obstacles that can be delineated by a chart, random dynamic obstacles that cannot be delineated by a chart, random dynamic obstacles with autonomous control capability, and random dynamic obstacles without autonomous control capability.

And 4 times of initial points and target points are randomly set for each generated marine environment, and the intelligent agent can interact for 500 times for the marine environments with different initial points and target points.

the motion space comprises discretized swaying force, discretized surging force and discretized yawing force;

the reinforcement learning state code is obtained through deep network learning, and specifically comprises the following steps:

Preferably, the dynamic obstacle avoidance state code is:

wherein σ_tFor detecting the obstacle mark in the detection radius range,

the angle of the unmanned ship in the world coordinate system is from the target, and psi is worldYaw angle u of unmanned ship in boundary coordinate system_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, r, of the coordinate system of the unmanned ship_tThe yaw speed of the coordinate system of the unmanned ship,

is the nearest obstacle distance in the world coordinate system,

The action space of the under-actuated unmanned ship is discretized output of the surging force and the yawing force, and each propulsion force is discretized into 20 levels according to the thrust level. Referring to fig. 2, the state code learning process of reinforcement learning, static programming state codes, i.e., sea chart features, are obtained through combined network learning of CNN and FC, and are finally compressed into 256-dimensional vectors. The diagram of the nine-tuple in the dynamic obstacle avoidance state coding information is shown in fig. 3. The reinforcement learning state code is a 265-dimensional vector of the combination of the two codes, and is obtained by multiplying the two state codes by a learned weight matrix.

the composite reward function is the product of a reward target weight matrix and reward targets, and the reward targets comprise: a distance reward objective, an obstacle avoidance reward objective, a speed reward objective, and an energy consumption reward objective.

The reward objectives are obtained by:

in the task of navigating the unmanned ship to the target point, if

if it is not

if it is not

4) Establishing and training an evaluation network and a strategy network: the evaluation network and the strategy network are completed based on an A3C algorithm, the evaluation network and the strategy network are respectively formed by connecting a state coding network and a perceptron, network parameters are initialized and trained, and in the network training process, the gradient calculation of the evaluation network meets the following updating method:

the gradient calculation of the policy network satisfies the following updating method:

wherein w is a network parameter of the policy network, θ is a network parameter of the evaluation network, s_tCoding for the dynamic obstacle avoidance state of the unmanned ship at the moment t, a_tIs the decision of the unmanned ship at the time t, pi (a)_t|s_tω) is a policy network at s_tAction of output in the state r_tMaking a for unmanned boat_tReward value given by the environment after decision, V(s)_tTheta) is in the state s_tThe value of the network prediction is evaluated.

Two methods for obtaining mixed perception state coding are obtained while parameters are updated through network learning to obtain V(s) and pi (a | s).

In the embodiment, during training, the output of the controller, i.e. action selection, is obtained by sampling according to the learned mean-variance strategy distribution. And when the unmanned ship collides, ending the training of the current round in advance, if the current target point and the initial point complete 500 training rounds, returning to the step 1, regenerating the target point and the initial point, and if the current environment has been set with 4 initial points and target points, regenerating the marine environment.

And (4) regenerating the marine environment, the initial point and the target point under the actual test environment, carrying out interactive observation global planning and local obstacle avoidance information on the unmanned ship and the marine environment, obtaining a reinforcement learning state code through the network trained in the step (4), and executing the action corresponding to the strategy distribution mean value under the state code, namely controller output, so as to complete the set marine operation task.

The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the technical spirit of the present invention.

Claims

1. An unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning is characterized by comprising the following steps:

2. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 1 is characterized in that in step 1), the unmanned ship and marine environment interaction rules follow the own kinetic equation of the unmanned ship.

3. The unmanned ship mixed perception autonomous obstacle avoidance method based on reinforcement learning of claim 1, wherein the random obstacles generated in step 1) include 4 kinds: random static obstacles that can be delineated by a chart, random dynamic obstacles that cannot be delineated by a chart, random dynamic obstacles with autonomous control capability, and random dynamic obstacles without autonomous control capability.

4. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 1, wherein the motion space in step 2) comprises discretized swaying force, surging force and yawing.

5. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 1, wherein the strong learning state code in step 2) is obtained through deep network learning, and specifically comprises:

6. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 5, wherein the dynamic obstacle avoidance state coding is as follows:

wherein σ_tFor detecting the obstacle mark in the detection radius range,

the angle of the unmanned ship in the world coordinate system from the target is psi, the yaw angle and u of the unmanned ship in the world coordinate system are u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs an unmanned boatCoordinate system yaw rate, r_tThe yaw speed of the coordinate system of the unmanned ship,

is the nearest obstacle distance in the world coordinate system,

7. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 1, wherein the comprehensive reward function in the step 3) is a product of a reward target weight matrix and a reward target, and the reward target comprises: a distance reward objective, an obstacle avoidance reward objective, a speed reward objective, and an energy consumption reward objective.

8. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 7, wherein the reward targets are obtained by:

in the task of navigating the unmanned ship to the target point, if

if it is not

if it is not

9. The unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning of claim 1, wherein step 4) is performed based on an A3C algorithm.

10. An unmanned ship hybrid perception autonomous obstacle avoidance system based on reinforcement learning, comprising a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein when the computer program is run by the processor, the autonomous obstacle avoidance method according to any one of claims 1 to 9 is realized.