CN109241552B

CN109241552B - Underwater robot motion planning method based on multiple constraint targets

Info

Publication number: CN109241552B
Application number: CN201810764979.8A
Authority: CN
Inventors: 张国成; 程俊涵; 孙玉山; 盛明伟; 冉祥瑞; 王力锋; 焦文龙; 王子楷; 贾晨凯; 吴凡宇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2022-04-05
Anticipated expiration: 2038-07-12
Also published as: CN109241552A

Abstract

An underwater robot motion planning method based on multiple constraint targets belongs to the field of machine learning and underwater robot motion planning. A model construction stage: converting the signals of the obstacle avoidance sonar of the robot and the flow velocity signals of the flow velocity sensor into the current environment; establishing a discrete action space according to dynamic constraints; establishing a reward function by taking an underwater obstacle as a constraint; establishing a Markov decision process based on multi-target constraint to establish a basis for algorithm realization; a training stage: training is carried out based on a Q learning algorithm, actions are executed based on a greedy strategy in the current environment, the strategy is evaluated and updated once each time the strategy is executed, the strategy is improved until the strategy is adaptive to the environment, and the planning purpose is achieved. The invention considers multi-constraint targets such as water flow, obstructive objects, targets and the like, combines the reinforcement learning method with the underwater multi-constraint targets, realizes the motion planning of the underwater robot, has stronger real-time performance, and can be suitable for various environments.

Description

Underwater robot motion planning method based on multiple constraint targets

Technical Field

The invention belongs to the field of machine learning and underwater robot motion planning, and particularly relates to an underwater robot motion planning method based on multiple constraint targets.

Background

The intelligent underwater robot has wide application prospect in the aspects of marine scientific research, marine development, underwater engineering, military and the like. The intelligent underwater robot generally works in a complex marine environment, and in order to better complete various task missions and ensure the safety of the intelligent underwater robot, the intelligent underwater robot needs to have autonomous motion planning capability in an unknown environment, and can avoid obstacles and navigate to a target point in the unknown environment.

The traditional underwater robot motion planning technology needs to construct a global map in advance. When the environment changes, a communication model needs to be reestablished, the adaptability is poor, and the practicability is not strong. Reinforcement learning is an unsupervised learning method, which is a process of continuous trial and error. Knowledge is obtained through continuous action and evaluation, and strategies are improved to adapt to the environment, so that the final evaluation function value is maximized, and the learning purpose is achieved.

Reinforcement learning has been proved to be used in underwater robots, but the traditional underwater robot motion planning method based on reinforcement learning considers a single constraint target and does not consider the influence on the underwater robot motion under the condition of multi-target constraints such as water flow constraint, target constraint and obstruction constraint.

Disclosure of Invention

The invention aims to provide a motion planning method of an underwater robot based on multiple constraint targets. The method is characterized in that an underwater robot dynamics model under the influence of water flow is constructed, multiple constraint targets are fused by combining a reinforcement learning method, reasonable reward signals and action spaces are constructed, and an optimal control strategy of the underwater robot is output through training. In addition, the underwater multi-constraint target is combined with a Q learning algorithm in reinforcement learning, so that the underwater robot can obtain environmental characteristics under an unknown underwater environment, strategy iteration is carried out, and the motion planning of the underwater robot is completed.

The purpose of the invention is realized as follows:

an underwater robot motion planning method based on multiple constraint targets is divided into a model construction stage and an algorithm training stage, and specifically comprises the following steps:

(1) and a model construction stage, specifically to the construction of a Markov decision process E, wherein the reinforcement learning task can be generally described by the Markov decision process. Because of the particularity of the underwater environment, a Markov decision process is constructed by considering multi-target constraints such as environment constraint, obstructive object constraint, target point constraint and the like, and the method specifically comprises the following steps:

(1-1) establishing a current environment x from the sensor signal_t(ii) a Let the distance of the obstacle in the ith degree of freedom of the robot be l_iIf no obstacle exists in the i degree of freedom, setting the i degree of freedom to be infinite; the flow speed of the position where the robot is located is set as vc; positioning the position of the robot in real time, and calculating the Euclidean distance d between the robot and a target point;

(1-2) establishing an action space A of the robot according to the maximum value of the underwater robot capable of advancing, wherein the action space A consists of five motion commands which are respectively advancing, left-front, right-front, left-side pushing and right-side pushing, and the speed is v_aAngular velocity of ω_a；

(1-3) considering the constraint of the obstructive objects, setting the ith degree of freedom underwater alert safety distance h_iIf detected l_i＜h_iThen a collision is considered to occur and a negative reward r is set_ter；

(1-4) considering the target point constraint with a target point threshold of d', if it is detected that d is getting larger, a negative reward r is set_oppIf d is detected to be smaller, a positive reward r is set_moveIf d < d' is detected, the machineThe robot arrives at the target point and sets a positive reward r_arr。

(2) In the algorithm training stage, the robot is specifically subjected to continuous trial and error in computer simulation, and a strategy is learned, and the method specifically comprises the following steps:

(2-1) initializing t to be 0, wherein t represents the step number of each training movement of the robot; initialization r_t＝0，r_tRepresenting the reward obtained when the robot performs the t-th action;

(2-2) initializing a matrix Q (x, a), recording a Q value which can be obtained by selecting the action a when the matrix Q is in the state x, and initializing the Q value to be 0;

(2-3) initializing a counter count which is 0, and recording the total times of robot training; setting an M value, representing that the robot needs to be trained for M times in total;

(2-4) when the count is less than the specified training time M, (2-5) is executed, otherwise, (2-14) is executed;

(2-5) acquiring sensor signals to obtain the current state x_tThe distance l of the obstacle in the direction of the freedom degree of the robot i is included in the robot_iSetting the distance to infinity if there is no obstacle; current position ocean current velocity information vc_t(ii) a Self-position information is obtained, and the Euclidean distance d between the target point and the robot is obtained through calculation;

(2-6) selecting action a according to the matrix Q_t；

(2-7) the action a to be selected according to the speed actually externally presented by the target formula in consideration of the kinematic constraint and the water flow constraint_tIs combined with the flow rate, simulated according to the combination, and l is updated_i；

(2-8) if l_i＜h_iExecuting (2-9), otherwise executing (2-10);

(2-9) occurrence of a collision, r_t＝r_terEnding the training, let x_t+1If the matrix is empty, updating the matrix Q, and re-executing training from (2-4) by using count +1 and making t equal to 0;

(2-10) if d' < d, executing (2-11), otherwise, arriving at the target point, ending the training, and enabling r_t＝r_arrLet x_t+1Null and update matrix Q to count +1, let t equal to 0, re-perform training from (2-4);

(2-11) if d_t＜d_t-1Executing (2-12), otherwise executing (2-13);

(2-12) d is decreased to let r_t＝r_moveUpdate x_t+1Updating the matrix Q, and re-executing the training process from (2-5) for t + 1;

(2-13) d is increased to let r_t＝r_oppUpdate x_t+1Updating the matrix Q, and re-executing the training process from (2-5) for t + 1;

(2-14) finishing training to obtain a trained matrix Q;

and (2-15) outputting the underwater robot motion planning strategy.

The kinematic constraint, namely the motion constraint of the underwater robot in the training process is as follows: assuming that the coordinates of the center of gravity of the aircraft in the fixed coordinate system are (x, y), the robot fixed coordinate system velocity is:

wherein theta is a longitudinal inclination angle, phi is a transverse inclination angle, and alpha is an influence coefficient of motion constraint on the speed of the underwater robot.

The water flow constraint is considered in the following method when selecting actions in the training process: in the learning training process, x_tFlow rate vc obtained by ADCP in this state_tAccording to a strategy, the robot selects one action a in the set of actions_tIts own velocity is

When the robot executes the action, the water flow constraint is considered, and the actual outward represented navigation speed is as follows: vi_t＝v_at+βvc_tAnd beta is the influence coefficient of water flow on the speed of the underwater robot.

Said selection action a_tThe specific method comprises the following steps: setting a threshold value epsilon by adopting a greedy strategy, generating a random number epsilon 'by using a computer, and executing a state Q (x) in a Q matrix by the robot if the random number is smaller than the threshold value, namely epsilon' < epsilon_tAction corresponding to maximum value of element in a), i.e. a_t＝max_a Q(x_tA); if the random number is larger than the threshold value, namely epsilon' > epsilon, the robot randomly selects an action to execute, namely a_t＝randomQ(xt,a)。

The method for updating the matrix Q comprises the following steps: the state of the robot before executing the action is assumed to be x_tThe action to be performed is a_tThe reward coefficient r obtained from the feedback_tThe state reached after the action is executed is x_t+1Then, then

Q(x_t,a_t)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Where α is the learning efficiency and where γ is the discount factor.

The invention has the beneficial effects that:

(1) according to the invention, multiple constraint targets such as water flow, an obstacle, a target and the like are considered, while the traditional reinforcement learning planning method does not consider multiple constraint targets at the same time, and the training method has practicability and robustness;

(2) the invention combines the reinforcement learning method with the underwater multi-constraint target, realizes the motion planning of the underwater robot, has stronger real-time performance and can be suitable for various environments.

Drawings

FIG. 1 is a model construction schematic diagram of an underwater robot motion planning method based on multiple constraint targets;

fig. 2 is a flow chart executed in a training phase of a multi-constraint target-based underwater robot motion planning method.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings:

the invention relates to a motion planning method for an underwater robot, in particular to a method for combining multi-target constraint and reinforcement learning and used for motion planning of the underwater robot. A model construction stage: converting the signals of the obstacle avoidance sonar of the robot and the flow velocity signals of the flow velocity sensor into the current environment; establishing a discrete action space based on the dynamic constraint of the underwater robot; establishing a reward function by taking an underwater obstacle as a constraint; and establishing a Markov decision process based on multi-target constraint to establish a basis for algorithm realization. A training stage: training is carried out based on a Q learning algorithm, actions are executed based on a greedy strategy in the current environment, the strategy is evaluated and updated once each time the strategy is executed, the strategy is improved until the strategy is adaptive to the environment, and the planning purpose is achieved. The invention combines the reinforcement learning method with the underwater multi-constraint target, realizes the motion planning of the underwater robot, has stronger real-time performance and can be suitable for various environments.

Aiming at the particularity of the underwater environment, the invention takes the combination of multiple constraint targets and a reinforcement learning method into consideration to train the motion planning strategy of the underwater robot. The method comprises a model construction stage and a strategy training stage, and comprises the following steps:

1. the model construction stage is shown in fig. 1, and comprises the following specific steps:

the reinforcement learning task may be generally described by a Markov decision process. Due to the particularity of the underwater environment, a Markov decision process is constructed by considering multi-target constraints such as environment constraint, obstructive object constraint, target point constraint and the like.

The specific composition of the state space X: firstly, obstacle avoidance sonar of an underwater robot processes the obstacle information of the environment where the robot is, namely the obstacle distance information l of the robot in the direction of the i-degree of freedom_i(ii) a Secondly, processing the ocean current information of the environment where the robot is located by the ADCP, namely the flow velocity vc of the position where the robot is located; and thirdly, the GPS processes the relative position information of the robot and the target point, namely the Euclidean distance d between the robot and the target point.

The specific composition of the motion space a: the action space in the invention comprises four control commands with the names of left front and frontRight front, left side push and right side push. The linear velocity of the robot is a fixed value v_a。

The concrete composition of the reward function R: once the robot collides, the reward value is r_ter(ii) a The robot does not collide but is farther and farther away from the target point with the reward value r_opp(ii) a The robot is not collided and is closer to the target point, and the reward value is r_move(ii) a The robot arrives at the target point with the reward value r_arr。

2. In the strategy training phase, the process is shown in fig. 2, and the specific steps are as follows:

firstly, establishing a virtual environment for training, wherein the specific method comprises the following steps:

a simulated marine environment is established by using robot motion simulation software, and obstacles, target points and ocean currents are set in the virtual environment. The obstacles and target points may be randomly defined and 6-12 different robot starting points defined.

Rasterizing a two-dimensional plane space, wherein the ocean currents in each grid can be regarded as the same, a flow field is randomly generated by a flow function psi (x, y), and the velocity field of the ocean currents can be obtained by the flow field function:

due to incompressibility of the fluid

In the formula vc_x，vc_yThe velocity components of the ocean current at the (X, Y) position along the X-axis direction and the Y-axis direction are respectively taken as the center point of each grid.

Performing strategy training, which comprises the following specific steps:

1) initializing t to be 0, wherein t represents the step number of each training movement of the robot; and initializing rt to 0, wherein rt represents the reward obtained when the robot executes the t-th action. A matrix Q (x, a) is defined, and the available Q values for action a are selected and initialized to 0 when recorded in state x. And (5) recording the total times of robot training when the initial counter count is 0. Setting the value of M, representing that the robot needs to train M times in total. Initializing safe radius h of i-degree-of-freedom direction of underwater robot_i. And setting a value d' which represents a threshold value of the distance between the robot and the target point.

2) Initializing the state of the robot, and randomly selecting a starting point to begin exploration.

3) Robot acquiring environmental information x_tThe distance l between the obstacle and the robot in the direction of the degree of freedom of the robot i is included_iSetting the distance to infinity if there is no obstacle; current position ocean current velocity information vc; self-position information is obtained, and the Euclidean distance d between the target point and the robot is obtained through calculation.

4) Setting a threshold value epsilon, generating a random number epsilon 'by means of a computer, and if epsilon' < epsilon, randomly selecting a motion in a motion space by the robot to execute, namely a_trandomQ (xt, a); if ε' > ε, the robot chooses to be in state x according to the matrix Q (x, a)_tThe action a with the largest value, i.e. a_t＝max_a Q(xt,a)。

5) The robot considers kinematic constraint and water flow constraint, according to the speed actually expressed by a target formula, and according to the speed vi in a simulation environment_tAnd (6) moving.

6) The robot has performed action a_tThereafter, the environmental information x is acquired again_t+1。

6-1) if l_i＜h_iWhen the collision occurs, the training is finished, and the counter count +1 is based on

Q(xt,at)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Updating the matrix Q, and if the training step number t is 0, if the count is less than M, starting to retrain from the step 2), and if the count is M, continuing to execute the step 7).

6-2) if li is more than hi, the collision is not generated, and whether the collision reaches the target point is continuously judged.

6-2-1) if d is less than or equal to d', indicating that the target point is reached, ending the training, counting a counter +1, updating the matrix Q, and if the training step number t is 0, starting retraining from the step 2) if the count is less than M, and continuing to execute the step 7) if the count is M).

6-2-2) if d > d', it is indicated that the target point has not been reached, t +1, continuing the training from step 3).

7) And after the training is finished, outputting a motion planning strategy of the underwater robot.

The method has the advantages that multiple constraint targets such as water flow, an obstacle, a target and the like are considered, multiple constraint targets are not considered simultaneously in the traditional reinforcement learning planning method, and the training method is lack of practicability and robustness. The invention performs characteristic fusion on multiple constraint targets through reinforcement learning, and can train a more practical underwater robot motion planning strategy.

Claims

1. A motion planning method of an underwater robot based on multiple constraint targets is characterized by comprising a model construction stage and an algorithm training stage, and comprises the following steps:

(1) a model construction stage; model construction for a Markov decision Process E, comprising the steps of:

(1-1) establishing a current environment x from the sensor signal_t(ii) a Let the distance of the obstacle in the ith degree of freedom of the robot be l_iIf there is no obstacle in the i degree of freedom, then l_iSetting to infinity; the flow speed of the position where the robot is located is set as vc; positioning the position of the robot in real time, and calculating the Euclidean distance d between the robot and a target point;

(1-2) establishing an action space A of the robot according to the maximum value of the underwater robot which can advance; the A comprises five motion commands which are respectively forward, left-front, right-front, left-side push and right-side push; velocity v_aAngular velocity of ω_a；

(1-3) consideration of obstructionsConstraining; set the ith degree of freedom underwater alert safety distance h_iIf detected l_i＜h_iThen a collision is considered to occur and a negative reward r is set_ter；

(1-4) considering target point constraints; setting the target point threshold as d', if d is detected to be increased, setting a negative reward r_oppIf d is detected to be smaller, a positive reward r is set_moveIf d < d' is detected, the robot arrives at the target point and a positive reward r is set_arr；

(2) An algorithm training stage; the robot continuously tries on and mistakes in computer simulation to learn a strategy, and the method comprises the following steps:

(2-2) initializing a matrix Q (x, a), and recording the Q value obtained by the action a when the matrix Q is in the state x;

(2-5) acquiring sensor signals to obtain the current state x_t(ii) a The current state x_tDistance l of obstacle including information of obstacle, i degree of freedom direction of robot_iCurrent position ocean current velocity information vc_tCalculating the Euclidean distance d between the target point and the robot according to the self position information;

(2-6) selecting action a according to the matrix Q_t；

(2-7) action a to be selected in consideration of kinematic constraint and water flow constraint_tIs combined with the flow velocity, the actual outward-presented sailing speed obtained by combination is simulated, and l is updated_i；

(2-8) if l_i＜h_iExecuting (2-9), otherwise executing (2-10);

(2-11) if d_t＜d_t-1Executing (2-12), otherwise executing (2-13);

(2-14) finishing training to obtain a trained matrix Q;

and (2-15) outputting the underwater robot motion planning strategy.

2. The underwater robot motion planning method based on the multiple constraint targets as recited in claim 1, wherein: the kinematic constraint, namely the motion constraint of the underwater robot in the training process is as follows: assuming that the coordinates of the center of gravity of the aircraft in the fixed coordinate system are (x, y), the robot fixed coordinate system velocity is:

3. The underwater robot motion planning method based on the multiple constraint targets as recited in claim 1, wherein: the water flow constraint is considered in the following method when selecting actions in the training process: in the learning training process, x_tFlow rate vc obtained by ADCP in this state_tAccording to a strategy, the robot selects one action a in the set of actions_tIts own velocity is

4. The underwater robot motion planning method based on the multiple constraint targets as recited in claim 1, wherein: said selection action a_tThe specific method comprises the following steps: setting a threshold value epsilon by adopting a greedy strategy, generating a random number epsilon 'by using a computer, and executing a state Q (x) in a Q matrix by the robot if the random number is smaller than the threshold value, namely epsilon' < epsilon_tAction corresponding to maximum value of element in a), i.e. a_t＝max_aQ(x_tA); if the random number is larger than the threshold value, namely epsilon' > epsilon, the robot randomly selects an action to execute, namely a_t＝randomQ(xt,a)。

5. The underwater robot motion planning method based on the multiple constraint targets as recited in claim 1, wherein: the method for updating the matrix Q comprises the following steps: the state of the robot before executing the action is assumed to be x_tThe action to be performed is a_tThe reward coefficient r obtained from the feedback_tThe state reached after the action is executed is x_t+1Then, then

Q(x_t,a_t)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Where α is the learning efficiency and where γ is the discount factor.