CN115493595A

CN115493595A - AUV path planning method based on local perception and near-end optimization strategy

Info

Publication number: CN115493595A
Application number: CN202211219574.9A
Authority: CN
Inventors: 杨嘉琛; 霍紫强; 霍佳明; 肖帅; ***
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-20

Abstract

The current ocean exploration mode is transformed to intellectualization and informatization so as to pursue smaller action risk and longer navigation time, the importance of an underwater unmanned exploration system is increasingly highlighted, and path planning considering ocean current factors and obstacle factors becomes a necessary condition for underwater navigation of the AUV. The invention provides an AUV path planning method of a near-end optimization strategy algorithm combined with local perception under the condition that ocean current factors are ignored and local obstacle information is not effectively utilized. A flow chart of AUV underwater path planning is obtained by constructing an underwater ocean current environment, constructing a neural network structure of a near-end optimization strategy, and designing and considering a multi-factor reward function. The method was verified experimentally. The method can be widely applied to real-time path planning of the underwater AUV.

Description

AUV path planning method based on local perception and near-end optimization strategy

Technical Field

The invention belongs to the field of AUV autonomous path planning, and particularly relates to an AUV path planning method considering ocean current influence and based on local perception and a near-end optimization strategy.

Background

The importance of the underwater unmanned detection system is increasingly highlighted along with the change of the current ocean detection mode to intellectualization and informatization so as to pursue smaller action risk and longer navigation time. The AUV is an important component of an underwater unmanned combat system, the path planning is an important technology for safely and effectively completing combat missions by the AUV, constraint conditions such as ocean currents, obstacle avoidance and self performance need to be considered, and indexes such as energy consumption, navigation time and safety concealment are pursued to be optimal.

The current common path planning methods mainly include a search method based on a directed graph, a heuristic search algorithm, an artificial potential field method, a fast random tree generation method and the like. For the AUV path planning problem in a large-scale area, rapidly obtaining a path meeting the requirement is more important than spending a large amount of time to solve an optimal path, and the reinforcement learning algorithm becomes a hotspot for path planning algorithm research due to the intelligence and dynamic learning capability of the reinforcement learning algorithm.

The reinforcement learning mainly comprises an intelligent agent, an environment, a state, an action and a reward; after the agent has performed an action, the agent will acquire a new observed state for which the state transition context will give a reward signal. The agent then performs a new action according to the current policy based on the reward of the new state and environmental feedback. The intelligent agent continuously optimizes own strategies through reinforcement learning, and finally can take optimal actions in different states. The near-end optimization strategy algorithm belongs to a reinforcement learning algorithm based on a strategy and is used for solving the problem of action selection under a multi-dimensional action space.

Disclosure of Invention

The invention aims to solve the technical problem of providing an AUV path planning method of a near-end optimization strategy algorithm combined with local perception. The technical scheme of the invention is as follows:

1. acquiring obstacle information and ocean current information, and constructing a three-dimensional environment according to the information;

2. a critic network for evaluating the action and an actor network for outputting the action are constructed, and network parameters are initialized.

3. And selecting an action according to the output of the neural network, acquiring a sample, and putting the sample into an experience pool for later learning.

4. The reward function in the sample is calculated as follows:

R _d ＝arctank ₁ (ξ _t -ξ _t+1 -δ _d )

in which ξ _t Distance, δ, representing the current position of the AUV and the target point _d For bias terms, the AUV is harder to obtain a positive reward.

The reward associated with ocean currents is determined by actual speed

And AUV speed

The ratio of (a) to (b) is set. When the target is reachable and ocean currents have a positive influence on the movement of the AUV, the actual speed should be greater than the speed of the AUV. Parameter tau _c Typically set to 0.5 to facilitate increased ocean current utilization by the AUV. R _c Decreasing with increasing ocean current angle and increasing with increasing ocean current velocity. When ocean currents are negatively affected or utilized to a lesser extent, the formula tends to pass through δ _c Punishment is made to the agent.

The final reward function is: r = k 1R _d +k2*R _c

5. The critic network and actor network are learned using the samples, and the updated formula for the actor network is derived as follows:

the objective function is:

the gradient of the objective function is:

updating the formula: is composed of

Alpha is the learning rate

The actor network is updated in the way of

When merit function estimate

Above 0, the network parameters will be optimized towards increasing the probability of this action output, but to r _t (θ) =1+ ε; on the contrary, when

Then the network parameters will be optimized towards the probability direction of decreasing the action until r _t (θ) =1- ε, essentially controlling the magnitude of policy updates.

In the near-end optimization strategy, the merit function is estimated using the timing difference error, and a single-step TD-error is defined as the difference between the cumulative discount reward and the Critic network state estimate. The estimate of the merit function is N steps TD-error, expressed as:

δ _t ＝rx ₊₁ +γV _β (s _t+1 )-V _β (s _t ).

the merit function is:

the updating mode of the critic network is as follows: l is a radical of an alcohol ^VF β＝(V _β -V _t ^targ ) ²

6. The parameters are updated for both networks in the manner of 5. And according to the output probability distribution, sampling and selecting actions are carried out. The above sampling and network updating process is repeated until a specified maximum number of rounds is reached. The mark of the end of each round is the maximum number of steps or the target point is reached, and finally the path is output.

The near-end optimization strategy algorithm adopted by the invention comprises two networks: a critic network and an actor network; and the value of the action is evaluated by the comment family network, the actor network is responsible for outputting the action, the sample can be learned for many times, and the on-polarity is converted into the off-polarity, so that the utilization rate of the sample in the experience pool is improved. The input of the invention is the joint input of the relative position information and the description information of the local barrier, and the functions of global guidance and local perception can be provided. And the network can still be converged under the multidimensional motion space by using a mode of outputting probability distribution.

Drawings

FIG. 1 is a block diagram of a method

FIG. 2 results of the experiment

Detailed Description

The method mainly comprises the following steps: input processing, network initialization, reward function design, network updating, decision making and the like. Fig. 1 presents a block diagram of the proposed method.

An AUV path planning method based on a near-end optimization strategy algorithm comprises the following steps:

1. and (5) environment construction. Ocean current and depth data of 122.75 degrees E-130.75 degrees E and 15.25 degrees N-23.625 degrees N are downloaded from a national ocean data center, and the maximum depth is 6400m. A coordinate system is established by taking (122.75 degrees E,15.25 degrees N and 6400 m) as a coordinate origin, the target point is (130.75 degrees E,23.625 degrees N and 6400 m), and the navigational speed of the AUV is 1.5m/s.

2. The state input comprises three parts: position information, ocean current information, local environment information. Wherein the position information is input by adopting relative position coordinates:

(g _x ,g _y ,g _z ) The coordinates of a target point, (x, y, z) are coordinates of a current position, ocean current information is obtained according to the current position and is represented as (u, v, w), and local environment information is sensed by a sensor and is converted into a 0,1 matrix. Wherein 0 represents a barrier and 1 represents safety. PerceptionThe range is 3 unit lengths and the perceptual matrix dimensions are 3 x 3.

3. The local perception input is changed into 1 multiplied by 3 input through neural network processing, and is connected with position and ocean current information to serve as final input.

4. Constructing an actor neural network for outputting a strategy, recording the parameter as alpha, finally outputting a 27-dimensional vector through softmax, and sampling according to probability distribution; and constructing a comment family neural network for outputting the value of the action, wherein the parameter of the comment family neural network is recorded as beta, and the rest network structures except the last layer are the same as the actor network.

5. Transmitting the input in 3 into actor network, and outputting action a _t AUV executes the current action and gets the new state s under the influence of ocean currents _t+1 To obtain a reward r _t Storing the current sample in an experience pool(s) _t ,a _t ,r _t ,s _t+1 ) This process is repeated until the current round ends. The round end flag is to reach the target or to reach the maximum number of steps 2000.

6. The reward function is set as follows:

R _d ＝arctank ₁ (ξ _t -ξ _t+1 -δ _d )

in which ξ _t Distance, δ, representing current position of AUV and target point _d To bias the items, the AUV is harder to obtain a positive reward.

The reward associated with ocean currents is determined by actual speed

And AUV speed

The ratio of (c) is set. When the target is reachable and ocean currents have a positive influence on the movement of the AUV, the actual speed should be greater than the speed of the AUV. Parameter tau _c Typically set to 0.5 to facilitate the AUV to make greater use of the ocean current. R _c Decreasing with increasing ocean current angle and increasing with increasing ocean current velocity. When ocean currents are negatively affected or are less utilized, the formula tends to pass through δ _c Punishment is made to the agent.

The final reward function is: r = k 1R _d +k2*R _c Wherein k1=1 and k2=0.5.

7. After the round is finished, if the number of samples reaches the designated capacity of 1000, updating is started; if not, sampling is continued. The update formula is as follows:

the critic network and the actor network are learned using the samples, and the updated formula of the actor network is derived as follows:

the objective function is:

the gradient of the objective function is:

updating the formula: is composed of

Alpha is the learning rate

The actor network is updated in the way of

When the merit function estimates

Above 0, the network parameters will be optimized towards increasing the probability of output of the action,but is optimized to r _t (θ) =1+ ε; on the contrary, when

In the near-end optimization strategy, the merit function is estimated using the timing difference error, and a single step TD-error is defined as the difference between the cumulative discount reward and the Critic network state estimate. The estimate of the merit function is N steps TD-error, expressed as:

δ _t ＝r _t+1 +γV _β (s _t+1 )-V _β (s _t ).

the merit function is:

the updating mode of the critic network is as follows: l is ^VF β＝(V _β -V _t ^targ ) ²

Where epsilon is set to 0.3 and the learning rate alpha is 0.001.

The test experiment result is shown in fig. 2, the path length is 610.38km, and the path time is 337413s.

Claims

1. An AUV path planning method based on local perception and near-end optimization strategies, the path planning method comprising:

(1) Obtaining obstacle information and ocean current information, and constructing a three-dimensional environment according to the information;

(2) Constructing a critic network for evaluating the action and an actor network for outputting the action, and initializing network parameters;

(3) Selecting an action according to the output of the neural network, acquiring a sample, and putting the sample into an experience pool for later learning;

(4) Designing a reward function considering a plurality of factors;

(5) Training is performed using samples of the experience pool until a maximum number of rounds is reached, outputting a path.

2. The AUV path planning method based on local perception and near-end optimization strategy as claimed in claim 1, wherein the reward function calculation formula in step (4) is as follows:

R _d ＝arctan k ₁ (ξ _t -ξ _t+1 -δ _d )

in which ξ _t Distance, δ, representing the current position of the AUV and the target point _d To bias the items, the AUV is harder to obtain a positive reward.

The reward associated with ocean currents is determined by actual speed

And AUV speed

The ratio of (c) is set. When the target is reachable and ocean currents have a positive influence on the motion of the AUV, the actual speed should be greater than the speed of the AUV. Parameter tau _c Typically set to 0.5 to facilitate the AUV to make greater use of the ocean current. R _c Decreasing with increasing ocean current angle and increasing with increasing ocean current velocity. When ocean currents are negatively affected or utilized to a lesser extent, the formula tends to pass through δ _c Punishment is made to the agent.

The final reward function is set as: r = k 1R _d +k2*R _c

3. The AUV path planning method based on local perception and near-end optimization strategy as claimed in claim 1, wherein in the step (4), a critics network and an actor network use sample are constructed for learning, and an update formula of the actor network is derived as follows:

the objective function is:

the gradient of the objective function is:

updating the formula: is composed of

Alpha is the learning rate

The actor network is updated in the way of

When the merit function estimates

δ _t ＝r _t+1 +γV _β (s _t+1 )-V _β (s _t ).

the merit function is:

the updating mode of the critic network is as follows:

4. the AUV path planning method based on local awareness and near-end optimization strategy as claimed in claim 1, wherein the step (5) updates parameters of two networks. And sampling and selecting according to the output probability distribution. The above sampling and network updating process is repeated until a specified maximum number of rounds is reached. The end of each round is marked by the maximum number of steps reached or the target point reached.