CN111998847A

CN111998847A - Underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning

Info

Publication number: CN111998847A
Application number: CN202010683268.5A
Authority: CN
Inventors: 刘明雍; 王聪; 牛云; 李嘉琦; 汪培新
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-11-27

Abstract

The invention provides a geomagnetic bionic navigation method based on Deep reinforcement learning, which is characterized in that under the premise of no prior geomagnetic information, a Deep-Q-Network algorithm is used for making course decision, an underwater vehicle is guided to navigate to a target point, and the search efficiency of an AUV is increased by using a boundary function. And as the times of reaching the target point by the AUV are increased, the navigation path is gradually shortened, and finally the optimal path is approached.

Description

Underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning

Technical Field

The invention relates to an underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning, and belongs to the technical field of underwater vehicle geomagnetic navigation.

Background

The underwater geomagnetic navigation and positioning has the characteristics of being passive, non-radiative, all-time, all-region and the like, and can effectively overcome the defects of the traditional underwater navigation system, such as the defect that an error of an inertial navigation system is accumulated along with time, the problem that a satellite navigation system has the defect of rapid attenuation of wireless signals in water and the like, so that the underwater geomagnetic navigation and positioning method is one of ideal methods for realizing real-time, continuous and accurate underwater autonomous navigation of an underwater vehicle.

At present, the main direction of geomagnetic navigation is geomagnetic matching navigation, geomagnetic data is measured in real time by using a geomagnetic sensor installed on a carrier, and the actually measured data is matched with a reference database (geomagnetic map) acquired in advance through a matching algorithm to determine the position information of the carrier. The method relies on accurate prior geomagnetic maps, but under severe environments such as underwater ocean currents, submerged reefs and the like, accurate geomagnetic map acquisition is extremely difficult. Some researchers have turned the research direction to navigation by using magnetic tendency, however, the current navigation method has certain randomness in course decision and lacks of learning, and when the AUV needs to go back and forth for many times between two places (such as transportation and patrol), the AUV cannot obtain experience from the previous round, and cannot learn a better path to navigate.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a geomagnetic bionic navigation method based on Deep reinforcement learning, which is characterized in that under the premise of no prior geomagnetic information, a Deep-Q-network (DQN) algorithm is used for making course decision, an underwater vehicle is guided to navigate to a target point, and the search efficiency of an AUV is increased by using a boundary function. And as the times of reaching the target point by the AUV are increased, the navigation path is gradually shortened, and finally the optimal path is approached.

The technical scheme of the invention is as follows:

the underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

step 1: initializing AUV navigation parameters: setting the movement speed v of the AUV, a sampling period delta T, the geomagnetic parameter of the initial position, the geomagnetic parameter of the target position and the threshold value mu of the target function;

step 2: initializing DQN algorithm parameters: initializing a current Q value network parameter w, a target Q value network parameter w' and an experience playback pool capacity D; initialization state space and action space: taking the geomagnetic parameter of the position of the AUV as a state characteristic s, taking the course angle of the AUV as an action a, and initializing an action space as follows:

θ＝{θ₁,θ₂,…,θ_m},m＝2π/Δθ

delta theta is a set course angle interval;

and step 3: selecting an action:

the current Q value network takes the AUV current state s as input, respectively outputs Q values of all actions in an action space, and selects the action with the maximum Q value or selects a random action through an epsilon-greedy strategy;

and 4, step 4: and executing the action:

the AUV executes the action a selected in step 3, moves to the next position according to the kinematic equation and calculates a normalized objective function F (B, k), obtains the next state s ' and the reward r, stores (s, a, r, s ') as a group of samples in the experience playback pool, and updates the state, where s is s ', and the reward function r is:

wherein k represents a certain movement moment of the aircraft, and bound () is a boundary function;

and 5: judging whether the target position is reached according to the normalized target function in the current state; if the target function is smaller than the threshold value mu, the AUV is considered to reach the target position, and the navigation is finished; otherwise, entering the next step;

step 6: judging whether the boundary is out of bounds by using a boundary function:

if the target function F (B, k) is larger than or equal to the boundary function bound (k-1), returning the AUV to the previous time position; if the objective function F (B, k) is smaller than the boundary function bound (k-1), the boundary function is updated:

and 7: updating the network parameters:

if the number of the sample groups in the experience playback pool is not more than m, directly returning to the step 3;

if the number of the sample groups in the empirical playback pool is greater than m, randomly extracting m groups of samples from the empirical playback pool, reversely propagating errors by using a gradient descent method, updating the current Q value network parameter w, copying the current Q value network parameter w to the target Q value network every N time steps, namely making w' ═ w, and then returning to the step 3, wherein the loss function of the DQN algorithm is as follows:

Loss＝(Target Q-Q(s,a；w))²

Target Q＝r+γmaxQ(s',a'；w')

wherein gamma is a set discount factor, Q (s, a; w) refers to a Q value corresponding to the action a in a Q value vector which is output by a network and has the same dimension with the action space, wherein the state s in the sample (s, a, s', r) is input into a current Q value network; maxQ (s ', a'; w ') refers to the action corresponding to the maximum Q value, where the state s' in the sample (s, a, s ', r) is input to the target Q value network, and a' is the maximum Q value in the Q value vector output by the network.

Advantageous effects

The invention provides a geomagnetic bionic navigation algorithm based on deep reinforcement learning, aiming at the problem of geomagnetic navigation when an AUV executes two-place round-trip tasks such as patrol, transportation and the like under the condition of no prior geomagnetic information.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a deep reinforcement learning geomagnetic navigation algorithm;

FIG. 2 is an AUV navigation track under deep reinforcement learning geomagnetic navigation algorithm;

FIG. 3 is a normalized objective function convergence curve;

fig. 4 is a sub-objective function convergence curve of the three geomagnetic components;

FIG. 5 is an iteration step number statistic of two geomagnetic bionic navigation algorithms, wherein (a) is the algorithm of the invention, and (b) is an evolution search algorithm.

Detailed Description

The invention aims to solve the problems that the traditional evolutionary search algorithm has larger randomness in the selection of course angles, and experience cannot be acquired from AUV and a better strategy cannot be acquired when the AUV executes two-place round-trip tasks such as patrol, transportation and the like, and provides a geomagnetic bionic navigation algorithm based on deep reinforcement learning. When the AUV needs to go back and forth for multiple times between two places, a more optimal and shorter navigation path can be obtained through updating the strategy.

First, an AUV kinematic model is established.

Considering an Autonomous Underwater Vehicle (AUV) as a particle in a two-dimensional plane, the equation of motion is:

wherein k represents a certain moving moment of the aircraft, Δ T represents a sampling period, v represents the moving speed of the aircraft, and θ represents the heading angle of the aircraft.

In the present invention, both Δ T and v are set as constants, and therefore, the equation of motion can be simplified into the following form:

where L represents the motion step of the AUV.

The following describes the principle of geomagnetic bionic navigation problem

The geomagnetic field is formed by combining a plurality of geomagnetic componentsCan be expressed as B ═ B₁,B₂,…,B_n}. The geomagnetic components that can be observed at present include seven types, such as total magnetic field strength, east component, north component, vertical component, horizontal component, magnetic inclination angle, magnetic declination angle, and the like, and can provide position and direction information for navigation.

From a biological point of view, living beings have a sensitivity to magnetic tendencies when navigating with the aid of terrestrial magnetism. Therefore, the process of the geomagnetic bionic navigation is a process of converging the multi-geomagnetic parameters to sub-target values, namely a multi-target search problem. Accordingly, the geomagnetic bionic navigation can be expressed in the following form:

wherein, B is a geomagnetic parameter; f (B, k) represents the total objective function at the k-th time;

an ith geomagnetic parameter representing a position at a kth time;

an ith geomagnetic parameter representing a target position; theta^kRepresenting the course angle at the kth time; g_iRepresenting a constraint relationship; f. of_i(B, k) represents an objective sub-function of the ith geomagnetic parameter at the kth time, as shown below.

Since the magnitude of the plurality of geomagnetic parameters is unequal, and the geomagnetic parameters are normalized, the multi-parameter objective function can be expressed as:

an ith geomagnetic parameter representing the starting position; the goal of geomagnetic bionic navigation is to converge geomagnetic parameters to a target value, and simultaneously, a target function approaches zero, namely:

in actual navigation, when the relative distance between the AUV and the target position is within a certain range, that is, the objective function is smaller than a certain threshold, the AUV is considered to be successfully navigated to the target position, that is:

F(B,k)＜μ

based on the principle, the algorithm of the invention is used for realizing geomagnetic bionic navigation, and comprises the following specific steps:

step 1: and initializing AUV navigation parameters.

And setting the motion step length L of the AUV, the geomagnetic parameter of the initial position, the geomagnetic parameter of the target position and the threshold value mu of the target function.

Step 2: relevant parameters of the DQN algorithm are initialized.

Initializing a current Q value network parameter w, a target Q value network parameter w' and an experience playback pool capacity D; state space and action space initialization: taking the geomagnetic parameter of the position of the AUV as a state characteristic s and the heading angle of the AUV as an action a, initializing an action space as follows:

θ＝{θ₁,θ₂,…,θ_m},m＝2π/Δθ

and delta theta is the set course angle interval.

And step 3: an action is selected.

The current Q value network takes the AUV current state s as input, respectively outputs Q values of all actions in the action space, and selects the action with the maximum Q value or selects random actions through epsilon-greedy strategy, namely selects the action with the maximum Q value with probability of (0< <1) or selects random actions with probability of 1-, for example, 90% of the probability is the action with the maximum Q value, and 10% of the probability is the random action.

And 4, step 4: an action is performed.

The AUV performs action a, moves to the next position according to the kinematic equation and calculates the normalized objective function F (B, k), obtains the next state s ' and the reward r, stores (s, a, r, s ') as a set of samples in the empirical playback pool, and updates the state, let s be s '. The AUV kinematic equation, the reward function r, is as follows:

Where bound () is a boundary function.

When the AUV performs geosteering using deep reinforcement learning, there is a huge unexplored state space, and the DQN network parameters are initialized randomly, which initially cannot provide correct heading. In this case, the AUV needs to go through a long exploration process to find the target point.

In the exploration process, there are many states that need not be explored because they do not become part of the optimal path, i.e., invalid states. Therefore, the exploration range of the AUV is limited by constructing a boundary function, so that the search efficiency of the AUV is remarkably increased.

The normalized target function represents the difference of geomagnetic parameters of the current position and the target position, and can indirectly reflect the relative distance between the current position of the AUV and the target position. At each moment, the area where the objective function is reduced can be regarded as a circular area with the target position as the center and the relative distance between the AUV and the target position as the radius. It is easy to know that the AUV can approach the target position when exploring towards the circle; when the AUV searches for the out-of-circle area, it is inevitable to move away from the target position, i.e., the out-of-circle area is in an invalid state. The boundary function is set as follows:

under the limit of the boundary function, when the exploration position of the AUV at the next moment is out of the boundary, the AUV returns to the position at the previous moment, thereby avoiding useless search. Therefore, the AUV only continuously searches the area in the circle determined by the boundary function, and the searching efficiency and the learning efficiency of the navigation algorithm are greatly improved.

And 5: and judging whether the target position is reached according to the normalized target function in the current state. And if the target function is smaller than the threshold value mu, the AUV is considered to reach the target position, and the navigation is finished. Otherwise, the next step is entered. The judgment formula is as follows:

F(B,k)＜μ

step 6: and judging whether the boundary is out of bounds or not by using a boundary function. If the target function F (B, k) is larger than or equal to the boundary function bound (k-1), returning the AUV to the previous time position; if the objective function F (B, k) is smaller than the boundary function bound (k-1), the boundary function is updated.

And 7: and updating the network parameters.

if the number of the sample groups in the empirical playback pool is greater than m, randomly extracting m groups of samples from the empirical playback pool, reversely propagating errors by using a gradient descent method, updating the current Q value network parameter w, copying the current Q value network parameter w to the target Q value network every N time steps, namely making w' ═ w, and then returning to the step 3, wherein when the parameters of the current Q value network are updated, the loss function of the DQN algorithm is as follows:

Loss＝(Target Q-Q(s,a；w))²

Target Q＝r+γmaxQ(s',a'；w')

wherein gamma is a set discount factor, Q (s, a; w) refers to a Q value corresponding to the action a in a Q value vector which is output by a network and has the same dimension with the action space, wherein the state s in the sample (s, a, s', r) is input into a current Q value network; maxQ (s ', a'; w ') refers to the action corresponding to the maximum Q value, where the state s' in the sample (s, a, s ', r) is input to the target Q value network, and a' is the maximum Q value in the Q value vector output by the network. By means of navigation and updating of the current Q value network in the navigation process, the current Q value network can make a more optimal decision.

In order to verify the feasibility and superiority of the algorithm, a simulation experiment was performed using Python3, and an international geomagnetic reference field (IGRF-12) was used to provide the required multi-geomagnetic feature data.

The specific implementation mode is as follows:

(1) geomagnetic feature selection. Three mutually independent parameters are selected from various geomagnetic characteristic quantities, namely magnetic north components B_xEast component of geomagnetism B_yPerpendicular component of geomagnetism B_zAs a navigation feature quantity.

(2) Initializing AUV parameters and geomagnetic bionic navigation parameters. Three geomagnetic characteristic quantities of an AUV initial position are respectively as follows: b is_x＝22270nT，B_y＝-1993nT，B_z-9732 nT. The three geomagnetic parameters of the target position are respectively as follows: b is_x＝22583nT， B_y＝-1881nT，B_z-9228 nT. The AUV movement step L is 500m, the heading interval angle Δ θ is 45 °, and the navigation end threshold μ is 0.0005.

(3) DQN algorithm parameter initialization. The learning rate α is 0.0001, the discount factor γ is 0.9, the epsilon greedy strategy is 0.9, the empirical playback pool capacity D is 500000, the sample sampling number m is 64, and the target Q value network parameter update period N is 200. The current Q value network parameter w and the target Q value network parameter w' both use an orthogonal initialization method.

(4) And the AUV completes the geomagnetic navigation task by utilizing the DQN geomagnetic bionic navigation algorithm.

The navigation trajectory of the algorithm of the present invention is shown in fig. 2. Shown in the figure is the navigation trajectory of the AUV after the DQN algorithm convergence. The whole track is relatively straight and reaches the target position approximately in a straight line, which reflects that the DQN algorithm learns a relatively excellent navigation strategy and guides the AUV to reach the target position in a relatively straight path.

FIG. 3 is a diagram illustrating the convergence process of the normalized objective function under the algorithm of the present invention. It can be seen that the convergence process of the objective function is relatively smooth and finally converges to the target value in 95 steps, which means that the navigation steps of the AUV are 95 steps. The convergence rate of the objective function is high in the initial stage, and gradually decreases as the objective function approaches the target position, and finally reaches the target value.

FIG. 4 shows the convergence process of the sub-objective functions of the three components of the earth magnet under the algorithm of the present invention. The convergence process for the sub-target functions for all three components is smooth and nearly uniform, with the east component converging slightly faster than the other two. As navigation progresses, the sub-objective functions of the three components converge to the target value simultaneously.

FIG. 5 is a graph of a statistical line of iteration steps in a multi-round experiment for two navigation algorithms. The motion step size of both algorithms is the same, so the number of steps can reflect the navigation path length. (a) The figure is a step number statistical diagram under the algorithm of the invention, and it can be seen that under the action of the algorithm, the number of search steps when the AUV reaches the target position for the first time is about 650 steps, and then the number of required search steps gradually decreases with the increase of the number of times of reaching the target position, and finally converges to about 90 steps. (b) The figure is a step number statistical diagram under the evolution search algorithm, and obviously, navigation under the evolution search algorithm has strong randomness and volatility, wherein a red dotted line represents the average iteration step number of the evolution search algorithm, and the step number is about 260. Comparing the two graphs (a) and (b), it can be found that although the initial iteration step number of the algorithm is higher than the maximum iteration step number of the evolutionary search algorithm, the algorithm can be rapidly converged to about 90 steps along with the progress of the round-trip task, and is far lower than the average iteration step number of the evolutionary search algorithm, namely 260 steps. Therefore, the algorithm can successfully navigate to the target position under the condition of no prior geomagnetic information, and has obvious superiority in the round-trip task of the AUV.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. An underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

θ＝{θ₁,θ₂,…,θ_m},m＝2π/Δθ

delta theta is a set course angle interval;

and step 3: selecting an action:

and 4, step 4: and executing the action:

and 7: updating the network parameters:

Loss＝(Target Q-Q(s,a；w))²

Target Q＝r+γmaxQ(s',a'；w')

2. The underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning according to claim 1, wherein: the geomagnetic parameter selected in the step 1 is one or more of total magnetic field strength, east component, north component, vertical component, horizontal component, magnetic dip angle and magnetic declination.

3. The underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning according to claim 2, wherein: the geomagnetic parameters selected in the step 1 are an east component, a north component and a vertical component.

4. The underwater vehicle bionic geomagnetic navigation method based on deep reinforcement learning according to claim 1, wherein: normalizing the objective function F (B, k) according to the formula

Is determined in which

An ith geomagnetic parameter representing a position of the AUV at the kth time;

an ith geomagnetic parameter representing a position of the target,

and n is the number of the selected geomagnetic parameters.