CN112179367B

CN112179367B - Intelligent autonomous navigation method based on deep reinforcement learning

Info

Publication number: CN112179367B
Application number: CN202011023274.4A
Authority: CN
Inventors: 彭小红; 陈亮; 陈荣发; 张军; 梁子祥; 史文杰; 黄文�; 陈剑勇; 黄曾祺; 余应淮
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2023-07-04
Anticipated expiration: 2040-09-25
Also published as: CN112179367A

Abstract

The invention relates to the technical field of autonomous navigation of an intelligent agent, in particular to an autonomous navigation method of an intelligent agent based on deep reinforcement learning. The method is used for solving the problems that the intelligent agent cannot sense the development condition of a plurality of states in the future in advance and the obstacle avoidance and navigation capability of the intelligent agent are insufficient because the existing algorithm only calculates the rewarding values of two adjacent states. The intelligent body autonomous navigation method based on deep reinforcement learning comprises the following steps: constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm oriented to a multi-step mechanism; building a simulation environment; placing the autonomous navigation system in the simulation environment for training; and loading the trained autonomous navigation system onto the intelligent body, wherein the intelligent body obtains the autonomous navigation capability. Through the technical scheme, the intelligent obstacle avoidance system has the advantages that the intelligent body can sense future obstacle distribution conditions and can make avoidance actions in advance.

Description

Intelligent autonomous navigation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of autonomous navigation of an intelligent agent, in particular to an autonomous navigation method of an intelligent agent based on deep reinforcement learning.

Background

The reserves of mineral resources and biological resources are rapidly decreasing due to the over exploitation of land resources by humans. The ocean area is more than twice of the land area, and mineral resources, energy resources, fishery resources and the like which are stored underground are far richer than those of land. In view of the unknowns and the complexity of the ocean environment, the intelligent agent can replace human beings to explore and develop ocean resources, so that the research on the intelligent agent in various countries in recent years is very important. Autonomous navigation is one of the key technologies for studying intelligent mobile agents. Autonomous navigation refers to that an agent finds an optimal or suboptimal path from a starting point to a target point in an environment containing complex obstacles according to given one or more constraint conditions, such as shortest path length, minimum energy consumption or minimum movement time, under the condition that pose information of the agent is known. The autonomous navigation problem of an agent may be equivalent to the autonomous path planning problem of an agent, with the goal being to control moving the agent away from an obstacle towards a target location. The path planning task aims at searching one or more paths from a starting point to avoid various obstacles and safely reach a target position in a known or unknown environment through a specific algorithm. The nature of the method can be regarded as a condition optimization problem, and the optimization targets have certain differences facing different requirements. Aiming at various proposed navigation algorithms, the navigation algorithms are roughly divided into two main categories according to different intelligent degrees of the intelligent agents, namely, a non-intelligent navigation algorithm and an intelligent navigation algorithm. The modularized deep neural network architecture is designed, so that the task of the neural network learning processing of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double-neural network structure method; and the output method, the loss function, the rewarding function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the obtained rewards can be diffused backwards to the state value estimated value of the multi-step interval state in the training process of the intelligent agent. The underwater intelligent agent is guided to learn quickly in the mode, and meanwhile, the underwater intelligent agent can sense the change of the future state in advance, which is equivalent to endowing the underwater intelligent agent with the capability of sensing the future obstacle distribution, so as to help the underwater intelligent agent to make avoidance actions in advance.

Deep Q Networks (DQN) are a deep reinforcement learning algorithm. The key technology of the DQN algorithm is to adopt a double neural network structure and an empirical data playback method. One of the innovation points of the DQN algorithm is to use nerves to approximate the optimal state-action cost Q function, instead of building a table in Q-learning to record the mapping relationship of states to actions. The method solves the defect that Q-learning cannot be applied to the field of high-dimensional state space, and simultaneously plays a role in processing high-dimensional information by the deep neural network. The second innovation is to build two neural networks, a current value network and a target value network, respectively. The stability of the algorithm is improved by adopting a double-neural-network structure method. The third innovation point is that a method of playing back an experience pool is adopted, sample data of interaction between an agent and the environment is stored in the experience pool, and each sample data is marked by a reward value, so that the defect that a large number of artificially marked samples are needed in a deep learning method is overcome. The training method of the DQN algorithm is shown in fig. 1. There are two deep neural networks in the network structure of the DQN algorithm, the current value network and the target value network, respectively. The function of the current value network is two, one of which is to process the input state information and evaluate the value of each output action in the training process, and then determine whether to randomly execute one action or select to execute the action according to the maximum value of the output of the current value network through a greedy method. The second function is to process training samples extracted from the experience pool in the network training process, output the value of each action, and compare the value of the action output in the target network to calculate the error for referring to And guiding the weight update of the current network. The target value network is mainly used for calculating training sample data extracted from an experience pool to process in the training process, and outputting the value of each action to help update iteration of the weight of the current network. The weights in the target value network are not iterated in the network training process, but the weights in the current value network are copied every N steps later. An empirical playback mechanism is used in the training process of the DQN network. Through experience playback mechanism, the agent not only can learn current state experience data, but also can repeatedly learn previous experience data. Each time the agent completes interacting with the environment, the information is stored in an experience pool, and the sample data stored in the experience pool has the current state s _t Executing action a _t Obtain the prize value r _t The next state s _t+1 The four data are combined into a unit storage and experience pool D in the form of [ s ] _t ,a _t ,r _t ,s _t+1 ]. Because the stored experience data has strong relevance, the DQN algorithm adopts a random sampling mode to extract training sample data from the experience pool in a small quantity, thereby ensuring the independence among training samples in the training learning process and being beneficial to improving the convergence rate of the algorithm.

Because the DQN algorithm uses the neural network to replace the Q value table, the current value network indicates that the strategy pi is learned, and the current value network output is given that the weight parameter in the current value network is theta

y ^DQN ＝Q(s,a,θ) (2-26)

The output of the target network is:

the loss function of the DQN network is then:

the weight θ update in the current value network is calculated by calculating the gradient of the loss function:

the parameters in the current network can be updated by adopting a gradient descent method, so that an optimal strategy is obtained. Due to the use of Q in the DQN algorithm _target (s ', a ', theta ') to approximate the optimization objective, and selecting actions according to actions corresponding to the maximum Q value, wherein the selection and evaluation of the actions are based on the target value network, which results in an overfitting problem. To solve this problem, a Deep Double Q Network algorithm (DDQN) has been proposed. The training process of the DDQN algorithm is almost the same as that of the DQN algorithm, the only difference being that the DDQN separates the target network selection action from the evaluation action. There are two different sets of weight parameters in the network structure just using the DQN algorithm, namely the weight parameter θ in the current network and the weight parameter θ' in the target network. Wherein the action is selected using the parameters in the current network and the selected action is evaluated using the parameters in the target network, the output of the target network of DDQN is therefore:

The output of the DDQN current value network is:

y ^DDQN ＝Q(s,a,θ) (2-31)

wherein gamma is a discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The degree of influence on the current t state, and gamma is a value less than 1 and greater than 0, gamma ^λ Representing the prize value r obtained for the t+lambda state _t+λ The degree of influence on the current t state; q is the state-action value estimated value, lambda is the number of steps at intervals, s _t A is the current state, a _t Action performed for the current state, r _t Indicating a timely rewarding value obtained by the agent at the time t, r _r+λ To obtain a prize value, s, in the lambda state _t+λ Lambda is lambdaA state of the states; θ is a weight parameter in the current value network, and θ' is a weight parameter in the target value network; i is a subscript of the prize value obtained from each state after the t state begins; q(s) _t+λ A, θ) represents the current value of the neural network based on the input information (s _t+λ A), outputting an estimated value of each action; q (Q) _target (s _t+λ ,argmaxQ(s _t+λ A, θ)) representing that an action instruction and s corresponding to the maximum value of the estimated value outputted by the current value network are selected first _t+λ As input information of the target estimated value network, the target estimated value network outputs an estimated value of each action; the loss function of the network of DDQN is:

wherein E is the neural network error, s is the state, a is the action performed, θ is the weight parameter in the current value network, Q is the state-action value estimate, Q (s, a, θ) represents the current value neural network

Outputting an estimated value of each action; the method for updating the weight parameters in the current value network comprises the following steps:

since deep reinforcement learning uses a neural network to process high-dimensional raw input information and uses the network to approximate a state-action cost function, deep reinforcement learning is more adaptable to problems in a larger state space than conventional reinforcement learning methods. Therefore, the MS-DDQN is provided by correspondingly improving the DDQN algorithm so as to improve the obstacle avoidance and navigation capability of the underwater intelligent body. The above description of the Q (lambda) algorithm shows the ability of Q (lambda) to enable the agent to obtain a rewarding situation. In the navigation of the underwater robot, the obstacle avoidance function is an important precondition for completing tasks, and the state-action Q value of a longer step state is influenced, which is quite important for enabling an intelligent body to perceive the improvement of the obstacle avoidance capability of the underwater robot in the future. If the underwater robot is in a certain state, the future obtained reward value can be perceived in advance, namely the development condition equivalent to the future state can be perceived in advance, which is very helpful for the underwater robot to avoid the obstacle and reach the target point.

Disclosure of Invention

The invention aims to overcome at least one defect (deficiency) of the prior art, and provides an intelligent body autonomous navigation method based on deep reinforcement learning and a manufacturing method thereof, which are used for solving the problems that an intelligent body cannot sense the development condition of a plurality of states in the future in advance and the obstacle avoidance and navigation capability of the intelligent body are insufficient because the existing algorithm only calculates the reward values of two adjacent states; the intelligent control system realizes the technical effect that the intelligent body can sense the development condition and the obstacle distribution condition of a plurality of states in the future, so as to make an avoidance action in advance.

The technical scheme adopted by the invention is that the intelligent body autonomous navigation method based on deep reinforcement learning is designed, and comprises the following steps: constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm oriented to a multi-step mechanism, and the MS-DDQN algorithm is an algorithm obtained by improving the DDQN algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance deep neural network module, a global navigation deep neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from the obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to go to the target position towards a closer path, and the instruction selection module is used for determining which network to output an action instruction to execute; building a simulation environment, including building an obstacle environment model and building a simulation intelligent body; placing the autonomous navigation system in the simulation environment for training, namely training and learning by the intelligent body in the simulation environment by adopting the MS-DDQN algorithm; the number of the simulation environments is multiple, and the training times of each simulation environment are multiple; and loading the trained autonomous navigation system onto the intelligent body, wherein the intelligent body obtains the autonomous navigation capability.

Further, the MS-DDQN algorithm includes a current value network for selecting an action, a target value network for evaluating the action, an error function for updating the weight, a reward function for a reward value obtained when an agent takes a certain action in a current state and reaches a next state, and an experience pool for storing sample data generated every step. The current value network, the target value network, the error function, the reward function and the experience pool are mutually matched to realize that the MS-DDQN algorithm endows the intelligent agent with the ability of sensing future obstacle distribution conditions and making avoidance actions in advance.

Further, the output of the target value network is:

wherein gamma is a discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The degree of influence on the current t state, and gamma is a value less than 1 and greater than 0, gamma ^λ Representing the prize value r obtained for the t+lambda state _t+λ The degree of influence on the current t state; q is the state-action value estimated value, lambda is the number of steps at intervals, s _t A is the current state, a _t Action performed for the current state, r _t Indicating a timely rewarding value obtained by the agent at the time t, r _r+λ To obtain a prize value, s, in the lambda state _t+λ A state of lambda state; θ is a weight parameter in the current value network, and θ' is a weight parameter in the target value network; i is a subscript of the prize value obtained from each state after the t state begins; q(s) _t+λ A, θ) represents the current value of the neural network based on the input information (s _t+λ A), outputting an estimated value of each action; q (Q) _target (s _t+λ ,argmaxQ(s _t+λ A, θ)) representing that an action instruction and s corresponding to the maximum value of the estimated value outputted by the current value network are selected first _t+λ As input information of the target estimated value network, the target estimated value network outputs an estimated value of each action; the loss function is:

wherein E is the neural network error, s is the state, a is the executed action, θ is the weight parameter in the current value network, Q is the state-action value estimation value, and Q (s, a, θ) represents the current value neural network to output the estimation value of each action; the data stored in the experience pool are as follows:

wherein t is a moment, s is a state, i is a subscript of a prize value obtained from each state after the t state starts; s is(s) _t For the current state, a is the action performed, a _t Indicating the execution of the action at time t, r _t Representing a timely rewarding value obtained by the agent at the time t; gamma is the discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The extent of influence on the current t state, and γ is a value less than 1 and greater than 0; lambda is the number of steps at intervals, s _t+λ Is in a state after lambda steps are separated;

the target value network is according to a function

And outputting the value of each group of actions, updating the weight theta of the current value network according to the loss function, and storing the sample data after each action execution into an experience pool.

Further, the training method of the MS-DDQN algorithm comprises the following steps:

random initialization of a current value network Q (s _t A; θ) weight θ and target value network Q _target (s _t A; θ) weights θ', Q(s) _t A; θ) represents the current value neural network outputting an estimated value for each action;

initializing experience pool D and setting hyper-parameters lambda

For episode＝1,M do

Resetting the simulation environment and obtaining the initial observation state s _t T≡infinity, four empty arrays S are initialized _t ,A,R,S _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein S is _t An array for declaring, for storing state information of a current state; an array of A declarations for storing actions performed by the current state; an array of R statements for storing prize values obtained from the current state; s is S _t+1 An array of declarations for storing the next state information of the second ethernet; t is mainly used for judging whether the data acquired in the round are stored in an experience pool when the round training is finished;

For t＝1,2…do

If t<Tthen

Selecting action a according to the current policy _t ＝argmaxQ(s _t A; θ), perform action a _t Return a prize value r _t And a new state s _t+1 Will s _t Stored in S _t 、r _t Stored in R, a _t Stored in A, s _t+1 Stored in S _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Wherein t represents environmental state data acquired by the agent in a round, a _t The action is executed at the moment t, Q (s, a, theta) represents the current value neural network to output the estimated value of each action;

If s _t+1 is the terminal state then:

T←t+1

τ←t-λ+1

If τ≥0 then

If τ+λ<T then

else

storage(s) _τ ,a _τ r _τ s _τ+λ ) In experience pool D

Extracting mini-batch sample data from random punch D

Setting:

using the loss function L (θ) =e [ (y) _i -Q(s,a,θ)) ² ]Gradient descent updating is carried out to update the current value network weight theta Untilτ=T-1;

the τ is mainly used for judging whether the number of times of executing actions of the mobile robot exceeds a set number of steps λ, if the number of times of executing actions exceeds the set number of steps, the value of τ is larger than 0, which indicates that the intelligent agent has obtained data of environmental states exceeding or equal to λ, and at the moment, the influence r of rewarding values obtained by λ states in the future on τ state can be calculated _τ And from array S _t ,A,S _t+1 Extracting state information s _τ Action a _τ Status information s _τ+λ These three information and r _τ Forming a training sample tuple which is stored in the experience pool D so as to extract training samples from the experience pool for training later; when the intelligent body starts one round of training, firstly T is in infinity, and at the moment, T is <T, the program goes to the second for loop; when a collision occurs or a target position is reached, t≡t+1, if τ=t-1, all environmental state information obtained by the intelligent agent is stored with training sample data of the round according to a multi-step method. When τ+λ<T represents that the intelligent agent has not collided, and the number of the environmental state information required to be stored is larger than or equal to the set lambda value at the moment, and the intelligent agent is required to pass through

r _i ∈R _t Calculating the influence r of the prize value obtained for the lambda states in the future on the tau state _τ The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, it indicates that the agent has collided, and the number of environmental status information to be stored is smaller than the set lambda value, and the agent is required to pass +.>

r _i ∈R _t Calculating future lambda-1 (lambda-2, lambda-3, …, 1) vs. tau (tau+1, tau+2, …),T-1) influence of the state r _τ ；y _i Representing the ith sample data in mini-batch data extracted from experience pool, estimated value of each action estimated by target value network and actual r obtained in the state _i Is a comprehensive reward of the current state and is used for carrying out gradient descent operation with the current value network. The MS-DDQN algorithm enables rewards obtained by the underwater agent to be spread back by state value estimates of multi-step interval states. Through the method, the underwater intelligent body can be better guided to learn quickly, and meanwhile, the underwater intelligent body can sense the change of the future state in advance.

Further, the training step in S3 specifically includes:

s31, acquiring current state information of an intelligent agent by a simulation environment, wherein the current state information comprises distance information of the intelligent agent and an obstacle in the environment and position relation information of the intelligent agent and a target point; the position relation information of the intelligent agent and the target point comprises a relative coordinate relation of the current position coordinate of the intelligent agent and the target position coordinate, an Euclidean distance of the current position of the intelligent agent to the target position and an included angle between a vector of the advancing direction of the intelligent agent and a vector of the direction of the advancing direction of the intelligent agent from the current position of the intelligent agent to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent agent from the target position coordinate;

s32, inputting the acquired current state information into the modularized deep neural network, specifically inputting distance information of an agent and an obstacle in the environment and position relation information of the agent and a target point into the local obstacle avoidance deep neural network module, and inputting the position relation information of the agent and the target point into the global navigation deep neural network module;

s33, outputting respective control instructions by the local obstacle avoidance depth neural network module and the global navigation depth neural network module according to the input current state information;

S34, determining an action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by the instruction selection module through judging a distance value between the intelligent agent and the nearest obstacle;

s35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state;

s36, storing sample data of the current interaction into the experience pool, wherein the sample data comprises a current state s _t Executing action a _t Obtain the prize value r _t The next state s _t+1 The sample data is stored in the experience pool in the form of

S41, position relation information of the underwater intelligent agent and the target point is added into the input state information, so that the underwater intelligent agent can reach the target position point in a shorter path; meanwhile, the position relation information is directly used as the input state information of the global navigation neural network, and the direction in which the underwater intelligent agent should advance can be learned through the position information, so that the target position can be reached fastest; meanwhile, a modularized deep neural network is adopted, so that each neural network is more clear in strategy method needing to be learned, and an underwater intelligent agent can better avoid obstacles and reach a target position in a shorter path; in S36, an experience pool playback method is adopted, sample data of interaction between an agent and the environment is stored in the experience pool, and a reward value is adopted to mark each sample data, so that the defect that a large number of artificially marked samples are needed in a deep learning method is overcome.

Further, the specific steps of the local obstacle avoidance deep neural network module and the global navigation deep neural network module are as follows: s331, the current value network receives the input current state information, processes according to the input current state information, and outputs a group of actions, wherein the processing process comprises extracting training samples from an experience pool according to the current state information; and transmitting the outputted set of actions to the target value network;

s332, the target value network processes according to the group of actions, the value of each action is calculated, and the value of the group of actions is transmitted back to the current value network;

s333, selecting a next action to be executed according to the maximum value in the value of a group of actions returned by the target value network by the current value network, and outputting a control instruction to the simulation environment according to the next action;

s334, the current value network calculates to obtain next-step state information according to the next-step action in S433, processes according to the next-step state information, and outputs a group of actions, wherein the processing process comprises the step of extracting training samples from an experience pool according to the next-step state information; and transmitting the outputted set of actions to the target value network; the target value network repeats step S332, i.e. the target value network processes according to the group of actions, calculates the value of each action, and then returns the value of the group of actions to the current value network;

S335, comparing the value of the group of actions returned by the target value network in S334 with the value of the group of actions returned by the target value network in S332 by the current value network, and calculating an error according to an error function;

s336, updating the weight of the current value network according to the error obtained in the S335, wherein the weight of the target value network is updated according to the weight of the current value network after every N steps;

s337, while the step S334 is performed, the current value network outputs a control instruction to the simulation environment according to the selected next action. The method comprises the steps of selecting actions through parameters in a current value network, evaluating the selected actions through parameters in a target value network, and separating the target value network selecting actions from the evaluating actions by utilizing two sets of weight parameters in the current value network and the target value network.

Further, the reward function is a continuous combined reward function, the continuous combined reward function including a terminal reward and a non-terminal reward; the terminal rewards specifically comprise: the agent gets a positive prize when it reaches the target point, expressed as r _arr ＝100；ifd _r-t ≤d _win The method comprises the steps of carrying out a first treatment on the surface of the Wherein d is _r-t Is the Euclidean distance from the agent to the target point, d _win Is the threshold for the agent to reach the target point, when d _r-t Less than the set d _win When the target point is reached, the target point is not reached; the r is _arr A positive prize value indicating that the agent reached the target location;

obtain negative rewards when the intelligent body collides with the obstacle, the expression is r _col ＝-100；ifd _r-o ≤d _col The method comprises the steps of carrying out a first treatment on the surface of the Wherein d is _r-o Is the Euclidean distance between the intelligent body and the nearest obstacle, d _col Is the threshold value of collision between the intelligent body and the obstacle, when d _r-o Less than or equal to d _col If so, indicating that collision occurs, otherwise, not; the r is _col A penalty negative prize value indicating that the agent is involved in a collision;

the non-terminal rewards specifically include: the agent gets a positive prize when advancing towards the target point, expressed as r _{t_goal} ＝c _r [d _r-t (t)-d _r-t (t-1)]The method comprises the steps of carrying out a first treatment on the surface of the Wherein c _r ∈(0,1]Is set to 1; when the minimum distance between the intelligent body and the obstacle is continuously reduced, the obtained dangerous rewards r _dang ∈[0,1]And also becomes smaller, expressed as

0≤r _dang Is less than or equal to 1; wherein d is _min Is the minimum distance of the agent from the obstacle, where beta is a coefficient such that r _dang The value space of (1, 0); said d _r-t (t) represents the Euclidean distance between the current position coordinate of the agent and the target position coordinate at the t time;

when the included angle between the advancing direction vector of the intelligent body and the direction vector of the intelligent body from the current coordinate to the target position is smaller than +/-18 degrees, the intelligent body obtains rewards 1, when the included angle is larger than +/-18 degrees and smaller than +/-72 degrees, the intelligent body obtains rewards 0.3, and other conditions obtain rewards 0, wherein the expression is as follows:

Wherein a is _ori Is the included angle between the advancing direction vector of the intelligent agent and the direction vector reaching the target position sent by the current coordinate of the intelligent agent. The continuous combined rewarding function improves the convergence speed of the algorithm, so that the underwater intelligent agent can better avoid the obstacle and reach the target position in a shorter path; through the continuous combined rewarding function, the underwater intelligent agent can obtain corresponding rewards in each learning step, and the method is more beneficial to guiding the underwater intelligent agent towards a target point and away from an obstacle.

Further, activation functions adopted by the local obstacle avoidance deep neural network module and the global navigation deep neural network module are ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network. In order to improve learning ability of the neural network and to suppress gradient disappearance in the neural network, a combination of ReLU6 and ReLU is adopted as an activation function in the neural network framework. The two activation functions of ReLU and ReLU6 can avoid gradient disappearance, so that the ReLU6 function is used as the activation function at the front end of the network, the sparse characteristics of the data sample can be quickly learned, and the ReLU function is used as the activation function at the rear end of the network.

Further, the local obstacle avoidance deep neural network module and the global navigation deep neural network module both adopt full connection structures, the number of hidden layers of the local obstacle avoidance neural network module is more than three, and the number of hidden layers of the global navigation neural network module is one.

Further, the instruction selection module is provided with a threshold value and selects the control instruction according to the size of the threshold value; when the threshold value is smaller than 40, selecting a control instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is greater than or equal to 40, executing the control instruction output by the global navigation neural network module. When the threshold value is smaller than 40, the underwater intelligent agent is indicated to be very close to the obstacle, so that the instruction output by the local obstacle avoidance depth neural network is adopted for execution; when the threshold value is greater than or equal to 40, the underwater agent is still a certain distance from the obstacle, and the global navigation neural network should be executed to reach the target position at a faster speed.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, by adopting a modularized deep neural network architecture, the task of the neural network learning processing of each module is more definite, and meanwhile, the stability of the algorithm is improved by adopting a double neural network structure method; and the output method, the loss function, the rewarding function and the data information stored in the experience pool of the target value network of the MS-DDQN algorithm are improved, so that the obtained rewards can be diffused backwards to the state value estimated value of the multi-step interval state in the training process of the intelligent agent. Through the method, the underwater intelligent body can be better guided to learn quickly, and meanwhile, the underwater intelligent body can sense the change of the future state in advance, which is equivalent to giving the underwater intelligent body the ability of sensing the future obstacle distribution, so that the underwater intelligent body can be helped to make the avoidance action in advance. The improved continuous combined rewarding function increases the convergence speed of the algorithm, so that the underwater intelligent body can better avoid the obstacle and reach the target position in a shorter path. The invention proposes to adopt Q (lambda) algorithm theory to improve the conventional DDQN algorithm, and the improved algorithm is a Multi-step-oriented DDQN algorithm (MS-DDQN). In the training process of the MS-DDQN, rewards obtained by the underwater robot can be diffused backwards by state value estimated values of multi-step interval states. By the method, the underwater robot can be better guided to learn quickly, and meanwhile, the underwater robot can sense the change of the future state in advance. The DDQN algorithm is improved by adopting the Q (lambda) method, which is equivalent to endowing the underwater robot with the capability of sensing future obstacle distribution so as to help the underwater robot to make avoidance actions in advance.

Drawings

Fig. 1 is a training method of the DQN algorithm.

Fig. 2 is a system architecture according to an embodiment of the invention.

Fig. 3 is a schematic diagram of a neural network according to an embodiment of the invention.

FIG. 4 is a diagram of a geometric model of an embodiment of the present invention.

FIG. 5 is a diagram of a sonar detector according to an embodiment of the present invention.

Fig. 6 is a diagram of a front view sonar detector simulator in an embodiment of the present invention.

FIG. 7 is a diagram of a simulation environment model in accordance with an embodiment of the present invention.

Fig. 8 is a coordinate conversion code diagram of an embodiment of the present invention.

Fig. 9 is a graph of a ReLU function in accordance with an embodiment of the present invention.

Fig. 10 is a graph of a ReLU6 function in accordance with an embodiment of the present invention.

Fig. 11 is a navigation trajectory diagram of an underwater robot according to an embodiment of the present invention.

Fig. 12 is a training result diagram of an embodiment of the present invention, in which 12 (a) is a success rate curve, 12 (b) is a per-round prize value curve, and 12 (c) is a prize average value.

Fig. 13 is a navigation track diagram of an underwater robot in different test environments according to an embodiment of the present invention, where 13 (a) is environment 2, 13 (b) is environment 3, 13 (c) is environment 4, and 13 (d) is environment 5.

Fig. 14 is a software and hardware configuration diagram of a computer according to an embodiment of the present invention.

FIG. 15 is a super parameter setting diagram according to an embodiment of the present invention.

Fig. 16 is a graph of test results in environment 1 according to an embodiment of the present invention.

FIG. 17 is a graph of test results for different environments according to an embodiment of the present invention.

Reference numerals illustrate: the underwater robot 1 after the materialization, the advancing direction 2 of the underwater robot, the target position 3 of the underwater robot, the target area 31 of the underwater robot, the starting position 4 of the underwater robot and the starting area 41 of the underwater robot;

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the invention. For better illustration of the following embodiments, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Example 1

The embodiment relates to an agent autonomous navigation method based on deep reinforcement learning, in particular to an underwater robot autonomous navigation method based on deep reinforcement learning, which comprises the following steps:

s1, constructing an intelligent autonomous navigation system, wherein the autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm oriented to a multi-step mechanism, and the MS-DDQN algorithm is a deep reinforcement learning algorithm; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance deep neural network module, a global navigation deep neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from the obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to go to the target position towards a closer path, and the instruction selection module is used for determining a finally executed action instruction. The MS-DDQN algorithm comprises a current value network, a target value network, an error function, a reward function and an experience pool, wherein the current value network is used for selecting actions, the target value network is used for evaluating the actions, the error function is used for updating weights, the reward function refers to the reward value obtained when an agent takes a certain action in a current state and reaches a next state, and the experience pool is used for storing sample data generated by each step.

Based on the modularized neural network framework, the underwater robot can adopt different strategies to deal with different environmental states. When the underwater robot approaches an obstacle, the main task of the underwater robot is to avoid the obstacle, and the global navigation task becomes a secondary task. When the underwater robot is far from the obstacle, the global navigation task becomes the main task to help the underwater robot reach the target position with a shorter path. Aiming at two subtasks of underwater robot navigation, namely a local obstacle avoidance task and a global navigation task, the embodiment provides a modularized neural network structure. The local obstacle avoidance neural network and the global navigation neural network are designed for the two subtasks respectively, and the underwater robot can clearly know what action should be taken by the underwater robot under each condition to avoid the obstacle better and reach the target position in a closer path through the double-neural network structure.

The training method of the MS-DDQN algorithm comprises the following steps:

Initialize the experience pool D and set the hyper-parameter lambda,

For episode＝1,M do

For t＝1,2…do

If t<Tthen

If s _t+1 is the terminal state then:

T←t+1

τ←t-λ+1

If τ≥0 then

If τ+λ<T then

r _i ∈R _t

else

r _i ∈R _t

storage(s) _τ ,a _τ r _τ s _τ+λ ) In the experience pool D, the data of the model,

mini-batch sample data is extracted in random punch D,

setting:

using the loss function L (θ) =e [ (y) _i -Q(s,a,θ)) ² ]Gradient descent update of the current value network weight theta is carried out

Until τ＝T-1；

Wherein τ is mainly used for judging whether the number of times of executing actions of the mobile robot exceeds a set number of steps λ, if the number of times of executing actions exceeds the set number of steps, the value of τ is larger than 0, which indicates that the intelligent agent has obtained data of environmental states exceeding or equal to λ, and at the moment, the influence r of rewarding values obtained by λ states on τ state in the future can be calculated _τ And from array S _t ,A,S _t+1 Extracting state information s _τ Action a _τ Status information s _τ+λ These three information and r _τ A training sample tuple is formed and stored in the experience pool D for later training by extracting training samples from the experience pool.

When the intelligent body starts one round of training, firstly T is in infinity, and at the moment, T is<T, the program goes to the second for loop; when a collision occurs or a target position is reached, t≡t+1, if τ=t-1, all environmental state information obtained by the intelligent agent is stored with training sample data of the round according to a multi-step method. When τ+λ<T represents that the intelligent agent has not collided, and the environment state information which needs to be stored at the momentThe number of the messages is larger than or equal to the set lambda value and needs to pass

r _i ∈R _t Calculating the influence r of the prize value obtained for the lambda states in the future on the tau state _τ The method comprises the steps of carrying out a first treatment on the surface of the Otherwise, it indicates that the agent has collided, the number of environmental status information to be stored is smaller than the set lambda value, and the agent is required to pass +.>

r _i ∈R _t Calculating the influence r of future lambda-1 (lambda-2, lambda-3, …, 1) on the state of tau (tau+1, tau+2, …, T-1) _τ ；y _i Representing the ith sample data in mini-batch data extracted from experience pool, estimated value of each action estimated by target value network and actual r obtained in the state _i Is a comprehensive reward of the current state and is used for carrying out gradient descent operation with the current value network.

S2, constructing a simulation environment, including constructing an obstacle environment model and constructing a simulation intelligent body.

The construction of the simulation environment model and the implementation of the related algorithm, the software platform and hardware accessories used are shown in fig. 14.

Constructing an obstacle environment model: the obstacle environment model refers to the description of the obstacle in the environment, and the quality of the description of the environment model directly influences the state information input by the deep reinforcement learning algorithm and the finally learned obstacle avoidance strategy. Current raster methods and geometric methods are two of the more common environmental model description methods. As shown in fig. 4, a graphical display depicting an environmental model using geometric methods. The geometric method does not need to segment the environment, but rather uses "points", "lines", "faces" of the obstacles to describe the obstacle information in the environment. Therefore, when the environment state information is established by adopting the geometric method, the data quantity describing the environment state is not increased sharply because the environment becomes complex. The working environment range of the underwater robot is relatively large, and the obstacles are not very dense, so that the embodiment chooses to construct an environment model by adopting a geometric method.

In this embodiment, the types of the obstacle are divided into two types of elliptical obstacle and polygonal obstacle, wherein the circular obstacle is also regarded as an elliptical obstacle, and the geometric environment model only needs to record the vertex coordinates of the upper left corner of the rectangle for intercepting the ellipse and the long and short axes of the ellipse. For polygons in which triangles, rectangles, squares are all categorized into polygonal obstructions, the geometric model then needs to record each vertex angle coordinate of the polygon. Building a simulation agent: the actual underwater robot is not a particle, but a physical object with geometric dimensions. In this embodiment, considering the underwater robot as a particle, in order to ensure the navigation safety of the underwater robot in the real environment, the obstacle needs to be expanded outwards, so that the present embodiment performs a corresponding puffing treatment on the obstacle. In this embodiment the underwater robot is projected as a dot with a radius of 1 pixel. In an underwater environment, acoustic devices are often used as sensing instruments for detecting the environment. Meanwhile, the simulated underwater robot sails in a fixed-depth underwater environment, and the multi-beam forward-looking sonar is adopted as an environment detection instrument in the embodiment, for example, a multi-beam forward-looking detection sonar sensor. As shown in figure 5, the sonar detector has a vertical opening angle of 17 degrees, a horizontal opening angle of 120 degrees, a maximum effective detection distance of 200m, 240 beams in total, and an opening angle between each beam of 0.5 degrees. In this embodiment, an analog front view sonar detector is designed. In this embodiment, the underwater machine only performs motion control action of the horizontal plane, so that a forward-looking detector simulator with 180 degrees of horizontal opening angle, 90 maximum measuring ranges and 36 beams is designed, and the opening angle between each beam is 5 degrees. As shown in fig. 6, a simulated forward looking sonar detector simulator is shown. The black dots 1 represent the underwater robot after materialization, the line segments 2 are the advancing direction of the underwater robot, and the line segments on two sides of the line segments 2 are the sound wave lines emitted by a forward-looking sonar detector of the underwater robot.

In order to ensure that the observation data detected and collected by the sonar simulator of the underwater robot has uniformity, the method comprises the following steps ofThe first sound wave line on the left side of the advancing direction of the underwater robot is 0 degrees, and the first sound wave line on the right side is 180 degrees. Information detected by the sonar detector at time t at each time is sequentially stored as row vectors in an angular sequence of (0 °, 5 °, 10 °,..

Is a kind of medium. If no obstacle is detected, the sonar detector returns a maximum detection distance value of the sound wave line segment, otherwise, returns the distance between the sound wave line segment and the obstacle. Finally, the detected information is normalized, namely the row vector s is processed ^t Divided by the maximum effective probe distance. For the design of the motion model of the underwater robot, the underwater robot is set to advance at a constant speed, and the underwater robot is set to only perform discrete motion steering actions, namely 15 degrees of left turn, 30 degrees of left turn, 15 degrees of original direction, 15 degrees of right turn and 30 degrees of right turn relative to the advancing direction of the underwater robot.

The environment model is built by adopting a Pygame library, firstly, a 500 x 500 window size is defined as a simulation environment, and different barriers, boundary walls and target positions are added in the window. The environmental model is built as shown in fig. 7, including gray ellipses, circles and polygons representing various types of obstacles. The middle black dot 1 represents the underwater robot after materialization, and the ray line surrounding the black dot represents the forward-looking sonar detector. In the figure, the starting position and the target point of the underwater robot are randomly initialized in the starting area 41 and the target area 31 of the underwater robot, i.e., the

rectangular shadow areas

41 and 31. The black dots 4 represent the starting positions and the black dots 3 represent the target positions.

Implementation of dynamic environment: after the environment model and the simulated underwater robot model are built, the underwater robot is enabled to move in the simulated environment by adopting a related method, namely the dynamic realization of the simulated environment. Such as a method for moving an underwater robot and a method for detecting the distance between the underwater robot and an obstacle by a forward-looking sonar ranging simulator. And simultaneously, setting rules for judging collision of the underwater robot and rules for judging whether the underwater robot reaches the target position. By calculating the position of the underwater robotCoordinate transformation in the simulation environment after the current position makes the relevant motion. First, assume that the underwater robot starts at a position P at the lower left corner of the simulation environment _start The coordinates are (x _start ,y _start ) The motion speed is v=0.5, the current direction of the underwater robot is angle, and in the current state, the underwater robot selects an action to execute, namely the underwater robot selects a steering action, and the angle steering is angle _tran The steering angle is angle _tran ∈(15° _{turn_left} ,30° _{turn_left} ,0°,15° _{turn_light} ,30° _{turn_light} ). Equation 3-1 is the angle of the underwater robot after performing the action:

angle←angle+angle _train (5-1)

the underwater machine coordinates become as follows in conjunction with equation 5-1:

x _next ＝x _start +cos(angle)*v (5-2)

y _next ＝y _start +sin(angle)*v (5-3)

before the distance between the front view sonar ranging simulator and the obstacle is detected, the projection length of the end point position coordinates of each sound wave line segment on the x axis and the y axis under the central coordinate system of the underwater robot is calculated, and then the projection length of each sound wave line segment under the environment model is calculated. The conversion from the robot-centered coordinate system to the environment model-centered coordinate system is a medium-two-dimensional coordinate system transformation process. Assuming that the coordinates of the robot are (center_x, center_y), the code of solving the coordinates of the end of the acoustic line segment at the coordinate portion centered on the environmental model is shown in fig. 8. When coordinate projection of the sound wave line segments in a coordinate system taking the environment model as a center is obtained, solving of the distance between each sound wave line segment and the underwater robot detected by the underwater robot and the obstacle can be started. Then constructing each side vector of the obstacle, each sound wave line segment vector and the vector from the position coordinates of the robot to each vertex of each obstacle, and solving a relative relation between the vectors in real time to obtain the position information between the underwater robot and the obstacle, so that the length of each sound wave line segment detected by the forward-looking sonar detector can be solved.

Rule for judging collision of underwater robot: first set d _min For the minimum safe distance between the underwater robot and the obstacle, the minimum sound wave line segment detected by the current apparent sonar detector is smaller than the set d _min And judging that collision occurs, ending the training of the round, and reassigning a new starting position for the underwater robot. Otherwise, no collision occurs, and the underwater robot selects action to execute according to the related strategy. A rule for determining that the underwater robot reaches the target position: first define d _Arrivals The maximum distance to reach the target position for the underwater machine. In the operation of the underwater robot, the Euclidean distance between the current position coordinate of the underwater robot and the target position coordinate is calculated, if the distance is smaller than the obtained distance d _Arrivals The target position of the underwater robot is indicated.

S3, placing the autonomous navigation system in the simulation environment for training, namely, training and learning by the intelligent body in the simulation environment by adopting the MS-DDQN algorithm; the number of the simulation environments is multiple, and the training times of each simulation environment are multiple.

S31, acquiring current state information of an intelligent agent by a simulation environment, wherein the current state information comprises distance information of the intelligent agent and an obstacle in the environment and position relation information of the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation of the current coordinate of the intelligent body and the target position coordinate, the Euclidean distance of the current position of the intelligent body to the target position and the included angle between the vector of the advancing direction of the intelligent body and the vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the target position coordinate.

The underwater robot mainly detects the environment information of the obstacle through the distance measuring sensor, and the information acquired by the distance measuring sensor is the distance information of the underwater robot and the obstacle in the environment. In order to improve learning efficiency and learn a better strategy of the underwater robot, the position relation information of the underwater robot and the target point is added in the state information of the input deep neural network, so that the underwater robot can reach the target point towards a shorter path. The position relation contains three information contents, wherein the first is the relative coordinate relation between the current coordinate of the underwater robot and the target position coordinate, and the coordinate relation is obtained by subtracting the current coordinate of the underwater robot from the target position coordinate. The second is the euclidean distance of the current position of the underwater robot to the target position, and the information covers how long the underwater robot is from the target position. The third information is the included angle between the vector of the advancing direction of the underwater robot and the vector reaching the target position from the current robot coordinate position; this information may be used to instruct the underwater robot in which direction it should be the nearest direction to the target location.

The underwater robot is helped to reach the target position point in a shorter path by adding the position relation information of the underwater robot and the target point in the input state information. Meanwhile, the position relation information is directly used as input state information of the global navigation neural network, and the underwater robot can learn in what direction to advance through the position information so as to reach the target position fastest.

S32, inputting the obtained current state information into the modularized deep neural network, specifically inputting distance information of an agent and an obstacle in the environment and position relation information of the agent and a target point into the local obstacle avoidance deep neural network module, and inputting the position relation information of the agent and the target point into the global navigation deep neural network module.

S33, the local obstacle avoidance deep neural network module and the global navigation deep neural network module output respective control instructions according to the input current state information.

S34, the instruction selection module determines an action instruction output by the local obstacle avoidance depth neural network module or the global navigation depth neural network module by judging a distance value between the intelligent agent and the nearest obstacle.

The system architecture of the autonomous navigation system is shown in fig. 2, and the input state information of the local obstacle avoidance neural network comprises passing distance measurementThe network directly outputs a control instruction for controlling the underwater robot after the environmental information detected by the distance sensor and the relative position information of the underwater robot are input and the state information is transmitted forwards through the local obstacle avoidance depth neural network. The input state information of the global obstacle avoidance neural network is only the relative position information of the underwater robot, and the control command for controlling the underwater robot is output after the input information is processed. Because the local obstacle avoidance deep neural network and the global navigation neural network both output corresponding instructions for controlling the underwater robot, an instruction selection module is designed to determine which network is adopted to output an action instruction for execution. The command selection module determines which network to output as a motion command by determining the distance value of the underwater robot from the nearest obstacle. In the present embodiment, a threshold d is set _{to_obs} =40, determining which module to use. When the number of the sensors is smaller than 40, the underwater robot is already very close to the obstacle, so that the instructions output by the local obstacle avoidance depth neural network are adopted to execute; if 40 or more, the underwater robot is further away from the obstacle, and the global navigation neural network should be executed to reach the target position at a faster speed. The deep reinforcement learning internal components are marked as a local obstacle avoidance module, a global navigation module, an instruction selection module and actions, and the components related to the external environment are marked as distance detection sensor information, relative position information and environment. The design of the content neural network structure of the local obstacle avoidance module and the internal neural network structure of the global navigation module is shown in fig. 3, and the internal neural network structure of the local obstacle avoidance module and the internal neural network structure of the global navigation module are all in a fully-connected structure, so that the system adopts a forward-looking sonar detector to sense environmental information, and meanwhile, the dimension of the information returned by the detector is lower, the data volume is less, and therefore, a complex convolution layer does not need to be built. The inside of the local obstacle avoidance neural network structure comprises three hidden layers, and the number of neurons of the hidden layers is 256, 138 and 32 respectively. The global navigation neural network structure only comprises an implicit layer of one layer, and the number of neurons is 32. The neural network structure of the global navigation module is so simple because the input state information of the network structure is only underwater The relative position information of the robot can well solve the problem of global navigation by adopting a hidden layer.

In order to optimize the input information of the state space, euclidean distance information of the underwater robot reaching the target position and an included angle between a vector of the advancing direction of the underwater robot and a vector reaching the target position from the coordinate position of the current robot are added in the original state space, and the underwater robot can know the current position state better by adding the two information, so that the underwater robot can reach the target position point towards more paths.

S35, the intelligent agent executes the instruction selected by the instruction selection module, receives the reward value output by the reward function and enters the next state.

In order to improve the convergence speed of the algorithm, the underwater robot can better avoid the obstacle and reach the target position in a shorter path. This embodiment employs a new sequential combined bonus function. By means of the continuous combined rewarding function, the underwater robot can obtain corresponding rewards in each learning step, and the method is more beneficial to guiding the underwater robot towards the target point and away from the obstacle. The continuous combined reward function includes a terminal reward and a non-terminal reward; the terminal rewards specifically comprise:

The agent gets a positive prize when it reaches the target point, expressed as r _arr ＝100；if d _r-t ≤d _win The method comprises the steps of carrying out a first treatment on the surface of the Wherein d is _r-t Is the Euclidean distance from the agent to the target point, d _win Is the threshold for the agent to reach the target point, when d _r-t Less than the set d _win When the target point is reached, the target point is not reached; the r is _arr A positive prize value indicating that the agent reached the target location;

the non-terminal rewards specifically include:

the agent gets a positive prize when advancing towards the target point, expressed as r _{t_goal} ＝c _r [d _r-t (t)-d _r-t (t-1)]The method comprises the steps of carrying out a first treatment on the surface of the Wherein c _r ∈(0,1]Is set to 1;

when the minimum distance between the intelligent body and the obstacle is continuously reduced, the obtained dangerous rewards r _dang ∈[0,1]And also becomes smaller, expressed as

wherein a is _ori Is the included angle between the advancing direction vector of the intelligent agent and the direction vector reaching the target position sent by the current coordinate of the intelligent agent.

S4, loading the trained autonomous navigation system onto the intelligent body, wherein the intelligent body obtains the autonomous navigation capability.

The specific steps of the local obstacle avoidance deep neural network module and the global navigation deep neural network module in the S33 are as follows:

s331, the current value network receives the input current state information, processes according to the input current state information, and outputs a group of actions, wherein the processing process comprises extracting training samples from an experience pool according to the current state information; and transmitting the outputted set of actions to the target value network.

The output of the target value network is:

the loss function is:

the data stored in the experience pool are as follows:

wherein t is a moment, s is a state, i is a subscript of a prize value obtained from each state after the t state starts; s is(s) _t For the current state, a is the action performed, a _t Indicating the execution of the action at time t, r _t Representing a timely rewarding value obtained by the agent at the time t; gamma is the discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The extent of influence on the current t state, and γ is a value less than 1 and greater than 0; lambda is the number of steps at intervals, s _t+λ Is in a state after a step lambda.

In single step reinforcement learning, a training sample tuple contains the current state s _t Action a is made in the current state _t Obtain the prize value r _t The next state s _t+1 Thus, a training sample tuple is (s _t ,a _t ,r _t ,s _t+1 ). In multi-step reinforcement learning, a training sample tuple contains the current state s _t Action a by the current state _t Obtain the prize value r _t The prize value r obtained for the next state _t+1 The prize value r obtained in the next state _t+2 … … until lambda state, when in lambda state, the prize value obtained is r _r+λ State s of lambda state _t+λ Thus, a training sample tuple (s _t ,a _t ,r _t ,r _t+1 ,r _t+2 …r _t+i …r _t+λ ,s _t+λ ) Thus i is the subscript of the prize value obtained for each state after the start of the t state.

Gamma is the discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The extent of the effect on the current t-state. Gamma is a value less than 1 and greater than 0. Let γ=0.5

/>

γ ^λ Representing the prize value r obtained for the t+lambda state _t+λ The extent of the effect on the current t-state. Q(s) _t+λ A, θ) represents the current value of the neural network based on the input information (s _t+λ A), outputting an estimated value of each action. Q (Q) _target (s _t+λ ,argmaxQ(s _t+λ A, θ)), first selects an action instruction and s corresponding to the maximum value of the estimated value output by the current value network _t+λ As input information of the target evaluation value network, the target evaluation value network outputs an evaluation value of each action. Q (s, a, θ) represents the current value neural network outputting an estimated value for each action.

S332, the target value network processes according to the group of actions, the value of each action is calculated, and the value of the group of actions is transmitted back to the current value network.

S333, the current value network selects the next action to be executed according to the maximum value in the value of the group of actions returned by the target value network, and outputs a control instruction to the simulation environment according to the next action.

S334, the current value network calculates next state information according to the next action in S333, processes according to the next state information, and outputs a group of actions, wherein the processing process comprises the next state information and extraction of training samples from an experience pool; and transmitting the outputted set of actions to the target value network; the target value network repeats step S332, i.e. the target value network processes according to the group of actions, calculates the value of each action, and then returns the value of the group of actions to the current value network.

S335, comparing the value of the group of actions returned by the target value network in S334 with the value of the group of actions returned by the target value network in S332 by the current value network, and calculating an error according to an error function.

S336, updating the weight of the current value network according to the error obtained in S335, and updating the weight of the target value network according to the weight of the current value network after every N steps.

S337, while the step S334 is performed, the current value network outputs a control instruction to the simulation environment according to the selected next action.

In order to improve learning ability of the neural network and inhibit gradient disappearance in the neural network, in this embodiment, reLU6 and ReLU are combined as activation functions in the neural network framework, that is, activation functions adopted by the local obstacle avoidance deep neural network module and the global navigation deep neural network module are ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.

The formula of the ReLU function is as follows.

As shown in fig. 9, when the input value is negative or 0, the output of the ReLU is 0, but when the input value is greater than 0, the ReLU outputs the input value itself. This single-sided activation characteristic of the ReLU allows sparse activation of neurons in the neural network. The ReLU relieves the problem that gradient disappearance easily occurs in the Sigmoid function and the Tanh function, so that the convergence rate of the neural network is more stable.

The ReLU6 is an improved activation function obtained by improving the ReLU, and the formula is as follows:

the ReLU6 function is mainly an improvement on the positive number part of the ReLU input, when the input value is larger than 6, the ReLU6 output value is always 6, if the input value is a real number larger than 0 and smaller than 6, the ReLU6 function is output, otherwise, the ReLU6 function is output 0, as shown in fig. 10, by plotting the ReLU6 function through Python. ReLU6 can encourage the neural network model to learn the relevant sparse characteristics of the input data ahead of time. Both the ReLU and ReLU6 activation functions can avoid the phenomenon of gradient disappearance. The ReLU6 function is used as an activation function at the front end of the network, so that the sparse feature of the data sample can be quickly learned. And finally, the network outputs the evaluation value of each action, and finally, the action corresponding to the highest evaluation value is selected as the action to be executed by the underwater robot.

And training the underwater robot navigation model based on different algorithms to verify the validity of the MS-DDQN algorithm. First, as shown in fig. 7, an established simulation environment model is used as an underwater robot training environment, and the environment is set as a training environment 1. In order to verify the effectiveness of the MS-DDQN method, we tested the navigation ability of the underwater robot in training ENV-1, and compared the MS-DDQN algorithm with the DDQN, prioritized DQN and prioritized DDQN algorithms. In order to ensure fairness of experiments, the same network structure and the same software and hardware platform are used for model training. Before training begins, we set relevant super parameters in deep reinforcement learning accordingly, as shown in fig. 15. To be able to quantitatively evaluate the performance of each algorithm, we use three metrics to assess the merits of the navigation model. The first is success rate, which means that the number of successfully reaching the target positions after the underwater robot starts training accounts for the total training; the second is a prize value curve representing the sum of prize values obtained per pass of the underwater robot during the training process. For a smooth bonus curve we process the curve using a sliding average method, with a sliding window size of 300. The third is the average value of the rewards obtained, which represents the sum of rewards obtained by the underwater robot during the training process divided by the number of training rounds. The underwater robot based on the MS-DDQN algorithm, the prioritized DQN algorithm and the prioritized algorithm trains the autonomous navigation capacity of the device in the environment 1, and the training result is shown in figure 11.

As shown in fig. 12 (a), it can be known that the success rate curve to the MS-DDQN rises faster than the other three methods, which indicates that the learning efficiency of the MS-DDQN algorithm is higher. This is also demonstrated by the prize curve of fig. 12 (b). After 3000 times of training, the success rate of the MS-DDQN reaching the target position is 80.133%, the DDQN is 61.7%, the prioritized DQN is 63.633%, the prioritized DDQN is 53.366%, and the success rate of the MS-DDQN is far higher than other algorithms. This shows that the MS-DDQN-based underwater robot performs more collision-free and target point-reaching training in the training process, and has stronger obstacle avoidance and navigation functions. In fig. 12 (b), it can be seen that after 500 training, the curve of his return obtained by MS-DDQN stabilizes above 200, while the curves of the other three algorithms show larger fluctuations, which indicates that the MS-DDQN based navigation model has higher stability. In fig. 12 (c), the average prize value for MS-DDQN is 185.072, DDQN is 130.064,prioritized DQN and 132.067,prioritized DDQN is 101.650, which also demonstrates that the MS-DDQN based underwater robot has a greater navigation capability. Because a lower prize value means a lot of negative prizes, it means that the underwater robot has more collisions. Through analyzing the success rate curve, the rewarding value curve of each round and the rewarding average value of the navigation model based on different algorithms in the training process, the underwater robot based on the MS-DDQN algorithm can be known to have stronger learning efficiency and higher stability of the trained navigation model in the training process compared with other three algorithms.

The navigation ability and generalization ability of the navigation model are tested. After the 3000 rounds of training in environment 1, a navigation model based on the MS-DDQN algorithm, the prioritized DQN algorithm, and the prioritized DQN algorithm is obtained. These navigation models were first tested 200 times in environment 1 and analyzed for the proportion of successful arrival at the target location. In 200 tests, the starting position and the target position of the underwater robot were randomly allocated. The superiority of navigation models based on different algorithms is measured by comparing the success rate of the underwater robot reaching the target position in 200 tests and obtaining average rewards. The higher the success rate, the higher the average reward, indicating that the navigation model is a better strategy. The results are shown in FIG. 16: after 3000 training, the underwater robot trained by these four algorithms basically learns how to avoid obstacles and reach the target position in the environment 1. According to the test result, the MS-DDQN algorithm has the best effect, the success rate is 100 percent, and the average rewards are the highest. The result shows that the underwater robot based on the MS-DDQN algorithm has higher obstacle avoidance capability and better navigation strategy. The navigation trajectory of the MS-DDQN training based underwater robot in Env-1 is shown in FIG. 11. To fully evaluate the generalization ability of navigation models based on different algorithms, we have additionally designed four test environments different from the training environment. The four test environments are 500 x 500, 600 x 600, 700 x 700 and 800 x 800, respectively, and are environment 2, environment 3, environment 4 and environment 5, respectively. As in the training environment, in the test environment, the starting position and the target position of the underwater robot are randomly initialized in the starting region 41 and the target region 31 of the underwater robot, i.e., the

rectangular shadow regions

41 and 31. The navigation model trained by the MS-DDQN, DDQN, prioritized DQN, and prioritized DDQN algorithms was tested 200 times in four test environments. As shown in FIG. 13, the navigation model is based on the navigation track of the MS-DDQN underwater robot in four unknown complex test environments, and shows that the navigation model trained by the MS-DDQN has strong generalization capability and can adapt to the new unknown environments. No further training is required.

The results of the 200 rounds of testing in four different test environments are shown in the following table, and it can be seen from FIG. 17 that the MS DDQN trained navigation model has a success rate of 97% in Env-2, 91% in Env-3, 94% in Env-4, and 96% in Env-5. However, the navigation model trained by the other three algorithms does not reach 90% success rate in the test environment, even in Env-3 the success rate of DDQN is only 46%. The result shows that the navigation model based on the MS-DDQN has strong generalization capability, so that the underwater robot can navigate in a new unknown environment without retraining. On the other hand, the test results also confirm the conclusion in FIG. 12 (b), confirming that the navigation strategy trained by MS-DDQN is more stable than DDQN, prioized DQN and prioized DDQN. The generalization capability of the Prioitized DQN and the Prioitized DQN is better than that of the DDQN, and the Prioitized DQN has stronger navigation capability because the Prioitized DQN and the Prioitized DQN conduct targeted training and learning on collision training samples in the training process. The above experiments show that the generalization ability of the MS-DDQN trained navigation model is better than that of DDQN, prioized DQN and prioized DDQN. The reason is that the MS-DDQN will be able to spread the prize values obtained during training towards the later steps of the state. The underwater robot can learn autonomous navigation faster, and simultaneously is also helpful for the underwater robot to sense the obstacle and the target position in advance and make the action of avoiding or tending to the target point in advance.

The present embodiment employs a geometric approach to simulate a 2-dimensional underwater environment in which multiple types of dense obstacles are contained. Finally, the effectiveness of the MS-DDQN algorithm is verified by comparing the navigation capability of the underwater robot in the simulated training environment with different algorithms. Meanwhile, relevant navigation tests are carried out in four test environments with completely different training environments, and experiments prove that the underwater robot trained by adopting the MS-DDQN algorithm has stronger generalization capability and can better adapt to new obstacle environments without retraining.

It should be understood that the foregoing examples of the present invention are merely illustrative of the present invention and are not intended to limit the present invention to the specific embodiments thereof. Any modification, equivalent replacement, improvement, etc. that comes within the spirit and principle of the claims of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The intelligent body autonomous navigation method based on deep reinforcement learning is characterized by comprising the following steps of:

s1, constructing an intelligent autonomous navigation system, wherein the intelligent autonomous navigation system adopts an MS-DDQN algorithm, namely a DDQN algorithm oriented to a multi-step mechanism; the MS-DDQN algorithm adopts a modularized neural network, and the modularized neural network comprises a local obstacle avoidance deep neural network module, a global navigation deep neural network module and an instruction selection module; the local obstacle avoidance depth neural network module is used for guiding the intelligent agent to be far away from the obstacle, the global navigation depth neural network module is used for guiding the intelligent agent to go to the target position towards a closer path, and the instruction selection module is used for determining a finally executed action instruction;

S2, constructing a simulation environment, including constructing an obstacle environment model and constructing a simulation intelligent body;

s3, placing the autonomous navigation system in the simulation environment for training, namely, training and learning by the intelligent body in the simulation environment by adopting the MS-DDQN algorithm; the number of the simulation environments is multiple, and the training times of each simulation environment are multiple;

s4, loading the trained autonomous navigation system onto the intelligent body, wherein the intelligent body obtains the autonomous navigation capability;

the MS-DDQN algorithm comprises a current value network, a target value network, an error function, a reward function and an experience pool, wherein the current value network is used for selecting actions, the target value network is used for evaluating the actions, the error function is used for updating weights, the reward function refers to the reward value obtained when an agent takes a certain action in a current state and reaches a next state, and the experience pool is used for storing sample data generated by each step.

2. The intelligent agent self-navigation method based on deep reinforcement learning according to claim 1, wherein the output function of the target value network is:

wherein gamma is a discount factor, gamma ⁱ The prize value r obtained for the t+i state of representation _t+i The degree of influence on the current t state, and gamma is a value less than 1 and greater than 0, gamma ^λ Representing the prize value r obtained for the t+lambda state _t+λ The degree of influence on the current t state; q is the state-action value estimated value, lambda is the number of steps at intervals, s _t A is the current state, a _t Action performed for the current state, r _t Indicating a timely rewarding value obtained by the agent at the time t, r _r+λ S is the prize value obtained in lambda state _t+λ A state of being spaced by λ steps; θ is a weight parameter in the current value network, and θ' is a weight parameter in the target value network; i is the subscript of the prize value obtained from each state after the start of the t state; q(s) _t+λ A, θ) represents the current value of the neural network based on the input information (s _t+λ A), outputting an estimated value of each action; q (Q) _target (s _t+λ ,argmaxQ(s _t+λ A, θ)) representing a first selection of an action command and s corresponding to the maximum value of the estimated values outputted from the current value network _t+λ As input information of the target value network, the target value network outputs an estimated value of each action;

the loss function is:

wherein E is the neural network error, s is the state, a is the action performed, θ is the weight in the current value networkThe heavy parameter, Q is the state-action value estimated value, Q (s, a, theta) represents the estimated value of each action output by the neural network with the current value;

The data stored in the experience pool are as follows:

3. The method for autonomous navigation of an agent based on deep reinforcement learning according to claim 2, wherein the training step in S3 specifically includes:

s31, acquiring current state information of an intelligent agent by a simulation environment, wherein the current state information comprises distance information of the intelligent agent and an obstacle in the environment and position relation information of the intelligent agent and a target point; the position relation information of the intelligent body and the target point comprises a relative coordinate relation of the current coordinate of the intelligent body and the target position coordinate, an Euclidean distance of the current position of the intelligent body to the target position and an included angle between a vector of the advancing direction of the intelligent body and a vector of the direction from the current coordinate position of the intelligent body to the target position, wherein the relative coordinate relation is obtained by subtracting the current coordinate of the underwater intelligent body from the target position coordinate;

S32, inputting the acquired current state information into the modularized neural network, specifically inputting distance information of an agent and an obstacle in the environment and position relation information of the agent and a target point into the local obstacle avoidance depth neural network module, and inputting the position relation information of the agent and the target point into the global navigation depth neural network module;

4. The intelligent agent autonomous navigation method based on deep reinforcement learning according to claim 3, wherein the specific steps of the local obstacle avoidance deep neural network module and the global navigation deep neural network module are as follows:

s331, the current value network receives the input current state information, processes according to the input current state information, and outputs a group of actions, wherein the processing process comprises extracting training samples from an experience pool according to the current state information; and transmitting the outputted set of actions to the target value network;

s334, the current value network calculates next state information according to the next action in S333, processes according to the next state information, and outputs a group of actions, wherein the processing process comprises the next state information and extraction of training samples from an experience pool; and transmitting the outputted set of actions to the target value network; the target value network repeats step S332, i.e. the target value network processes according to the group of actions, calculates the value of each action, and then returns the value of the group of actions to the current value network;

5. A deep reinforcement learning-based agent autonomous navigation method according to claim 3, wherein the reward function is a continuous combined reward function comprising a terminal reward and a non-terminal reward; the terminal rewards specifically comprise:

the agent gets a positive prize when it reaches the target point, expressed as r _arr ＝100；ifd _r-t ≤d _win The method comprises the steps of carrying out a first treatment on the surface of the Wherein d is _r-t Is the Euclidean distance from the agent to the target point, d _win Is the threshold for the agent to reach the target point, when d _r-t Less than or equal to the set d _win When the target point is reached, the target point is not reached; the r is _arr Representing an agent toA positive reward value obtained by reaching the target position;

the non-terminal rewards specifically include:

Wherein d is _min Is the minimum distance of the agent from the obstacle, where beta is a coefficient such that r _dang The value space of (1, 0); said d _r-t (t) represents the Euclidean distance between the current position coordinate of the agent and the target position coordinate at the moment t;

6. The method for autonomous navigation of an agent based on deep reinforcement learning according to any one of claims 1 to 5, wherein activation functions adopted by the local obstacle avoidance deep neural network module and the global navigation deep neural network module are ReLU6 and ReLU; the activation function ReLU6 is applied to the front end of the neural network, and the activation function ReLU is applied to the back end of the neural network.

7. The method for autonomous navigation of an agent based on deep reinforcement learning according to any one of claims 1 to 5, wherein the local obstacle avoidance deep neural network module and the global navigation deep neural network module each adopt a fully connected structure, the number of hidden layers of the local obstacle avoidance deep neural network module is more than three, and the number of hidden layers of the global navigation deep neural network module is one.

8. The intelligent agent autonomous navigation method based on deep reinforcement learning according to any one of claims 3 to 5, wherein the instruction selection module is provided with a threshold value, and selects the action instruction according to a threshold value size; when the threshold value is smaller than 40, selecting an action instruction output by the local obstacle avoidance depth neural network module; and when the threshold value is greater than or equal to 40, selecting an action instruction output by the global navigation neural network module.