CN110209152A

CN110209152A - The deeply learning control method that Intelligent Underwater Robot vertical plane path follows

Info

Publication number: CN110209152A
Application number: CN201910514354.0A
Authority: CN
Inventors: 李晔; 白德乾; 姜言清; 安力; 武皓微
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2019-06-14
Filing date: 2019-06-14
Publication date: 2019-09-06
Anticipated expiration: 2039-06-14
Also published as: CN110209152B

Abstract

The present invention is to provide the deeply learning control methods that a kind of Intelligent Underwater Robot vertical plane path follows.Step 1 is required according to the path model- following control of Intelligent Underwater Robot, establishes the Intelligent Underwater Robot environment interacted with agent；Step 2 establishes agent's set；Step 3 establishes experience cache pool；Step 4 establishes learner；Step 5 carries out Intelligent Underwater Robot path model- following control using distributed deterministic policy gradient.The phenomenon that present invention is complicated and changeable for marine environment locating for Intelligent Underwater Robot, and traditional control method can not actively be interacted with environment, the deeply learning control method that design Intelligent Underwater Robot vertical plane path follows.The path model- following control task of Intelligent Underwater Robot is completed by distributed method using deterministic policy gradient, there is self study, precision is high, and adaptability is good, the stable advantage of learning process.

Description

The deeply learning control method that Intelligent Underwater Robot vertical plane path follows

Technical field

The present invention relates to a kind of submarine navigation device control method, specifically a kind of Intelligent Underwater Robot is vertical The deeply learning control method that face path follows.

Background technique

With deepening continuously for ocean development, Intelligent Underwater Robot is since it is flexible, easy to carry with movement, can be certainly The features such as main operation, has been widely used in marine environmental protection, marine resources development, and status becomes more and more important.Furthermore pass through standard Really control Intelligent Underwater Robot so that some extremely hazardous tasks become safety, such as explores submarine oil, repairs seabed Pipeline, and the position of tracking and record explosive substance.

Traditional path follow-up control method such as fuzzy logic control, PID control, the control of the face S need artificial adjustment Control parameter, control effect depend on the experience of people, Intelligent Underwater Robot cannot active interact with environment.In recent years Come, with the fast development of artificial intelligence technology, one of important content as artificial intelligence, intensified learning achieves in recent years A series of important breakthrough.In intensified learning, learner, which not may be notified that, will take which to take action, and have to be by trial To find which action can generate maximum return.Action can not only influence directly to reward, and can also influence next moment State, and pass through this state and influence all subsequent rewards.

Summary of the invention

The purpose of the present invention is to provide one kind to have the characteristics that self study, with high accuracy, is adapted to various complexity oceans The deeply learning control method that the Intelligent Underwater Robot vertical plane path of environment follows.

The object of the present invention is achieved like this:

Step 1 is required according to the path model- following control of Intelligent Underwater Robot, establishes the intelligence interacted with agent It can underwater robot environment；

Step 2 establishes agent's set；

Step 3 establishes experience cache pool；

Step 4 establishes learner；

Step 5 carries out Intelligent Underwater Robot path model- following control using distributed deterministic policy gradient.

The present invention may also include:

1. the Intelligent Underwater Robot environment that the foundation is interacted with agent is by the road of Intelligent Underwater Robot Diameter model- following control process model building determines the chief component of Markovian decision process at a Markovian decision process: Motion space, state space, observation space, reward function.

2. the chief component of the determining Markovian decision process specifically includes:

(1) motion space is determined

Motion space expression formula is F=[delS], and wherein delS indicates the rudder angle of Intelligent Underwater Robot hydroplane；

(2) state space is determined

State-space expression is S=[w, q, z, theta], and wherein w indicates Intelligent Underwater Robot in satellite coordinate system Under heave velocity, q indicates that rate of pitch of the Intelligent Underwater Robot under satellite coordinate system, z indicate intelligent underwater Depth of the people under earth coordinates, theta indicate pitch angle of the Intelligent Underwater Robot under earth coordinates；

(3) observation space is determined

Observation space is the function of state space: O=f (S), wherein following straight line path are as follows: O=[w, q, zdelta, Cos (theta), sin (theta)], zdelta=z-zr, zr indicate the depth where straight line path；

(4) reward function is determined

In intensified learning, the purpose or target of agency is formed according to distinctive signal, referred to as reward or reward function, Agent is passed to from environment, leads to the current shape generated after previous moment takes movement for evaluating Intelligent Underwater Robot The effect of state:

R (s, a)=R (s)+R (a)

Wherein:

R (s)=- (α_ww²+α_qq²+α_zzdelta²+α_ttheta²)

R (a)=- (α_a1delS²)

Wherein α_w、α_q、α_z、α_t、α_a1It is weight coefficient.

3. agent's set of establishing specifically includes:

(1) K movement network is established simultaneously, and K movement network is interacted to Intelligent Underwater Robot environment simultaneously Establish agent's set；

(2) agent's set is from network parameter is received for acting the update of network, generation in agent's set from learner Reason people set will act that experience member that network and Intelligent Underwater Robot environment interact is handed down from one's ancestors to be delivered to experience cache pool, The expression formula of single experience member ancestral is:

(o_i,a_i,R(s,a)_i)。

4. the experience cache pool of establishing specifically includes:

Experience cache pool is from being movement network and Intelligent Underwater Robot ring in Receiving Agent people set from agent's set The experience member ancestral that border interacts, experience cache pool are delivered to study for the experience member sampled according to priority is handed down from one's ancestors Person, the expression formula of priority sampling are as follows:

Wherein, p_iIt is the priority of experience member ancestral i, α is the coefficient of a very little greater than 0, for determining priority Degree, if α=0, priority sampling just becomes random uniform sampling.

It establishes learner 5. described and specifically includes:

(1) learner's network receives the experience member ancestral sampled according to priority from experience cache pool, and is learned Acquistion to network parameter be transmitted to agent set；

(2) learner uses performer-reviewer's structure, and wherein the input of performer's network is observation space, and output is movement Space, i.e. control variable, expression formula are F=[delS], and movement network is identical as performer's network structure；The input of reviewer's network It is observation space and motion space, output is the distribution of Z, and then the mean value of Z is acquired by distribution, and Z is indicated in t time step, According to tactful π, when state is s, desired return, i.e. state-action value after movement a are taken, using asking state-movement The form of Distribution value is than directly seeking merely state-action value average value or onlying demand state-action value form.

6. described specifically included using distributed deterministic policy gradient progress Intelligent Underwater Robot path model- following control:

(1) size for initializing the experience member ancestral sampled according to priority is M=256, the size of experience cache pool It is no more than 10 for R=1000000, the number K for acting network, the learning rate of performer's network and reviewer's network in learner For α₀=β₀=0.0001, constant ε=0.00001 is explored, maximum explores number E=100, the maximum exploration step number explored every time It is T=1000；

(2) using performer-reviewer's network network weight parameter in random fashion initialization action network and learner (θ, w), wherein θ is the parameter for acting performer's network in network and learner, and w is the parameter of reviewer's network in learner；

(3) using (2) step initiation parameter be learner in performer's network and reviewer's network establish one respectively A target network, the parameter of target network are denoted as (θ ', w')；

(4) K movement network is run parallel；

(5) according to priority p from experience cache pool_iChoose sample experience member ancestral the M ((o that length is N_i:i+N,a_i:i+N-1, R(o,a)_i:i+N-1)；

(6) distribution of Z is constructed

(7) according to performer-reviewer's network update in following formula calculating action network and learner

(8) network parameter is updated

θ←θ+α_tδ_θ,

w←w+β_tδ_w；

(9) if the step number explored every time reaches 1000, terminate the exploration of current number；If do not reached, the is returned (5) step；

(10) if exploring number reaches 100, terminate experiment；If do not reached, (2) step is returned；

(11) return action network includes the Intelligent Underwater Robot path model- following control model of suitable parameters θ.

7. K movement network of the parallel operation specifically includes:

1) selection acts a,Wherein Section 2 indicates fixed Gaussian noise；

2) execution acts a, and the R that is recompensed (s, a) and the observation state o' of subsequent time；

3) by experience member ancestral (o_i,a_i,R(s,a)_i) be stored in experience cache pool；

4) step 1) -3 is repeated), until convergence or training terminate.

The present invention provides the deeply learning control method that a kind of Intelligent Underwater Robot vertical plane path follows, needles Complicated and changeable to marine environment locating for Intelligent Underwater Robot, traditional control method can not show with what environment actively interacted As the deeply learning control method that design Intelligent Underwater Robot vertical plane path follows.

The characteristics of present invention can be interacted actively with environment using intensified learning proposes logical using deterministic policy gradient Distributed method is crossed to complete the path model- following control task of Intelligent Underwater Robot, there is self study, precision is high, adaptability It is good, the stable advantage of learning process.

The invention has the benefit that

1. the present invention has self study, the good feature of adaptability, the spy innately learnt with environmental interaction due to intensified learning Point, the deeply learning control method that Intelligent Underwater Robot vertical plane provided by the invention path follows can active and rings Border interacts, and is adapted to various complicated marine environment.

2. there is the present invention learning process to stablize, the good feature of learning outcome scalability.Intelligent water provided by the invention The deeply learning control method that lower robot vertical face path follows is provided more preferably by using distributed method, More stable learning signal；Learn simultaneously resulting control strategy destination path variation be not especially acutely in the case where can be with It directly uses, without training again, saves the time, improve efficiency.

Detailed description of the invention

Fig. 1 is overall construction drawing of the invention；

Fig. 2 is the schematic diagram of performer's network in present invention movement network and learner's structure；

Fig. 3 is the schematic diagram of reviewer's network in learner's structure of the present invention；

Fig. 4 is to carry out the simulation result that sinusoidal path follows using the method for the present invention.

Specific embodiment

It illustrates below and the present invention is described in more detail.

As shown in connection with fig. 1, it is overall construction drawing of the invention, specifically includes that

Step 1 is required according to the path model- following control of Intelligent Underwater Robot, establishes the intelligence interacted with agent It can underwater robot environment.

Step 2 establishes agent's set.

Step 3 establishes experience cache pool.

Step 4 establishes learner.

The deeply learning control method that Intelligent Underwater Robot vertical plane proposed by the present invention path follows, is tied below The drawings and specific embodiments are closed to be described in more detail the present invention.

Detailed implementation method of the invention the following steps are included:

1. by the path model- following control task process model building of Intelligent Underwater Robot at a Markovian decision process, really Determine the chief component of Markovian decision process: motion space, state space, observation space, reward function.

The first step determines motion space

Second step determines state space

State-space expression is S=[w, q, z, theta], and wherein w indicates Intelligent Underwater Robot in satellite coordinate system Under heave velocity, q indicates that rate of pitch of the Intelligent Underwater Robot under satellite coordinate system, z indicate intelligent underwater Depth of the people under earth coordinates, theta indicate pitch angle of the Intelligent Underwater Robot under earth coordinates.

Third step determines observation space

Observation space is the function of state space: O=f (S).For following straight line path, O=[w, q, zdelta, Cos (theta), sin (theta)], wherein zdelta=z-zr, zr indicate the depth where straight line path.

4th step, determines reward function

R (s, a)=R (s)+R (a)

Wherein:

R (s)=- (α_ww²+α_qq²+α_zzdelta²+α_ttheta²)

R (a)=- (α_a1delS²)

Wherein α_w=0.5, α_q=0.5, α_z=1, α_t=1, α_a1=0.001.

2. agent's set is established, specifically:

The first step, by establishing K=3 movement network, K=3 movement network while and Intelligent Underwater Robot simultaneously Environment interacts to establish agent's set；

Second step, agent's set is from network parameter is received for acting network more in agent's set from learner Newly, agent's set is delayed the experience member experience handed down from one's ancestors that is delivered to that network is interacted with Intelligent Underwater Robot environment is acted Pond is deposited, the expression formula of single experience member ancestral is:

(o_i,a_i,R(s,a)_i)；

Third step, each movement network (Fig. 2) include two hidden layers h1, h2 and output layer output, wherein h1 There are 400 nodes, h2 there are 300 nodes, and output layer uses hyperbolic tangent function tanh.

3. experience cache pool is established, specifically:

Experience cache pool is from being that movement network with environment interacts to obtain in Receiving Agent people set from agent's set Experience member ancestral, experience cache pool is delivered to learner for the experience member sampled according to priority is handed down from one's ancestors, priority sampling Expression formula is as follows:

4. learner's network is established, specifically:

First step learner network receives the experience member ancestral sampled according to priority from experience cache pool, and by its The network parameter for learning to obtain is transmitted to agent's set.

Second step learner uses performer-reviewer's structure, and wherein the input of performer's network (Fig. 2) is observation space, defeated It is motion space out, i.e., control variable, expression formula are that F=[delS] performer's network includes two hidden layer h1, and h2 and one defeated Layer output out, wherein h1 has 400 nodes, and h2 has 300 nodes, and output layer uses hyperbolic tangent function tanh；Reviewer The input of network (Fig. 3) is observation space and motion space, and output is the distribution of Z, and Z is indicated in t time step, according to tactful π, When state is s, desired return, i.e. state-action value after movement a are taken.Using seeking state-movement Distribution value shape For formula than directly seeking state-action value average value merely or onlying demand state-action value form, learning process is more stable. Comprising two hidden layers h1, h2 and output layer output, wherein h1 has 400 nodes, and h2 has 300 nodes, output layer Using softmax function.

5. carrying out Intelligent Underwater Robot path model- following control, including following step using distributed deterministic policy gradient It is rapid:

Step 1: the size for the experience member ancestral that initialization foundation priority samples is M=256, experience cache pool Size is R=1000000, and acting the number K of network, (K value follows task to be adjusted according to specific path, is usually no more than 10) learning rate of performer's network and reviewer's network in learner's network is α₀=β₀=0.0001, exploration constant ε= 0.00001, maximum explores number E=100, and the maximum step number of exploring explored every time is T=1000.

Step 2: using the network weight parameter (θ, w) of random fashion initialization action network and learner's network, wherein θ is the parameter for acting performer's network in network and learner's network；W is the parameter of reviewer's network in learner's network.

Step 3: the use of the initiation parameter of second step being the performer's network and reviewer's network difference in learner's network A target network is established to reduce the concussion in learning process, the parameter of target network is denoted as (θ ', w').

Step 4: running K movement network parallel.

Step 5: according to priority p from experience cache pool_iChoose sample experience member ancestral the M ((o that length is N_i:i+N, a_i:i+N-1,R(o,a)_i:i+N-1)。

Step 6: the distribution of construction Z

Step 7: the update according to following formula calculating action network and learner's network

Step 8: updating network parameter

θ←θ+α_tδ_θ

w←w+β_tδ_w。

Step 9: terminating the exploration of current number if the step number explored every time reaches 1000；If do not reached, return Return the 5th step.

Step 10: terminating experiment if exploring number reaches 100；If do not reached, second step is returned.

Step 11: return action network, that is, include the Intelligent Underwater Robot path model- following control mould of suitable parameters θ Type.

6. Intelligent Underwater Robot path model- following control is carried out using distributed deterministic policy gradient, wherein the 4th step has Body are as follows:

The first step, selection act a,Wherein Section 2 indicates fixed Gaussian noise, and ε is to be Number is used to control the range of noise.

Second step, execution act a, and the R that is recompensed (s, a) and the observation state o' of subsequent time.

Third step, by experience member ancestral (o_i,a_i,R(s,a)_i) be stored in experience cache pool.

4th step, repeats the above steps, until convergence or training terminate.

Claims

1. the deeply learning control method that a kind of Intelligent Underwater Robot vertical plane path follows, it is characterized in that:

Step 1 is required according to the path model- following control of Intelligent Underwater Robot, establishes the intelligent water interacted with agent Lower robot environment；

Step 2 establishes agent's set；

Step 3 establishes experience cache pool；

Step 4 establishes learner；

2. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 1 path follows, It is characterized in that: the Intelligent Underwater Robot environment that the foundation is interacted with agent is by the path of Intelligent Underwater Robot Model- following control process model building determines the chief component of Markovian decision process at a Markovian decision process: dynamic Make space, state space, observation space, reward function.

3. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 2 path follows, It is characterized in that the chief component of the determining Markovian decision process specifically includes:

(1) motion space is determined

(2) state space is determined

State-space expression is S=[w, q, z, theta], and wherein w indicates Intelligent Underwater Robot under satellite coordinate system Heave velocity, q indicate that rate of pitch of the Intelligent Underwater Robot under satellite coordinate system, z indicate that Intelligent Underwater Robot exists Depth under earth coordinates, theta indicate pitch angle of the Intelligent Underwater Robot under earth coordinates；

(3) observation space is determined

(4) reward function is determined

In intensified learning, the purpose or target of agency is formed according to distinctive signal, referred to as reward or reward function, from ring Border passes to agent, leads to the current state generated after previous moment takes movement for evaluating Intelligent Underwater Robot Effect:

R (s, a)=R (s)+R (a)

Wherein:

R (s)=- (α_ww²+α_qq²+α_zzdelta²+α_ttheta²)

R (a)=- (α_a1delS²)

Wherein α_w、α_q、α_z、α_t、α_a1It is weight coefficient.

4. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 1 path follows, It is characterized in that: agent's set of establishing specifically includes:

(1) K movement network is established simultaneously, and K movement network interacts to establish with Intelligent Underwater Robot environment simultaneously Agent's set；

(2) agent's set is from network parameter is received for acting the update of network, agent in agent's set from learner Set will act that experience member that network and Intelligent Underwater Robot environment interact is handed down from one's ancestors to be delivered to experience cache pool, individually The expression formula of experience member ancestral is:

(o_i,a_i,R(s,a)_i)。

5. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 1 path follows, It is characterized in that: the experience cache pool of establishing specifically includes:

Experience cache pool from from agent's set Receiving Agent people set in be movement network and Intelligent Underwater Robot environment into The obtained experience member ancestral of row interaction, experience cache pool is delivered to learner for the experience member sampled according to priority is handed down from one's ancestors, excellent The expression formula of first grade sampling are as follows:

Wherein, p_iIt is the priority of experience member ancestral i, α is the coefficient of a very little greater than 0, the degree for determining priority, If α=0, priority sampling just becomes random uniform sampling.

6. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 1 path follows, It establishes learner it is characterized in that: described and specifically includes:

(1) learner's network receives the experience member ancestral sampled according to priority from experience cache pool, and is learnt To network parameter be transmitted to agent set；

(2) learner uses performer-reviewer's structure, and wherein the input of performer's network is observation space, and output is that movement is empty Between, i.e., control variable, expression formula are F=[delS], and movement network is identical as performer's network structure；The input of reviewer's network is Observation space and motion space, output are the distributions of Z, and then the mean value of Z is acquired by distribution, and Z is indicated in t time step, root According to tactful π, when state is s, desired return, i.e. state-action value after movement a are taken, using seeking state-action value The form of distribution is than directly seeking merely state-action value average value or onlying demand state-action value form.

7. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 1 path follows, It is characterized in that: described specifically included using distributed deterministic policy gradient progress Intelligent Underwater Robot path model- following control:

(1) size for initializing the experience member ancestral sampled according to priority is M=256, and the size of experience cache pool is R= 1000000, the number K for acting network is no more than 10, and the learning rate of performer's network and reviewer's network in learner is α₀ =β₀=0.0001, constant ε=0.00001 is explored, maximum explores number E=100, and the maximum step number of exploring explored every time is T =1000；

(3) using (2) step initiation parameter be learner in performer's network and reviewer's network establish a mesh respectively Network is marked, the parameter of target network is denoted as (θ ', w')；

(4) K movement network is run parallel；

(5) according to priority p from experience cache pool_iChoose sample experience member ancestral the M ((o that length is N_i:i+N,a_i:i+N-1,R(o, a)_i:i+N-1)；

(6) distribution of Z is constructed

(8) network parameter is updated

θ←θ+α_tδ_θ,

w←w+β_tδ_w；

(9) if the step number explored every time reaches 1000, terminate the exploration of current number；If do not reached, return (5) Step；

8. the deeply learning control method that Intelligent Underwater Robot vertical plane according to claim 7 path follows, It is characterized in that: K movement network of the parallel operation specifically includes:

1) selection acts a,Wherein Section 2 indicates fixed Gaussian noise；

4) step 1) -3 is repeated), until convergence or training terminate.