Disclosure of Invention
The invention aims to overcome the defects and provides an underwater robot control method based on reinforcement learning, which can accurately track a target track, reduce the sampling times of a system with uncertain parameters and realize control by learning the environment by using the underwater robot.
In order to achieve the purpose, the invention adopts the following technical scheme:
an underwater robot control method based on reinforcement learning is characterized by comprising the following steps:
step 1, establishing a fixed reference system based on the self expected track position of the robot and an inertial reference system based on uncertain factors of an underwater environment for the self position of the underwater robot;
step 2, for an inertial reference system, constructing an output model of the system mapping robot based on uncertain factors in the front-back direction, the left-right direction and the up-down direction:
in the formula, a
iIs the ith uncertainty factor suffered by the underwater robot,
for each uncertainty factor a
iAll follow independent probability density functions
Sampling each uncertain factor at fixed points according to the respective probability density function of the uncertain factor, training a system mapping robot output model by using the sampling points, and constructing a reduced-order system mapping robot output model:
in the formula (I), the compound is shown in the specification,
is the coefficient of the uncertain factor in the low-order mapping;
step 3, converting the real position of the underwater robot into the coordinates in the fixed reference system in the step 1, and obtaining model output mapped by the robot reduced-order system in the inertial reference system in the step 2;
step 4, defining the real positions of the underwater robot in different states k as follows:
p(k)=[x(k),y(k),z(k)]T
defining expected track positions of the underwater robot in different states k as follows:
pr(k)=[xr(k),yr(k),zr(k)]T
defining the one-step cost function of the next action of the underwater robot under different states k as
gk(p,u)=(x(k)-xr(k))2+(y(k)-yr(k))2+(z(k)-zr(k))2+u2(k)
Wherein (x-x)r)2+(y-yr)2+(z-zr)2Representing the cost of the underwater robot position error, u is the underwater robot controller input, u2Represents a cost of consuming energy;
training the robot according to a one-step cost function generated by the position movement of the underwater robot to obtain a value function
V(p(k))=Ea(k){gk(p,u)+γV(p(k+1))}
Wherein γ ∈ (0,1) is a discount factor, Ea(k) Representing the expectation function at state k;
let V equal WTΦ (p), obtaining a value model of the control method using an iterative weighting method:
Wj+1Φ(p(k))=Ea(k)[gk(p,u)+γWjΦ(p(k+1))]
in the formula (I), the compound is shown in the specification,
is a basis vector, W is a weight vector;
step 5, solving a value model of the control method; let h (p) be UTσ (p), wherein the weight vector U is updated with a gradient descent method, the control method is improved with a minimum cost function:
wherein h (p) is the next action performed in each state when the underwater robot performs position tracking, and h (p) is used as an optimal control strategy;
step 6, simultaneously converging the two processes of updating the value model of the control method and improving the control strategy by using an iterative weight method, and completing the solution of the optimal control strategy in the current state;
and 7, inputting the real position in the step 3 into the step 4, obtaining the next optimal control strategy through the operations in the steps 5-6, inputting the optimal control strategy as output into the system mapping robot output model in the step 2, and circularly repeating the operations in the steps 3 and 7 to complete the tracking task of the underwater robot.
The further technical scheme is that the uncertain factors in the step 1 are underwater surging, swinging and heaving.
The further technical scheme is that the reduced order system in the step 2 maps the output mean value E '(G' (a)) of the robot output model1,a2,a3) And the output mean value E (G (a)) of the robot output model is mapped with the system1,a2,a3) Are identical).
The further technical scheme is that the specific steps of the step 4 are as follows:
the self position of the underwater robot under different states k is p (k) ═ x (k), y (k), z (k)]TThe expected trajectory is pr(k)=[xr(k),yr(k),zr(k)]T(ii) a In order to obtain an optimal control strategy, namely the action h performed by the underwater robot in each state when the underwater robot performs position tracking, a one-step cost function of the underwater robot in different states is set as gk(p,u)=(x(k)-xr(k))2+(y(k)-yr(k))2+(z(k)-zr(k))2+u2(k) Wherein (x-x)r)2+(y-yr)2+(z-zr)2Representing the cost of the tracking error, u is the underwater robot controller input, u2Represents a cost of consuming energy; calculating a cost function through the set one-step cost function:
V(p(k))=Ea(k){gk+γV(p(k+1))}
wherein γ ∈ (0,1) is a discount factor, Ea(k) Representing the expectation function at state k;
at priceIn the process of updating the value, let V become WTΦ (p), the cost function can then be expressed as: wj+1Φ(p(k))=Ea(k)[gk(p,u)+γWjΦ(p(k+1))]
In the formula (I), the compound is shown in the specification,
is a base vector; w is a weight vector, and iterative solution is carried out by a least square method; after obtaining the value function, in the strategy improvement step, the optimal tracking control strategy is solved by using a method for setting a base vector and a weight vector, and when the optimal tracking control strategy is solved, h (p) is equal to U
Tσ (p), wherein the weight vector U is updated by gradient descent, σ (p) being the basis vector; the control strategy is improved by using the minimum cost function:
and h (p) is a control strategy obtained by learning the environment by the underwater robot, and the strategy is an optimal control strategy.
The further technical scheme is that the specific content of the step 6 is as follows:
and when the value model of the control method is updated and the control strategy is improved by using an iterative weight method each time, and the obtained weight change is smaller than a threshold value of 0.001, the convergence is regarded, and h after the iteration is finished is input to the underwater robot as the input u of the controller.
The underwater robot control method based on reinforcement learning controls the underwater robot to realize the tracking of the tracked object.
Compared with the prior art, the invention has the following advantages:
the invention samples the uncertain parameters of the underwater robot relating to the underwater uncertain factors by using a reduction method, and can give accurate output statistics of the original mapping, thereby reducing the calculation cost and effectively reducing the simulation times.
The invention uses the reinforcement learning method to track the position of the underwater robot, integrates the advantages of self-adaption and optimal control, and seeks an optimal feedback strategy by using the response of the environment. By utilizing the surrounding environment information, the underwater robot can find the control strategy which best accords with the target track through self learning by multiple iterations.
The invention realizes the intelligent tracking of the underwater robot. The uncertain parameters of the underwater robot are sampled by using a reduction method and are combined with reinforcement learning, so that the backward real-time optimal control of an underwater robot system becomes forward self-adaptive control, and the underwater robot can better complete track tracking.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
as shown in fig. 2, the invention is provided with a buoy relay on the water surface, the buoy relay is used for self-positioning the underwater robot, and the control center gives the expected track information of the underwater robot and sends the expected track information to the underwater robot; and the underwater robot controller controls the driver to drive according to the system control to complete the motion of the underwater robot.
As shown in fig. 1, the present invention provides a reinforcement learning-based underwater robot control method, which comprises the following steps:
step one, the underwater robot is influenced by the surrounding environment underwater, uncertain factors in an underwater robot model need to be evaluated, and the control of an underwater robot controller to a driver can be completed; the underwater robot has six degrees of freedom, namely, up, down, left, right, front and back, the dynamic characteristics of the underwater robot can be described by two reference systems, namely a fixed reference system based on the expected track position of the robot and an inertial reference system based on uncertain factors of an underwater environment. The fixed reference system and the inertial reference system respectively consider the up-down direction, the left-right direction and the front-back direction, and the underwater inertial reference system uses factors such as underwater surging, swinging and heaving as uncertain factors to quote uncertain parameters.
In an inertial reference system, linear speeds in three directions of surging, swinging and heaving are pairwise perpendicular, and meanwhile, the influences of rolling, pitching and yawing on the angular speed of the underwater robot are considered in the direction of the linear speed.
Step two, due to the random influence of the underwater environment, respectively estimating uncertain parameters of the underwater robot in three directions: and selecting a group of sampling points for each parameter according to the respective probability density functions of the uncertain parameters, and respectively calculating by using the sampling points to reduce the order of the robot model, so that the output result of the controller can be obtained through calculation for a few times, the output mean value is ensured to be the same as the output mean value of the original model, the underwater robot adapts to the underwater environment, and more accurate control is realized.
The method comprises the following specific steps: for an inertial reference system, an output model of the robot mapped by the system based on uncertain factors is constructed in the front-back direction, the left-right direction and the up-down direction:
a
ithe uncertainty parameter is the ith uncertainty factor suffered by the underwater robot, and in the embodiment, factors such as underwater surge, swing and heave are used as uncertainty factors to quote uncertainty parameters.
Are coefficients. Each uncertainty parameter (or uncertainty factor) a
iAll follow independent probability density functions
Sampling each uncertain factor at fixed points according to the respective probability density function of the uncertain factor, training a system mapping robot output model by using the sampling points, and constructing a reduced-order system mapping robot output model:
wherein
Is a new coefficient of uncertain parameters in the low-order mapping. The order-reducing system maps the output mean value E '(G' (a) of the robot output model
1,a
2,a
3) And the output mean value E (G (a)) of the robot output model is mapped with the system
1,a
2,a
3) Are identical, i.e., E '(G' (a))
1,a
2,a
3))=E(G(a
1,a
2,a
3))。
Step three, converting the real position of the underwater robot into the coordinates in the fixed reference system in the step 1, and acquiring model output mapped by the robot reduced-order system in the inertial reference system in the step 2;
step four, defining the self position of the underwater robot in different states k as
p(k)=[x(k),y(k),z(k)]T,
The desired trajectory to be tracked is:
pr(k)=[xr(k),yr(k),zr(k)]T。
in order to obtain an optimal control strategy, namely the action h performed by the underwater robot in each state when the underwater robot performs position tracking, a one-step cost function of the underwater robot in different states is set as
gk(p,u)=(x(k)-xr(k))2+(y(k)-yr(k))2+(z(k)-zr(k))2+u2(k)
Wherein (x-x)r)2+(y-yr)2+(z-zr)2Representing the cost of the tracking error, u is the underwater robot controller input, u2Representing a cost of consuming energy. By setting a one-step cost functionCalculating a cost function:
V(p(k))=Ea(k){gk(p,u)+γV(p(k+1))}
wherein γ ∈ (0,1) is a discount factor, Ea(k) Representing the expectation function at state k;
let V equal WTΦ (p), the cost function can then be expressed as:
Wj+1Φ(p(k))=Ea(k)[gk(p,u)+γWjΦ(p(k+1))]
in the formula (I), the compound is shown in the specification,
are basis vectors. W is the weight vector, solved iteratively by the least squares method.
After the value function is obtained in the fifth step, in the strategy improvement step, the optimal tracking control strategy is solved by using a method for setting a base vector and a weight vector, and when the optimal tracking control strategy is solved, h (p) is set to be UTσ (p), where the weight vector U is updated with a gradient descent method, σ (p) being the basis vector. The control strategy is improved by using a minimum cost function:
and h (p) is a control strategy obtained by learning the environment by the underwater robot, and the strategy is an optimal control strategy.
And step six, two processes of iterative value updating and strategy improvement are circulated, when the weight change obtained in each iterative value updating and strategy improvement process is smaller than a threshold value of 0.001, convergence is considered, h after iteration is finished is used as the output u of the controller and is input into a driver of the underwater robot, and the optimal control strategy under the current state is solved.
Step seven, inputting the optimal control strategy into the reduced-order system obtained in the step two, updating the state of the underwater robot,
repeating the fifth step and the sixth step again to obtain an optimal control strategy for the next action, and inputting the optimal control strategy into the second step again.
The invention also discloses a control method for tracking by using the underwater robot, which uses the track information generated by the continuous movement of the tracked object as the expected track information in the upper step 1, and controls the underwater robot by using the underwater robot control method based on reinforcement learning to realize the tracking of the tracked object.
The track information of the tracked object can be obtained by positioning the buoy relay.
An embodiment is specifically described below:
(1) as shown in figure 2, in a given water area with the length of 6m, the width of 5m and the depth of 1.5m, an underwater robot is deployed, a buoy relay is arranged on the water surface, the buoy relay is used for self-positioning the underwater robot, and a control center gives expected track information x of the underwater robotr=2sin(0.1k),yr=0.1k,zr1, where k ∈ [ 0.., 100s]And sent to the underwater robot.
(2) The underwater robot has a kinematic model S
k+1=S
k+U
k+A
k,S
k=[x(k),y(k),z(k)]
TIs the self position of the underwater robot, U
k=[u
x,u
y,u
z]
TObtained by reinforcement learning, A
k=[a
1(k),a
2(k),a
3(k)]
TIs not a definite parameter, wherein
-0.2≤a
1(k)≤0.3,
-0.8≤a
2(k)≤0.7,a
3(k)=0。
(3) Tracking the position by a reinforcement learning method, and setting a one-step cost function V (p (k) ═ Ea(k){gkIn (p, u) + γ V (p (k +1)) }, a discount factor γ is set to 0.9. To obtain the cost function, let V equal WTPhi (p), the weight vector can be obtained by least squares iteration, the basisThe quantity phi (p) is [1, x, y, x2,y2,xy]T. After obtaining the value function, in the strategy improvement step, the optimal tracking control strategy is solved by using a method for setting a base vector and a weight vector, and when the optimal tracking control strategy is solved, h (p) is equal to UTσ (p), where the weight vector U is updated by gradient descent, σ (p) [1, x, y ]]T. The control strategy is improved when the minimum cost function is utilized.
(4) And (3) through two processes of cyclic iteration value updating and strategy improvement, when the weight change obtained in each iteration value updating and strategy improvement process is less than 0.001 of a threshold value, the convergence is regarded, h after the iteration is finished is used as the output u of the controller and is input into a driver of the underwater robot, and the solution of the optimal control strategy under the current state is finished.
(5) And (3) inputting the optimal control strategy as output into the system mapping robot output model in the step (2), and circulating the steps to realize the tracking task. The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and not restrictive, and various changes and modifications may be made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention, which is defined by the claims.