CN114296350A

CN114296350A - Unmanned ship fault-tolerant control method based on model reference reinforcement learning

Info

Publication number: CN114296350A
Application number: CN202111631716.8A
Authority: CN
Inventors: 张清瑞; 熊培轩; 张雷; 朱波; 胡天江
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-08
Anticipated expiration: 2041-12-28
Also published as: CN114296350B

Abstract

The invention discloses a model reference reinforcement learning-based unmanned ship fault-tolerant control method, which comprises the following steps: analyzing uncertainty factors of the unmanned ship and constructing a nominal dynamics model of the unmanned ship; designing a nominal controller of the unmanned ship based on the name meaning kinetic model of the unmanned ship; the Actor-criticic method based on the maximum entropy is used for constructing a fault-tolerant controller based on model reference reinforcement learning according to a state variable difference value of an actual unmanned ship system and an unmanned ship name-sense dynamic model and the output of an unmanned ship nominal controller; and according to the control task requirements, building a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy. By using the unmanned ship system, the safety and the reliability of the unmanned ship system can be obviously improved. The unmanned ship fault-tolerant control method based on model reference reinforcement learning can be widely applied to the field of unmanned ship control.

Description

Unmanned ship fault-tolerant control method based on model reference reinforcement learning

Technical Field

The invention relates to the field of unmanned ship control, in particular to an unmanned ship fault-tolerant control method based on model reference reinforcement learning.

Background

With the remarkable progress of guidance, navigation and control technologies, the use of unmanned ships (ASV) has occupied a part of the great weight of aviation. In most applications, unmanned boats are expected to operate safely without human intervention for extended periods of time. Thus, there is a need for unmanned ships that have sufficient safety and reliability attributes to provide proper operation and avoid catastrophic consequences. However, unmanned ships are prone to failure, degradation of system construction, sensor failure, etc., and thus experience performance degradation, instability, and even catastrophic loss.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a model-based reference reinforcement learning unmanned ship fault-tolerant control method, which can recover system performance or maintain system operation after a fault occurs, thereby significantly improving system safety and reliability.

The first technical scheme adopted by the invention is as follows: a fault-tolerant control method of an unmanned ship based on model reference reinforcement learning comprises the following steps:

s1, analyzing uncertainty factors of the unmanned ship and constructing a nominal dynamics model of the unmanned ship;

s2, designing a nominal controller of the unmanned ship based on the name meaning dynamic model of the unmanned ship;

s3, constructing a fault-tolerant controller based on model reference reinforcement learning according to the actual unmanned ship system, the state variable difference value of the unmanned ship name-sense dynamic model and the output of the unmanned ship nominal controller by an Actor-criticic method based on the maximum entropy;

and S4, building a reinforcement learning evaluation function and a control strategy model according to the control task requirements, and training a fault-tolerant controller to obtain a trained control strategy.

Further, the formula of the unmanned ship name meaning dynamic model is as follows:

in the above formula, the first and second carbon atoms are,

representing a generalized coordinate vector, v representing a generalized velocity vector, u representing a control force and moment, M representing an inertia matrix, C (v) including Coriolis force and centripetal force, D (v) representing a damping matrix, G (v) representing unmodeled dynamics due to gravity, buoyancy and moment, B representing a preset input matrix

Further, the formula of the nominal controller of the unmanned ship is as follows:

in the above formula, N_mAnd H_mComprising all known constant parameters, η, of the unmanned ship dynamics model_mGeneralized coordinate vector, u, representing a nominal model_mRepresenting the control law, x_mRepresenting the state of the reference model.

Further, the formula of the fault-tolerant controller is as follows:

in the above formula, H_mL represents a Hurwitz matrix,

u_lrepresents the control strategy from the deep learning module, β (v) represents the set of all model uncertainties in the inner-loop dynamics, n_vRepresenting a noise vector on the generalized velocity measurement, f_vIndicating a sensor fault acting on the generalized velocity vector.

Further, the formula of the reinforcement learning evaluation function is expressed as follows:

Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t)

in the above formula, u_l,tIndicating the control excitation, s, from the RL_tRepresenting the state signal at a time step T, T^πDenotes a fixed policy, E_πRepresenting the desired operator, gamma representing the discount factor, alpha representing the temperature coefficient, Q_π(s_t,u_l,t) Representing a reinforcement learning evaluation function.

Further, the formula of the control strategy model is expressed as follows:

in the above formula, pi represents a policy set, pi^oldA policy representing a previous update is shown,

denotes pi^oldQ value of (D)_KLThe dispersion of the KL is expressed,

represents a normalization factor,(s)_tAnd.) represents a control strategy and the dots represent a write that omits the arguments.

Further, according to the control task requirement, a step of building a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy is specifically included:

and S41, building a reinforcement learning rating function and a model strategy model for the fault-tolerant controller based on model reference reinforcement learning according to the control task requirements.

S42, training the fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;

and S43, injecting faults into the unmanned ship system, retraining the initial control strategy and returning to the step S41 until the reinforcement learning evaluation function network model and the control strategy model converge.

Further, still include:

introducing a double-evaluation function model, and adding strategy entropy value into a control strategy expected return function, wherein R_tIs a reward function, R_t＝R(s_t,u_l,t)。

The method has the beneficial effects that: aiming at an unmanned ship system with model uncertainty and sensor faults, the invention provides a fault-tolerant control algorithm based on reinforcement learning, which combines model reference reinforcement learning with a fault diagnosis and estimation mechanism, and takes the Monte Carlo sampling efficiency into consideration, an Actor-Critic model is used to convert accumulated income into a Q function, and through new fault-tolerant control based on reinforcement learning, the unmanned ship can be ensured to be capable of learning to adapt to different sensor faults and recover the track tracking performance under the fault condition.

Drawings

FIG. 1 is a flow chart of steps of an unmanned ship fault-tolerant control method based on model reference reinforcement learning according to the invention;

fig. 2 is a block diagram of an Actor-critical network according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in fig. 1, the present invention provides a fault-tolerant control method for an unmanned ship based on model reference Reinforcement Learning (RL), which includes the following steps:

s1, analyzing the inherent uncertainty factors of the unmanned ship, neglecting all nonlinear terms in the inner loop dynamics, obtaining a linear and decoupling model of a dynamics equation of the generalized velocity vector, and establishing a nominal dynamics model of the unmanned ship;

the dynamic model is specifically as follows:

wherein

Is a generalized coordinate vector, x_pAnd y_pRepresenting the horizontal coordinates of the ASV in the inertial system,

is the heading angle. v ═ u_p,v_p,r_p]^T∈R³Is a generalized velocity vector, u_pAnd v_pLinear velocities in the x-and y-directions, r_pIs the heading angular rate. u ═ τ_u,τ_r]∈R³Control of force and moment, g (v) ═ g₁(v),g₂(v),g₃(v)]^T∈R³Is unmodeled dynamics due to gravity, buoyancy and moment, M ∈ R^3×3Is provided with M ═ M^TAn inertia matrix > 0 and

wherein

Matrix C (v) ═ C^T(v) Including coriolis forces and centripetal forces, are given by:

wherein C is₁₃(v)＝-M₂₂v-M₂₃r，C₂₃(v)＝M₁₁u. Damping matrix

Wherein D₁₁(v)＝-X_u-X_|u|u|u|-X_uuuu²，D₂₂(v)＝-Y_v-Y_|v|v|v|-Y_|r|v|r|，D₂₃(v)＝-Y_r-Y_|v|r|v|-Y_|r|r|r|，D₃₂(v)＝-N_v-N_|v|v|v|-N_|r|v|r|，D₃₃(v)＝-N_r-N_|v|r|v|-N_|r|rAnd | r |, X (·), Y (·), N (·) are hydrodynamic coefficients, and the definitions are detailed in ship hydrodynamic and motion control manuals. Rotation matrix

Input matrix

Definition x ═ η^T v^T]^TIs provided with

Wherein H (v) ═ M^-1(c (v) + d (v)) and N ═ M^-1B。

The state measurement of the ASV system (1) is corrupted by noise and sensor faults and is therefore denoted as y ═ x + n + f (t), where n ∈ R⁶Is the measurement of the noise vector, f (t) e R⁶Representing possible sensor fault vectors. In the invention, only the sensor fault is considered to the course angular rate r_pSo that f (t) is [0,0,0,0,0, f_r(t)]^T. Sensor failure f_r(t) is given by:

f_r(t)＝β(t-T_f)φ(t-T_f) Wherein phi (T-T)_f) Is an unknown function of sensor failure occurring at the instant T, β (T-T)_f) For T < T_fTime beta (T-T)_f) 0 and T > T_fTime of flight

(k is the evolution rate of the fault). Note that if the occurrence of a sensor fault is sudden, such as a bias fault, k → ∞. The object of the invention is to design a controller that allows state x to track state x in the presence of model uncertainties, possible sensor failures and measurement noise_rThe indicated reference state trace.

S2, designing a nominal controller of the unmanned ship based on the nominal dynamics model, and ensuring the basic stability of the unmanned ship system under the condition of no fault. And analyzing the nominal model of the unmanned ship.

The nominal controller design process is as follows:

the proposed RL-based FTC algorithm follows a model reference control structure. For most ASV systems, accurate nonlinear dynamical models are rarely available, with the main uncertainties coming from M, C (v) and d (v) due to fluid mechanics, and g (v) due to gravity and buoyancy and moments. Despite the uncertainty in the ASV dynamics, the nominal model (5) can still be used based on the known information of the ASV dynamics. The nominal model of the uncertain ASV model (5) is as follows:

wherein N is_mAnd H_mContains all known constant parameters of the ASV dynamics (5),

is the generalized coordinate vector of the nominal model. In the present invention, M_mIs formed by M_m＝diag{M₁₁,M₂₂,M₃₃< derived, H_m＝M_m ^-1D_mFrom D_m＝diag{-X_u,-Y_v,-N_rAnd N_m＝M_m ^-1And B, obtaining. Therefore, in the nominal model, all nonlinear terms in the inner loop dynamics are ignored, and therefore, the linear solution of the generalized velocity vector v kinetic equation is finally obtainedAnd (4) coupling the model. Since the dynamics of the nominal model (6) are known, it is possible to design the control law u_mTo allow the state of the nominal system (6) to converge to the reference signal x_rE.g., | | x when t → ∞ time | | x_m-x_r||₂→ 0. Control law u_mCan also be used as a nominal controller by the whole ASV dynamics (5).

In the model reference control architecture, the goal is to design a control law that allows the states of (5) to track the state trajectory of the nominal model (6). The overall control law of the ASV system (5) has the following expression:

u＝u_b+u_l (7)

wherein u is_bIs a model-based approach of nominal, u_lIs a control strategy from a deep learning module. Baseline control u_bFor ensuring some basic properties (i.e. local stability), and u_lTo compensate for all system uncertainties and sensor failures.

S3, constructing the fault-tolerant controller based on model reference reinforcement learning by taking the difference value of the state variables of the actual unmanned ship system and the nominal model and the output of the nominal controller as input.

Network block diagram of Actor-criticic referring to fig. 2, the specific derivation process of the fault-tolerant controller is as follows:

the formula for RL is based on a Markov decision process MDP represented by a tuple<S,U,P,R,γ>Where S is the state space, U specifies the operation/input space, P: S × U × S → R defines the transition probability, R: S × U → R is a reward function, γ ∈ [0,1) is a discount coefficient. In MDP, the state vector S ∈ S contains the influence RL control u_lAll available signals of e U. For the tracking control of the ASV system in the invention, the transition probability is determined by the ASV dynamic and the reference signal x in (1)_rAnd (5) characterizing. In the RL, the control strategy is learning using data samples collected in the discrete time domain. Let s_tIs the state signal s at the time step t, respectively u_l,tIs the input of the RL-based control at time step t. The RL algorithm of the present invention aims to maximize a cost of action function, also called the Q function, such asShown below:

wherein R is_tIs a reward function, R_t＝R(s_t,u_l,t)，

And V is_π(s_t+1) is called s under strategy π_tA state value function of +1, wherein

Wherein pi (u)_l,t|s_t) Is a control strategy that is based on the fact that,

is the entropy of the strategy and α is the temperature parameter. Control strategy pi (u) in RL_l,t|s_t) Is to select action u_l,tE.g. U in state s_tE.g., the probability under S. In the present invention, a control strategy is employed that satisfies a Gaussian distribution, i.e.

π(u_l|s)＝N(u_l(s),σ) (10)

Wherein N (·,. cndot.) represents a Gaussian distribution, u_l(s) is the mean and σ is the covariance matrix. The covariance matrix sigma controls the exploratory performance of the learning phase.

The goal of RL is to find an optimal control strategy π^*Making Q in (8)_π(s_t,u_l,t) To a maximum, i.e.

π^*＝argmaxQ_π(s_t,u_l,t) (11)

Note that the variance σ^*Will converge to 0, once the optimal strategy is obtained^*(u_l ^*|s)＝N(u_l ^*(s),σ^*) Mean function u_l ^*(s) optimal control law deep neural network Q to be learned_θ(s_t,u_lT) is called critic, control strategy pi_φ(u_l,t|s_t) Called actor, rewriting the ASV model uncertain inner loop dynamics in (5) as:

where β (v) is the set of all model uncertainties in the inner loop dynamics. Let the uncertainty term β (v) be bounded. Let e_v＝v-v_mAccording to (6) and (12), the error dynamics are:

under healthy conditions, the model uncertainty term β (v) may use a learning-based control u_lComplete compensation is performed. This means | | | e when t → ∞ time | | | e_v(t)||₂≦ ε, where ε is some positive small constant. Error signal e if sensor failure occurs_vWill be greater than epsilon. One inexperienced idea of learning-based Fault Tolerance Control (FTC) is to treat sensor failures as part of an external disturbance. However, considering sensor faults as disturbances will result in a control based on conservative learning, such as robust control. Therefore, we introduce a fault diagnosis and estimation mechanism that allows the learning-based control to adapt to different scenarios: healthy and unhealthy conditions.

Let y_v＝v+n_v+f_vWherein n is_vRepresenting the noise vector on the generalized velocity measurement, and, correspondingly, f_vIs a sensor fault acting on the generalized velocity vector. In addition, we define

As a fault tracking error vector. In the practical application of the method, the material is,

is measurableInstead of e_v. Finally, the following fault diagnosis and estimation mechanisms are introduced:

wherein L is selected as H_m-L Hurwitz. Signal

As an indicator of the occurrence and intensity of sensor failure. Is provided with

To obtain

In the above formula, H_mL represents a Hurwitz matrix,

S4, designing a corresponding callback function according to the control task requirement, and building a reinforcement learning evaluation function model (Q-value) and a control strategy model by using a full-connectivity network.

The callback function, the learning evaluation function and the control strategy model are derived as follows:

RL-based fault tolerance control is derived using the output of a fault diagnosis and estimation mechanism. The RL learns the control strategy at discrete time steps using data samples (including input and state data). The sampling time step is assumed to be fixed, denoted by δ t. Without loss of generality, let y_t，u_b,t，u_l,tAnd are and

representing the ASV state, nominal controller excitation, control excitation from the RL, and the output of the fault diagnosis and estimation mechanism at time step t, respectively. The state signal s at the time step t is thus represented as:

the training learning process of the RL will repeatedly perform policy evaluation and policy improvement. In policy evaluation, Q-value is operated by Bellman Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t) Obtained wherein

In policy refinement, the policy is updated by:

where pi represents the set of policies, pi^oldA policy indicating the last time the user was updated,

denotes pi^oldQ value of (D)_KLShows the Kullback-Leibler (KL) divergence,

representing a normalization factor. By mathematical operations, the object is converted into

S5, introducing a double-evaluation function model idea into an evaluation function training framework, and adding an entropy value of a strategy into a control strategy expected return function to improve the reinforcement learning training efficiency.

And (3) a dual-evaluation function model derivation process:

parameterizing the Q function by theta, and using Q_θ(s_t,u_l,t) And (4) showing. The parameterization strategy consists of_φ(u_l,t|s_t) Where phi is the parameter set to be trained. Note that both θ and φ are a set of parameters, the size of which is determined by the deep neural network settings. For example, if Q_θRepresented by an MLP with K hidden layers and L neurons per hidden layer, the parameter set θ is then θ ═ θ₀,θ₁,...,θ_KAnd at 1. ltoreq. i.ltoreq.K-1

θ_K∈R^1×(L+1)，θ_i∈R^(L)×(L+1)Wherein dim_sSize of state s, dim_uRepresenting input u_lThe size of (c).

The training session is offline, and data samples are collected at each time step t +1, e.g. input u from the previous time step_l,tLast time step s_tState of (1), reward R_tAnd the current state s_t+1. These history data will be referred to as tuples(s)_t,u_l,t,R_t,s_t+1) Stored in the memory pool D. In each strategy evaluation or improvement step, we randomly extract a batch of historical data B from the memory pool D for the training parameters θ and φ. At the beginning of training, we will put the nominal control strategy u_bApplied to ASV system to collect initial data D₀As shown in algorithm 1. Initial data set D₀For initial fitting of the Q function. After the initialization is finished, executing u_bAnd the newly updated reinforcement learning strategy pi_φ(u_l,t|s_t) To operate the ASV system.

The parameters θ of the Q function are trained to minimize the bellman residual:

wherein(s)_t,u_l,t) D means the samples(s) we randomly selected from the memory pool D_t,u_l,t) And is and

wherein

Is a target parameter that will be updated slowly. The DNN parameter θ is obtained by applying a random gradient descent method to (15) on the correction data batch B, the size of which is represented by | B |. The invention uses two channels of which each is theta₁And theta₂And (4) evaluating parameterization. Both evaluations are introduced to reduce the overestimation problem in evaluating neural network training. Under a double evaluation function, the target value Y_targetComprises the following steps:

the policy refinement step is to use the data samples in memory pool D to achieve the following parameterized objective function minimization:

the parameter phi is trained to a minimum using a stochastic gradient descent method, and in the training phase, the actor neural network is represented as:

wherein

Is the parameterized control law to be learned,

is the standard deviation of the detection noise, ξ -N (0, I) are the detection noise, and "" is the Hadamard product. Note that the detection noise ξ is only applicable in the training phase, and once training is complete, only needs to be in use

Therefore, u in the training phase_lEquivalent to u_l,φ. Once training is over, get

The temperature parameter a is also updated during the training phase. The update is obtained by minimizing the following objective function:

wherein

Is the entropy value of the strategy. In the invention is provided with

Where "2" represents the action dimension.

And S6, training the controller based on model reference reinforcement learning under the condition of no fault, obtaining an initial control strategy, and ensuring the robustness of the overall controller to the model uncertainty.

And S7, injecting faults into the unmanned ship system, retraining the acquired initial control strategy based on model reference reinforcement learning, and realizing the adaptability of the overall controller to partial sensor faults.

And S8, under different initial state conditions, continuously repeating the step S6 and the step S7 until the reinforcement learning evaluation function network model and the control strategy model converge.

Specifically, the training process of steps S6-S8 is as follows: 1) is composed of

And

separately initializing the parameters θ₁，θ₂Denotes the actor network by phi; 2) values are specified for the target parameters:

3) run u_lWhen u in formula (5) is 0_bObtaining a data set D₀(ii) a 4) Ending the exploration of the learning phase using the data set D₀Training initial critic parameter θ₁ ⁰，

5) Initializing memory pool D ← D₀(ii) a 6) Initial values are assigned for critic parameters and their targets: theta₁←θ₁ ⁰，

7) Repeating; 8) starting a loop, each data collection step executing an operation; 9) according to pi_φ(u_l,t|s_t) Selecting an action u_l,t(ii) a 10) Operating the nominal system (6) and the entire system (5) and a fault diagnosis and evaluation mechanism (14)&Collecting s_t+1＝{x_t+1,x_m,t+1,u_b,t+1}；11)D←D∪{s_t,u_l,t,R(s_t,u_l,t),s_t+1}; 12) ending the circulation; 13) starting a loop, each gradient update step performing an action; 14) extracting a batch of data B from the D; 15) theta_j←θ_j-ι_Q▽_θJ_Q(θ_j) And j is 1, 2; 16) phi ← phi-iota_π▽_φJ_π(φ)；17)α←α-ι_α▽_αJ_α(α)；18)

And j is 1, 2; 19) ending the circulation; 20) until convergence (e.g. J)_Q(θ) < a small threshold). In the algorithm, iota_Q，ι_πAnd iota_αIs a positive learning rate (scalar) and κ > 0 is a constant scalar.

An unmanned ship fault-tolerant control system based on model reference reinforcement learning comprises:

the dynamic model building module is used for analyzing uncertainty factors of the unmanned ship and building a nominal dynamic model of the unmanned ship;

the controller design module is used for designing a nominal controller of the unmanned ship based on the nominal dynamics model of the unmanned ship;

the fault-tolerant controller building module is used for building a fault-tolerant controller based on model reference reinforcement learning according to a state variable difference value of an actual unmanned ship system and an unmanned ship name-meaning dynamic model and the output of an unmanned ship nominal controller based on an Actor-criticic method of the maximum entropy;

and the training module is used for building a reinforcement learning evaluation function and a control strategy model and training the fault-tolerant controller according to the control task requirements to obtain a trained control strategy.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

A model reference reinforcement learning-based unmanned ship fault-tolerant control device comprises:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to implement the unmanned ship fault-tolerant control method based on model reference reinforcement learning as described above.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a model reference reinforcement learning based unmanned ship fault tolerance control method as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A fault-tolerant control method of an unmanned ship based on model reference reinforcement learning is characterized by comprising the following steps:

2. The unmanned ship fault-tolerant control method based on model reference reinforcement learning of claim 1, wherein the formula of the unmanned ship nominal dynamics model is as follows:

in the above formula, the first and second carbon atoms are,

representing a generalized coordinate vector, v representing a generalized velocity vector, u representing a control force and moment, M representing an inertia matrix, c (v) comprising coriolis force and centripetal force, d (v) representing a damping matrix, g (v) representing unmodeled dynamics due to gravity and buoyancy and moment, and B representing a preset input matrix.

3. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 2, wherein the formula of the unmanned ship nominal controller is as follows:

4. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 3, wherein the formula of the fault-tolerant controller is as follows:

in the above formula, H_mL represents a Hurwitz matrix,

5. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the reinforcement learning evaluation function is expressed as follows:

Q_π(s_t,u_l,t)＝T^πQ_π(s_t,u_l,t)

6. The unmanned ship fault-tolerant control method based on model reference reinforcement learning as claimed in claim 4, wherein the formula of the control strategy model is as follows:

in the above formula, pi represents a policy set, pi^oldRepresents the previous onePolicy of secondary update, Q^πoldDenotes pi^oldQ value of (D)_KLThe dispersion of the KL is expressed,

represents a normalization factor,(s)_tAnd) represents a control strategy.

7. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 1, wherein the step of constructing a reinforcement learning evaluation function and a control strategy model and training a fault-tolerant controller to obtain a trained control strategy according to control task requirements specifically comprises:

8. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 7, further comprising:

and introducing a double-evaluation function model, and adding an entropy value of the strategy into the expected return function of the control strategy.