CN114296350B

CN114296350B - Unmanned ship fault-tolerant control method based on model reference reinforcement learning

Info

Publication number: CN114296350B
Application number: CN202111631716.8A
Authority: CN
Inventors: 张清瑞; 熊培轩; 张雷; 朱波; 胡天江
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-11-03
Anticipated expiration: 2041-12-28
Also published as: CN114296350A

Abstract

The application discloses a model reference reinforcement learning-based unmanned ship fault-tolerant control method, which comprises the following steps: analyzing uncertainty factors of the unmanned ship, and constructing a nominal dynamics model of the unmanned ship; designing a nominal controller of the unmanned ship based on the nominal dynamics model of the unmanned ship; according to the method, a fault-tolerant controller based on model reference reinforcement learning is constructed according to the actual unmanned ship system, the state variable difference value of an unmanned ship name dynamics model and the output of an unmanned ship nominal controller; and building a reinforcement learning evaluation function and a control strategy model according to the control task requirement, and training a fault-tolerant controller to obtain a trained control strategy. By using the application, the safety and reliability of the unmanned ship system can be obviously improved. The unmanned ship fault-tolerant control method based on model reference reinforcement learning can be widely applied to the field of unmanned ship control.

Description

Unmanned ship fault-tolerant control method based on model reference reinforcement learning

Technical Field

The application relates to the field of unmanned ship control, in particular to a model reference reinforcement learning-based unmanned ship fault-tolerant control method.

Background

With significant advances in guidance, navigation and control technology, unmanned ship (autonomous surface vehicles, ASV) applications have taken up a significant portion of the aviation. In most applications, unmanned vessels are expected to operate safely without human intervention for extended periods of time. Accordingly, unmanned boats are required to have sufficient safety and reliability attributes to provide proper operation and avoid catastrophic consequences. However, unmanned ships are prone to problems such as failure, degradation of system components, sensor failure, etc., and experience performance degradation, instability, and even catastrophic loss.

Disclosure of Invention

In order to solve the technical problems, the application aims to provide the unmanned ship fault-tolerant control method based on model reference reinforcement learning, which can recover the system performance or keep the system running after encountering faults, thereby obviously improving the safety and reliability of the system.

The first technical scheme adopted by the application is as follows: a model reference reinforcement learning-based unmanned ship fault-tolerant control method comprises the following steps:

s1, analyzing uncertainty factors of an unmanned ship and constructing a nominal dynamics model of the unmanned ship;

s2, designing a nominal controller of the unmanned ship based on a nominal dynamics model of the unmanned ship;

s3, constructing a fault-tolerant controller based on model reference reinforcement learning according to a state variable difference value of an actual unmanned ship system, an unmanned ship name kinetic model and the output of an unmanned ship nominal controller by using an Actor-Critic method based on maximum entropy;

and S4, building a reinforcement learning evaluation function and a control strategy model according to the control task requirements, and training a fault-tolerant controller to obtain a trained control strategy.

Further, the formula of the unmanned ship nominal dynamics model is expressed as follows:

in the above-mentioned method, the step of,represents a generalized coordinate vector, v represents a generalized velocity vector, u represents a control force and moment, M represents an inertia matrix, C (v) comprises a coriolis force and a centripetal force, D (v) represents a damping matrix, G (v) represents unmodeled dynamics due to gravity and buoyancy and moment, and B represents a preset input matrix->

Further, the formula of the unmanned ship nominal controller is expressed as follows:

in the above, N _m And H _m All known constant parameters, η, comprising unmanned ship dynamics model _m Generalized coordinate vector representing nominal model, u _m Representing control law, x _m Representing the state of the reference model.

Further, the fault tolerant controller is formulated as follows:

in the above, H _m L represents the Hurwitz matrix,u _l representing the control strategy from the deep learning module, β (v) represents the set of all model uncertainties in the inner loop dynamics, n _v Representing noise vectors on generalized velocity measurements, f _v Representing a sensor failure acting on the generalized velocity vector.

Further, the reinforcement learning evaluation function is formulated as follows:

Q _π (s _t ,u _l,t )＝T ^π Q _π (s _t ,u _l,t )

in the above, u _l,t Representing control excitation from RL, s _t Representing a status signal at a time step T, T ^π Representing fixed policy, E _π Representing the desirability operator, gamma representing the discount factor, alpha representing the temperature coefficient, Q _π (s _t ,u _l,t ) A reinforcement learning evaluation function is represented.

Further, the control strategy model is formulated as follows:

in the above, pi represents policy set, pi ^old The policy indicating the previous update is indicated,represents pi ^old Q value, D of (C) _KL Indicating KL divergence, & lt & gt>Representing normalization factor,(s) _t And,) represents the control strategy, and the points represent write methods that omit arguments.

Further, according to the control task requirement, building a reinforcement learning evaluation function and a control strategy model, and training a fault-tolerant controller to obtain a trained control strategy, which specifically comprises the following steps:

s41, constructing a reinforcement learning rating function and a model strategy model for the fault-tolerant controller based on model reference reinforcement learning according to control task requirements.

S42, training a fault-tolerant controller based on model reference reinforcement learning to obtain an initial control strategy;

s43, injecting faults into the unmanned ship system, retraining the initial control strategy, and returning to the step S41 until the reinforcement learning evaluation function network model and the control strategy model are converged.

Further, the method further comprises the following steps:

introducing a double-evaluation function model, and adding an entropy value of a strategy into an expected return function of a control strategy, wherein R is _t Is a reward function, R _t ＝R(s _t ,u _l,t )。

The method has the beneficial effects that: aiming at an unmanned ship system with model uncertainty and sensor faults, the application provides a fault-tolerant control algorithm based on reinforcement learning, which combines model reference reinforcement learning with fault diagnosis and estimation mechanisms, takes Monte Carlo sampling efficiency into consideration, uses an Actor-Critic model to change accumulated benefits into a Q function, and ensures that the unmanned ship can learn to adapt to different sensor faults and recover track tracking performance under fault conditions through new fault-tolerant control based on reinforcement learning.

Drawings

FIG. 1 is a flow chart of steps of an unmanned ship fault-tolerant control method based on model reference reinforcement learning of the present application;

FIG. 2 is a block diagram of an Actor-Critic network according to an embodiment of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

As shown in fig. 1, the present application provides an unmanned ship fault-tolerant control method based on model reference reinforcement learning (reinforcement learning, RL), which includes the steps of:

s1, analyzing an uncertainty factor in an unmanned ship, ignoring all nonlinear items in inner loop dynamics to obtain a linear decoupling model of a dynamics equation of a generalized speed vector, and establishing a nominal dynamics model of the unmanned ship;

the dynamics model is specifically as follows:

wherein the method comprises the steps ofIs a generalized coordinate vector, x _p And y _p Representing the horizontal coordinates of ASV in inertial frame, < >>Is the heading angle. v= [ u ] _p ,v _p ,r _p ] ^T ∈R ³ Is a generalized velocity vector, u _p And v _p The linear velocities in the x-axis and y-axis directions, r _p Is the heading angular rate. u= [ tau ] _u ,τ _r ]∈R ³ Control force and moment, G (v) = [ G ] ₁ (v),g ₂ (v),g ₃ (v)] ^T ∈R ³ Is the unmodeled dynamics due to gravity and buoyancy and moment, M.epsilon.R ^3×3 Is provided with M=M ^T Inertial matrix of > 0 and

wherein the method comprises the steps ofMatrix C (v) = -C ^T (v) Including coriolis forces and centripetal forces, are given by:

wherein C is ₁₃ (v)＝-M ₂₂ v-M ₂₃ r，C ₂₃ (v)＝M ₁₁ u. Damping matrix

Wherein D is ₁₁ (v)＝-X _u -X _|u|u |u|-X _uuu u ² ，D ₂₂ (v)＝-Y _v -Y _|v|v |v|-Y _|r|v |r|，D ₂₃ (v)＝-Y _r -Y _|v|r |v|-Y _|r|r |r|，D ₃₂ (v)＝-N _v -N _|v|v |v|-N _|r|v |r|，D ₃₃ (v)＝-N _r -N _|v|r |v|-N _|r|r R, X (·), Y (·), N (·) are hydrodynamic coefficients, defined in the manual for marine hydrodynamic and motion control. Rotation matrixInput matrix->

Definition x= [ eta ] ^T v ^T ] ^T There is

Wherein H (v) = -M ^-1 (C (v) +d (v)) and n= -M ^-1 B。

The state measurement of an ASV system (1) is corrupted by noise and sensor failure, thus denoted y=x+n+f (t), where n e R ⁶ Is a measurement noise vector, f (t) ∈R ⁶ Representing possible sensor fault vectors. In the application, we only consider the sensor fault versus heading angle rate r _p So f (t) = [0, f) _r (t)] ^T . Sensor failure f _r (t) is given by:

f _r (t)＝β(t-T _f )φ(t-T _f ) Wherein phi (T-T) _f ) Is an unknown function of sensor failure occurring at transient T, beta (T-T _f ) Is T < T _f Beta (T-T) _f ) =0 and T > T _f Time of day(k is the evolution rate of the fault). Note that if the occurrence of a sensor failure is abrupt, e.g., a bias failure, k→infinity. The object of the application is to design a controller that allows the state x to track the state x in the presence of model uncertainty, possible sensor failures and measurement noise _r A reference state trace is represented.

S2, designing a nominal controller of the unmanned ship based on the nominal dynamics model, and guaranteeing the basic stability of the unmanned ship system on the premise of no fault. And analyzing the unmanned ship nominal model.

The nominal controller design process is:

the proposed RL-based FTC algorithm follows a model reference control structure. For most ASV systems, accurate nonlinear dynamics models are rarely available, with the main uncertainties coming from hydrodynamic induced M, C (v) and D (v), and gravity and buoyancy and moment induced G (v). Despite uncertainty in ASV dynamics, a nominal model (5) can still be used based on known information of ASV dynamics. The nominal model of the uncertain ASV model (5) is shown below:

wherein N is _m And H _m All known constant parameters including ASV dynamics (5),is the generalized coordinate vector of the nominal model. In the present application, M _m Is made up of M _m ＝diag{M ₁₁ ,M ₂₂ ,M ₃₃ Derived, H _m ＝M _m ^-1 D _m From D _m ＝diag{-X _u ,-Y _v ,-N _r Sum N _m ＝M _m ^-1 B. Therefore, in the nominal model, all nonlinear terms in the inner loop dynamics are ignored, and thus a linear decoupling model of the generalized velocity vector v dynamics equation is finally obtained. Since the dynamics of the nominal model (6) are known, a control law u can be designed _m To allow the state of the nominal system (6) to converge to the reference signal x _r For example, when t.fwdarw.infinity, |x _m -x _r || ₂ And 0. Such control law u _m Can also be used as a nominal controller by the whole ASV dynamics (5).

In the model reference control architecture, the goal is to design a control law that allows the state of (5) to track the state trajectory of the nominal model (6). The overall control law of the ASV system (5) has the following expression:

u＝u _b +u _l (7)

wherein u is _b Is based on the nominal of the model method, u _l Is a control strategy from the deep learning module. Baseline control u _b For ensuring some basic properties (i.e. local stability), and u _l To compensate for all system uncertainties and sensor failures.

S3, constructing a fault-tolerant controller based on model reference reinforcement learning by taking a difference value of state variables of an actual unmanned ship system and a nominal model and output of the nominal controller as inputs according to an Actor-Critic method based on maximum entropy.

Referring to FIG. 2, a network block diagram of an Actor-Critic, the specific derivation of the fault tolerant controller is as follows:

the formula for RL is based on a markov decision process MDP =represented by tuples<S,U,P,R,γ>Where S is the state space, U specifies the operation/input space, P: S U S R defines the transition probability, R: S U R is a back rewards function, gamma E [0, 1) is a discount coefficient. In MDP, the state vector S ε S contains the influence on RL control u _l E all available signals of U. For the tracking control of the ASV system in the application, the transition probability is determined by the ASV dynamic state in (1) and the reference signal x _r Characterization. In RL, the control strategy is learning using data samples acquired in the discrete time domain. Let s be _t For the state signal s at the time step t, accordingly, u _l,t Is the input of the RL-based control at time step t. The RL algorithm in the present application aims to maximize an action cost function, also called Q function, as follows:

wherein R is _t Is a reward function, R _t ＝R(s _t ,u _l,t )，And V is _π (s _t +1) is called s under policy pi _t A state value function of +1, wherein

Wherein pi (u) _l,t |s _t ) Is a control strategy that is set up to control the operation of the device,is the entropy of the strategy and α is the temperature parameter. Control strategy pi (u) in RL _l,t |s _t ) Is a selection action u _l,t E U is in state s _t E probability under S. In the present application, a control strategy is employed that satisfies a gaussian distribution, i.e

π(u _l |s)＝N(u _l (s),σ) (10)

Wherein N (·, ·) represents a gaussian distribution, u _l (s) is the mean and σ is the covariance matrix. The covariance matrix sigma controls the exploration performance of the learning phase.

The goal of RL is to find an optimal control strategy pi ^* Q in (8) _π (s _t ,u _l,t ) Maximising, i.e

π ^* ＝argmaxQ _π (s _t ,u _l,t ) (11)

Note that the variance σ ^* Will converge to 0 once the optimal strategy pi is obtained ^* (u _l ^* |s)＝N(u _l ^* (s),σ ^* ) Average function u _l ^* (s) the learned optimal control law deep neural network Q _θ (s _t ,u _l T) is called critic, control strategy pi _φ (u _l,t |s _t ) Called actor, the ASV model uncertainty inner loop dynamics in (5) is rewritten as:

where β (v) is the set of all model uncertainties in the inner loop dynamics. Let the uncertainty term β (v) be bounded. Let e _v ＝v-v _m According to (6) and (12), the error dynamics are:

under healthy conditions, the model uncertainty term β (v) can use learning-based control u _l Complete compensation is performed. This means that when t→infinity, ||e _v (t)|| ₂ ε, where ε is some positive small constant. Error signal e in case of sensor failure _v Will be greater than epsilon. One inexperienced idea of learning-based Fault Tolerant Control (FTC) is to treat sensor faults as part of external disturbances. However, treating a sensor failure as a disturbance will result in a conservative learning based control, such as a robust control. Thus, we introduced a fault diagnosis and estimation mechanism that allowed learning-based control to adapt to different scenarios: healthy and unhealthy conditions.

Let y be _v ＝v+n _v +f _v Wherein n is _v Representing noise vectors on generalized velocity measurements and, correspondingly, f _v Is a sensor fault acting on the generalized velocity vector. In addition we defineAs a fault tracking error vector. In practical use, the->Is measurable instead of e _v . Finally, the following fault diagnosis and estimation mechanisms are introduced:

wherein L is selected as H _m -L Hurwitz. Signal signalAs an indicator of sensor failure occurrence and strength. Is provided withObtaining

And S4, designing a corresponding callback function according to the control task requirement, and building a reinforcement learning evaluation function model (Q-value) and a control strategy model by using a full-connected network.

The callback function, the learning evaluation function are derived as follows from the control strategy model:

fault tolerant RL-based control is derived using the outputs of the fault diagnosis and estimation mechanisms. RL learns the control strategy in discrete time steps using data samples (including input and state data). The sampling time step is assumed to be fixed, denoted δt. Without loss of generality, let y _t ，u _b,t ，u _l,t A kind of electronic deviceRepresenting the ASV state, nominal controller excitation, control excitation from RL, and output of the fault diagnosis and estimation mechanism at time step t, respectively. Thus, the state signal s at time step t is expressed as: />The training learning process of the RL will repeatedly perform policy evaluation and policy improvement. In policy evaluation, Q-value is Q through the Bellman operation _π (s _t ,u _l,t )＝T ^π Q _π (s _t ,u _l,t ) Obtained by the method, wherein

In policy improvement, the policy is updated by:

where pi represents the policy set, pi ^old The policy indicating the last time the update was made,represents pi ^old Q value, D of (C) _KL Indicating Kullback-Leibler (KL) divergence, < >>Representing the normalization factor. Through mathematical operations, the target is converted into

S5, introducing a double-evaluation function model idea into the evaluation function training framework, and simultaneously adding the entropy value of the strategy into the expected return function of the control strategy, so that the reinforcement learning training efficiency is improved.

The derivation process of the dual-evaluation function model comprises the following steps:

parameterizing the Q function with θ, with Q _θ (s _t ,u _l,t ) And (3) representing. The parameterization strategy is composed of pi _φ (u _l,t |s _t ) Representation, wherein phi is the parameter set to be trained. Note that both θ and Φ are a set of parameters whose size is determined by the deep neural network settings. For example, if Q _θ Represented by an MLP with K hidden layers and L neurons per hidden layer, the parameter set θ is θ= { θ ₀ ,θ ₁ ,...,θ _K And is 1.ltoreq.i.ltoreq.K-1θ _K ∈R ^1×(L+1) ，θ _i ∈R ^(L)×(L+1) Wherein dim _s Representing the size, dim, of the state s _u Representing input u _l Is a size of (c) a.

The training process is offline, collecting data samples at each time step t+1, e.g., input u from the last time step _l,t Last time step s _t Status of (2) reward R _t And the current state s _t+1 . These historical data will be used as tuples (s _t ,u _l,t ,R _t ,s _t+1 ) Stored in memory pool D. In each policy evaluation or improvement step, we randomly draw a batch of historical data B from the memory pool D for training parameters θ and Φ. At the beginning of training we will nominal control strategy u _b Applied to ASV systems to collect initial data D ₀ As shown in algorithm 1. Initial dataset D ₀ For initial fitting of the Q function. After the initialization is finished, u is executed _b And a newly updated reinforcement learning strategy pi _φ (u _l,t |s _t ) To operate the ASV system.

The parameter θ of the Q function is trained to minimize the bellman residual:

wherein(s) _t ,u _l,t ) By D we mean that we randomly choose samples (s _t ,u _l,t ) And (2) andwherein->Is the target parameter that will be updated slowly. The DNN parameter θ is obtained by applying a random gradient descent method to (15) on the corrected data lot B, the size of which is represented by |b|. In the application, two groups respectively represented by theta ₁ And theta ₂ And (5) evaluation of parameterization. These two evaluations were introduced to reduce the overestimation problem in evaluating neural network training. Under the double evaluation function, the target value Y _target The method comprises the following steps:

the policy improvement step uses the data samples in memory pool D to achieve the following parameterized objective function minimization:

the parameter phi is trained to a minimum using a random gradient descent method, and during the training phase, the actor neural network is expressed as:

wherein the method comprises the steps ofIs a parameterized control law to be learned, +.>Is the standard deviation of the detection noise, ζ to N (0,I) are the detection noise, "" is the Hadamard product. Note that the detected noise ζ is only applicable in the training phase, once training is completed, only in-use +.>Thus, u in training phase _l Equivalent to u _l,φ . Once training is finished, get +.>

The temperature parameter alpha is also updated during the training phase. The update is obtained by minimizing the following objective function:

wherein the method comprises the steps ofIs the entropy value of the policy. The application is provided with->Where "2" represents the action dimension.

And S6, training the controller based on model reference reinforcement learning under the fault-free condition to obtain an initial control strategy, and ensuring the robustness of the overall controller to the model uncertainty.

S7, injecting faults into the unmanned ship system, retraining the obtained initial control strategy based on model reference reinforcement learning, and realizing the adaptability of the overall controller to partial sensor faults.

And S8, continuously repeating the step S6 and the step S7 under different initial state conditions until the reinforcement learning evaluation function network model and the control strategy model are converged.

Specifically, the training process of steps S6-S8 is specifically as follows: 1) Is thatAnd->Respectively initializing the parameters theta ₁ ，θ ₂ The actor network is represented by phi; 2) Assigning values to the target parameters: />3) Run u _l U in equation (5) when=0 _b Obtaining a data set D ₀ The method comprises the steps of carrying out a first treatment on the surface of the 4) Ending the exploration of the learning phase, using the dataset D ₀ Training initial critic parameter θ ₁ ⁰ ，/>5) Initializing memory pool D≡D ₀ The method comprises the steps of carrying out a first treatment on the surface of the 6) Initial values are specified for the critic parameter and its target: θ ₁ ←θ ₁ ⁰ ，/> 7) Repeating; 8) Starting a loop, and executing operation by each data collecting step; 9) According to pi _φ (u _l,t |s _t ) Select an action u _l,t The method comprises the steps of carrying out a first treatment on the surface of the 10 Operating the nominal system (6) and the overall system (5) and the fault diagnosis and estimation mechanism (14)&Collecting s _t +1＝{x _t+1 ,x _m,t+1 ,u _b,t+1 }；11)D←D∪{s _t ,u _l,t ,R(s _t ,u _l,t ),s _t+1 -a };12 Ending the cycle; 13 A loop is started, and each gradient updating step executes actions; 14 Extracting a batch of data B from D; 15 Theta (theta) _j ←θ _j -ι _Q ▽ _θ J _Q (θ _j ) And j=1, 2;16 Phi ≡phi-iota _π ▽ _φ J _π (φ)；17)α←α-ι _α ▽ _α J _α (α)；18)/>And j=1, 2;19 Ending the cycle; 20 Up to convergence (e.g. J) _Q (θ) < a small threshold). In this algorithm, iota _Q ，ι _π And iota (iota) _α Is a positive learning rate (scalar), and κ > 0 is a constant scalar.

An unmanned ship fault-tolerant control system based on model reference reinforcement learning, comprising:

the dynamics model construction module is used for analyzing uncertainty factors of the unmanned ship and constructing a nominal dynamics model of the unmanned ship;

the controller design module is used for designing a nominal controller of the unmanned ship based on the nominal dynamics model of the unmanned ship;

the fault-tolerant controller construction module is used for constructing a fault-tolerant controller based on model reference reinforcement learning according to the state variable difference value of an actual unmanned ship system, an unmanned ship name dynamic model and the output of an unmanned ship nominal controller based on an Actor-Critic method of maximum entropy;

and the training module is used for building a reinforcement learning evaluation function and a control strategy model according to the control task requirement and training the fault-tolerant controller to obtain a trained control strategy.

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

Unmanned ship fault-tolerant control device based on model reference reinforcement learning:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement an unmanned ship fault-tolerant control method based on model reference reinforcement learning as described above.

The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.

A storage medium having stored therein instructions executable by a processor, characterized by: the processor-executable instructions, when executed by the processor, are configured to implement an unmanned ship fault-tolerant control method based on model reference reinforcement learning as described above.

The content in the method embodiment is applicable to the storage medium embodiment, and functions specifically implemented by the storage medium embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The unmanned ship fault-tolerant control method based on model reference reinforcement learning is characterized by comprising the following steps of:

s4, building a reinforcement learning evaluation function and a control strategy model according to the control task requirements, and training a fault-tolerant controller to obtain a trained control strategy;

the formula of the fault tolerant controller is expressed as follows:

in the above, H _m L represents the Hurwitz matrix,u _l representing control strategies from a deep learning moduleSlightly, β (v) represents the set of all model uncertainties in the inner loop dynamics, n _v Representing noise vectors on generalized velocity measurements, f _v Sensor fault, signal +.>N as an indicator of sensor failure occurrence and strength _m And H _m All known constant parameters of the unmanned ship dynamics model are included.

2. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 1, wherein the formula of the unmanned ship nominal dynamics model is expressed as follows:

in the above-mentioned method, the step of,representing generalized coordinate vector, x _p And y _p For the horizontal position coordinates of ASV in the inertial coordinate system, v represents the generalized velocity vector, u represents the control force and moment, M represents the inertial matrix, C (v) includes the coriolis force and centripetal force, D (v) represents the damping matrix, G (v) represents the unmodeled dynamics due to gravity and buoyancy and moment, B represents the preset input matrix, and R (η) represents the rotation matrix.

3. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 2, wherein the formula of the unmanned ship nominal controller is expressed as follows:

in the above, N _m And H _m All known constants including unmanned ship dynamics modelParameters, eta _m Generalized coordinate vector representing nominal model, u _m Representing control law, x _m Representing the state of the reference model.

4. A model reference reinforcement learning-based unmanned ship fault-tolerant control method according to claim 3, wherein the reinforcement learning evaluation function is formulated as follows:

Q _π (s _t ,u _l,t )＝T ^π Q _π (s _t ,u _l,t )

in the above, u _l,t Representing control excitation from RL, s _t Representing a status signal at a time step T, T ^π Representing fixed policy, E _π Representing the desirability operator, gamma representing the discount factor, alpha representing the temperature coefficient, Q _π (s _t ,u _l,t ) Representing a reinforcement learning evaluation function, R _t Indicating a reward.

5. The unmanned ship fault-tolerant control method based on model reference reinforcement learning of claim 4, wherein the control strategy model is formulated as follows:

in the above, pi represents policy set, pi ^old The policy indicating the previous update is indicated,represents pi ^old Q value, D of (C) _KL Indicating KL divergence, & lt & gt>Representing normalization factor,(s) _t And (c) represents a control strategy.

6. The unmanned ship fault-tolerant control method based on model reference reinforcement learning according to claim 1, wherein the step of constructing reinforcement learning evaluation function and control strategy model and training fault-tolerant controller according to control task requirement to obtain trained control strategy specifically comprises the following steps:

s41, constructing a reinforcement learning rating function and a model strategy model for the fault-tolerant controller based on model reference reinforcement learning according to the control task requirement,

7. The unmanned ship fault-tolerant control method based on model reference reinforcement learning of claim 6, further comprising:

and introducing a double-evaluation function model, and adding the entropy value of the strategy into the expected return function of the control strategy.