CN113885330B - Information physical system safety control method based on deep reinforcement learning - Google Patents

Information physical system safety control method based on deep reinforcement learning Download PDF

Info

Publication number
CN113885330B
CN113885330B CN202111247290.6A CN202111247290A CN113885330B CN 113885330 B CN113885330 B CN 113885330B CN 202111247290 A CN202111247290 A CN 202111247290A CN 113885330 B CN113885330 B CN 113885330B
Authority
CN
China
Prior art keywords
physical system
attack
representing
time
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111247290.6A
Other languages
Chinese (zh)
Other versions
CN113885330A (en
Inventor
吴承伟
柴庆杰
刘健行
孙光辉
吴立刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202111247290.6A priority Critical patent/CN113885330B/en
Publication of CN113885330A publication Critical patent/CN113885330A/en
Application granted granted Critical
Publication of CN113885330B publication Critical patent/CN113885330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

An information physical system security control method based on deep reinforcement learning belongs to the technical field of information security. The invention solves the problem of poor control performance of the safety control strategy designed based on the existing method under the condition of network attack. The invention describes an information physical system dynamic equation under an attacked condition as a Markov decision process, converts the information physical system safety control problem under the attacked condition of false data injection into a control strategy learning problem only using data based on the established Markov process, and provides a flexible operator-critic reinforcement learning algorithm based on a Lyapunov function and a novel deep neural network training frame based on a flexible operator-critic reinforcement learning algorithm frame. The invention can be applied to the safety control of an information physical system.

Description

Information physical system safety control method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of information safety, and particularly relates to an information physical system safety control method based on deep reinforcement learning.
Background
The information physical system integrates and develops the existing technologies of communication, wireless network, distribution, artificial intelligence and the like, constructs the mutual mapping, timely interaction and efficient cooperation of human, machine, object, environment, information and other elements in the physical space and the information space, and becomes a new generation intelligent system integrating calculation, communication and control. At the heart of smart manufacturing, the development of cyber-physical systems has been receiving attention from countries around the world. In recent years, network attack events occur frequently, great threats are caused to national security, economic development, infrastructure security and people's lives and properties, and how to ensure the security of an information physical system becomes a great problem to be solved urgently. Although the research on the security problem of the cyber-physical system is continuously breaking through, the shortage of the cyber-physical system security design relying on only a single discipline knowledge, such as a control discipline, a computer discipline or a network cyber-security discipline technology, is increasingly highlighted. Aiming at network attack, the existing method still has the problems of complex model of the information physical system, difficult modeling, difficult design of a security control strategy, poor control performance under the security control strategy and the like, so how to solve the security problem of the information physical system by fusing multi-disciplinary technical means becomes an important problem to be solved urgently.
Deep reinforcement learning techniques have been widely used in the fields of robotics and games. The technology can realize intelligent control scheme design based on data completely. The cyber-physical system is an intelligent system framework, and how to design an intelligent safety control strategy based on a deep reinforcement learning technology has important scientific significance.
Disclosure of Invention
The invention aims to solve the problem that the control performance of a safety control strategy designed based on the existing method is poor under the condition of network attack, and provides an information physical system safety control method based on deep reinforcement learning.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a deep reinforcement learning-based cyber-physical system security control method specifically comprises the following steps:
step one, establishing an information physical system model under the attack of false data injection of an actuator;
step two, describing the information physical system model under the attack of injecting the actuator false data established in the step one into a Markov decision process;
and step three, building a deep neural network, and outputting a decision strategy for the Markov decision process based on the built deep neural network.
Further, the specific process of the first step is as follows:
step one, under an ideal state, the dynamic equation of the information physical system model is as follows:
Figure BDA0003321263940000021
wherein the content of the first and second substances,
Figure BDA0003321263940000022
a state vector representing the physical system of the information,
Figure BDA0003321263940000023
representing the real number field, nxRepresenting the dimensions of the state vector x,
Figure BDA0003321263940000024
representing the control signal to be designed, nuWhich represents the dimension of the control signal u,
Figure BDA0003321263940000025
f (-) represents a generalized function map for the first derivative of x;
step two, discretizing the information physical system model in the step one based on an Euler method to obtain a discretized information physical system model:
x(k+1)=(f(x(k),u(k)))Δt+x(k)
wherein x (k) represents the state vector of the discretization information physical system at the time k, u (k) represents the control signal at the time k, Δ t represents the sampling period, and x (k +1) represents the state vector of the discretization information physical system at the time k + 1;
step three, the dynamic equation of the reference model without considering external disturbance, uncertainty and network attack behavior is as follows:
xr(k+1)=g(xr(k),ur(k))
wherein the content of the first and second substances,
Figure BDA0003321263940000026
representing the state vector of the reference model at time k, ur(k) Controller representing the reference model at time k, xr(k +1) represents the state vector of the reference model at the moment k +1, and g (-) represents the generalized function mapping;
step four, describing a control signal of the discretization information physical system model under the condition that false data is injected into the attack as follows:
Figure BDA0003321263940000027
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003321263940000028
indicating the control signal at time k after an attack, ua(k) Representing a spurious data injection attack signal at time k, wherein gamma represents an attack distribution matrix of the spurious data injection attack;
step one, the state vector of the reference model is used as part input of the control signal generated by the discretization information physical system model, and the control signal after being attacked is considered, so that the discretization information physical system model in the step two is described as the information physical system model under the attack of executor false data injection:
Figure BDA0003321263940000029
wherein the content of the first and second substances,
Figure BDA00033212639400000210
state vector x representing model to be referencedr(k) When the control signal is generated as part of the input of the discretized cyber-physical system model, the state vector of the discretized cyber-physical system model at the time k,
Figure BDA00033212639400000211
state vector x representing model to be referencedr(k +1) generating control as discretized information physical system modelWhen a signal is partially input, the state vector of the information physical system model at the time k +1 is discretized.
Further, the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if an ith actuator is attacked, an ith diagonal element of the attack distribution matrix Γ takes a value of 1, otherwise, an ith diagonal element takes a value of 0.
Further, the spurious data injection attack signal ua(k) Is a state independent attack signal, if ua(k) Is time-invariant, then ua(k) ω, where ω represents a constant value; if u isa(k) Is time-varying, then
Figure BDA0003321263940000031
Where ω (k) represents a bounded time-varying function,
Figure BDA0003321263940000032
representing an upper bound on the attack signal.
Further, the spurious data injection attack signal ua(k) Is a state-dependent attack signal, if ua(k) Is time-invariant, then ua(k) Is described as
Figure BDA0003321263940000033
Where ω denotes a constant value, if ua(k) Is time-varying, then ua(k) Is described as
Figure BDA0003321263940000034
Where ω (k) represents a bounded time-varying function.
Further, the markov decision process is described as follows:
Figure BDA0003321263940000035
wherein the content of the first and second substances,
Figure BDA0003321263940000036
markov decision-making process for representing k timeThe state of the program is that,
Figure BDA0003321263940000037
representing the state of the markov decision process at time k +1,
Figure BDA0003321263940000038
is shown in the action
Figure BDA0003321263940000039
Under the action of (2) system state from
Figure BDA00033212639400000310
To
Figure BDA00033212639400000311
The transition probability of (2).
Further, in the third step, a decision strategy is output for the markov decision process based on the built deep neural network, and the specific process is as follows:
defining the action value function
Figure BDA00033212639400000312
Figure BDA00033212639400000313
Wherein the content of the first and second substances,
Figure BDA00033212639400000314
for the cost function, γ ∈ [0,1) denotes the discount factor, α denotes the temperature parameter,
Figure BDA00033212639400000315
is an item of entropy of the information that,
Figure BDA00033212639400000316
is shown in a state
Figure BDA00033212639400000317
From strategy piSelection actions
Figure BDA00033212639400000318
The probability of (a) of (b) being,
Figure BDA00033212639400000319
which is indicative of a desired operation to be performed,
Figure BDA00033212639400000320
is the function value of the state value at the moment of k + 1;
the deep neural network obtains the optimal strategy pi by minimizing the action value function*
Figure BDA00033212639400000321
Wherein Π represents a set of policies;
for optimal strategy pi*Sampling to obtain optimal control signal u*(k),
Figure BDA0003321263940000041
Representing a Gaussian distribution,. sigma*Is the variance.
Further, the information entropy item
Figure BDA0003321263940000042
The expression of (c) is:
Figure BDA0003321263940000043
wherein the content of the first and second substances,
Figure BDA0003321263940000044
is shown in a state
Figure BDA0003321263940000045
Selecting actions from policy π
Figure BDA0003321263940000046
Am ofThe ratio of the total weight of the particles,
Figure BDA0003321263940000047
representing the desired operation.
Further, the cost function
Figure BDA0003321263940000048
Comprises the following steps:
Figure BDA0003321263940000049
where the superscript T represents transpose.
Further, the deep neural network is a network built based on a multi-layer perceptron and provided with two hidden layers, and the built network is described as follows:
Figure BDA00033212639400000410
wherein, the first and the second end of the pipe are connected with each other,DNN w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w0,w1And w2Representing a weight coefficient required to be trained by the network, wherein rho (·) is an activation function;
the deep neural network repeatedly carries out strategy evaluation and strategy updating during training, and the Lyapunov function constraint needs to be ensured during strategy updating, so that the strategy updating mode is as follows:
Figure BDA00033212639400000411
Figure BDA00033212639400000412
wherein, pinewThe updated policy is represented by the policy data,
Figure BDA00033212639400000413
is shown in a state
Figure BDA00033212639400000414
Selecting actions from policy π
Figure BDA00033212639400000415
The probability of (a) of (b) being,
Figure BDA00033212639400000416
representing the function of the Lyapunov function,
Figure BDA00033212639400000417
representing a Lyapunov function constraint;
and introducing a Lagrange multiplier lambda, and then re-describing the strategy updating mode as follows:
Figure BDA00033212639400000418
the invention has the beneficial effects that:
the invention designs a security control strategy design algorithm based on deep reinforcement learning for an information physical system under the condition of virtual false data injection attack of an execution mechanism, describes a dynamic equation of the information physical system under the condition of attack as a Markov decision process, converts the security control problem of the information physical system under the condition of false data injection attack into a control strategy learning problem only using data based on the established Markov process, and provides a flexible operator-critical reinforcement learning algorithm based on a Lyapunov function based on a flexible operator-critical reinforcement learning algorithm framework.
In addition, the safety control algorithm designed by the invention has strong robustness to factors such as uncertainty, external disturbance and the like. Under the condition that network attacks are complex and changeable, the safety control algorithm designed by the invention has better control performance and a model is simpler.
Drawings
FIG. 1 is a control block diagram of an cyber-physical system in the presence of a spurious data injection attack in accordance with the present invention;
FIG. 2 is a diagram of a deep neural network architecture for approximate evaluation and action constructed in the present invention;
FIG. 3 is a schematic diagram of a learning algorithm offline training process;
wherein the simulator comprises a system of formula (4) and a reference model of formula (3);
FIG. 4 is a state response graph using a PD controller reference model;
FIG. 5 is a graph of the mean response of the state of the arm system under the action of a control strategy obtained based on learning, under different initial conditions of a non-attack situation;
FIG. 6 is a graph of the mean response of the state of the robot system under the PD control strategy under different initial conditions of a no-attack situation;
FIG. 7 is a graph of the mean response of the state of the robot arm under the control strategy based on learning under different initial conditions of constant value attack (case 1);
FIG. 8 is a graph of the mean response of the state of the manipulator under the action of the PD control strategy under different initial conditions of a constant attack (case 1);
FIG. 9 is a graph of the mean response of the state of the manipulator under the action of a learning-based control strategy under different initial conditions of a time-dependent attack (case 2);
FIG. 10 is a graph of the mean response of the state of the manipulator under the PD control strategy under different initial conditions of the time-dependent attack (case 2);
fig. 11 is a graph of the mean response of the state of the manipulator under the action of the control strategy obtained based on learning under different initial conditions for designing the optimal attack strategy (case 3);
fig. 12 is a graph of the mean response of the state of the manipulator under the PD control strategy under different initial conditions for designing an optimal attack strategy (case 3).
Detailed Description
First embodiment this embodiment will be described with reference to fig. 1. The method for controlling the safety of the cyber-physical system based on the deep reinforcement learning in the embodiment specifically comprises the following steps:
step one, establishing an information physical system model under the attack of false data injection of an actuator;
step two, describing the information physical system model under the condition that the false data of the actuator established in the step one is injected into the attack as a Markov decision process;
and step three, building a deep neural network, and outputting a decision strategy for the Markov decision process based on the built deep neural network.
The second embodiment is as follows: the difference between this embodiment and the first embodiment is that the specific process of the first step is as follows:
step one, under an ideal state, a dynamic equation of an information physical system model is as follows:
Figure BDA0003321263940000061
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003321263940000062
a state vector representing the cyber-physical system,
Figure BDA0003321263940000063
representing the real number field, nxRepresenting the dimensions of the state vector x,
Figure BDA0003321263940000064
representing the control signal to be designed, nuWhich represents the dimension of the control signal u,
Figure BDA0003321263940000065
is the first derivative of x, f (·) represents a generalized function map;
step two, discretizing the information physical system model in the step one based on an Euler method to obtain a discretized information physical system model:
x(k+1)=(f(x(k),u(k)))Δt+x(k)
wherein x (k) represents the state vector of the discretization information physical system at the time k, u (k) represents the control signal at the time k, Δ t represents the sampling period, and x (k +1) represents the state vector of the discretization information physical system at the time k + 1;
step three, the dynamic equation of the reference model without considering external disturbance, uncertainty and network attack behavior is as follows:
xr(k+1)=g(xr(k),ur(k))
wherein the content of the first and second substances,
Figure BDA0003321263940000066
representing the state vector of the reference model at time k, ur(k) Controller representing the reference model at time k, xr(k +1) represents the state vector of the reference model at the moment k +1, and g (-) represents the generalized function mapping;
step four, describing a control signal of the discretization information physical system model under the condition that false data is injected into the attack as follows:
Figure BDA0003321263940000067
wherein the content of the first and second substances,
Figure BDA0003321263940000068
indicating the control signal at time k after an attack, ua(k) Representing a spurious data injection attack signal at time k, wherein gamma represents an attack distribution matrix of the spurious data injection attack;
step one, the state vector of the reference model is used as part input of the control signal generated by the discretization information physical system model, and the control signal after being attacked is considered, so that the discretization information physical system model in the step two is described as the information physical system model under the attack of executor false data injection:
Figure BDA0003321263940000071
wherein the content of the first and second substances,
Figure BDA0003321263940000072
state vector x representing model to be referencedr(k) When the control signal is generated as part of the input of the discretized cyber-physical system model, the state vector of the discretized cyber-physical system model at the time k,
Figure BDA0003321263940000073
state vector x representing model to be referencedrAnd (k +1) when the control signal is partially input as the discretized cyber-physical system model, discretizing the state vector of the cyber-physical system model at the moment of k + 1.
Other steps and parameters are the same as those in the first embodiment.
The third concrete implementation mode: the first or second difference between this embodiment and the specific embodiment is that the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if the ith actuator is attacked, the ith diagonal element (i.e., the element in the ith row and ith column of the diagonal matrix, corresponding to the ith actuator) of the attack distribution matrix Γ takes a value of 1, otherwise, the ith diagonal element takes a value of 0.
Other steps and parameters are the same as those in the first or second embodiment.
The fourth concrete implementation mode: the difference between this embodiment and one of the first to third embodiments is that the dummy data injection attack signal ua(k) Is a state independent attack signal, if ua(k) Is time-invariant, then ua(k) ω, where ω represents a constant value; if ua(k) Is time-varying, then
Figure BDA0003321263940000074
Where ω (k) represents a bounded time-varying function,
Figure BDA0003321263940000075
representing an upper bound on the attack signal.
Other steps and parameters are the same as those in one of the first to third embodiments.
The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that the dummy data injection attack signal ua(k) Is a state-dependent attack signal, if ua(k) Is time-invariant, then ua(k) Is described as
Figure BDA0003321263940000076
Where ω denotes a constant value, if ua(k) Is time-varying, then ua(k) Is described as
Figure BDA0003321263940000077
Where ω (k) represents a bounded time-varying function.
Other steps and parameters are the same as in one of the first to fourth embodiments.
The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is that the markov decision process is described as follows:
Figure BDA0003321263940000078
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003321263940000081
representing the state of the markov decision process at time k,
Figure BDA0003321263940000082
representing the state of the markov decision process at time k +1,
Figure BDA0003321263940000083
is shown in the action
Figure BDA0003321263940000084
Under the action of the pressure in the system from
Figure BDA0003321263940000085
To
Figure BDA0003321263940000086
The transition probability of (2).
Other steps and parameters are the same as those in one of the first to fifth embodiments.
The seventh embodiment: the difference between this embodiment and the first to sixth specific embodiments is that, in step three, a decision strategy is output for a markov decision process based on a built deep neural network, and the specific process is as follows:
defining the action value function
Figure BDA0003321263940000087
Figure BDA0003321263940000088
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003321263940000089
for the cost function, γ ∈ [0,1) denotes the discount factor, α denotes the temperature parameter,
Figure BDA00033212639400000810
is an item of entropy of the information that,
Figure BDA00033212639400000811
is shown in a state
Figure BDA00033212639400000812
Selecting actions from policy π
Figure BDA00033212639400000813
The probability of (a) of (b) being,
Figure BDA00033212639400000814
which is indicative of a desired operation to be performed,
Figure BDA00033212639400000815
is the function value of the state value at the moment of k + 1;
the deep neural network obtains the optimal strategy pi by minimizing the action value function*
Figure BDA00033212639400000816
Wherein, pi represents a strategy set,
Figure BDA00033212639400000817
means for selecting from the strategy set pi to minimize the value of the action value function*
To the optimal strategy pi*Sampling to obtain optimal control signal u*(k),
Figure BDA00033212639400000818
Representing a Gaussian distribution,. sigma*Is the variance.
Other steps and parameters are the same as those in one of the first to sixth embodiments.
The specific implementation mode is eight: this embodiment is different from the first to seventh embodiments in that the information entropy item
Figure BDA00033212639400000819
The expression of (a) is:
Figure BDA00033212639400000820
wherein the content of the first and second substances,
Figure BDA00033212639400000821
is shown in a state
Figure BDA00033212639400000822
Selecting actions from policy π
Figure BDA00033212639400000823
The probability of (a) of (b) being,
Figure BDA00033212639400000824
representing the desired operation.
Other steps and parameters are the same as those in one of the first to seventh embodiments.
The specific implementation method nine: in this embodiment, the difference between the first embodiment and the eighth embodiment is that the cost function
Figure BDA0003321263940000091
Comprises the following steps:
Figure BDA0003321263940000092
where the superscript T represents transpose.
Other steps and parameters are the same as those in one to eight of the embodiments.
The specific implementation mode is ten: the difference between this embodiment and one of the first to ninth embodiments is that the deep neural network is a network built based on a multi-layer perceptron and having two hidden layers, and the built network is described as follows:
Figure BDA0003321263940000093
wherein, the first and the second end of the pipe are connected with each other,DNN w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w0,w1And w2Representing a weight coefficient required to be trained by the network, wherein rho (-) is an activation function;
the deep neural network repeatedly carries out strategy evaluation and strategy updating during training, and the Lyapunov function constraint needs to be ensured during strategy updating, so that the strategy updating mode is as follows:
Figure BDA0003321263940000094
Figure BDA0003321263940000095
wherein, pinewThe updated policy is represented by the policy data,
Figure BDA0003321263940000096
is shown in a state
Figure BDA0003321263940000097
Selecting actions from policy π
Figure BDA0003321263940000098
The probability of (a) of (b) being,
Figure BDA0003321263940000099
representing the function of the Lyapunov function,
Figure BDA00033212639400000910
representing a Lyapunov function constraint;
and introducing a Lagrange multiplier lambda, and then re-describing the strategy updating mode as follows:
Figure BDA00033212639400000911
other steps and parameters are the same as those in one of the first to ninth embodiments.
The process of the present invention is described in detail below with reference to the accompanying drawings:
as shown in fig. 1, the physical processes in the cyber-physical system are distributed in different geographical locations from the control center. And the sensor-controller end and the controller-actuator end are used for information interaction through an open shared communication network. The malicious attacker can obtain system knowledge including system control targets, simple models of physical systems, and instrumented attack detection mechanisms through the monitoring system. An attacker constructs false data attacks through the knowledge and injects the malicious data into a control signal so as to destroy the system performance or realize other illegal purposes. The details of each module in fig. 1 are as follows:
1. description of physical System
The present invention contemplates a more general model of a physical system whose dynamic model is described by the following differential equations
Figure BDA0003321263940000101
Wherein
Figure BDA0003321263940000102
A vector of states of the system is represented,
Figure BDA0003321263940000103
representing the control signal to be designed. For the above system, the present invention assumes that it is controllable. In addition, no constraints are imposed on the system model.
In cyber-physical systems, computers and communication networks are heavily used. Therefore, it is necessary to discretize the system (1) and obtain a discrete signal. Using the Euler method, the discrete form of the system (1) can be described as
x(k+1)=(f(x(k),u(k)))Δt+x(k) (2)
Where Δ t > 0 represents the sampling period.
As can be seen from fig. 1, the present invention uses the state of a reference model for part of the inputs to the neural network that generate the control signals. Here, the reference model is a nominal model of the system (2). In the reference model, external disturbances, uncertainties and cyber-attack behavior are not considered. The reference model facilitates the design of the controller according to established maturity theories and ensures good control performance. For the reference model, its dynamic equations are described as follows:
xr(k+1)=g(xr(k),ur(k)) (3)
wherein
Figure BDA0003321263940000104
Representing the state of the reference model, ur(k) Indicating well-designed controlsAnd the controller ensures that the system has good performance.
2. Information physical system under network attack
The invention assumes that a malicious attacker attacks the communication network of the controller-actuator end. The attacker achieves its illegal purpose by constructing the dummy data and injecting it into the control signal u (k). Describing the control signal in the case of a spurious data injection attack as
Figure BDA0003321263940000105
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003321263940000106
representing the control signal after attack, ua(k) Representing a spurious data injection attack. Γ represents the attack distribution matrix of a spurious data injection attack. Here, Γ is a diagonal matrix whose diagonal elements take on values of 0 or 1. If the ith executor is attacked, the ith diagonal element of the matrix Γ takes on a value of 1. Otherwise, the diagonal element takes a value of 0. For attack signal ua(k) It is an unknown but bounded class of signals. Attack signals can be divided into two broad categories, namely state-independent attack signals and state-dependent attack signals.
1) State independent attack signal
If the attack signal ua(k) Is time-invariant, ua(k) ω, where ω represents a constant value. If u isa(k) Is a time-varying one of the two phases,
Figure BDA0003321263940000107
wherein
Figure BDA0003321263940000108
Representing an upper bound.
2) State dependent attacks
If the attack signal ua(k) Is time-invariant, and may be described as
Figure BDA0003321263940000111
Where ω represents a constant value. If the attack signal ua(k) Is time-varying, and the attack signal can be described as
Figure BDA0003321263940000112
Where ω (k) represents a bounded time-varying function.
Considering the control signal after attack, the system (2) can be described again
Figure BDA0003321263940000113
From the above system description, the object of the invention is to learn a strategy pi (control signal) for a system (4) using a deep reinforcement learning algorithm
Figure BDA0003321263940000114
Sampled from strategy pi) under which the system (4) can achieve mean square index stabilization as described in definition 1.
Definition 1 if
Figure BDA0003321263940000115
And
Figure BDA0003321263940000116
so that the following inequality holds for all k ≧ 0:
Figure BDA0003321263940000117
then state
Figure BDA0003321263940000118
The mean square index converges.
3. Markov decision process
The invention mainly designs a deep reinforcement learning algorithm and introduces an implementation mode of the reinforcement learning algorithm. The invention first describes the system (4) as a Markov decision process. Aiming at the Markov decision process, a flexible operator-critic learning algorithm based on the Lyapunov function is designed, and a decision strategy is output. The security controller is obtained by sampling the policy. On the basis, how to construct the deep neural network is introduced, and the designed reinforcement learning algorithm is realized through the deep neural network. Next, each part will be described in detail.
The Markov decision process consists of a quintuple, i.e.
Figure BDA0003321263940000119
Here, the number of the first and second electrodes,
Figure BDA00033212639400001110
the state space is represented by a representation of,
Figure BDA00033212639400001111
the space of motion is represented by a graph,
Figure BDA00033212639400001112
a distribution of the probability of the transition is represented,
Figure BDA00033212639400001113
representing the cost of control, and γ ∈ [0,1) representing a discount factor. As can be seen from the foregoing description, the control signal u (k) to be designed contains the reference model state xr(k) The information of (1). Thus, the markov decision process of the system (4) can be described as follows:
Figure BDA00033212639400001114
wherein
Figure BDA00033212639400001115
Representing the state of the markov decision process,
Figure BDA00033212639400001116
is shown in action
Figure BDA00033212639400001117
Under the action of the pressure in the system from
Figure BDA00033212639400001118
To
Figure BDA00033212639400001119
The transition probability of (2).
4. Reinforced learning algorithm
The invention designs a flexible operator-critic reinforcement learning algorithm for the Markov decision process described by the formula (5). Different from the existing reinforcement learning algorithm result, the reinforcement learning algorithm designed by the invention does not define a reward function, but defines a control cost function. Thus, the present invention minimizes control costs while maximizing the reward function, unlike prior efforts. For the reference model, the control quality is guaranteed by a well-designed controller. Therefore, if the algorithm of the present invention is designed to enable the system state
Figure BDA0003321263940000121
The exponent converges to xr(k) Then the stability of the system (4) can be guaranteed. Therefore, the invention will control the cost function
Figure BDA0003321263940000122
Defined in the form:
Figure BDA0003321263940000123
the goal of reinforcement learning is to find an optimal strategy that minimizes the following state-value function:
Figure BDA0003321263940000124
wherein
Figure BDA0003321263940000125
Pi tableShowing the strategy to be learned. In reinforcement learning, the strategy π is a Gaussian distribution.
Figure BDA0003321263940000126
Can be obtained by the following formula:
Figure BDA0003321263940000127
wherein
Figure BDA0003321263940000128
Represents a mean value of
Figure BDA0003321263940000129
Variance is σ2A gaussian distribution of (a).
Figure BDA00033212639400001210
Is shown in a state
Figure BDA00033212639400001211
Selecting actions from policy π
Figure BDA00033212639400001212
The probability of (c).
In the reinforcement learning algorithm, to obtain the optimal strategy pi, the actual minimization in the training process is the action value function (also called Q-function)
Figure BDA00033212639400001213
According to (6), the Q-function can be defined as:
Figure BDA00033212639400001214
wherein
Figure BDA00033212639400001215
Is shown in
Figure BDA00033212639400001220
The desired operation is performed on the distribution of (a).
In order to ensure that the algorithm can fully explore unknown action space and avoid trapping in a local solution, an information entropy item can be added to the Q-function (8). Further, the Q-function (8) considering the entropy terms of information can be described as:
Figure BDA00033212639400001216
where α represents a temperature parameter that can be used to adjust the relative importance of the increasing entropy term.
Figure BDA00033212639400001217
Is an added information entropy term, and the expression is as follows:
Figure BDA00033212639400001218
according to the above description, the reinforcement learning algorithm designed by the present invention needs to solve the following problems:
Figure BDA00033212639400001219
where Π represents a set of policies.
By solving (10), the optimal strategy pi can be obtained*. Is available based on the policy
Figure BDA0003321263940000131
Equation (10) is solved by a reinforcement learning algorithm. When the algorithm is finished, the variance
Figure BDA0003321263940000132
Slave strategy pi*Control signal u obtained by sampling*(k) Is a definite mean value and is no longer affected by the exploration noise.
In the training of the reinforcement learning algorithm, two steps, namely strategy evaluation and strategy, need to be repeatedly executedAnd (4) improving. In strategy evaluation, a Bellman backtracking (backup) operator is used
Figure BDA00033212639400001312
Q-functions (9) acting on increasing entropy terms of information, i.e.
Figure BDA0003321263940000133
When the strategy is improved, the strategy is updated according to the following equation
Figure BDA0003321263940000134
Wherein the content of the first and second substances,
Figure BDA0003321263940000135
indicating a state
Figure BDA0003321263940000136
Candidate strategy of (2), pioldThe policy of the last step of update is represented,
Figure BDA0003321263940000137
representing a policy noldThe corresponding Q value of the signal is obtained,
Figure BDA0003321263940000138
representing the Kullback-Leibler divergence,
Figure BDA0003321263940000139
indicating the normalization factor. Further, the formula (12) can be described again by the mathematical calculation:
Figure BDA00033212639400001310
5. deep neural network approximation
It is clear that the subject of the invention belongs to a system with spatial continuity of motion. The traditional Q-table based algorithm implementation is no longer applicable. In order to realize the learning algorithm, the deep neural network is built on the basis of the fully-connected multilayer sensing mechanism, and the Q function and the action are approximated by the deep neural network.
In the constructed deep neural network, a ReLU function is generally selected as an activation function of the neural network, and is defined as ρ (z) ═ max { z,0 }. To clearly show how to build a deep neural network using a multi-layer sensing mechanism, the present invention provides an example of a deep neural network with two hidden layers. For the ReLU activation function, when z represents a vector, i.e., z ═ z1,...,,zn]T,ρ(z)=[ρ(z1),...,ρ(zn)]T. Thus, a deep neural network with two hidden layers can be described as
Figure BDA00033212639400001311
Wherein [ z ]T,1]T"1" denotes bias, w ═ w { (w)0,w1,w2In which w0,w1And w2And weight coefficients representing the neural network to be trained.
In the present invention, the constructed deep neural network will be used to approximate "critic" respectively "
Figure BDA0003321263940000141
And "actor"
Figure BDA0003321263940000142
Theta and phi are used for parameterization
Figure BDA0003321263940000143
And
Figure BDA0003321263940000144
is a parameter of
Figure BDA0003321263940000145
Fig. 2 depicts the corresponding neural network structure.
One of the design goals of the invention is to ensure the stability of the cyber-physical system under the action of learning to obtain a control strategy. The lyapunov theory is widely used to analyze the stability of systems. In the learning process, the invention combines the Q-function
Figure BDA0003321263940000146
Viewed as the Lyapunov function
Figure BDA0003321263940000147
Then evaluating the neural network would be used to approximate the lyapunov function
Figure BDA0003321263940000148
If the invention can learn a Lyapunov function and make the Lyapunov function meet the constraint of the system stability condition, the stability of the information physical system can be ensured. In the following, the Lyapunov function will be used
Figure BDA0003321263940000149
Replacement of Q-function
Figure BDA00033212639400001410
Figure BDA00033212639400001411
Representing a parameterized Lyapunov function whose neural network approximation is described in the form of
Figure BDA00033212639400001412
For an action neural network, its output consists of two parts. Safety control signal u, part of which is parameterizedφ(k) The other part is variance
Figure BDA00033212639400001413
According to the formula (7)
Figure BDA00033212639400001414
6. Implementation of flexible operator-critic deep reinforcement learning control algorithm based on Lyapunov function
The above description has introduced the design of reinforcement learning algorithm, the construction of deep neural network, and the definition of lyapunov function in the training process. The invention mainly discusses how to realize the flexible operator-critical deep reinforcement learning algorithm with Lyapunov function constraint. Fig. 3 describes the training process of the algorithm. As shown in FIG. 3, the reference model (3) and the system (4) are used to generate training data and store and replay the generated data in a storage unit
Figure BDA00033212639400001415
In (1). By replaying storage units from
Figure BDA00033212639400001416
And (5) randomly sampling training samples, and repeatedly executing strategy evaluation and strategy improvement. Then, from the updated policy
Figure BDA00033212639400001417
The sampling control signal is used by the system (4) to generate new training data.
At the time of policy evaluation, the policy
Figure BDA00033212639400001418
The following bellman residual equation should be minimized:
Figure BDA00033212639400001419
wherein the content of the first and second substances,
Figure BDA00033212639400001420
representing slave memory cells
Figure BDA00033212639400001421
Middle random sampling data
Figure BDA00033212639400001422
Figure BDA00033212639400001423
Wherein
Figure BDA0003321263940000151
Representing the target parameter. Here, the number of the first and second electrodes,
Figure BDA0003321263940000152
and
Figure BDA0003321263940000153
has the same neural network structure. But parameters
Figure BDA0003321263940000154
And updating according to an exponential approach mode.
During training, residual equations are minimized using an ADAM optimizer. As to the size of-
Figure BDA0003321263940000155
I training sample data, the random gradient of equation (15) is:
Figure BDA0003321263940000156
one of the contribution points of the invention is to solve the problem of ensuring the stability of a system based on a deep reinforcement learning algorithm. This is also the biggest difference from the existing deep reinforcement learning algorithm. In order to ensure the stability of the system, the constraint condition of the Lyapunov function when the system is stable should be satisfied. Therefore, it is necessary to guarantee the lyapunov function constraint in updating the strategy, that is, to apply the lyapunov function constraint to the optimization problem (13). Further, the optimization problem in equation (13) can be re-described as:
Figure BDA0003321263940000157
wherein
Figure BDA0003321263940000158
Beta is a weight coefficient, and beta is a weight coefficient,
Figure BDA0003321263940000159
representing the lyapunov function to be learned with theta as a parameter.
Obviously, the above optimization problem is a constrained optimization problem. In order to solve the optimization problem with constraints, the invention introduces a Lagrange multiplier. Equation (16) is further described as:
Figure BDA00033212639400001510
where λ represents the langerhan multiplier.
The goal of the strategy improvement then becomes to minimize the following equation:
Figure BDA00033212639400001511
the above formula is equivalent to:
Figure BDA00033212639400001512
the random gradient of the above equation for the parameter φ is:
Figure BDA00033212639400001513
wherein:
Figure BDA00033212639400001514
the temperature parameter α will be updated by minimizing the following equation:
Figure BDA0003321263940000161
wherein
Figure BDA0003321263940000162
Representing the target entropy.
For the lagrange multiplier λ, the update is done by maximizing the following equation:
Figure BDA0003321263940000163
in light of the above discussion, the algorithm in Table 1 describes pseudo code implementing a Lyapunov function-based flexible operator-critical reinforcement learning control algorithm. By using the algorithm in Table 1, an optimal strategy π can be obtained that satisfies the Gaussian distribution*. In the reasoning process, i.e. in practical applications, the mean value
Figure BDA0003321263940000165
Will be used to control the cyber-physical system in the event of an actuator spurious data injection attack.
TABLE 1
Figure BDA0003321263940000164
Simulation result
The invention provides a simulation result for explaining the effectiveness and superiority of the data-based safety control algorithm. In the simulation, the invention utilizes Python 3.6 and TensorFlow 1.15 to realize the algorithm in the table 1 to obtain the control strategy. The details of the simulation and the results are presented below.
Assuming that the physical device in FIG. 1 is a single joint robotic arm, its dynamics can be described by the following differential equations:
Figure BDA0003321263940000171
where theta represents the angular position of the robotic arm,
Figure BDA0003321263940000172
the angular velocity of the mechanical arm is shown, and the gravity acceleration parameter g is 9.8m/s2The arm length L of the arm is 0.5M, M is 10kg indicating the load weight of the arm, and J is 10kg · M2Representing moment of inertia, d (t) ± 2.2 sin (t) representing an uncertain but bounded coefficient of friction, and u (t) representing a control input.
Definition of
Figure BDA0003321263940000173
The dynamic equation for the arm can then be expressed in the form:
Figure BDA0003321263940000174
wherein
Figure BDA0003321263940000175
The reference model of the robot arm model may be obtained by linearizing the nonlinear model. For the linearized model, a linear quadratic optimal control law or a PD control law can be designed as the input of the reference model. In the present simulation example, the PD control law was selected and designed. Because the operating personnel remotely operate the mechanical arm, malicious attackers can destroy the integrity of the control signal by injecting false attack signals into the communication network, and further realize the stability of the state of the mechanical arm. Then, a safety control law is obtained through learning by adopting the safety control algorithm based on deep reinforcement learning designed by the invention, and the mechanical arm is ensured to work according to expected performance under the condition of malicious attack. In the subsequent simulation, the sampling period Δ t is 0.2 s.
Supposing that a malicious attacker invades the information layer, false data attack signals are injected into the control signals to destroy the running state of the mechanical arm. In the simulation example, the control target is to obtain a control strategy by using a learning algorithm to ensure that the mechanical arm can ensure a stable state under the attack behavior. In the simulation, it is assumed that the sampling period Δ t is 0.2 s.
In the simulation, the reference model PD control law is designed as
Figure BDA0003321263940000176
Wherein Kp=20,Kd=25。
Fig. 4 depicts a state variable response curve of a reference model when a PD controller is used. It can be seen from the curve change in the figure that the designed PD controller can effectively ensure the stability of the attitude of the mechanical arm system. In the following, the parameters and structure of the deep neural network to be trained will be given first. The neural network will be trained using the algorithm in table 1. Furthermore, a strategy obtained based on training can ensure that the mechanical arm is stable under the condition of no attack, and the control performance of the mechanical arm can be the same as that of the mechanical arm shown in the figure 4.
In the training process, 10 strategies will be trained. The best performing one of these 10 strategies is then selected for controlling the arm system. Table 2 gives the values of the neural network structure and the related parameters in the training. The initial value of the angle of the arm system is randomly selected from the interval of [ -2, 2] rad/s, and the initial value of the angular velocity is randomly selected from the interval of [ -1, 1] rad/s.
TABLE 2 deep neural network training parameters
Figure BDA0003321263940000181
After training is completed, the best one of the obtained strategies is selected. Then, the initial conditions were randomly chosen to run 10 times of simulations under a non-attack situation using this strategy. Fig. 5 shows the mean state variable response for 10 random initial conditions under the action of the strategy obtained by learning. FIG. 6 shows the mean values of the robot arm state responses for 10 random initial conditions under the action of the PD controller. As shown in fig. 5, the strategy obtained by learning can well achieve the goal of controlling the state of the robot arm in a non-attack situation. Since the false data attacks can be roughly classified into three categories, namely constant value attacks (bias attacks), time-dependent attacks and designed optimal attacks. Among the three types of false data injection attacks, the designed optimal attack is most concealed. An attacker constructs a false data attack signal by using system knowledge, so that the designed attack signal can not trigger an attack detection mechanism, and further the system performance is seriously damaged.
Next, the validity of the security control algorithm designed by the present invention will be verified for three different classes of false data injection attack scenarios. In addition, in order to illustrate the superiority of the control algorithm designed by the invention, the invention is compared with the classical PD control algorithm. In subsequent simulation graphs, the shaded areas are obtained by filling in the response mean minus and plus the standard deviation.
Case 1: constant value attack
In this case, it is assumed that the attacker designs the attack signal to be constant 5. The attacker starts to apply the spurious data injection attack from the initial moment. As with the above simulation, 10 initial conditions were randomly selected for simulation in this case. Fig. 7 and 8 depict the mean response curves of the mechanical arm states under the action of the learning-based control strategy and the PD control strategy, respectively. As can be seen from the simulation results, the control strategy based on learning designed by the invention can effectively relieve the attack behavior.
Case 2: time dependent attacks
In this case, assume that an attacker performs a dummy data injection attack signal of the form as follows starting from k ≧ 0
ua(k)=2cos(k)sin(k)+4
Fig. 9 shows a mean response curve of the state of the robot system under the action of the learning-based control strategy under the time-dependent attack. Fig. 10 depicts a mean response curve of the state of the robot system under the action of the PD control strategy under the time-dependent attack. As shown in fig. 10, the time-dependent attack has changed the position of the robotic arm. Despite the strong robustness of the PD controller, the position of the robotic arm is not restored. According to the comparison result, the control strategy provided by the invention can effectively relieve the time-dependent false data injection attack.
Case 3: designed optimal attack strategy
In this case, the attacker destroys the system using the optimal attack strategy. The attacker utilizes the system knowledge to construct the dummy data injection attack signal. Suppose that the attacker designs an attack signal of
Figure BDA0003321263940000191
Wherein KaA matrix of attack weights is represented that is,
Ka=[613]
fig. 11 and 12 depict the corresponding simulation results. Where fig. 11 depicts the mean response curve for the state of the robotic arm system under the influence of the learning-based control strategy, and fig. 12 depicts the mean response curve for the state of the robotic arm system under the influence of the PD controller. Obviously, the PD controller does not protect the system well and the system stability is severely compromised in such attack scenarios. However, the control strategy based on learning designed by the invention can compensate the influence of attack and ensure the safe operation of the mechanical arm.
According to the simulation results of the three situations, the learning-based safety control strategy designed by the invention is superior to the well-designed PD control strategy. By virtue of the powerful advantages of deep neural networks, the learning-based security control strategy is capable of handling different types of spurious data injection attacks.
The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims (9)

1. An information physical system security control method based on deep reinforcement learning is characterized by specifically comprising the following steps:
step one, establishing an information physical system model under the attack of false data injection of an actuator; the specific process comprises the following steps:
step one, under an ideal state, a dynamic equation of an information physical system model is as follows:
Figure FDA0003561199330000011
wherein the content of the first and second substances,
Figure FDA0003561199330000012
a state vector representing the physical system of the information,
Figure FDA00035611993300000110
representing the real number field, nxRepresenting the dimensions of the state vector x,
Figure FDA0003561199330000013
representing the control signal to be designed, nuWhich represents the dimension of the control signal u,
Figure FDA0003561199330000014
f (-) represents a generalized function map for the first derivative of x;
step two, discretizing the information physical system model in the step one based on an Euler method to obtain a discretized information physical system model:
x(k+1)=(f(x(k),u(k)))Δt+x(k)
wherein x (k) represents the state vector of the discretization information physical system at the time k, u (k) represents the control signal at the time k, Δ t represents the sampling period, and x (k +1) represents the state vector of the discretization information physical system at the time k + 1;
step three, the dynamic equation of the reference model without considering external disturbance, uncertainty and network attack behavior is as follows:
xr(k+1)=g(xr(k),ur(k))
wherein the content of the first and second substances,
Figure FDA0003561199330000015
representing the state vector of the reference model at time k, ur(k) Controller representing the reference model at time k, xr(k +1) represents the state vector of the reference model at the moment k +1, and g (-) represents the generalized function mapping;
step four, describing a control signal of the discretization information physical system model under the condition that false data is injected into the attack as follows:
Figure FDA0003561199330000016
wherein the content of the first and second substances,
Figure FDA0003561199330000017
control signal representing the time k after attack, ua(k) Representing a spurious data injection attack signal at time k, wherein gamma represents an attack distribution matrix of the spurious data injection attack;
step one five, the state vector of the reference model is used as part input of the control signal generated by the discretization information physical system model, and the control signal after being attacked is considered, so that the discretization information physical system model in the step two is re-described as the information physical system model under the executor false data injection attack:
Figure FDA0003561199330000018
wherein the content of the first and second substances,
Figure FDA0003561199330000019
state vector x representing model to be referencedr(k) When the control signal is generated as part of the input of the discretized cyber-physical system model, the state vector of the discretized cyber-physical system model at the time k,
Figure FDA0003561199330000021
state vector x representing model to be referencedr(k +1) when the control signal is input as part of the discretization information physical system model, the state vector of the discretization information physical system model at the k +1 moment;
step two, describing the information physical system model under the attack of injecting the actuator false data established in the step one into a Markov decision process;
and step three, building a deep neural network, and outputting a decision strategy for the Markov decision process based on the built deep neural network.
2. The cyber-physical system security control method based on the deep reinforcement learning of claim 1, wherein the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if an ith actuator is attacked, an ith diagonal element of the attack distribution matrix Γ takes a value of 1, otherwise, the ith diagonal element takes a value of 0.
3. The cyber-physical system security control method based on deep reinforcement learning of claim 2, wherein the dummy data injection attack signal u is a signal of the false data injection attacka(k) Is a state independent attack signal, if ua(k) Is time-invariant, then ua(k) ω, where ω represents a constant value; if ua(k) Is time-varying, then
Figure FDA0003561199330000022
WhereinAnd ω (k) represents a bounded time-varying function,
Figure FDA0003561199330000023
representing an upper bound on the attack signal.
4. The cyber-physical system security control method based on deep reinforcement learning of claim 3, wherein the false data injection attack signal u is a false data injection attack signala(k) Is a state-dependent attack signal, if ua(k) Is time-invariant, then ua(k) Is described as
Figure FDA0003561199330000024
Where ω denotes a constant value, if ua(k) Is time-varying, then ua(k) Is described as
Figure FDA0003561199330000025
Where ω (k) represents a bounded time-varying function.
5. The cyber-physical system security control method based on the deep reinforcement learning of claim 4, wherein the Markov decision process is described as follows:
Figure FDA0003561199330000026
wherein the content of the first and second substances,
Figure FDA0003561199330000027
representing the state of the markov decision process at time k,
Figure FDA0003561199330000028
representing the state of the markov decision process at time k +1,
Figure FDA0003561199330000029
is shown in the action
Figure FDA00035611993300000210
Under the action of the pressure in the system from
Figure FDA00035611993300000211
To
Figure FDA00035611993300000212
The transition probability of (2).
6. The cyber-physical system safety control method based on the deep reinforcement learning according to claim 5, wherein in the third step, a decision strategy is output for a Markov decision process based on a built deep neural network, and the specific process is as follows:
defining the action value function
Figure FDA0003561199330000031
Figure FDA0003561199330000032
Wherein the content of the first and second substances,
Figure FDA00035611993300000319
for the cost function, γ ∈ [0,1) denotes the discount factor, α denotes the temperature parameter,
Figure FDA0003561199330000033
is an item of entropy of the information that,
Figure FDA0003561199330000034
is shown in a state
Figure FDA0003561199330000035
Selecting actions from policy π
Figure FDA00035611993300000318
The probability of (a) of (b) being,
Figure FDA0003561199330000036
which is indicative of a desired operation to be performed,
Figure FDA0003561199330000037
is the function value of the state value at the moment of k + 1;
the deep neural network obtains an optimal strategy pi by minimizing an action value function*
Figure FDA0003561199330000038
Wherein Π represents a set of policies;
for optimal strategy pi*Sampling to obtain optimal control signal u*(k),
Figure FDA0003561199330000039
Figure FDA00035611993300000320
Represents a Gaussian distribution, σ*Is the variance.
7. The cyber-physical system security control method based on deep reinforcement learning of claim 6, wherein the information entropy item
Figure FDA00035611993300000310
The expression of (a) is:
Figure FDA00035611993300000311
wherein the content of the first and second substances,
Figure FDA00035611993300000312
is shown in a state
Figure FDA00035611993300000313
Selecting actions from policy π
Figure FDA00035611993300000314
The probability of (a) of (b) being,
Figure FDA00035611993300000315
representing the desired operation.
8. The cyber-physical system security control method based on deep reinforcement learning of claim 7, wherein the cost function
Figure FDA00035611993300000321
Comprises the following steps:
Figure FDA00035611993300000316
where the superscript T represents transpose.
9. The cyber-physical system security control method based on deep reinforcement learning of claim 8, wherein the deep neural network is a network built based on a multi-layer perceptron and having two hidden layers, and the built network is described as follows:
Figure FDA00035611993300000317
wherein the content of the first and second substances,DNN w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w0,w1And w2Representing a weight coefficient required to be trained by the network, wherein rho (·) is an activation function;
the deep neural network repeatedly carries out strategy evaluation and strategy updating during training, and the Lyapunov function constraint needs to be ensured during strategy updating, so that the strategy updating mode is as follows:
Figure FDA0003561199330000041
Figure FDA0003561199330000048
wherein, pinewThe updated policy is represented by the policy data,
Figure FDA0003561199330000042
is shown in a state
Figure FDA0003561199330000043
Selecting actions from policy π
Figure FDA0003561199330000044
The probability of (a) of (b) being,
Figure FDA0003561199330000045
representing the function of the Lyapunov function,
Figure FDA0003561199330000046
representing a Lyapunov function constraint;
and introducing a Lagrange multiplier lambda, and then re-describing the strategy updating mode as follows:
Figure FDA0003561199330000047
CN202111247290.6A 2021-10-26 2021-10-26 Information physical system safety control method based on deep reinforcement learning Active CN113885330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111247290.6A CN113885330B (en) 2021-10-26 2021-10-26 Information physical system safety control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111247290.6A CN113885330B (en) 2021-10-26 2021-10-26 Information physical system safety control method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN113885330A CN113885330A (en) 2022-01-04
CN113885330B true CN113885330B (en) 2022-06-17

Family

ID=79014455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111247290.6A Active CN113885330B (en) 2021-10-26 2021-10-26 Information physical system safety control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN113885330B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995137B (en) * 2022-06-01 2023-04-28 哈尔滨工业大学 Rope-driven parallel robot control method based on deep reinforcement learning
CN116460860B (en) * 2023-06-19 2023-10-20 中国科学技术大学 Model-based robot offline reinforcement learning control method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN109377218A (en) * 2018-09-20 2019-02-22 北京邮电大学 A kind of method, server and the mobile terminal of the false perception attack of containment
CN109375514A (en) * 2018-11-30 2019-02-22 沈阳航空航天大学 A kind of optimal track control device design method when the injection attacks there are false data
CN110928189A (en) * 2019-12-10 2020-03-27 中山大学 Robust control method based on reinforcement learning and Lyapunov function
WO2020187591A1 (en) * 2019-03-18 2020-09-24 Robert Bosch Gmbh Method and device for controlling a robot
CN112614009A (en) * 2020-12-07 2021-04-06 国网四川省电力公司电力科学研究院 Power grid energy management method and system based on deep expected Q-learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108040073A (en) * 2018-01-23 2018-05-15 杭州电子科技大学 Malicious attack detection method based on deep learning in information physical traffic system
GB2610938B (en) * 2018-10-30 2023-09-06 Motional Ad Llc Redundancy in autonomous vehicles
CN110782011B (en) * 2019-10-21 2023-11-24 辽宁石油化工大学 Distributed optimization control method of networked multi-agent system based on reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN109377218A (en) * 2018-09-20 2019-02-22 北京邮电大学 A kind of method, server and the mobile terminal of the false perception attack of containment
CN109375514A (en) * 2018-11-30 2019-02-22 沈阳航空航天大学 A kind of optimal track control device design method when the injection attacks there are false data
WO2020187591A1 (en) * 2019-03-18 2020-09-24 Robert Bosch Gmbh Method and device for controlling a robot
CN110928189A (en) * 2019-12-10 2020-03-27 中山大学 Robust control method based on reinforcement learning and Lyapunov function
CN112614009A (en) * 2020-12-07 2021-04-06 国网四川省电力公司电力科学研究院 Power grid energy management method and system based on deep expected Q-learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Detection and Isolation of False Data Injection Attacks in Smart Grid via Unknown Input Interval Observer;Xinyu Wang 等;《IEEE INTERNET OF THINGS JOURNAL》;20200430;第7卷(第4期);第3214-3229页 *
基于双马尔科夫链的新型能源互联网虚假数据注入攻击检测;杨杉 等;《电力自动化设备》;20210228;第41卷(第2期);第131-136页 *

Also Published As

Publication number Publication date
CN113885330A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
Viera-Martin et al. Artificial neural networks: a practical review of applications involving fractional calculus
CN113885330B (en) Information physical system safety control method based on deep reinforcement learning
Han et al. Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee
Hou et al. Data-efficient hierarchical reinforcement learning for robotic assembly control applications
Wu et al. Deep reinforcement learning control approach to mitigating actuator attacks
Zhou et al. Event-triggered practical prescribed time output feedback neuroadaptive tracking control under saturated actuation
Zhang et al. Adaptive neural network control of coordinated robotic manipulators with output constraint
Ma et al. Improved adaptive fuzzy output-feedback dynamic surface control of nonlinear systems with unknown dead-zone output
Zheng et al. Parameter identification of nonlinear dynamic systems using an improved particle swarm optimization
Yu et al. Trustworthiness modeling and analysis of cyber-physical manufacturing systems
Li et al. Secure finite-horizon consensus control of multiagent systems against cyber attacks
Han et al. Local design of distributed H∞‐consensus filtering over sensor networks under multiplicative noises and deception attacks
Kang et al. Tracking of uncertain robotic manipulators using event-triggered model predictive control with learning terminal cost
Soorki et al. Adaptive robust control of fractional‐order swarm systems in the presence of model uncertainties and external disturbances
Zhao et al. Fuzzy adaptive secure tracking control against unknown false data injection attacks for uncertain nonlinear systems with input quantization
Zhao et al. Goal representation adaptive critic design for discrete-time uncertain systems subjected to input constraints: The event-triggered case
Song et al. A comprehensive evaluation and comparison of enhanced learning methods
Liu et al. Adaptive neural network control for nonlinear cyber-physical systems subject to false data injection attacks with prescribed performance
Hsu et al. Indirect adaptive self-organizing RBF neural controller design with a dynamical training approach
Banerjee et al. A survey on physics informed reinforcement learning: Review and open problems
Jiang et al. Adaptive sliding mode security control of wheeled mobile manipulators with Markov switching joints against adversarial attacks
Li et al. Stabilization of semi-markovian jumping uncertain complex-valued networks with time-varying delay: A sliding-mode control approach
Wan et al. Non-fragile dissipative filtering of cyber–physical systems with random sensor delays
Cui et al. Resilient output containment control of heterogeneous multiagent systems against composite attacks: A digital twin approach
Cheng Adaptive fuzzy CMAC-based nonlinear control with dynamic memory architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant