CN113885330B

CN113885330B - Information physical system safety control method based on deep reinforcement learning

Info

Publication number: CN113885330B
Application number: CN202111247290.6A
Authority: CN
Inventors: 吴承伟; 柴庆杰; 刘健行; 孙光辉; 吴立刚
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-06-17
Anticipated expiration: 2041-10-26
Also published as: CN113885330A

Abstract

An information physical system security control method based on deep reinforcement learning belongs to the technical field of information security. The invention solves the problem of poor control performance of the safety control strategy designed based on the existing method under the condition of network attack. The invention describes an information physical system dynamic equation under an attacked condition as a Markov decision process, converts the information physical system safety control problem under the attacked condition of false data injection into a control strategy learning problem only using data based on the established Markov process, and provides a flexible operator-critic reinforcement learning algorithm based on a Lyapunov function and a novel deep neural network training frame based on a flexible operator-critic reinforcement learning algorithm frame. The invention can be applied to the safety control of an information physical system.

Description

Information physical system safety control method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of information safety, and particularly relates to an information physical system safety control method based on deep reinforcement learning.

Background

The information physical system integrates and develops the existing technologies of communication, wireless network, distribution, artificial intelligence and the like, constructs the mutual mapping, timely interaction and efficient cooperation of human, machine, object, environment, information and other elements in the physical space and the information space, and becomes a new generation intelligent system integrating calculation, communication and control. At the heart of smart manufacturing, the development of cyber-physical systems has been receiving attention from countries around the world. In recent years, network attack events occur frequently, great threats are caused to national security, economic development, infrastructure security and people's lives and properties, and how to ensure the security of an information physical system becomes a great problem to be solved urgently. Although the research on the security problem of the cyber-physical system is continuously breaking through, the shortage of the cyber-physical system security design relying on only a single discipline knowledge, such as a control discipline, a computer discipline or a network cyber-security discipline technology, is increasingly highlighted. Aiming at network attack, the existing method still has the problems of complex model of the information physical system, difficult modeling, difficult design of a security control strategy, poor control performance under the security control strategy and the like, so how to solve the security problem of the information physical system by fusing multi-disciplinary technical means becomes an important problem to be solved urgently.

Deep reinforcement learning techniques have been widely used in the fields of robotics and games. The technology can realize intelligent control scheme design based on data completely. The cyber-physical system is an intelligent system framework, and how to design an intelligent safety control strategy based on a deep reinforcement learning technology has important scientific significance.

Disclosure of Invention

The invention aims to solve the problem that the control performance of a safety control strategy designed based on the existing method is poor under the condition of network attack, and provides an information physical system safety control method based on deep reinforcement learning.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a deep reinforcement learning-based cyber-physical system security control method specifically comprises the following steps:

step one, establishing an information physical system model under the attack of false data injection of an actuator;

step two, describing the information physical system model under the attack of injecting the actuator false data established in the step one into a Markov decision process;

and step three, building a deep neural network, and outputting a decision strategy for the Markov decision process based on the built deep neural network.

Further, the specific process of the first step is as follows:

step one, under an ideal state, the dynamic equation of the information physical system model is as follows:

wherein the content of the first and second substances,

a state vector representing the physical system of the information,

representing the real number field, n_xRepresenting the dimensions of the state vector x,

representing the control signal to be designed, n_uWhich represents the dimension of the control signal u,

f (-) represents a generalized function map for the first derivative of x;

step two, discretizing the information physical system model in the step one based on an Euler method to obtain a discretized information physical system model:

x(k+1)＝(f(x(k),u(k)))Δt+x(k)

wherein x (k) represents the state vector of the discretization information physical system at the time k, u (k) represents the control signal at the time k, Δ t represents the sampling period, and x (k +1) represents the state vector of the discretization information physical system at the time k + 1;

step three, the dynamic equation of the reference model without considering external disturbance, uncertainty and network attack behavior is as follows:

x_r(k+1)＝g(x_r(k),u_r(k))

wherein the content of the first and second substances,

representing the state vector of the reference model at time k, u_r(k) Controller representing the reference model at time k, x_r(k +1) represents the state vector of the reference model at the moment k +1, and g (-) represents the generalized function mapping;

step four, describing a control signal of the discretization information physical system model under the condition that false data is injected into the attack as follows:

wherein, the first and the second end of the pipe are connected with each other,

indicating the control signal at time k after an attack, u_a(k) Representing a spurious data injection attack signal at time k, wherein gamma represents an attack distribution matrix of the spurious data injection attack;

step one, the state vector of the reference model is used as part input of the control signal generated by the discretization information physical system model, and the control signal after being attacked is considered, so that the discretization information physical system model in the step two is described as the information physical system model under the attack of executor false data injection:

wherein the content of the first and second substances,

state vector x representing model to be referenced_r(k) When the control signal is generated as part of the input of the discretized cyber-physical system model, the state vector of the discretized cyber-physical system model at the time k,

state vector x representing model to be referenced_r(k +1) generating control as discretized information physical system modelWhen a signal is partially input, the state vector of the information physical system model at the time k +1 is discretized.

Further, the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if an ith actuator is attacked, an ith diagonal element of the attack distribution matrix Γ takes a value of 1, otherwise, an ith diagonal element takes a value of 0.

Further, the spurious data injection attack signal u_a(k) Is a state independent attack signal, if u_a(k) Is time-invariant, then u_a(k) ω, where ω represents a constant value; if u is_a(k) Is time-varying, then

Where ω (k) represents a bounded time-varying function,

representing an upper bound on the attack signal.

Further, the spurious data injection attack signal u_a(k) Is a state-dependent attack signal, if u_a(k) Is time-invariant, then u_a(k) Is described as

Where ω denotes a constant value, if u_a(k) Is time-varying, then u_a(k) Is described as

Where ω (k) represents a bounded time-varying function.

Further, the markov decision process is described as follows:

wherein the content of the first and second substances,

markov decision-making process for representing k timeThe state of the program is that,

representing the state of the markov decision process at time k +1,

is shown in the action

Under the action of (2) system state from

To

The transition probability of (2).

Further, in the third step, a decision strategy is output for the markov decision process based on the built deep neural network, and the specific process is as follows:

defining the action value function

Wherein the content of the first and second substances,

for the cost function, γ ∈ [0,1) denotes the discount factor, α denotes the temperature parameter,

is an item of entropy of the information that,

is shown in a state

From strategy piSelection actions

The probability of (a) of (b) being,

which is indicative of a desired operation to be performed,

is the function value of the state value at the moment of k + 1;

the deep neural network obtains the optimal strategy pi by minimizing the action value function^*：

Wherein Π represents a set of policies;

for optimal strategy pi^*Sampling to obtain optimal control signal u^*(k)，

Representing a Gaussian distribution,. sigma^*Is the variance.

Further, the information entropy item

The expression of (c) is:

wherein the content of the first and second substances,

is shown in a state

Selecting actions from policy π

Am ofThe ratio of the total weight of the particles,

representing the desired operation.

Further, the cost function

Comprises the following steps:

where the superscript T represents transpose.

Further, the deep neural network is a network built based on a multi-layer perceptron and provided with two hidden layers, and the built network is described as follows:

wherein, the first and the second end of the pipe are connected with each other,DNN _w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w₀，w₁And w₂Representing a weight coefficient required to be trained by the network, wherein rho (·) is an activation function;

the deep neural network repeatedly carries out strategy evaluation and strategy updating during training, and the Lyapunov function constraint needs to be ensured during strategy updating, so that the strategy updating mode is as follows:

wherein, pi_newThe updated policy is represented by the policy data,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

representing the function of the Lyapunov function,

representing a Lyapunov function constraint;

and introducing a Lagrange multiplier lambda, and then re-describing the strategy updating mode as follows:

the invention has the beneficial effects that:

the invention designs a security control strategy design algorithm based on deep reinforcement learning for an information physical system under the condition of virtual false data injection attack of an execution mechanism, describes a dynamic equation of the information physical system under the condition of attack as a Markov decision process, converts the security control problem of the information physical system under the condition of false data injection attack into a control strategy learning problem only using data based on the established Markov process, and provides a flexible operator-critical reinforcement learning algorithm based on a Lyapunov function based on a flexible operator-critical reinforcement learning algorithm framework.

In addition, the safety control algorithm designed by the invention has strong robustness to factors such as uncertainty, external disturbance and the like. Under the condition that network attacks are complex and changeable, the safety control algorithm designed by the invention has better control performance and a model is simpler.

Drawings

FIG. 1 is a control block diagram of an cyber-physical system in the presence of a spurious data injection attack in accordance with the present invention;

FIG. 2 is a diagram of a deep neural network architecture for approximate evaluation and action constructed in the present invention;

FIG. 3 is a schematic diagram of a learning algorithm offline training process;

wherein the simulator comprises a system of formula (4) and a reference model of formula (3);

FIG. 4 is a state response graph using a PD controller reference model;

FIG. 5 is a graph of the mean response of the state of the arm system under the action of a control strategy obtained based on learning, under different initial conditions of a non-attack situation;

FIG. 6 is a graph of the mean response of the state of the robot system under the PD control strategy under different initial conditions of a no-attack situation;

FIG. 7 is a graph of the mean response of the state of the robot arm under the control strategy based on learning under different initial conditions of constant value attack (case 1);

FIG. 8 is a graph of the mean response of the state of the manipulator under the action of the PD control strategy under different initial conditions of a constant attack (case 1);

FIG. 9 is a graph of the mean response of the state of the manipulator under the action of a learning-based control strategy under different initial conditions of a time-dependent attack (case 2);

FIG. 10 is a graph of the mean response of the state of the manipulator under the PD control strategy under different initial conditions of the time-dependent attack (case 2);

fig. 11 is a graph of the mean response of the state of the manipulator under the action of the control strategy obtained based on learning under different initial conditions for designing the optimal attack strategy (case 3);

fig. 12 is a graph of the mean response of the state of the manipulator under the PD control strategy under different initial conditions for designing an optimal attack strategy (case 3).

Detailed Description

First embodiment this embodiment will be described with reference to fig. 1. The method for controlling the safety of the cyber-physical system based on the deep reinforcement learning in the embodiment specifically comprises the following steps:

step two, describing the information physical system model under the condition that the false data of the actuator established in the step one is injected into the attack as a Markov decision process;

The second embodiment is as follows: the difference between this embodiment and the first embodiment is that the specific process of the first step is as follows:

step one, under an ideal state, a dynamic equation of an information physical system model is as follows:

a state vector representing the cyber-physical system,

is the first derivative of x, f (·) represents a generalized function map;

x(k+1)＝(f(x(k),u(k)))Δt+x(k)

x_r(k+1)＝g(x_r(k),u_r(k))

wherein the content of the first and second substances,

wherein the content of the first and second substances,

wherein the content of the first and second substances,

state vector x representing model to be referenced_rAnd (k +1) when the control signal is partially input as the discretized cyber-physical system model, discretizing the state vector of the cyber-physical system model at the moment of k + 1.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the first or second difference between this embodiment and the specific embodiment is that the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if the ith actuator is attacked, the ith diagonal element (i.e., the element in the ith row and ith column of the diagonal matrix, corresponding to the ith actuator) of the attack distribution matrix Γ takes a value of 1, otherwise, the ith diagonal element takes a value of 0.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between this embodiment and one of the first to third embodiments is that the dummy data injection attack signal u_a(k) Is a state independent attack signal, if u_a(k) Is time-invariant, then u_a(k) ω, where ω represents a constant value; if u_a(k) Is time-varying, then

Where ω (k) represents a bounded time-varying function,

representing an upper bound on the attack signal.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to the fourth embodiments is that the dummy data injection attack signal u_a(k) Is a state-dependent attack signal, if u_a(k) Is time-invariant, then u_a(k) Is described as

Where ω (k) represents a bounded time-varying function.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is that the markov decision process is described as follows:

representing the state of the markov decision process at time k,

representing the state of the markov decision process at time k +1,

is shown in the action

Under the action of the pressure in the system from

To

The transition probability of (2).

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and the first to sixth specific embodiments is that, in step three, a decision strategy is output for a markov decision process based on a built deep neural network, and the specific process is as follows:

defining the action value function

is an item of entropy of the information that,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

which is indicative of a desired operation to be performed,

is the function value of the state value at the moment of k + 1;

Wherein, pi represents a strategy set,

means for selecting from the strategy set pi to minimize the value of the action value function^*；

To the optimal strategy pi^*Sampling to obtain optimal control signal u^*(k)，

Representing a Gaussian distribution,. sigma^*Is the variance.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The specific implementation mode is eight: this embodiment is different from the first to seventh embodiments in that the information entropy item

The expression of (a) is:

wherein the content of the first and second substances,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

representing the desired operation.

Other steps and parameters are the same as those in one of the first to seventh embodiments.

The specific implementation method nine: in this embodiment, the difference between the first embodiment and the eighth embodiment is that the cost function

Comprises the following steps:

where the superscript T represents transpose.

Other steps and parameters are the same as those in one to eight of the embodiments.

The specific implementation mode is ten: the difference between this embodiment and one of the first to ninth embodiments is that the deep neural network is a network built based on a multi-layer perceptron and having two hidden layers, and the built network is described as follows:

wherein, the first and the second end of the pipe are connected with each other,DNN _w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w₀，w₁And w₂Representing a weight coefficient required to be trained by the network, wherein rho (-) is an activation function;

wherein, pi_newThe updated policy is represented by the policy data,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

representing the function of the Lyapunov function,

representing a Lyapunov function constraint;

other steps and parameters are the same as those in one of the first to ninth embodiments.

The process of the present invention is described in detail below with reference to the accompanying drawings:

as shown in fig. 1, the physical processes in the cyber-physical system are distributed in different geographical locations from the control center. And the sensor-controller end and the controller-actuator end are used for information interaction through an open shared communication network. The malicious attacker can obtain system knowledge including system control targets, simple models of physical systems, and instrumented attack detection mechanisms through the monitoring system. An attacker constructs false data attacks through the knowledge and injects the malicious data into a control signal so as to destroy the system performance or realize other illegal purposes. The details of each module in fig. 1 are as follows:

1. description of physical System

The present invention contemplates a more general model of a physical system whose dynamic model is described by the following differential equations

Wherein

A vector of states of the system is represented,

representing the control signal to be designed. For the above system, the present invention assumes that it is controllable. In addition, no constraints are imposed on the system model.

In cyber-physical systems, computers and communication networks are heavily used. Therefore, it is necessary to discretize the system (1) and obtain a discrete signal. Using the Euler method, the discrete form of the system (1) can be described as

x(k+1)＝(f(x(k),u(k)))Δt+x(k) (2)

Where Δ t > 0 represents the sampling period.

As can be seen from fig. 1, the present invention uses the state of a reference model for part of the inputs to the neural network that generate the control signals. Here, the reference model is a nominal model of the system (2). In the reference model, external disturbances, uncertainties and cyber-attack behavior are not considered. The reference model facilitates the design of the controller according to established maturity theories and ensures good control performance. For the reference model, its dynamic equations are described as follows:

x_r(k+1)＝g(x_r(k),u_r(k)) (3)

wherein

Representing the state of the reference model, u_r(k) Indicating well-designed controlsAnd the controller ensures that the system has good performance.

2. Information physical system under network attack

The invention assumes that a malicious attacker attacks the communication network of the controller-actuator end. The attacker achieves its illegal purpose by constructing the dummy data and injecting it into the control signal u (k). Describing the control signal in the case of a spurious data injection attack as

representing the control signal after attack, u_a(k) Representing a spurious data injection attack. Γ represents the attack distribution matrix of a spurious data injection attack. Here, Γ is a diagonal matrix whose diagonal elements take on values of 0 or 1. If the ith executor is attacked, the ith diagonal element of the matrix Γ takes on a value of 1. Otherwise, the diagonal element takes a value of 0. For attack signal u_a(k) It is an unknown but bounded class of signals. Attack signals can be divided into two broad categories, namely state-independent attack signals and state-dependent attack signals.

1) State independent attack signal

If the attack signal u_a(k) Is time-invariant, u_a(k) ω, where ω represents a constant value. If u is_a(k) Is a time-varying one of the two phases,

wherein

Representing an upper bound.

2) State dependent attacks

If the attack signal u_a(k) Is time-invariant, and may be described as

Where ω represents a constant value. If the attack signal u_a(k) Is time-varying, and the attack signal can be described as

Where ω (k) represents a bounded time-varying function.

Considering the control signal after attack, the system (2) can be described again

From the above system description, the object of the invention is to learn a strategy pi (control signal) for a system (4) using a deep reinforcement learning algorithm

Sampled from strategy pi) under which the system (4) can achieve mean square index stabilization as described in definition 1.

Definition 1 if

And

so that the following inequality holds for all k ≧ 0:

then state

The mean square index converges.

3. Markov decision process

The invention mainly designs a deep reinforcement learning algorithm and introduces an implementation mode of the reinforcement learning algorithm. The invention first describes the system (4) as a Markov decision process. Aiming at the Markov decision process, a flexible operator-critic learning algorithm based on the Lyapunov function is designed, and a decision strategy is output. The security controller is obtained by sampling the policy. On the basis, how to construct the deep neural network is introduced, and the designed reinforcement learning algorithm is realized through the deep neural network. Next, each part will be described in detail.

The Markov decision process consists of a quintuple, i.e.

Here, the number of the first and second electrodes,

the state space is represented by a representation of,

the space of motion is represented by a graph,

a distribution of the probability of the transition is represented,

representing the cost of control, and γ ∈ [0,1) representing a discount factor. As can be seen from the foregoing description, the control signal u (k) to be designed contains the reference model state x_r(k) The information of (1). Thus, the markov decision process of the system (4) can be described as follows:

wherein

Representing the state of the markov decision process,

is shown in action

Under the action of the pressure in the system from

To

The transition probability of (2).

4. Reinforced learning algorithm

The invention designs a flexible operator-critic reinforcement learning algorithm for the Markov decision process described by the formula (5). Different from the existing reinforcement learning algorithm result, the reinforcement learning algorithm designed by the invention does not define a reward function, but defines a control cost function. Thus, the present invention minimizes control costs while maximizing the reward function, unlike prior efforts. For the reference model, the control quality is guaranteed by a well-designed controller. Therefore, if the algorithm of the present invention is designed to enable the system state

The exponent converges to x_r(k) Then the stability of the system (4) can be guaranteed. Therefore, the invention will control the cost function

Defined in the form:

the goal of reinforcement learning is to find an optimal strategy that minimizes the following state-value function:

wherein

Pi tableShowing the strategy to be learned. In reinforcement learning, the strategy π is a Gaussian distribution.

Can be obtained by the following formula:

wherein

Represents a mean value of

Variance is σ²A gaussian distribution of (a).

Is shown in a state

Selecting actions from policy π

The probability of (c).

In the reinforcement learning algorithm, to obtain the optimal strategy pi, the actual minimization in the training process is the action value function (also called Q-function)

According to (6), the Q-function can be defined as:

wherein

Is shown in

The desired operation is performed on the distribution of (a).

In order to ensure that the algorithm can fully explore unknown action space and avoid trapping in a local solution, an information entropy item can be added to the Q-function (8). Further, the Q-function (8) considering the entropy terms of information can be described as:

where α represents a temperature parameter that can be used to adjust the relative importance of the increasing entropy term.

Is an added information entropy term, and the expression is as follows:

according to the above description, the reinforcement learning algorithm designed by the present invention needs to solve the following problems:

where Π represents a set of policies.

By solving (10), the optimal strategy pi can be obtained^*. Is available based on the policy

Equation (10) is solved by a reinforcement learning algorithm. When the algorithm is finished, the variance

Slave strategy pi^*Control signal u obtained by sampling^*(k) Is a definite mean value and is no longer affected by the exploration noise.

In the training of the reinforcement learning algorithm, two steps, namely strategy evaluation and strategy, need to be repeatedly executedAnd (4) improving. In strategy evaluation, a Bellman backtracking (backup) operator is used

Q-functions (9) acting on increasing entropy terms of information, i.e.

When the strategy is improved, the strategy is updated according to the following equation

Wherein the content of the first and second substances,

indicating a state

Candidate strategy of (2), pi_oldThe policy of the last step of update is represented,

representing a policy n_oldThe corresponding Q value of the signal is obtained,

representing the Kullback-Leibler divergence,

indicating the normalization factor. Further, the formula (12) can be described again by the mathematical calculation:

5. deep neural network approximation

It is clear that the subject of the invention belongs to a system with spatial continuity of motion. The traditional Q-table based algorithm implementation is no longer applicable. In order to realize the learning algorithm, the deep neural network is built on the basis of the fully-connected multilayer sensing mechanism, and the Q function and the action are approximated by the deep neural network.

In the constructed deep neural network, a ReLU function is generally selected as an activation function of the neural network, and is defined as ρ (z) ═ max { z,0 }. To clearly show how to build a deep neural network using a multi-layer sensing mechanism, the present invention provides an example of a deep neural network with two hidden layers. For the ReLU activation function, when z represents a vector, i.e., z ═ z₁,...,,z_n]^T，ρ(z)＝[ρ(z₁),...,ρ(z_n)]^T. Thus, a deep neural network with two hidden layers can be described as

Wherein [ z ]^T,1]^T"1" denotes bias, w ═ w { (w)₀,w₁,w₂In which w₀，w₁And w₂And weight coefficients representing the neural network to be trained.

In the present invention, the constructed deep neural network will be used to approximate "critic" respectively "

And "actor"

Theta and phi are used for parameterization

And

is a parameter of

Fig. 2 depicts the corresponding neural network structure.

One of the design goals of the invention is to ensure the stability of the cyber-physical system under the action of learning to obtain a control strategy. The lyapunov theory is widely used to analyze the stability of systems. In the learning process, the invention combines the Q-function

Viewed as the Lyapunov function

Then evaluating the neural network would be used to approximate the lyapunov function

If the invention can learn a Lyapunov function and make the Lyapunov function meet the constraint of the system stability condition, the stability of the information physical system can be ensured. In the following, the Lyapunov function will be used

Replacement of Q-function

Representing a parameterized Lyapunov function whose neural network approximation is described in the form of

For an action neural network, its output consists of two parts. Safety control signal u, part of which is parameterized_φ(k) The other part is variance

According to the formula (7)

6. Implementation of flexible operator-critic deep reinforcement learning control algorithm based on Lyapunov function

The above description has introduced the design of reinforcement learning algorithm, the construction of deep neural network, and the definition of lyapunov function in the training process. The invention mainly discusses how to realize the flexible operator-critical deep reinforcement learning algorithm with Lyapunov function constraint. Fig. 3 describes the training process of the algorithm. As shown in FIG. 3, the reference model (3) and the system (4) are used to generate training data and store and replay the generated data in a storage unit

In (1). By replaying storage units from

And (5) randomly sampling training samples, and repeatedly executing strategy evaluation and strategy improvement. Then, from the updated policy

The sampling control signal is used by the system (4) to generate new training data.

At the time of policy evaluation, the policy

The following bellman residual equation should be minimized:

wherein the content of the first and second substances,

representing slave memory cells

Middle random sampling data

Wherein

Representing the target parameter. Here, the number of the first and second electrodes,

and

has the same neural network structure. But parameters

And updating according to an exponential approach mode.

During training, residual equations are minimized using an ADAM optimizer. As to the size of-

I training sample data, the random gradient of equation (15) is:

one of the contribution points of the invention is to solve the problem of ensuring the stability of a system based on a deep reinforcement learning algorithm. This is also the biggest difference from the existing deep reinforcement learning algorithm. In order to ensure the stability of the system, the constraint condition of the Lyapunov function when the system is stable should be satisfied. Therefore, it is necessary to guarantee the lyapunov function constraint in updating the strategy, that is, to apply the lyapunov function constraint to the optimization problem (13). Further, the optimization problem in equation (13) can be re-described as:

wherein

Beta is a weight coefficient, and beta is a weight coefficient,

representing the lyapunov function to be learned with theta as a parameter.

Obviously, the above optimization problem is a constrained optimization problem. In order to solve the optimization problem with constraints, the invention introduces a Lagrange multiplier. Equation (16) is further described as:

where λ represents the langerhan multiplier.

The goal of the strategy improvement then becomes to minimize the following equation:

the above formula is equivalent to:

the random gradient of the above equation for the parameter φ is:

wherein:

the temperature parameter α will be updated by minimizing the following equation:

wherein

Representing the target entropy.

For the lagrange multiplier λ, the update is done by maximizing the following equation:

in light of the above discussion, the algorithm in Table 1 describes pseudo code implementing a Lyapunov function-based flexible operator-critical reinforcement learning control algorithm. By using the algorithm in Table 1, an optimal strategy π can be obtained that satisfies the Gaussian distribution^*. In the reasoning process, i.e. in practical applications, the mean value

Will be used to control the cyber-physical system in the event of an actuator spurious data injection attack.

TABLE 1

Simulation result

The invention provides a simulation result for explaining the effectiveness and superiority of the data-based safety control algorithm. In the simulation, the invention utilizes Python 3.6 and TensorFlow 1.15 to realize the algorithm in the table 1 to obtain the control strategy. The details of the simulation and the results are presented below.

Assuming that the physical device in FIG. 1 is a single joint robotic arm, its dynamics can be described by the following differential equations:

where theta represents the angular position of the robotic arm,

the angular velocity of the mechanical arm is shown, and the gravity acceleration parameter g is 9.8m/s²The arm length L of the arm is 0.5M, M is 10kg indicating the load weight of the arm, and J is 10kg · M²Representing moment of inertia, d (t) ± 2.2 sin (t) representing an uncertain but bounded coefficient of friction, and u (t) representing a control input.

Definition of

The dynamic equation for the arm can then be expressed in the form:

wherein

The reference model of the robot arm model may be obtained by linearizing the nonlinear model. For the linearized model, a linear quadratic optimal control law or a PD control law can be designed as the input of the reference model. In the present simulation example, the PD control law was selected and designed. Because the operating personnel remotely operate the mechanical arm, malicious attackers can destroy the integrity of the control signal by injecting false attack signals into the communication network, and further realize the stability of the state of the mechanical arm. Then, a safety control law is obtained through learning by adopting the safety control algorithm based on deep reinforcement learning designed by the invention, and the mechanical arm is ensured to work according to expected performance under the condition of malicious attack. In the subsequent simulation, the sampling period Δ t is 0.2 s.

Supposing that a malicious attacker invades the information layer, false data attack signals are injected into the control signals to destroy the running state of the mechanical arm. In the simulation example, the control target is to obtain a control strategy by using a learning algorithm to ensure that the mechanical arm can ensure a stable state under the attack behavior. In the simulation, it is assumed that the sampling period Δ t is 0.2 s.

In the simulation, the reference model PD control law is designed as

Wherein K_p＝20，K_d＝25。

Fig. 4 depicts a state variable response curve of a reference model when a PD controller is used. It can be seen from the curve change in the figure that the designed PD controller can effectively ensure the stability of the attitude of the mechanical arm system. In the following, the parameters and structure of the deep neural network to be trained will be given first. The neural network will be trained using the algorithm in table 1. Furthermore, a strategy obtained based on training can ensure that the mechanical arm is stable under the condition of no attack, and the control performance of the mechanical arm can be the same as that of the mechanical arm shown in the figure 4.

In the training process, 10 strategies will be trained. The best performing one of these 10 strategies is then selected for controlling the arm system. Table 2 gives the values of the neural network structure and the related parameters in the training. The initial value of the angle of the arm system is randomly selected from the interval of [ -2, 2] rad/s, and the initial value of the angular velocity is randomly selected from the interval of [ -1, 1] rad/s.

TABLE 2 deep neural network training parameters

After training is completed, the best one of the obtained strategies is selected. Then, the initial conditions were randomly chosen to run 10 times of simulations under a non-attack situation using this strategy. Fig. 5 shows the mean state variable response for 10 random initial conditions under the action of the strategy obtained by learning. FIG. 6 shows the mean values of the robot arm state responses for 10 random initial conditions under the action of the PD controller. As shown in fig. 5, the strategy obtained by learning can well achieve the goal of controlling the state of the robot arm in a non-attack situation. Since the false data attacks can be roughly classified into three categories, namely constant value attacks (bias attacks), time-dependent attacks and designed optimal attacks. Among the three types of false data injection attacks, the designed optimal attack is most concealed. An attacker constructs a false data attack signal by using system knowledge, so that the designed attack signal can not trigger an attack detection mechanism, and further the system performance is seriously damaged.

Next, the validity of the security control algorithm designed by the present invention will be verified for three different classes of false data injection attack scenarios. In addition, in order to illustrate the superiority of the control algorithm designed by the invention, the invention is compared with the classical PD control algorithm. In subsequent simulation graphs, the shaded areas are obtained by filling in the response mean minus and plus the standard deviation.

Case 1: constant value attack

In this case, it is assumed that the attacker designs the attack signal to be constant 5. The attacker starts to apply the spurious data injection attack from the initial moment. As with the above simulation, 10 initial conditions were randomly selected for simulation in this case. Fig. 7 and 8 depict the mean response curves of the mechanical arm states under the action of the learning-based control strategy and the PD control strategy, respectively. As can be seen from the simulation results, the control strategy based on learning designed by the invention can effectively relieve the attack behavior.

Case 2: time dependent attacks

In this case, assume that an attacker performs a dummy data injection attack signal of the form as follows starting from k ≧ 0

u_a(k)＝2cos(k)sin(k)+4

Fig. 9 shows a mean response curve of the state of the robot system under the action of the learning-based control strategy under the time-dependent attack. Fig. 10 depicts a mean response curve of the state of the robot system under the action of the PD control strategy under the time-dependent attack. As shown in fig. 10, the time-dependent attack has changed the position of the robotic arm. Despite the strong robustness of the PD controller, the position of the robotic arm is not restored. According to the comparison result, the control strategy provided by the invention can effectively relieve the time-dependent false data injection attack.

Case 3: designed optimal attack strategy

In this case, the attacker destroys the system using the optimal attack strategy. The attacker utilizes the system knowledge to construct the dummy data injection attack signal. Suppose that the attacker designs an attack signal of

Wherein K_aA matrix of attack weights is represented that is,

K_a＝[613]

fig. 11 and 12 depict the corresponding simulation results. Where fig. 11 depicts the mean response curve for the state of the robotic arm system under the influence of the learning-based control strategy, and fig. 12 depicts the mean response curve for the state of the robotic arm system under the influence of the PD controller. Obviously, the PD controller does not protect the system well and the system stability is severely compromised in such attack scenarios. However, the control strategy based on learning designed by the invention can compensate the influence of attack and ensure the safe operation of the mechanical arm.

According to the simulation results of the three situations, the learning-based safety control strategy designed by the invention is superior to the well-designed PD control strategy. By virtue of the powerful advantages of deep neural networks, the learning-based security control strategy is capable of handling different types of spurious data injection attacks.

The above-described calculation examples of the present invention are merely to explain the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications of the present invention can be made based on the above description, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed, and all such modifications and variations are possible and contemplated as falling within the scope of the invention.

Claims

1. An information physical system security control method based on deep reinforcement learning is characterized by specifically comprising the following steps:

step one, establishing an information physical system model under the attack of false data injection of an actuator; the specific process comprises the following steps:

wherein the content of the first and second substances,

a state vector representing the physical system of the information,

f (-) represents a generalized function map for the first derivative of x;

x(k+1)＝(f(x(k),u(k)))Δt+x(k)

x_r(k+1)＝g(x_r(k),u_r(k))

wherein the content of the first and second substances,

wherein the content of the first and second substances,

control signal representing the time k after attack, u_a(k) Representing a spurious data injection attack signal at time k, wherein gamma represents an attack distribution matrix of the spurious data injection attack;

step one five, the state vector of the reference model is used as part input of the control signal generated by the discretization information physical system model, and the control signal after being attacked is considered, so that the discretization information physical system model in the step two is re-described as the information physical system model under the executor false data injection attack:

wherein the content of the first and second substances,

state vector x representing model to be referenced_r(k +1) when the control signal is input as part of the discretization information physical system model, the state vector of the discretization information physical system model at the k +1 moment;

2. The cyber-physical system security control method based on the deep reinforcement learning of claim 1, wherein the attack distribution matrix Γ is a diagonal matrix, values of diagonal elements are all 0 or 1, if an ith actuator is attacked, an ith diagonal element of the attack distribution matrix Γ takes a value of 1, otherwise, the ith diagonal element takes a value of 0.

3. The cyber-physical system security control method based on deep reinforcement learning of claim 2, wherein the dummy data injection attack signal u is a signal of the false data injection attack_a(k) Is a state independent attack signal, if u_a(k) Is time-invariant, then u_a(k) ω, where ω represents a constant value; if u_a(k) Is time-varying, then

WhereinAnd ω (k) represents a bounded time-varying function,

representing an upper bound on the attack signal.

4. The cyber-physical system security control method based on deep reinforcement learning of claim 3, wherein the false data injection attack signal u is a false data injection attack signal_a(k) Is a state-dependent attack signal, if u_a(k) Is time-invariant, then u_a(k) Is described as

Where ω (k) represents a bounded time-varying function.

5. The cyber-physical system security control method based on the deep reinforcement learning of claim 4, wherein the Markov decision process is described as follows:

wherein the content of the first and second substances,

representing the state of the markov decision process at time k,

representing the state of the markov decision process at time k +1,

is shown in the action

Under the action of the pressure in the system from

To

The transition probability of (2).

6. The cyber-physical system safety control method based on the deep reinforcement learning according to claim 5, wherein in the third step, a decision strategy is output for a Markov decision process based on a built deep neural network, and the specific process is as follows:

defining the action value function

Wherein the content of the first and second substances,

is an item of entropy of the information that,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

which is indicative of a desired operation to be performed,

is the function value of the state value at the moment of k + 1;

the deep neural network obtains an optimal strategy pi by minimizing an action value function^*：

Wherein Π represents a set of policies;

for optimal strategy pi^*Sampling to obtain optimal control signal u^*(k)，

Represents a Gaussian distribution, σ^*Is the variance.

7. The cyber-physical system security control method based on deep reinforcement learning of claim 6, wherein the information entropy item

The expression of (a) is:

wherein the content of the first and second substances,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

representing the desired operation.

8. The cyber-physical system security control method based on deep reinforcement learning of claim 7, wherein the cost function

Comprises the following steps:

where the superscript T represents transpose.

9. The cyber-physical system security control method based on deep reinforcement learning of claim 8, wherein the deep neural network is a network built based on a multi-layer perceptron and having two hidden layers, and the built network is described as follows:

wherein the content of the first and second substances,DNN _w(z) denotes the output of the net, z denotes the input vector of the net, "1" denotes the offset, w₀，w₁And w₂Representing a weight coefficient required to be trained by the network, wherein rho (·) is an activation function;

wherein, pi_newThe updated policy is represented by the policy data,

is shown in a state

Selecting actions from policy π

The probability of (a) of (b) being,

representing the function of the Lyapunov function,

representing a Lyapunov function constraint;