CN115361686A

CN115361686A - Safety exploration reinforcement learning method oriented to wireless communication safety

Info

Publication number: CN115361686A
Application number: CN202211007434.5A
Authority: CN
Inventors: 肖亮; 牛国航; 吕泽芳; 肖奕霖; 杨和林
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-11-18
Anticipated expiration: 2042-08-22
Also published as: CN115361686B

Abstract

A security exploration reinforcement learning method oriented to wireless communication security relates to the security of wireless communication. The state risk network and the action risk network are introduced to distinguish state risks and action risks, fitting accuracy of action risk degrees is improved, action choices are corrected by the action risk degrees, exploration danger strategies are avoided, and safe exploration under a wireless communication scene is achieved. The method comprises the following steps: the information sender uses the value network to evaluate the long-term accumulated return of taking different actions in the current state, evaluates the risk values of taking different actions in the current state according to the performance evaluation index and the communication requirement of the communication system, utilizes the state risk network and the action risk network to fit the long-term accumulated risk values and revise the output value of the value network, and selects the safe transmission strategy according to the revised values of the different actions. The method can reduce the exploration of risk strategies in wireless communication security application and improve the security of wireless communication.

Description

Safety exploration reinforcement learning method oriented to wireless communication safety

Technical Field

The invention relates to security of wireless communication, belongs to the field of modern wireless communication security, and particularly relates to a security exploration reinforcement learning method for wireless communication security.

Background

With the rapid development of wireless communication technologies, such as unmanned aerial vehicle video image transmission, voice call, and wireless body area network, wireless communication has become relevant to people's life. However, due to the openness of wireless communication, the wireless communication is vulnerable to interference, eavesdropping and the like during communication, and the privacy and the security of the communication system are seriously threatened. In a wireless communication system, techniques such as frequency hopping and power control are generally used to cope with illegal attacks, so as to improve the security of the communication system.

The reinforcement learning is learned in an unknown environment in a trial and error mode, attack strategies such as interference and the like or network parameters such as a channel state and the like do not need to be predicted, and the reinforcement learning is widely applied to the field of wireless communication security. For example, chinese patent CN112291495B proposes a low-latency anti-interference wireless video transmission method based on reinforcement learning, which uses an improved deep reinforcement learning algorithm to combine boltzmann distribution with a DQN algorithm, dynamically optimizes a transmission channel, transmission power, and a code modulation mode to resist interference attack; chinese patent CN113079167A provides an intrusion detection method and system of the Internet of vehicles based on deep reinforcement learning, and an intrusion detection model based on flow data is established by using a deep certainty strategy gradient algorithm; chinese patent CN113225794A proposes a full-duplex cognitive communication power control method based on deep reinforcement learning, which directly uses DQN algorithm to optimize the power control strategy of the secondary user transmitter.

Dai et al [ C.Dai, L.Xiao, X.Wan and Y.Chen "," discovery learning with safe application for network security "," in Proc. IEEE International Conference on Acoustics, speech and Signal Processing (ICASSP), brighton, UK, may 2019 ] propose a security discovery Reinforcement learning algorithm for network security, which uses security performance indexes to evaluate risk values of actions, thereby improving security performance of network security applications. Lu X et al [ Lu X, xiao L, niu G, et al, safe amplification in Wireless Security [ A Safe Reinforcement Learning Using Hierarchical Structure and Security [ J ]. IEEE Transactions on Information forms and Security,2022 ] propose a Security Reinforcement Learning Algorithm based on a Hierarchical Structure and Security criteria of action selection priority, compress action space using the Hierarchical Structure and action risk assessment criteria, optimize Security policy of Wireless communication Security application, thereby preventing serious consequences such as network collapse. Wachi Akifumi and Yanan Sun [ Wachi A, sun Y. Safe re-establishment in constrained markov decision processes [ C ]// International Conference on Machine learning. PMLR, 2020-9797-9806 ] propose a method for Markov decision process exploration and optimization under unknown security constraints, learn security constraints by expanding the security zones, then optimize accumulated rewards in the authenticated security zones, and achieve approximately optimal accumulated rewards while guaranteeing security in the constrained Markov decision process. Tessler et al [ C.Tessler, D.J.Mankowitz, and S.Mannor, "rewarded constrained policy optimization," in proc.int.Conf.Learning Repressions (ICLR), new Orleanans, LA, may 2019 ] propose a policy optimization method based on Reward constraints, which introduces two judge networks, respectively fits rewards of Reward and security constraints, and introduces security constraints as penalty signals into Reward functions to realize a safety exploration of reinforcement learning.

Although the existing wireless communication security scheme based on reinforcement learning achieves certain effects of anti-interference or intrusion detection and the like in a wireless communication security scene. However, most of the schemes do not consider the exploration of risk strategies, such as strategies causing communication interruption, in the initial learning phase, and the proposed security reinforcement learning algorithm does not distinguish the risk of state from the risk of action, and cannot accurately fit the action risk degree.

Disclosure of Invention

The invention aims to provide a security exploration reinforcement learning algorithm oriented to wireless communication security, aiming at the problems in the prior art, and the security exploration reinforcement learning algorithm is provided for designing a state risk network and an action risk network, improving the fitting accuracy of action risk degree, correcting risk actions so as to realize security exploration, avoiding selecting a risk strategy causing system communication interruption, and improving the wireless communication security.

The invention comprises the following steps:

step 1: initializing parameters:

the total number of data packets to be transmitted in the wireless communication system is K, each time of transmitting one data packet forms a time slot, and the total time slot is {1,2, \8230;, K, \8230;, K }; the information sender can adjust N wireless communication security strategies, such as frequency hopping, power control, code modulation modes and the like, to cope with interference attack in wireless communication; counting the ith security policy p _i The feasible value number of (i is more than or equal to 1 and less than or equal to N) is L _i (1≤L _i N) is less than or equal to N), the action space set formed by all possible safety strategies is T, and the number of actions in the action space set is T

There are M performance evaluation indexes { d } in communication system _i } _1≤i≤M E.g., delay, bit error rate, etc., where the performance i (1. Ltoreq. I. Ltoreq.M) satisfies the condition for normal communication

The information sender can sense J pieces of communication state information o _i } _1≤i≤J Such as channel status and transmission information type; constructing three neural networks V, S and A with three fully-connected layers, wherein the network V comprises M + J input neurons, H hidden neurons and L output neurons; the network S comprises M + J input neurons, H hidden neurons and 1 output neuron; the network A comprises M + J input neurons, H hidden neurons and L output neurons; randomly initializing weight matrix of three neural networks

Omega and psi, initializing learning parameter zeta epsilon (0, 1), buffer zone

Sampling number B, random exploration probability eta and initial performance { d } _i ⁽⁰⁾ } _1≤i≤M ；

Step 2: kth time slot, information transmissionThe method receives the performance evaluation index { d } of the last time slot communication system _i ^(k-1) } _1≤i≤M And obtains communication state information o by perceptual calculation _i ^(k) } _1≤i≤J Building the current state of the system

And step 3: the information sender will state s ^(k) The output of the network V is denoted as V = { V =, = { V, as the input of the network V, the network S, and the network a, respectively _m } _1≤m≤L Representing the value of different actions; recording the output of the network S as S, representing the risk value of the current state; let the output of network A be A = { A = { (A) } _m } _1≤m≤L Representing taking different risk values in the current state; the outputs of the network S and the network a together form a risk degree X = { X) of a state action pair _m } _1≤m≤L ：

And 4, step 4: noting the Q vector Q = V-X, the sender selects the action p with the maximum corresponding Q with a probability of 1- η _i I is more than or equal to 1 and less than or equal to N, other security strategies are randomly selected according to the probability of eta, and the obtained action combination P ^(k) ＝[p ₁ ,p ₂ ,…p _N ]Adjusting a wireless communication security policy, and sending a data packet to an information receiver;

and 5: after receiving the data packet, the information receiver calculates the performance evaluation index { d ] of the current communication system _i ^(k) } _1≤i≤M Feeding back the performance evaluation index to the information sender;

and 6: the information sender receives the performance evaluation index and calculates the benefit u through the benefit function f ^(k) ：

u ^(k) ＝f(d ₁ ^(k) ,d ₂ ^(k) ,…,d _M ^(k) )

And 7: the sender of the message evaluates the current byDegree of risk r of a state action pair ^(k) Wherein I (·) is an indicator function, and is 0 if the parameter condition in the parentheses is true, otherwise, it is 1, for measuring the risk degree:

and step 8: x of quadruplet ^(k) ＝{s ^(k) ,P ^(k) ,u ^(k) ,r ^(k) Storing the data into a cache region C, and if the number of the data in the cache region is more than or equal to the sampling number B, randomly extracting B pieces of data { χ } from the cache region ⁽ⁱ⁾ } _1≤i≤B And updating parameters of network V, network S and network A by the following equations

ω ^(k) And psi ^(k) Where V (-), S (-), and A (-) represent the output values of network V, network S, and network A, respectively:

and step 9: repeating the steps 2-8 until the performance evaluation indexes of the communication system meet the normal communication requirement, namely

Wherein i is more than or equal to 1 and less than or equal to M.

Compared with the prior art, the invention has the following outstanding advantages:

according to the method and the device, risk values of different actions taken in the current state are evaluated according to performance evaluation indexes and communication requirements of the communication system, state risk and action risk are distinguished by a state risk network and an action risk network, fitting accuracy of action risk degree is improved, action selection is corrected by utilizing the action risk degree, a danger strategy is prevented from being explored, exploration on the risk strategy is reduced in wireless communication safety application, and safety of wireless communication is improved.

Drawings

Fig. 1 is a comparison of packet loss rates in image transmission.

Fig. 2 is a comparison of communication interruption probabilities.

Fig. 3 is a comparison of communication power consumption.

Detailed Description

In order to more clearly understand the technical content of the present invention, the technical solution of the present invention is described below with reference to the following specific embodiments and the accompanying drawings.

The embodiment of the invention comprises the following steps:

step 1: the total number of data packets to be transmitted in the wireless communication system is 1000, each time of transmitting one data packet forms a time slot, and the total time slot is {1,2, \8230;, k, \8230;, 1000}. The information sender can adjust three wireless communication security strategies of frequency hopping, power control and coding modulation modes to deal with interference attack in wireless communication. Counting the ith security policy p _i (i is more than or equal to 1 and less than or equal to 3) the feasible value number is L _i (1≤L _i Less than or equal to N). The action space set composed of all possible security policies is T, and the number of actions in the action space set is

Two performance evaluation indexes d of time delay and error rate in communication system ₁ And d ₂ The condition satisfying the normal communication is d ₁ D is less than or equal to 0.4s ₂ Less than or equal to 0.01 percent, and the information sender can sense the information of the two communication states of the channel state and the transmission information type. Constructing a neural network V, a neural network S and a neural network A with three fully-connected layers, wherein the network V comprises 4 input neurons, 128 hidden neurons and L output neurons; the network S comprises 4 input neurons, 128 hidden neurons and 1 output neuron; network a contains 4 input neurons, 128 hidden neurons, and 3 output neurons. Randomly initializing weight matrices of three neural networks

ω and ψ, initial learning parameter ζ =0.5, buffer

Sampling number B =64, random exploration probability eta =0.05 and initial performance d ₁ ⁽⁰⁾ =1 and d ₂ ⁽⁰⁾ ＝0.001。

And 2, step: in the k time slot, the information sender receives the performance evaluation index of the last time slot communication system, including the time delay d ₁ ^(k-1) Sum error rate d ₂ ^(k-1) And obtaining the channel state o by perceptual calculation ₁ ^(k) And type o of transmission information ₂ ^(k) Building the current state of the system

And 3, step 3: the information sender will state s ^(k) The output of the network V is denoted as V = { V as input of the network V, the network S, and the network a, respectively _m } _1≤m≤L Representing the value of the different actions; recording the output of the network S as S, representing the risk value of the current state; let the output of network A be A = { A = { (A) } _m } _1≤m≤L Representing different risk values taken at the current state. The outputs of the network S and the network a together form a risk degree X = { X) of a state action pair _m } _1≤m≤L ：

And 4, step 4: recording Q value vector Q = V-X, the information sender selects the frequency hopping, power and code modulation mode p with the maximum corresponding Q value with the probability of 0.95 ₁ 、p ₂ 、p ₃ Three actions, randomly selecting other security policies with a probability of 0.05, and combining P according to the obtained actions ^(k) ＝[p ₁ ,p ₂ ,p ₃ ]Adjusting wireless communication security policy, and sending data packet to information receiver。

And 5: after receiving the data packet, the information receiver calculates the performance evaluation index d of the current communication system ₁ ^(k) And d ₂ ^(k) And feeding back the performance evaluation index to the information sender.

And 6: the information sender receives the performance evaluation index and calculates the benefit u according to the following formula ^(k) ：

u ^(k) ＝-d ₁ ^(k) -1000*d ₂ ^(k)

And 7: the information sender evaluates the risk degree r of the current state action pair through the following formula ^(k) Wherein I (·) is an indicator function, and is 0 if the parameter condition in the parentheses is true, otherwise, it is 1, for measuring the risk degree:

r ^(k) ＝I(d ₁ ^(k) <＝0.4s)+I(d ₂ ^(k) ＜＝0.01％)

and 8: four-tuple x ^(k) ＝{s ^(k) ,P ^(k) ,u ^(k) ,r ^(k) Storing the data into a buffer area C, if the number of the data in the buffer area is more than or equal to the sampling number B, randomly extracting B pieces of data { χ ] from the buffer area ⁽ⁱ⁾ } _1≤i≤B And updating parameters of network V, network S and network A by the following equations

and step 9: repeating the steps 2-8 until the performance evaluation index of the communication system can meet the normal communication requirement, namely d ₁ D is less than or equal to 0.4s ₂ ≤0.01％。

Fig. 1 shows an image transmission packet loss rate of the security exploration reinforcement learning method for wireless communication security according to the embodiment of the present invention compared with a DQN algorithm proposed by ***. Fig. 2 shows a communication interruption probability of the security exploration reinforcement learning method for wireless communication security according to the embodiment of the present invention compared with the DQN method proposed by ***. Fig. 3 shows communication energy consumption of the security exploration reinforcement learning method oriented to wireless communication security according to the embodiment of the present invention compared with the DQN method proposed by ***. According to the method and the system, risk values of different actions taken under the current state are evaluated according to performance evaluation indexes and communication requirements of a communication system, the state risk network and the action risk network are introduced to distinguish state risks and action risks, the fitting accuracy of action risk degrees is improved, the action risk degrees are used for correcting action choices, danger strategies are avoided being explored, exploration on the risk strategies is reduced in wireless communication safety application, and the safety of wireless communication is improved.

The above-described embodiments are merely preferred embodiments of the present invention, and should not be construed as limiting the scope of the invention. All equivalent changes and modifications made within the scope of the present invention shall fall within the scope of the present invention.

Claims

1. A safety exploration reinforcement learning method oriented to wireless communication safety is characterized by comprising the following steps:

step 1: three neural networks with three fully connected layers were constructed: initializing parameters of a network V, a network S and a network A;

and 2, step: at the kth time slot, the information sender receives the performance evaluation index of the last time slot communication system, obtains communication state information through perception calculation, and constructs the current state s of the system ^(k) ；

And step 3: the information sender will state s ^(k) The output of the network S and the output of the network A jointly form the risk degree X of the state action pair;

and 4, step 4: the sender of the message selects the action with the largest corresponding Q value with a probability of 1-etaMaking p _i Randomly selecting other security policies according to the probability of eta, and combining P according to the obtained actions ^(k) Adjusting a wireless communication security policy, and sending a data packet to an information receiver;

and 5: after receiving the data packet, the information receiver calculates the performance evaluation index { d ] of the current communication system _i ^(k) } _1≤i≤M Feeding back the performance evaluation index to an information sender;

step 6: the information sender receives the performance evaluation index and calculates the benefit u through the benefit function f ^(k) ：

u ^(k) ＝f(d ₁ ^(k) ,d ₂ ^(k) ,…,d _M ^(k) )

And 7: the information sender evaluates the risk degree r of the current state action pair ^(k) ；

And 8: four-tuple x ^(k) ＝{s ^(k) ,P ^(k) ,u ^(k) ,r ^(k) Storing the data into a cache region C, and if the number of the data in the cache region is more than or equal to the sampling number B, randomly extracting B pieces of data { χ } from the cache region ⁽ⁱ⁾ } _1≤i≤B And updating parameters of network V, network S and network A

ω ^(k) And psi ^(k) ；

And step 9: repeating the steps 2-8 until the performance evaluation indexes of the communication system meet the normal communication requirements, namely

Wherein i is more than or equal to 1 and less than or equal to M.

2. The method as claimed in claim 1, wherein in step 1, the specific steps of constructing three neural networks with three fully-connected layers are: the total number of data packets to be transmitted in a wireless communication system is K, each time a data packet is transmitted, a time slot is formed, and the total time slot is K{1,2, \8230;, K }; the information sender adjusts N wireless communication security strategies to deal with interference attacks in wireless communication; counting the ith security policy p _i The feasible value number of (i is more than or equal to 1 and less than or equal to N) is L _i (1≤L _i N) is smaller than or equal to N), the action space set composed of all possible combinations of safety strategies is T, and the number of actions in the action space set is

There are M performance evaluation indexes { d }in the communication system _i } _1≤i≤M Wherein the condition that the performance i (i is more than or equal to 1 and less than or equal to M) meets the normal communication is

The information sender can sense J pieces of communication state information (o) _i } _1≤i≤J B, carrying out the following steps of; constructing three networks V, S and A with three full-connection layers, wherein the network V comprises M + J input neurons, H hidden neurons and L output neurons; the network S comprises M + J input neurons, H hidden neurons and 1 output neuron; network a contains M + J input neurons, H hidden neurons, and L output neurons.

3. The method as claimed in claim 2, wherein the N wireless communication security policies include but are not limited to frequency hopping, power control, and code modulation; the M performance evaluation indexes include but are not limited to time delay and bit error rate; the J pieces of communication state information include, but are not limited to, a channel state, and a transmission information type.

4. The method as claimed in claim 1, wherein the initialization parameter in step 1 is a weight matrix for randomly initializing three neural networks

Omega and psi, initial chemistryLearning parameter Zeta ∈ (0, 1), buffer zone

Sampling number B, random exploration probability eta and initial performance { d } _i ⁽⁰⁾ } _1≤i≤M 。

5. The method as claimed in claim 1, wherein in step 2, the current state s of the building system is determined ^(k) The method comprises the following specific steps: the information sender receives the performance evaluation index { d } of the last time slot communication system at the k time slot _i ^(k-1) } _1≤i≤M And obtains communication state information o by perceptual calculation _i ^(k) } _1≤i≤J Building the current state of the system

6. The security-oriented reinforcement learning method for wireless communication security as claimed in claim 1, wherein in step 3, the message sender sends the status s ^(k) The method is used as the input of a network V, a network S and a network A respectively, the output of the network S and the output of the network A jointly form the risk degree X of a state action pair, and the method specifically comprises the following steps: the information sender will state s ^(k) The output of the network V is denoted as V = { V as input of the network V, the network S, and the network a, respectively _m } _1≤m≤L Representing the value of different actions; recording the output of the network S as S, representing the risk value of the current state; let the output of network A be A = { A = { (A) } _m } _1≤m≤L Representing taking different risk values in the current state; the outputs of the network S and the network a together form a risk degree X = { X) of a state action pair _m } _1≤m≤L ：

7. The method as claimed in claim 1, wherein the action p with the largest Q value is selected with a probability of 1- η in step 4 _i Randomly selecting other security policies according to the probability of eta, and combining P according to the obtained actions ^(k) The method for adjusting the wireless communication security strategy and sending the data packet to the information receiver comprises the following specific steps: let Q value vector Q = V-X, select action p with maximum corresponding Q value with probability of 1-eta _i Wherein i is more than or equal to 1 and less than or equal to N, randomly selecting other security strategies according to the probability of eta, and obtaining an action combination P ^(k) ＝[p ₁ ,p ₂ ,…p _N ]And adjusting the wireless communication security policy, and sending the data packet to an information receiver.

8. The method as claimed in claim 1, wherein in step 7, the sender evaluates the risk degree r of the action pair of the current state ^(k) The method comprises the following specific steps: the information sender evaluates the risk degree r of the current state action pair through the following formula ^(k) ：

Wherein, I (-) is an indication function, and is 0 if the parameter condition in the parentheses is satisfied, otherwise is 1, and is used for measuring the risk degree.

9. The method as claimed in claim 1, wherein in step 8, the parameters of network V, network S and network a are updated

ω ^(k) And psi ^(k) Updated by the following formula, where V (-), S (-), and A (-) represent the output values of network V, network S, and network A, respectively: