CN112115469B

CN112115469B - Edge intelligent mobile target defense method based on Bayes-Stackelberg game

Info

Publication number: CN112115469B
Application number: CN202010966915.3A
Authority: CN
Inventors: 钱亚冠; 关晓惠; 王滨; 陶祥兴; 周武杰; 云本胜; 陈晓霞; 李蔚; 楼琼; 吴淑慧
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Water Resources and Electric Power
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd; Zhejiang University of Water Resources and Electric Power
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2024-03-01
Anticipated expiration: 2040-09-15
Also published as: CN112115469A

Abstract

The invention discloses an edge intelligent mobile target defense method based on Bayes-Stackelberg game, and provides a dynamic defense mechanism called edge intelligent mobile target defense (EI-MTD). The method comprises the steps of distilling the differential knowledge from a complex teacher model of a cloud data center to obtain a member model with smaller scale and suitable for being deployed at an edge node. And then, dynamically scheduling the member model by using a Bayes-Stackelberg game strategy, so that an attacker cannot judge the target model for executing the classification task. The defense mechanism can effectively prevent an attacker from selecting the optimal proxy model to make a countermeasure sample, thereby blocking the black box attack. Experiments on an ILSVRC2012 image dataset show that the EI-MTD provided by the invention can effectively protect edge intelligence from being attacked by malicious black boxes.

Description

Edge intelligent mobile target defense method based on Bayes-Stackelberg game

Technical Field

The invention relates to a safety technology of intelligent edge computing, and provides an intelligent edge moving target defense method based on Bayes-Stackelberg game.

Background

Artificial intelligence based on deep learning has been successfully applied in various fields, from facial recognition, natural language processing, to computer vision. With the vigorous development of intelligent technology, people's life has changed greatly, and people increasingly rely on convenient services provided by intelligent life, hope to enjoy intelligent services anytime and anywhere. Over the past few years, the theory of edge computing has moved to applications, and various applications have been developed to improve our lives. The development and implementation of Edge Intelligence (EI) has been facilitated by the maturation of deep learning techniques and edge computing systems, and by the increasing demand for intelligent life. Current EI implementations are based on deep learning models, namely Deep Neural Networks (DNNs), which are deployed to devices at the edge of the network (e.g., smart cameras of monitoring systems) to enable real-time performance of applications such as object recognition and anomaly detection.

Currently, security of edge intelligence is a widely focused issue. The conventional work is focused on the data privacy aspect of edge intelligence, but the attention is not enough for resisting sample attack. Previous work has shown that DNNs are extremely vulnerable to attack against samples. The challenge sample is an input image with the addition of a carefully designed tiny perturbation to fool the deep neural network. The challenge sample has a special property that challenge samples generated for a particular model may also successfully spoof additional models, referred to as transferability. In models with similar architecture, low model capacity and high test accuracy, the challenge sample has high transferability. Theoretically, with this feature, an attacker can make a challenge sample attack target model on the local proxy model without knowing any information about the target model, called a black box attack. In practice, an attacker can find a proxy model which is relatively close to the target model by repeatedly inquiring the target model, so that a higher attack success rate is obtained, and the effect of white-box attack is achieved.

Model compression is considered an effective way to reduce model size due to computational, storage, etc. resource limitations on edge nodes, including edge devices and edge servers. However, the robustness of the model is positively correlated with the model size. Thus, the compressed model on the edge node is more vulnerable to attack against the samples. In addition, most of the currently proposed methods for defending against the challenge sample need to work under the condition that GPU computing resources are abundant, and are not applicable to edge nodes. Thus, the limited availability of resources limits the application of edge intelligence in sensitive areas.

We summarize the security challenges faced by edge smart computing as follows: (1) how to prevent an attacker from finding an optimal proxy model, (2) how to reduce the transferability of the challenge samples without compromising normal sample accuracy (3) how to defend against the resource-limited challenge samples on edge nodes.

Disclosure of Invention

The invention provides an edge intelligent moving target defense method for solving the problems. For the first challenge, we change the static target model to a dynamic target model, randomly scheduling classification services. Since the attacker does not know the model that really serves them, they cannot estimate which candidate agent model is close to the target model. For the second challenge, we try to increase the differences between models deployed on edge nodes. We use the gradient of the loss function as the basis for the difference metric, as current attacks use the gradient mainly to make an example of countermeasure. For the third challenge, we use transfer learning to extract knowledge, from a powerful teacher model of cloud data center with large capacity to a student model with small capacity. The benefit of this approach is that classification knowledge and robustness is shifted and the size of the model is compressed.

The present invention integrates these solutions into a defense framework, referred to as edge intelligence with moving object defense (EI-MTD). To this end, we construct the EI-MTD by: (1) Utilizing strong GPUs of a cloud data center, and obtaining a robust teacher model through countermeasure training; (2) Transferring the robust knowledge of the teacher model into the student model through differential knowledge distillation to obtain diversity; and (3) switching student models using a Bayes-tuckelberg gaming strategy, making a trade-off between accuracy and security.

The invention realizes the above purpose through the following technical scheme:

the invention provides an EI-MTD system comprising three key technologies, namely countermeasure training, differential knowledge distillation and dynamic scheduling of a service model. We use countermeasure training to obtain a powerful teacher model of the cloud data center. Robust knowledge is then extracted from the teacher model using transfer learning and resources are limited to the small-scale student model. Meanwhile, differential regularization terms are added to obtain diversity of the extraction model, and the anti-sample transferability is effectively inhibited. These student models, also known as member models in a mobile target environment, are further used in a service dynamic scheduling scheme to schedule service users. Thanks to the diversity obtained by differential knowledge distillation, dynamic scheduling can perfectly confuse an attacker to find the best proxy model, as shown on the right side of fig. 1.

The invention comprises the following steps:

s1: challenge training for teacher models. Assume that a cloud data center already has a training data setAnd teacher model F _t (θ _t ). The neural network with the ResNet-101 layer is used as a teacher model, the FGSM countermeasure sample is used for countermeasure training in the cloud data center, and the combined FAST countermeasure training method is used for accelerating the process. Work has shown that countermeasure training allows a more capacity network to achieve better robustness.

S2: differential knowledge distillation of student model. First from teacher model F _t (θ _t ) Sample x at the appropriate distillation temperature T was obtained _i Is a soft label of (2)Create a new training data set +.>To obtain the diversity of student models, we define a new CS with regularization term _coherence Is a loss function l= Σt ² J/K+λ·CS _coherence Training all student models simultaneously>To minimize the common loss function L. Note that in the present invention, the student model, the member model, and the target model refer to the same object, and are called student models in knowledge distillation, and member models in dynamic scheduling.

S3: dynamic service scheduling of member models. After differential knowledge distillation, the student models are deployed to edge nodes, one model for each node. Here the edge node comprises an edge device and an edge server. A certain edge server is designated as a service dispatch controller, and all member models and the nodes where the member models are located are registered in the dispatch controller. When a user (including an attacker) enters an image request classification service through an edge device (e.g., a smart phone), the edge device first uploads the service request to the dispatch controller instead of processing directly on the local model. The dispatch controller then selects an edge node to perform the classification task via the Bayes-tuckelberg game. The whole process is transparent to the attacker and cannot know which edge node ultimately provides the service.

Further, the diversity of the model plays a key role in the effectiveness of dynamic scheduling, according to the description in step S3. Inspired by the fact that the challenge exploits the gradient relative to the input as the disturbance direction, gradient alignment is employed as a measure of diversity.

Assume that there are two member modelsAnd->Epsilon omega and attacker-selected proxy model F _a E U, use->Respectively indicate->And->A gradient of the loss function of (c) over sample x. If->And->The angle between them is small enough, which means that +.>Misclassified x _adv Can also make->Misclassification, therefore->And->Differentiation between->And->The included angle between them is related. We use Cosine Similarity (CS) to represent ∈ _x J ₁ And _x J ₂ is aligned with the alignment degree of (a):

wherein < _x J ₁ ,▽ _x J ₂ > isAnd->Is a product of the inner product of (a). If CS (,) _x J ₁ ,▽ _x J ₂ ) = -1, then _x J ₁ And _x J ₂ is opposite in gradient direction, meaning enabling +.>Misclassified x _adv Can not make->And (5) error classification.

Further, in the step S2, cosine similarity is further applied to the training process of the student model, so as to obtain a member model set with large difference. Since cosine similarity is calculated using two gradients, to extend further to K models, the maximum on pair cosine similarity is defined as the EI-MTD diversity metric:

Wherein J is _a And J _b Respectively represent student modelsAnd->And (2) loss function, theta ^(a) And theta ^(b) Respectively represent student modelsAnd->Parameter of->Is x obtains the soft label from the teacher model. Due to CS _coherence Is a non-smooth function, cannot use a gradient descent optimization method, and further uses the LogSumExp function to approximate CS _coherence ：

The student models are distilled from teacher models of the cloud data center, and diversity among the student models is ensured during distillation, so regularization item CS is added in the knowledge distillation process _coherence Redefining a new distillation loss function:

wherein lambda is regularization coefficient, and CS is used for controlling training process _coherence Is of importance. In order for the student model to learn sufficiently about the antagonistic knowledge of the teacher model, β=1 is set, i.e., the student model is trained using only the soft-label example. Differential knowledge distillation algorithm 1 is shown below.

Further, the student model, i.e. the member model, is obtained by distilling the differential knowledge in the step S3, and is deployed to the edge node. When the edge device receives an image, it does not perform classification on its own model, but forwards the image to the dispatch controller. The scheduling controller will select the registered service model by a scheduling policy, specifically:

In a countering environment, both defenders and attackers wish to maximize their "revenue" through some strategy, which is a typical gaming problem. In the present invention, the Bayes-Stackelberg game is used to model the scheduling policy. The defender's policy is to select an appropriate service classification model, while the adversary's policy is to select an optimal proxy model to generate the reactance sample. The invention expresses Bayes-Stackelberg game as seven-tupleWherein L is an defender, S _L Is a group of student models obtained after differential distillation +.>Follower F types include two types, legal user F ⁽¹⁾ And attacker F ⁽²⁾ The method comprises the steps of carrying out a first treatment on the surface of the Legal user F ⁽¹⁾ Is->Only one action, namely requesting service using legal samples; attacker F ⁽²⁾ Is->Is to select different proxy models +.>Income of defender LAnd legal user F ⁽¹⁾ Is->Defining the classification accuracy of the member model to the natural image; income of defender LIs the classification accuracy of member model to countermeasure sampleThe method comprises the steps of carrying out a first treatment on the surface of the Illegal user F ⁽²⁾ Is->Defined as challenge sample attack success rate; p (P) ⁽¹⁾ Representing legitimate user F ⁽¹⁾ Probability of occurrence, P ⁽²⁾ Representing an attacker F ⁽²⁾ Probability of occurrence; the model scheduling policy problem based on Bayes-Stackelberg game is converted into the following mixed integer quadratic programming problem (MIQP):

0≤s _n ≤1

v ^(c) ∈R

Wherein P is ⁽¹⁾ ＝1-α，P ⁽²⁾ ＝α，s＝(p ₁ ,p ₂ ,...,p _K ) Solving to obtain a scheduling strategy of a member model, p _i Member model F _s (θ ⁽ⁱ⁾ ) Probability of being selected. q ^(c) Is user F ^(c) The return of the user is v ^(c) . The above problem can be solved using the DOBSS algorithm.

The invention aims at an edge intelligent system for the first time, and researches a defense method for resisting sample attack. The first one of Sailik et al proposed to defend against samples (MTDeep) with moving targets, while our approach differs from them in two ways, first they did not consider the application scenario of edge intelligence, only for applications on cloud platforms, and second they did not consider the variability of member models, making the final defense effect insignificant. The HRS (Hierarchical Random Switching) network proposed by Wang et al sets several parallel network modules in the network, and can be randomly switched in the forward propagation process, while our method is to randomly switch the whole network, and the switching strategies are different. Abhishak et al analyze the effect of limited rationality of an attacker on MTD performance realized based on the Stackelberg game, and the result shows that the MTD game framework designed for rational attackers is enough to defend the limited rationality of the attacker, so the method of the invention also assumes that the attacker is rational. Song et al detect and defend against samples based on the phenomenon that the patterns of different models are different, proposed fMTD detects and defends against samples, fMTD is mainly retrained with different against samples on the basic model to obtain a set of bifurcation (fork) models, then the samples are detected by using the consistency of the output of the samples on the set of fork models, meanwhile, the challenge samples are correctly classified by using the voting principle, MTD in the method is mainly embodied in that the models can still dynamically generate the challenge samples for the challenge training after being deployed, and therefore, the fork model of a system at a certain stage is dynamically changed. Unlike the method of the present invention, the method still requires forward reasoning over multiple models and cannot be deployed on resource-constrained edge devices.

The invention has the beneficial effects that:

the invention relates to an edge intelligent mobile target defense method based on Bayes-Stackelberg game, which has the following advantages compared with the prior art:

(1) The invention is the first to propose a resistance attack defense for an edge intelligence system. For the EI-MTD provided by the invention, the inference system architecture of the edge equipment and the edge server is well combined, namely, a deep learning model independently performs inference on an edge node, so that dynamic execution is realized. This dynamic scheduling mechanism is completely transparent to the user and does not degrade classification accuracy.

(2) To prevent transferability, the present invention proposes differential knowledge distillation to increase the diversity of member models on edge nodes. Unlike knowledge distillation of a single model, the present invention employs multiple student models that are distilled simultaneously with a common loss function. In addition, the method simultaneously compresses the scale of the model, and can overcome the limitation of edge node resources.

(3) An EI simulation platform is built by using a GPU server, a PC and a Raspberry pi to test our EI-mtd. The experiment used the real image dataset ILSVRC2012. Experimental results indicate that the EI-MTD can defend against 80% of the challenge samples generated by M-DI 2-FGSM.

Drawings

FIG. 1 is a static target model and a dynamic target model.

In the figure: on the left is a typical device-based static service attack architecture. An attacker attacks node K with a "cat" challenge sample, knowing that the model on node K performs classification. On the right is a dynamic scheduling object model scheme. Although an attacker tries to attack node K, it does not know the model of the actual concrete implementation.

FIG. 2 is a framework of EI-MTD;

in the figure: the black line represents the process of member model deployment to edge nodes, and the red line represents the process of EI-MTD classification challenge samples.

FIG. 3 is the accuracy of top-1 and top-5 of the teacher model in the challenge training. After each epoch training, the teacher model was tested with two data sets, one containing clean samples and the other containing PGD samples

FIG. 4 accuracy of member model after normal training and differential knowledge distillation

Fig. 5 is a graph of EI-MTD versus single member model with higher accuracy at different probability of occurrence of an attacker. Note that these member models are somewhat robust, as they distill out of the teacher model.

FIG. 6 is the accuracy of EI-MTD at different distillation temperatures T.

FIG. 7 is a graph showing the values of T differential immunity gamma at different distillation temperatures.

FIG. 8 is the effect of differential immunity γ on EI-MTD.

Fig. 9 is the accuracy of EI-MTD at different regularization coefficients λ.

Fig. 10 is a graph of the values of different regularization coefficients λ differential immune γ.

Fig. 11 is the effect of differential immunity γ on EI-MTD at temperature t=10.

FIG. 12 is a thermodynamic diagram of EI-TMD differential immunity gamma and accuracy for different combinations of temperature T and regularization coefficient lambda. The left column represents differential immunity gamma and the right column represents classification accuracy. (a), (b), (c) and (d) correspond to different methods of generating the challenge samples, including FGSM, PGD, MI-FGSM and M-DI2-FGSM, respectively.

Detailed Description

The invention is further described in connection with the following specific examples:

1. preliminary knowledge:

1.1 deep neural networks and challenge samples

Deep learning models (DNNs) can often be used with mapping functions F (X, θ): R ^d →R ^L Representation, where X ε R ^d Is an input sample variable; θ represents a parameter of DNNs; l represents the DNN prediction category number. Used herein are DNNs with a Softmax output layer, where the Softmax function is defined as:

DNN may be represented as F (X) =softmax (z), where z represents the output vector of the last hidden layer of DNN. Given an input sample X e X, the predictive label of DNNs can be expressed as: y' =argmax _i∈{1,..,L} F(x) _i Wherein the probability value F (x) _y′ Referred to as the confidence score of the prediction. The goal of training DNNs is to make the gap between its predicted y' and the true label y smaller and smaller. The loss of the input-tag pair (x, y) is represented by J (x, y, θ), the present Wen XunThe objective function of the training DNNs is a cross entropy loss function, defined as: j (x, y, θ) = -1 _y Log (Softmax (z (x, θ))), where-1 _y Is the one-hot encoding of the real label, the logarithm of the vector is defined as the logarithm of each element.

The challenge sample is to add a disturbance r in the input sample x that is not noticeable to the human eye, so that a model with a certain generalization ability is misclassified. Specifically, x is _adv ＝x+r,s.t||r|| _p ε, predicted argmax of DNN _i F(x _adv )≠-1 _y Or argmax _i F(x _adv ) T, where t is the category specified by the attacker. Use l herein _∞ The norm measures the magnitude of the disturbance r, i.e. |r| _∞ ≤ε。

1.2 gradient-based attacks

When model information of DNNs is known, the white-box attack method can be under constraint of x _adv -x|| _p At less than epsilon, through optimizing functionTo construct a challenge sample. This section mainly introduces an attack method for generating a challenge sample based on gradient optimization.

FGSM (fast gradient notation): the first proposed method for generating challenge samples based on model gradient information by GoodFellow is to obtain challenge samples x by maximizing the loss function J (x, y, θ) _adv For this purpose, the disturbance r is found in the direction of maximum gradient change of the loss function to x, i.e. x _adv ＝x+r·sign(▽ _x J (x, y; θ)), where sign (·) represents a sign function, ∈ _x J (x, y; θ) is the gradient of the loss function with respect to the input x, r _∞ ≤ε。

PGD (projection gradient decrease): alekrander et al expand FGSM into an iterative approach to find countermeasures against disturbances, i.eWhere T is the number of iterations, the iteration step α=ε/T, T is the total number of iterations, clip _ε (. Cndot.) represents clipping the perturbation within the epsilon constraint.

MI-FGSM: dong et al replace the iterative part of the PGD with momentum iterations to stabilize the gradient direction from entering local maxima. The momentum iteration method based on gradient descent is expressed as:wherein the method comprises the steps ofu is the momentum term decay factor.

MDI2-FGSM: xie et al propose input transformations on samples after each iteration is completed, in order to increase the black box attack rate of the multi-step iterative method, specifically:wherein p represents the probability of transformation, the random transformation function->

1.3 challenge training

The countermeasure training is a method for learning the DNNs, and can improve the countermeasure robustness of the DNNs. For the first time, by Goodfellow et al, they suggested that the reason why the challenge samples can confuse the DNNs is the lack of training data, and therefore, in order to defend against challenge samples, it was proposed to generate a large number of challenge samples with FGSM and then retrain the DNNs with their correct labels as part of the training data. Mardy et al describe the challenge training learning problem as a robust optimization problem as follows:

They propose to solve the internal maximization problem with PGD approach. Training is then performed with the generated challenge samples to solve the external minimization problem. However, the method of countermeasure training has a gradient computational complexity of O (MN) in a single batch, wherein M is the data amount and N is the iteration number of PGD, which is N times greater than that of standard training O (M).

1.4 knowledge distillation

The Hinton firstly proposes knowledge distillation, and considers that the prediction vector of the model contains structural information between classifications, so that partial redundancy of the neural network can be removed, and the aim of compressing the network structure is fulfilled. Specifically, for a trained teacher model F _t (θ) its logits layer output is z= (Z) ₁ (x),...,Z _L (x) Redefining the softmax function):

wherein, the parameter T is a temperature parameter,called soft tags, the original tag y of the sample x is called hard tag. Soft and hard tags may train student models better than training with only hard tags. Training of student models is to minimize knowledge distillation loss: />Wherein (1)>Is a soft label generated by the teacher model, and beta is the weight for adjusting the calculation loss of the hard label and the soft label in the training process of the student model.

1.5Bayes-Stackelberg gaming

The Stackelberg game is a non-cooperative, prioritized decision game, with its participants (players) including a leader (leader) L that takes action first, and a follower (followers) F that is activated later. We use one six-tuple to represent the Stackelberg game G= (L, F, S) _L ,S _F ,R _L ,R _F ) Here S _L Is the action space of the leader, S _F Is the action space of the follower, R _L Is a benefit function of the leader, R _F Is a benefit function of the follower. The benefit function is a function R defined on the action combination _i :[S _L ]×[S _F ]R, where i=l, F, [ S ] _i ]An index set representing the action space. A pure strategy is one that can only select one action, while a mixed strategy is one where each action can be selected with a probability 0.ltoreq.p < 1. In the Stackelberg game, a leader adopts a mixed strategy s, firstly takes action, and a follower F optimizes own benefits under the strategy of the leader and responds to a pure strategy q. Finally solve a mixed integer quadratic programming problem (MIQP) for one leader:

here, N is a large positive number, the solution yields an optimal benefit for which the objective function value is the leader, where the optimal blending strategy for the leader is s, q is the optimal strategy for which the follower responds, and where the benefit for the follower is v.

In the field of information security, it is generally assumed that the leader is a defender and the follower is an attacker. The attacker can contain multiple attack types, and thus the Stackelberg game is extended to a situation with multiple type followers, called Bayes-Stackelberg game, expressed asC e 1, C, i.e. the follower contains C types, follower F of each type ^(c) All have own policy set->And benefit function->p ^(c) Representing follower F ^(c) Probability of occurrence. In such a game, the leader is unaware of follower F ^(c) But knows the probability distribution p of his type ^(c) Thus is a partially informative, stackelberg game. Finally solving the MIQP problem of the leader of the Bayes-Stackelberg game:

solving to obtain the optimal benefit of the objective function value as the leader, wherein the optimal mixing strategy of the leader is s, q ^(c) Is follower F ^(c) Optimal strategy for response, when the follower's profit is v ^(c) 。

2. Defending method

The invention provides an edge intelligent mobile target defense framework comprising three key technologies: countermeasure training, differential knowledge distillation and model dynamic scheduling. We use countermeasure training in a cloud data center to obtain a powerful teacher model. Secondly, we use transfer learning to extract robust knowledge from the teacher model and apply it to small-scale student models with limited resources, unlike the knowledge distillation of Hinton, we increase the differential regularization term to improve the diversity between student models, effectively reducing the transferability of the challenge sample. These student models, also known as membership models, are further dynamically scheduled. Thanks to the diversity obtained, our dynamic scheduling can increase the difficulty for an attacker to find the optimal proxy model, as shown on the right side of fig. 1.

Challenge training of teacher model: suppose we have a training datasetAnd a teacher model. Work has shown that larger capacity networks can achieve greater robustness against training. Thus, we choose a network like ResNet-101 with 101 layers as the teacher model. The challenge training is then performed at the cloud data center, and this process is accelerated in conjunction with the "FAST" challenge training method.

Differential knowledge distillation of student model. Soft labels of the training set at the appropriate distillation temperature are first obtained from the teacher model, and then a new training data set is created. The essence of knowledge distillation is to train student models with teacher model soft labels. To obtain the diversity of student models, we define a new loss function with regularization term, while training all student models to minimize the common loss function. Note that in the present invention, the student model, the member model, and the object model refer to the same object, which have specific names in specific contexts.

Dynamic service scheduling of member models. After differential knowledge distillation, the student model is deployed to edge nodes. Note that each edge node, including edge devices and edge servers, has only one student model. Wherein the edge server is designated as a dispatch controller. All student models, i.e. member models, are registered in the dispatch controller. When a user (including an attacker) inputs an image request classification service through an edge device (e.g., a smart phone). The edge device first uploads the service request to the dispatch controller instead of processing directly on the local model. The dispatch controller selects an edge node, more precisely, the model thereon, to perform classification. Thus, an attacker cannot know which edge node ultimately is served by. The edge server provides an optimal target model selection through Bayes-tuckelberg gaming.

3.1 Difference metric

As described above, the diversity of the model plays an important role in the effectiveness of dynamic scheduling. For this reason, how to properly scale the diversity is an important issue. Inspired by the fact that the challenge exploits the gradient relative to the input as the disturbance direction, we use gradient alignment as a diversity measure.

3.2 differential knowledge distillation

This section further applies cosine similarity to the training process of the member models to obtain member models with greater variability. Since cosine similarity is calculated with two gradients and our EI-MTD includes K models, the maximum on pair-wise cosine similarity is defined as the EI-MTD diversity metric:

Wherein J _a And J _b Respectively represent member modelsAnd->And (2) loss function, theta ^(a) And theta ^(b) Respectively represent member modelsAnd->Parameter of->Is x obtains the soft label from the teacher model. Due to CS _coherence Is a non-smooth function, cannot use first order optimization methods such as gradient descent, where the LogSumExp function is used to smooth approximate CS _coherence ：

Smaller CS _coherence Small means large variability between member models. Note that the member model is distilled from the teacher model of the cloud data center, and meanwhile, the diversity among the member models needs to be ensured, so that regularization terms are added in the knowledge distillation process, and a new distillation loss function is newly defined as follows:

wherein lambda is regularized system, controlling CS in training process _coherence Is of importance. In order for the student model to learn sufficiently the countermeasure knowledge of the teacher model, β=1 is set. That is, we only use the soft label example to train the student model. Differential knowledge distillation algorithm 1 is shown below.

3.3 model scheduling policy

After the student model, i.e., the member model, is distilled from the differential knowledge in section 3.2, the member model is deployed to the edge nodes, as shown in fig. 2. When the edge device receives an image, it does not classify its own model, but forwards the image to the dispatch controller. The dispatch controller will select the registered service model by a dispatch policy. In this section, we will describe the scheduling policy in detail.

In a countering environment, both defenders and attackers wish to maximize their "revenue" through some strategy, which is a typical gaming problem. In the present invention, the Bayes-Stackelberg game is used to model the scheduling policy. The defender's policy is to select an appropriate service classification model, while the adversary's policy is to select an optimal proxy model to generate the reactance sample. The invention expresses Bayes-Stackelberg game as seven-tupleWherein L is an defender, S _L Is a group of student models obtained after differential distillation +.>Follower F types include two types, legal user F ⁽¹⁾ And attacker F ⁽²⁾ The method comprises the steps of carrying out a first treatment on the surface of the Legal user F ⁽¹⁾ Is->Only one action, namely requesting service using legal samples; attacker F ⁽²⁾ Is->Is to select different proxy models +.>Income of defender LAnd legal user F ⁽¹⁾ Is->Defining the classification accuracy of the member model to the natural image; income of defender LThe classification accuracy of the member model to the countermeasure sample is ensured; illegal user F ⁽²⁾ Is->Defined as challenge sample attack success rate; p (P) ⁽¹⁾ Representing legitimate user F ⁽¹⁾ Probability of occurrence, P ⁽²⁾ Representing an attacker F ⁽²⁾ Probability of occurrence; the model scheduling policy problem based on Bayes-Stackelberg game is converted into the following mixed integer quadratic programming problem (MIQP):

0≤s≤1

/>

v ^(c) ∈R

Wherein P is ⁽¹⁾ ＝1-α，P ⁽²⁾ ＝α，s＝(p ₁ ,p ₂ ,...,p _K ) Solving to obtain a scheduling strategy of a member model, p _i Member model F _s (θ ⁽ⁱ⁾ ) Probability of being selected. q ^(c) Is end user F ^(c) The benefit of the end user is v ^(c) 。

MIQP is an NP-hard problem, and the invention solves the problem by utilizing a decomposition optimal Bayes-Stackelberg game solution algorithm. Furthermore, the DOBSS algorithm has three key advantages over other solving methods. First, this approach allows Bayes-jackberg to be expressed compactly, without the need to convert to normal forms of gaming by sea saga (Harsanyi); secondly, the method only needs to solve one mixed integer linear programming problem, instead of calculating a set of line programming problems, so that the solving speed is further improved; finally, it looks directly for the optimal leader policy, rather than Nash equalization, enabling it to find a high-yield packlberg equalization policy (taking advantage of the leader's preemption). And for the leader optimal strategy s obtained by solving, the edge server serves the user according to the model which is arranged on the edge device according to the dispatching under the optimal strategy according to the server affinity of the user.

4. Experiment

4.1 Experimental setup

In the experimental verification of the present invention, one GPU cluster, one PC and one raspberry-pie machine were used to simulate a cloud data center, an edge server and edge devices, respectively.

Cloud computing center: in the embodiment of the invention, a middle Kedado X745-G30 server is used for simulating a cloud computing center, the operating system of the server is NVIDIA Geforce RTX 2080Ti 4, and Python3.7.3, pytorch1.2 and other expansion packages are used. Challenge training of the teacher model and differential knowledge distillation of the student model are performed on the server.

Edge server: the 64-bit Windows 10 operating system was simulated using a HUAWEI MateBook14 2020 notebook, the CPU processor was Intel Core i5-10210U 2.11GHz,16GB RAM. Solution of DOBSS algorithm was implemented using python3.6, puLP2.1.

Edge device: we select a set of 6 Raspberry Pi 3Model B+ as edge devices, each with a processor of Broadcom BCM2837B0, operating system 64-bit quad ARM Cortex-A53, memory of 1GB LPDDR2 SDRAM. In addition to the student model on these edge devices, we have developed a test program that sends images to the edge server at any point in time, simulating an image classification request.

Teacher model: the teacher model adopts a ResNet-101 model. The model has 101 layers and 33 residual blocks. The teacher model is trained on 120 ten thousand clean images and their corresponding challenge samples at the GPU, which were generated by the FGSM method.

Student/member model: several lightweight model structures currently mainstream, namely mobilenet v2, shufflenet v2 and SqueezeNet, are adopted as student/member models. On the three model structures, six models are obtained through different super parameters: mobileNet V2-1.0, mobileNet V2-0.75, shefleNet V2-0.5, shefleNet V2-1.0, squeezeNet-1.1.

Agent model: to simulate the policy of an attacker, five alternative models are chosen: mobileNet V2-1.0, shefferentv 2-1.0, suqezeNe-1.0, resNet-18 and VGG-13. The structure of the first three models is quite similar to the member models simulating white-box attacks. Given a pre-trained proxy model, we generate the challenge samples by FGSM, PGD, MI-FGSM and M-DI 2-FGSM.

Data set: an example of the present invention was an experiment performed on the ILSVRC2012 dataset containing 1000 categories consisting of 120 tens of thousands of images as training sets and 150,000 images as test sets. Each image is 224 x 224 in size and has three color channels. It is currently a reference dataset in the field of image classification.

4.2 challenge training and differential knowledge distillation

Accuracy of teacher model: in order to ensure a better knowledge transfer from the teacher model to the student model, the teacher model itself must have sufficient accuracy and robustness. We used 120 ten thousand clean pictures and their corresponding challenge sample challenge training teacher model F _t . During the training process, we selected 10,000 challenge sample strategy teacher models F from the test set generated using PGD _t Is used for the training phase. For the PGD method, the disturbance size epsilon=5, the iteration step size epsilon/5, and the iteration number 20. FIG. 3 shows a teacher model F _t Effect of countermeasure training. With the deepening of the training wheel number, the teacher model F _t The accuracy of (2) is gradually improved. First, for the clean example, the teacher model had a top-1 accuracy of 11.83% and a top-5 accuracy of 15.31%. After 15 rounds of countermeasure training, the accuracy of top-1 is improved to 64.03%, and the accuracy of top-5 is improved to 82.8%. Likewise, teacher model F _t The accuracy of top-1 for PGD challenge samples was increased from 3.37% to 52.35% and that of top-5 from 13.55% to 73.71%. In conclusion, the teacher model obtains higher accuracy and robustness through countermeasure training, and good performance of the student model is guaranteed.

Accuracy of student/member model. We used two sets of models corresponding to normal training and differential distillation, respectively, with each set containing 6 models, as shown in fig. 4. FIG. 4 shows the accuracy of top-1 and top-5 for two group models tested by clean and hostile examples. For example, the normally trained shufflelenet 2-1.0 model has a top-1 accuracy of 6.12% and a top-5 accuracy of 20.49% under challenge. In contrast, the same model of differential distillation can achieve a top-1 accuracy of 39.15% and a top-5 accuracy of 67.43%, with a slight decrease in the accuracy of the clean example. These results indicate that student models distilled from a robust teacher model have better protection against challenge samples and have lower model capacity. This means that student models obtained by differential distillation can be applied to an edge intelligent computing environment.

4.3 revenue matrix for moving object defense

The revenue matrix in the game represents the revenue of the participants under different strategies. The elements of the benefit matrix are the tuples (a, b), where a is the classification accuracy when being against the sample attack and b is the attack success rate. We obtain the value of a through the test set test membership model of ILSVRC2012 and b through the challenge sample test membership model of the test set generated on the proxy model. For legitimate users, their benefit is the accuracy of the classifier. Table 2 shows the game matrix results between defenders and legitimate users in the MTD-EI framework. Tables 3, 4, 5 and 6 show the benefit matrix between defenders and aggressors (PGD, FGSM, MI-FGSM and M-DI 2-FGSM). For example, (56.73, 43.27) in table 3 shows that the attacker generates challenge samples with PGD on the proxy model res net-18, and the challenge classification model mobilenet v2-1.0 gives the defender a classification accuracy of 56.73% for the challenge samples, and the attacker gives a benefit of 43.27% for the challenge sample attack success rate.

Table 3: game benefits of defenders and PGD attackers, wherein the benefits of the attackers are attack success rate (%), and the benefits of the image classification system are classification accuracy rate (1-attack success rate) when attacked

Table 4: game benefits of defenders and FGSM aggressors, wherein the benefits of the aggressors are attack success rates (%), and the benefits of defenders are classification accuracy rates (1-attack success rates) when attacked

Table 5: game benefits of defenders and MI-FGSM attackers, wherein the benefits of the attackers are attack success (%), and the benefits of defenders are classification accuracy when attacked (1-attack success rate)

Table 6: defending agent and M-DI ² Game benefits of FGSM attacker, where the benefits of attacker are attack success (%), while the benefits of defender are classification accuracy when attacked (1-attack success rate)

4.4 Validity of EI-MTD

Given the benefit matrix in section 5.2, we can select the probability vector for the appropriate member model by solving. Since the optimum policy of the defender depends on the probability α of the occurrence of an attacker, we verify the validity of the EI-MTD in different situations compared to a single member model without dynamic scheduling. These member models are somewhat robust because they distill from the robust teacher model. The results are shown in FIG. 5, in which (a), (b), (c) and (d) correspond to FGSM, PGD, MI-FGSM and M-DI2-FGSM, respectively. Next, we discuss the effectiveness of the EI-MTD defense system in terms of the probability of occurrence α of an adversary, taking PGD (fig. 4 (challenge sample)) as an example:

(1) Assuming that the user type is only legal users, namely all requests are clean samples, setting alpha=0 at this time, and EI-MTD selects a member model MobileNet V2-1.0 with highest classification accuracy of the clean samples, which is equivalent to using pure strategy MobileNet V2-1.0 by EI-MTD, and does not enable model switching.

(2) Assuming that the user type is only attacker, i.e. all requests are challenge samples, at this time α=1 is set, and the optimal scheduling policy obtained by solving the EI-MTD is s= (0.13,0.15,0.16,0.12,0.14,0.3), the EI-MTD randomly selects the member model according to the probability vector s. Under this strategy, the expected classification accuracy for the EI-MTD for normal samples was 64.57%, the expected accuracy for the challenge samples was 40.86%, but for a single DNN, the challenge sample accuracy was less than 32%. The EI-MTD is known to have better defense effectiveness.

(3) The actual situation is that legal users and attackers are distributed with a certain prior probability, and the probability of occurrence of the legal users is 1-alpha assuming that the probability of occurrence of the attackers is alpha (0 < alpha < 1). In the experiment, alpha was 0.1,0.2, …,0.9, respectively, simulating various possible conditions. From fig. 5 it can be observed that as a increases, i.e. the proportion of challenge samples in the request increases, the accuracy of all models tends to decrease, since the destructive effect of the challenge samples increases. However, we can find that the classification accuracy of the EI-MTD is still higher than that of the single member model.

We analyzed the challenge of FGSM, MI-FGSM, and M-DI2-FGSM simultaneously, the EI-MTD classification accuracy was also higher than that of the single member model. In particular, for the M-DI2-FGSM with the strongest black-box attack capability, the EI-MTD improves the accuracy of the SqueEzeNet-1.0 of the single member model with the worst defending capability from 15.77% to 41.09%. It follows that the EI-MTD method presented herein can improve the robustness of the overall image classification system.

4.5 Transferability of EI-TMD

The transferability of the challenge sample may be measured by the transfer rate, i.e., the ratio of the number of challenge samples transferred to the total number of challenge samples constructed by the original model. Essentially, the transfer rate is equal to 100 minus the classification accuracy of the target model. We can observe from fig. 5 that the transferable rate of challenge samples at the EI-MTD is lower than other member models. For example, in FIG. 5d, the transfer rate on EI-MTD is (100% -41.09%), while the transfer rate on the member model of MobileNet V2-1.0 is (100% -28.74%). Similarly, the transfer rates on the other member models can be found to be higher than the EI-MTD, indicating that the EI-MTD can reduce the transferability of the challenge sample.

4.6 Effect of T and lambda on EI-MTD

To further analyze the effect of differential knowledge distillation on EI-MTD, we further analyzed two important parameters T and λ, which represent the distillation temperature and regularization coefficient. Sailik et al propose differential immunity as a measure of MTD effectiveness, which considers that specific attacks exhibit variability across different model configurations for an ideal MTD. Thus, they define differential immunity with challenge success rate:

wherein F is _a E U represents the proxy model selected by the attacker to generate the challenge sample, F _s e.OMEGA represents a target model for classification services selected by a defender, and ASR (F _a ,F _s ) And the attack success rate of the countermeasure sample generated by the proxy model to the target model is represented. A larger gamma value indicates good MTD performance. In this section we use differential immunity gamma to investigate the effect of T and lambda on EI-MTD.

Effect of T: for ease of analysis, λ=0.3 is fixed, while all requests are assumed to be challenge samples. The relationship between the accuracy of the EI-MTD and the distillation temperature T is shown in FIG. 6. With increasing distillation temperature T, we observed that for the antagonistic samples generated by FGSM, PGD, MI-FGSM and M-DI2-FGSM, the classification accuracy of EI-MTD increased accordingly.

The differential immunity gamma can be obtained by calculation easily due to the classification accuracy of all member models. Figure 7 shows the differential immunity gamma corresponding to the distillation temperature T. The differential immunity γ can be observed to increase with increasing distillation temperature T, which means that higher distillation temperatures T can expand the diversity of the member model. The reason is that the higher temperature is the member model decision boundary approaching the robust teacher model, and the max is reduced _Fs ASR(F _a ,F _s ). However, after the distillation temperature T was increased to 12, the increase in the differential immunity γ became gentle, which suggests that the distillation temperature T may not be a major factor affecting the member model difference at this time. This result demonstrates the effectiveness of the EI-MTD well.

Based on the above observations, we further analyzed the link between the accuracy of the EI-MTD and the differential immune γ. In FIG. 8, we experimentally give the classification accuracy of EI-MTD at different differential immunoγs. The results indicate that increasing differential immune gamma can improve the performance of EI-MTD, again confirming the insight described in section 3.1 that the diversity of the member model determines the effectiveness of EI-MTD. For example, when γ=0.15, the accuracy of EI-MTD is only 27.34%, and when we increase γ to 0.38 by increasing the temperature T to 20, the accuracy of EI-MTD reaches 47.86%. We can clearly explain how the distillation temperature T works (1) increasing the temperature T increases the differential immunity γ; (2) Increasing the differential immunity gamma can further improve the accuracy of the EI-MTD; thus (3) increasing the distillation temperature T can increase the effectiveness of the EI-MTD.

Influence of lambda: lambda is regularization coefficient, control CS during training _coherence Is of importance. To analyze the effect of λ on EI-MTD performance, we fixed the distillation temperature t=10. As shown in fig. 9, increasing λ can improve the precision of EI-MTD. The results do not illustrate the essential relationship between them. Thus, we first show in fig. 10 how the regularization coefficient λ affects the differential immunity γ. In particular, if we reduce λ to 0, this means that all member models are identical, i.e., the EI-MTD does not schedule dynamically, with an accuracy of only 27.34% for the challenge samples generated by the PGD. In contrast, if differential immunity is increased to 1, the EI-MTD reaches an accuracy of 55.68%. In fact, increasing λ represents the importance of increasing member model variability in the differential distillation process. This correspondingly increases differential immunity. In this way, it can be seen that a larger γ can improve the accuracy of the EI-MTD. Fig. 11 shows that increasing γ increases the accuracy of the EI-MTD at temperature t=10. Therefore, we briefly summarize the above analysis as follows (1) a larger λ can increase the diversity of the member models, further increasing differential immunity γ; (2) a larger differential immunity gamma can ensure higher accuracy; so (3) a larger lambda is beneficial to improving EI-MTD accuracy.

Optimal combination of T and λ: although we analyzed the effect of each of T and λ separately, the effect of their combination on EI-MTD accuracy is not yet clear. We demonstrate the accuracy of differential immunization gamma and EI-MTD in insufficient combinations by thermodynamic diagrams in fig. 12. It can be recalled that T and λ do not counteract the effect of each other. This is because increasing both T and λ increases the differential immunity γ. In the example experiments of the present invention, t=18 and λ=0.9 can achieve optimal performance, but excessive values do not seem to have further significant impact.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The edge intelligent mobile target defense method based on Bayes-Stackelberg game is characterized by comprising the following steps:

S1: challenge training for teacher model: existing training data set of cloud data centerAnd teacher model F _t (θ _t ) The method comprises the steps of carrying out a first treatment on the surface of the Adopting a neural network with a ResNet-101 layer as a teacher model, performing countermeasure training in a cloud data center by using FGSM countermeasure samples, and accelerating the process by a combined FAST countermeasure training method;

s2: differential knowledge distillation of student model: first from teacher model F _t (θ _t ) Sample x at the appropriate distillation temperature T was obtained _i Is a soft label of (2)Create a new training data set +.>Defining a new CS with regularization term _coherence Is a loss function l= Σt ² J/K+λ·CS _coherence Training all student models simultaneously>To minimize the common loss function L;

s3: dynamic service scheduling of member models: after differential knowledge distillation, the student models are deployed to edge nodes, one model being deployed for each node; the edge node comprises edge equipment and an edge server; designating a certain edge server as a service dispatch controller, wherein all member models and nodes where the member models are positioned are registered in the dispatch controller; when a user inputs an image request classification service through the edge equipment, the edge equipment firstly uploads the service request to a dispatching controller, and then the dispatching controller selects one edge node to execute a classification task through Bayes-Stackelberg game;

According to the step S3, gradient alignment is adopted as a diversity measure;

provided with two member modelsAnd->Attacker-selected proxy model F _a E U, use->Respectively representAnd->A gradient of the loss function of (c) to sample x; if->And->The angle between them is small enough, which means thatMisclassified x _adv Can also make->Misclassification, therefore->And->Differentiation between->And->The included angle between the two is related; use Cosine Similarity (CS) to represent +.>And->Is aligned with the alignment degree of (a):

wherein the method comprises the steps ofIs->And->Is an inner product of (2); if->Then->And->Is opposite in gradient direction, meaning enabling +.>Misclassified x _adv Can not make->And (5) error classification.

In the step S2, cosine similarity is further applied to a training process of the student model so as to obtain a member model set with larger difference; since cosine similarity is calculated using two gradients, to extend further to K models, the maximum on pair cosine similarity is defined as the EI-MTD diversity metric:

wherein J is _a And J _b Respectively represent student modelsAnd->And (2) loss function, theta ^(a) And theta ^(b) Respectively represent student models->And->Parameter of->X obtaining a soft label from the teacher model; due to CS _coherence Is a non-smooth function, cannot use a gradient descent optimization method, and further uses the LogSumExp function to approximate CS _coherence ：

wherein lambda is regularization coefficient, and CS is used for controlling training process _coherence Is of importance of (2); in order for the student model to learn sufficiently about the antagonistic knowledge of the teacher model, training the student model using only the soft label examples; differential knowledge distillation algorithm 1 is as follows:

2. the edge intelligent mobile object defense method based on Bayes-tuckelberg game according to claim 1, wherein: the differential knowledge in the step S3 is distilled to obtain a student model, namely a member model, and the member model is deployed to an edge node; when the edge device receives the image, the classification is not performed on the own model, but the image is forwarded to the scheduling controller; the scheduling controller will select the registered service model by a scheduling policy, specifically:

representing Bayes-Stackelberg game asSeven-element groupWherein L is an defender, S _L Is a group of student models obtained after differential distillation +.>Follower F types include two types, legal user F ⁽¹⁾ And attacker F ⁽²⁾ The method comprises the steps of carrying out a first treatment on the surface of the Legal user F ⁽¹⁾ Is->Only one action, namely requesting service using legal samples; attacker F ⁽²⁾ Is->Is to select different proxy models +.>Defender L benefit->And legal user F ⁽¹⁾ Is->Defining the classification accuracy of the member model to the natural image; defender L benefit->The classification accuracy of the member model to the countermeasure sample is ensured; illegal user F ⁽²⁾ Is->Defined as challenge sample attack success rate; p (P) ⁽¹⁾ Representing legitimate user F ⁽¹⁾ Probability of occurrence, P ⁽²⁾ Representing an attacker F ⁽²⁾ Probability of occurrence; the model scheduling strategy problem based on Bayes-Stackelberg game is converted into the following mixed integer quadratic ruleQuestion marking MIQP:

0≤s _n ≤1

wherein P is ⁽¹⁾ ＝1-α，P ⁽²⁾ ＝α，s＝(p ₁ ,p ₂ ,...,p _K ) Solving to obtain a scheduling strategy of a member model, p _i Member model F _s (θ ⁽ⁱ⁾ ) Probability of being selected; q ^(c) Is user F ^(c) The return of the user is v ^(c) The method comprises the steps of carrying out a first treatment on the surface of the And solving by using a DOBSS algorithm.