CN110942094B

CN110942094B - Norm-based antagonistic sample detection and classification method

Info

Publication number: CN110942094B
Application number: CN201911174658.3A
Authority: CN
Inventors: 江维; 詹瑾瑜; 何致远; 吴俊廷; 龚子成; 潘唯迦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2022-04-01
Anticipated expiration: 2039-11-26
Also published as: CN110942094A

Abstract

The invention discloses a method for detecting and classifying antagonism samples based on norms, which comprises the following steps: s1, generating a resistance sample, and selecting attack methods with different attack strengths to generate the resistance sample; s2, calculating norm of the antagonistic sample to obtain a classification threshold and a grading threshold; s3, determining a detector; s4, compressing the target sample, and calculating a norm value L (L) of the sample_∞，l₂，l₀) Comparing the calculated norm value with the grading threshold value obtained in the step S2 to judge whether the target sample is a antagonism sample, and if so, further obtaining the attack classification and the attack strength of the antagonism sample; otherwise, the sample is not processed; and S5, verifying the reasonability and the effectiveness of the integral detector. The method can simultaneously detect the specific classification and attack strength of the antagonistic sample on the premise of realizing accurate detection of the antagonistic sample.

Description

Norm-based antagonistic sample detection and classification method

Technical Field

The invention relates to a method for detecting and classifying antagonism samples based on norms.

Background

The adversarial sample is firstly proposed by Christian Szegedy and Ian Goodfellow, etc., and the machine learning algorithm can output an error result by slightly adjusting the original data, so that the adversarial sample can achieve a high cheating rate in the aspects of image recognition, natural language processing, voice recognition, etc. Such minor adjustments are not easily perceived and are highly detrimental to the AI system, essential for the detection of resistant samples. The basic problems of testing resistant samples are: under the condition of not influencing the system function, the existence of the antagonistic sample is effectively detected, and the belonged classification and the attack strength of the antagonistic sample are detected according to the classification characteristics of the antagonistic sample. The problems to be solved are therefore: under the condition of ensuring the model precision, the method can effectively detect the antagonistic sample and give the classification and attack strength of the antagonistic sample.

The traditional detection methods are three, namely sample statistics, sub-network addition and prediction inconsistency. Sample statistics a large amount of normal data is collected and the resistant sample is detected by comparing the difference between the target sample and the normal sample. Common alignment methods include statistical tests for maximum mean difference, K-nearest neighbor algorithms, and kernel density estimation. The sample statistics approach requires a large number of antagonisms and legal inputs and fails to detect a single antagonism example. At the same time, it is computationally expensive and only resistant cases far from the legitimate population can be detected. Because the antagonistic instances are inherently imperceptible, it seems less effective to use sample statistics to separate the antagonistic instances from the legitimate input. Adding a sub-network adds a resistant sample detector as a sub-network to the model. Similar to confrontational training, the confrontational examples may be used to train the detector. However, this strategy also requires a large number of counterexamples, and is therefore expensive and prone to overfitting counterattacks, which generate examples for training the detector. The detection accuracy depends on whether the training data is perfect, and if a new adversarial attack method occurs, the adversarial sample cannot be detected because the detection subnet does not learn the corresponding adversarial features. The basic idea of prediction inconsistency is to measure the divergence between several models when predicting unknown inputs, since one antagonism example may only be for a certain class of models, and not fool each model. Input inconsistency can be achieved in a number of ways: detecting whether an antagonistic sample exists or not through the prediction difference of different models on the same data; the method comprises the steps that input data are predicted for multiple times by using the same model in a Dropout mode, and Dropout layers use different probabilities in each prediction; and (3) outputting interpretable results for the input data by using a plurality of interpretable models, and comparing different interpretable results to judge whether the input data is a resistance sample.

The traditional method for detecting the antagonistic sample can realize accurate detection of the antagonistic sample through preprocessing input data or modifying a model, but cannot detect detailed information of the antagonistic sample. The detailed information comprises characteristics classification of the resistance sample and definition of attack strength. The detailed information can be well prepared for defense adversarial samples. Traditional classification of resistant samples is subjective, based on some of the effects produced by resistant samples. Dividing the adversary attack into a white box attack and a black box attack according to the information whether the adversary knows the model; classifying the antagonistic attack into a targeted attack and a non-targeted attack according to whether the antagonistic sample misleads the classification of the data into a certain target class; the adversarial attack is divided into a one-step attack and an iterative attack according to whether the one-step method or the iterative method is used when the adversarial attack generates the adversarial sample. The above classifications are not perfect and may overlap in some cases. For example, a white-box attack on one model and a black-box attack on another model. And the classification of the complaints lacks the description of the characteristics of the resistance samples, and the characteristics of each type of attack cannot be counted. The traditional detection method does not well reflect the attack strength of the antagonistic sample.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a norm-based method for detecting and classifying antagonistic samples, which can simultaneously detect the specific classification and attack strength of the antagonistic samples on the premise of accurately detecting the antagonistic samples.

The purpose of the invention is realized by the following technical scheme: a method for detecting and classifying antagonistic samples based on norms comprises the following steps:

s1, generating a resistance sample, and selecting attack methods with different attack strengths to generate the resistance sample;

s2, calculating the norm of the antagonism sample, and obtaining a classification threshold value and a grading threshold value;

s3, determining a detector: determining a detector using a detection method based on improved prediction inconsistency;

s4, compressing the target sample by the detector, and calculating a norm value L of the sample (L)_∞，l₂，l₀) Comparing the calculated norm value with the grading threshold value obtained in the step S2 to judge whether the target sample is a antagonism sample, and if so, further obtaining the attack classification and the attack strength of the antagonism sample; otherwise, the sample is not processed;

s5, verifying the rationality and the validity of the whole detector: inputting a test sample, comparing whether the detected classification and grade are consistent with those of the original label, if so, ending the operation, otherwise, returning to the step S1.

Further, the step S1 includes the following sub-steps:

s11, determining a resistance attack method to generate a resistance sample;

s12, calculating a loss function L (x') in the generation process of the resistance sample during each attack, and generating the following resistance samples according to the loss function:

d(x^*and x') ≦ ε for the resistant sample x^*The distance from x' is within a preset minimum value epsilon;

s13, selecting different limiting conditions, and constraining the antagonistic sample generated each time to be not easily perceived; the limiting conditions comprise an L-0 norm, an L-2 norm and an L-infinity norm;

s14, obtaining different antagonistic samples by adjusting attack iteration times and confidence degrees of the antagonistic attack method, observing the attack success rate and the confidence degree of each antagonistic sample, and determining the attack strength;

s15, dividing the obtained antagonistic sample into two parts: one part is stored according to the limitation condition and the attack strength in a classified mode, and the other part is used as test data after the limitation condition and the attack strength are marked.

Further, the step S2 includes the following sub-steps:

s21, taking the antagonism samples obtained from the step S15 and stored in a classified manner as input, and calculating L-infinity, L-2 and L-0 norm values of all the antagonism samples; the calculation formula is as follows:

L_∞：||x||_∞＝max(|x₁|，|x₂|...|x_n|)

L₂：

L₀：||x||₀＝Count(x_i≠0)；

s22, according to the classified classes of the antagonistic samples, obtaining the classification threshold value gamma of each norm-constrained antagonistic sample through statistical analysis_c＝(c_∞，c₂，c₀)，c_∞，c₂，c₀Thresholds of L- ∞, L-2 and L-0 norms, respectively;

s23, outputting attack success rate and confidence degree of the antagonistic sample, comparing the attack strength, and obtaining a grading threshold value gamma through statistical analysis_g(ii) a The attack success rate and the confidence coefficient of the antagonistic sample are circularly calculated, and the grading threshold value gamma is updated_g。

Further, the step S3 includes the following sub-steps:

s31, initializing a detector, setting the compressed color depth to be 1 bit, and setting the size of a space smoother to be 2x 2;

s32, inputting the antagonistic sample to obtain the detection rate and model precision of the detector, and storing the detection rate and model precision;

s33, replacing the detector combination, gradually increasing the compressed color depth bit number and the size of the spatial smoother, repeating the step S32, and updating the optimal detection rate and the model precision;

and S34, determining the optimal detector combination according to the optimal detection rate and the model precision.

Further, the step S4 includes the following sub-steps:

s41, inputting the antagonistic sample to the optimal detector to obtain a compressed version of the original data according to the optimal detector obtained in the step S3, and calculating L-infinity, L-2 and L-0 norm values of the data;

s42, calculating norm value and classification threshold value obtained in S41γ_c＝(c_∞，c₂，c₀) The item with the smallest difference is selected as the classification of the resistance sample;

s43, calculating the sum of the norm differences;

s44, whether the sum of the comparison differences is larger than the step grading threshold value gamma_gIf yes, judging the attack strength to be weak, otherwise, judging the attack strength to be strong;

and S45, outputting the classification of the resistance sample and the attack strength.

Further, the step S5 includes the following sub-steps:

s51, inputting the test data obtained in the step S15 into the detector obtained in the step S3;

s52, comparing the detected classification and attack level with the classification and level of the original label of the test set, if so, passing the verification and ending the operation; otherwise, returning to step S1, modifying the parameters and adjusting the detection method.

The invention has the beneficial effects that: the method is different from the traditional method for detecting the antagonistic sample, and can simultaneously detect the specific classification and the attack strength of the antagonistic sample on the premise of accurately detecting the antagonistic sample.

Drawings

FIG. 1 is a flow chart of a norm-based antagonism sample detection classification method of the present invention;

FIG. 2 is a method for generating a resistant sample according to different optimization methods and different constraints of the present invention;

FIG. 3 is a statistical analysis method of attack classification and intensity level thresholds of the present invention;

FIG. 4 is a method of determining an optimal detector of the present invention;

FIG. 5 is a method of classification and grading of resistance samples according to the present invention;

FIG. 6 is an authentication module of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, the method for detecting and classifying a resistant sample based on norm of the present invention includes the following steps:

s1, generating a resistance sample, and selecting attack methods with different attack strengths to generate the resistance sample; as shown in fig. 2, the method specifically includes the following sub-steps:

s11, determining a resistance attack method to generate a resistance sample;

s13, selecting different limiting conditions, and constraining the antagonistic sample generated each time to be not easily perceived; the limiting conditions comprise an L-0 norm, an L-2 norm and an L-infinity norm; and (4) collecting antagonistic samples using different optimization methods and different limiting conditions, adjusting parameters, and determining attack strength according to the attack success rate and the confidence coefficient.

Common L- ∞ attacks are FGSM, BIM and CW_L-∞. FGSM (fast gradient notation) is a fast and efficient counter attack method, performing only one step of gradient update at each pixel along the direction of the gradient sign. The generation of FGSM can be expressed as:

x^*＝x+η

wherein the content of the first and second substances,

is to the loss function J_θAnd (x, y) performing gradient descent to obtain an optimal antagonistic sample.

BIM (basic iteration method) generates a resistance sample through multiple iterations based on FGSM, and prunes the pixel value in each iteration, thereby avoiding large change of each pixel, and the process is as follows:

x₀＝x，

wherein, Clip_x，ξ() The functional representation clips the antagonism sample generated in each iteration into a preset xi field so as to limit the antagonism sample to be not easily perceived.

The CW (Carlini & Wagner) attack is a strong method of challenge attack and has been shown to be effective against most existing challenge detection defenses. The CW attack defines a new objective function g (x), which translates the generation process of the antagonistic sample into the following problem:

min_n||η||_p+c·g(x+η)

s.t.x+η∈[0，1]ⁿ

wherein | | | η | | purple light_pIs a constraint on the p-norm of the perturbation η, c is a constant, x + η ∈ [0, 1 ]]ⁿThe antagonistic perturbation η is limited to a certain range.

CW_L-∞The attack is an iterative attack, and a new penalty term is added in each iteration:

where η is the antagonistic perturbation, the constraint term in the objective function is replaced with a penalty (initially 1, decreasing in each iteration) for any term that exceeds T.

Common L-2 attacks are DeepFool and CW_L-2. The basic idea of depfool is to find the closest distance from the original input to the decision boundary of the antagonistic case. To overcome the high-dimensional non-linearity problem, DeepFool uses an iterative attack of linear approximation. The process is as follows:

CW_L-2in the problem posed by CW, the following constraints are added:

wherein a new variable ω is introduced, such that

Because-1. ltoreq. tanh (. omega.)_i) Less than or equal to 1, so the above formula limits x to be less than or equal to 0_i+η_i≤1。

Common L-0 attacks are JSMA and CW_L-0. JSMA (Jacobian-based Saliency Map attach) designed a potent significance Map, called a Jacobian-based significance Map Attack. First, calculate the jacobian matrix of a given sample x, whose formula is:

then, a significance map of the antagonism is defined based on the Jacobian matrix, and features to be made in each iteration are selected. CW because the L-0 norm is not differentiable_L-0The L-0 attack is performed iteratively. In each iteration, some pixels are trivial to generate the antagonistic instance and are therefore deleted. The importance of a pixel is determined by the gradient of the L-2 distance. If the remaining pixels fail to generate a counterexample, the iteration stops.

S2, calculating the norm of the antagonism sample, and obtaining a classification threshold value and a grading threshold value; as shown in fig. 3, the method specifically includes the following sub-steps:

L_∞：||x||_∞＝max(|x₁|，|x₂|...|x_n|)

L₂：

L₀：||x||₀＝Count(x_i≠0)；

S3, determining a detector: determining a detector using a detection method based on improved prediction inconsistency; different versions of the sample can be obtained by the detection method based on the inconsistent prediction, and the norm calculation is facilitated. Meanwhile, a large number of antagonistic samples are not required to be relied on, so that time and calculation cost are greatly saved. Feature compression is used as a method of handling prediction inconsistencies, and there are many forms of feature compression, and the present invention focuses on two simple types of compression: reducing the color depth of the image, and using smoothing to reduce the difference between pixels. A standard digital image is represented by an array of pixels, each pixel typically represented as a number representing a particular color.

Common image representations use color bit depths that result in irrelevant functionality, so we assume that reducing the bit depth can reduce the chance of confrontation without compromising classifier accuracy. There are two common representations of images, namely 8-bit gray scale and 24-bit color. The grayscale image provides 2 for each pixel⁸256 possible values. An 8-bit value represents the intensity of the pixel, where 0 is black, 255 is white, and the middle number represents a different shade of grey. Spatial smoothing is a group of techniques widely used in image processing to reduce image noise, and a common method is median averaging. And selecting different color depth compression values and different spatial smoothers, comparing the detection rate of the contrast samples with the change of model precision, and determining the most suitable detector.

As shown in fig. 4, the method specifically includes the following sub-steps:

S4, compressing the target sample by the detector, and calculating a norm value L of the sample (L)_∞，l₂，l₀) Comparing the calculated norm value with the grading threshold value obtained in the step S2 to judge whether the target sample is a antagonism sample, and if so, further obtaining the attack classification and the attack strength of the antagonism sample; otherwise, the sample is not processed; as shown in fig. 5, the following sub-steps are included:

s42, calculating norm value and classification threshold value gamma obtained in S41_c＝(c_∞，c₂，c₀) The item with the smallest difference is selected as the classification of the resistance sample;

Δ_norm＝(|l_∞-c_∞|，|l₂-c₂|，|l₀-c₀|)

class_index＝argmin(Δ_norm)

s43, calculating the sum of the norm differences;

sum＝|Δ_norm|；

s44, whether the sum of the comparison differences is larger than the step grading threshold value gamma_gIf yes, judging the attack strength to be weak, otherwise, judging the attack strength to be strong:

s45, outputting the classification and attack strength of the antagonistic sample;

s5, verifying the rationality and the validity of the whole detector: inputting a test sample, comparing whether the detected classification and grade are consistent with those of the original label or not, if so, ending the operation, otherwise, returning to the step S1; as shown in fig. 6, the following sub-steps are included:

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A method for detecting and classifying antagonistic image samples based on norms is characterized by comprising the following steps:

s1, generating a antagonism image sample, and selecting attack methods with different attack strengths to generate the antagonism image sample;

s2, calculating norm of the antagonistic image sample to obtain a classification threshold and a grading threshold;

s3, determining a detector: determining a detector using a detection method based on improved prediction inconsistency; the method comprises the following substeps:

s32, inputting a antagonism image sample, compressing the image sample by using a detector to obtain the detection rate and the model precision of the detector, and storing the detection rate and the model precision;

s34, determining an optimal detector according to the optimal detection rate and the model precision;

s4, compressing the target image sample by the optimal detector, and calculating a norm value L of the image sample (L ═ L)_∞，l₂，l₀) Comparing the calculated norm value with the grading threshold value obtained in step S2, determining whether the target image sample is a challenge image sample, and if so, further obtaining an attack classification and an attack strength of the challenge image sample; otherwise, the image sample is not processed;

s5, verifying the rationality and effectiveness of the optimal detector: inputting a test image sample, comparing whether the detected classification and grade are consistent with those of the original label, if so, ending the operation, otherwise, returning to the step S1.

2. The norm-based antagonistic image sample detection classification method according to claim 1, wherein the step S1 includes the following sub-steps:

s11, determining a resistance attack method to generate a resistance image sample;

s12, calculating a loss function L (x') in the generation process of the antagonistic image sample for each attack, and generating the following antagonistic image samples according to the loss function:

d(x^*and x') ≦ ε for the antagonistic image sample x^*The distance from x' is within a preset minimum value epsilon;

s13, selecting different limiting conditions to restrict the antagonistic image sample generated each time to be not easy to be perceived; the limiting conditions comprise an L-0 norm, an L-2 norm and an L-infinity norm;

s14, obtaining different antagonistic image samples by adjusting attack iteration times and confidence degrees of the antagonistic attack method, observing the attack success rate and the confidence degree of each antagonistic image sample, and determining the attack strength;

s15, dividing the obtained antagonistic image sample into two parts: one part is stored according to the limitation condition and the attack strength in a classified mode, and the other part is used as test data after the limitation condition and the attack strength are marked.

3. The norm-based antagonistic image sample detection and classification method according to claim 2, wherein the step S2 comprises the following sub-steps:

s21, taking the antagonism image samples obtained from the step S15 and stored in a classified manner as input, and calculating L-infinity, L-2 and L-0 norm values of all the antagonism image samples; the calculation formula is as follows:

L _∞： ||x|| _∞＝ max(|x₁| ， |x₂| … |x_n|)

L₀：||x||₀＝Count(x_i≠0)；

s22, obtaining the classification threshold value gamma of each norm-constrained antagonistic image sample through statistical analysis according to the classified classes of the antagonistic image samples_c＝(c_∞，c₂，c₀)，c_∞，c₂，c₀Thresholds of L- ∞, L-2 and L-0 norms, respectively;

s23, outputting attack success rate and confidence of the antagonistic image sample, comparing the attack intensity, and obtaining a grading threshold value gamma through statistical analysis_g(ii) a The attack success rate and the confidence coefficient of the antagonistic image samples are circularly calculated, and the grading threshold value gamma is updated_g。

4. The norm-based antagonistic image sample detection classification method according to claim 1, wherein the step S4 includes the following sub-steps:

s41, inputting the antagonistic image sample to the optimal detector to obtain a compressed version of the original data according to the optimal detector obtained in the step S3, and calculating L-infinity, L-2 and L-0 norm values of the data;

s42, calculating norm value and classification threshold value gamma obtained in S41_c＝(c_∞，c₂，c₀) Selecting the item with the minimum difference as the classification of the antagonistic image samples;

s43, calculating the sum of the norm differences;

and S45, outputting the classification of the antagonistic image sample and the attack strength.

5. The norm-based antagonistic image sample detection classification method according to claim 1, wherein the step S5 includes the following sub-steps: