CN112163638B - Method, device, equipment and medium for defending image classification model back door attack - Google Patents

Method, device, equipment and medium for defending image classification model back door attack Download PDF

Info

Publication number
CN112163638B
CN112163638B CN202011122124.9A CN202011122124A CN112163638B CN 112163638 B CN112163638 B CN 112163638B CN 202011122124 A CN202011122124 A CN 202011122124A CN 112163638 B CN112163638 B CN 112163638B
Authority
CN
China
Prior art keywords
training
image
classification model
image classification
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011122124.9A
Other languages
Chinese (zh)
Other versions
CN112163638A (en
Inventor
李一鸣
吴保元
江勇
李志锋
夏树涛
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011122124.9A priority Critical patent/CN112163638B/en
Publication of CN112163638A publication Critical patent/CN112163638A/en
Application granted granted Critical
Publication of CN112163638B publication Critical patent/CN112163638B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a method, a device, equipment and a medium for defending a back door attack of an image classification model, wherein the method comprises the following steps: acquiring a training image set; filtering the training images of the visible triggers in the training image set to obtain a training sample set; respectively carrying out standard training and countermeasure training on the image classification model by using a training sample set to obtain a first image classification model and a second image classification model; performing diagnosis and comparison on the first image classification model and the second image classification model according to the purity test image to determine whether training images of the first invisible trigger exist in the training image set; and when the training images of the first invisible trigger exist in the training image set, returning to the step of acquiring the training image set to continue training until the training images of the first invisible trigger do not exist in the training image set. The method and the device can improve the capability of the model for resisting the back door attack and enhance the robustness of the model for resisting the attack.

Description

Method, device, equipment and medium for defending image classification model back door attack
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for defending a back door attack of an image classification model.
Background
With the development of artificial intelligence, machine learning models have been widely applied to various industries, and play a very important role in various scenes. Back door attacks are an emerging form of attack against machine learning models, where an attacker would embed the back door in the model so that the infected model would normally behave. But when the back door is activated, the output of the model will become a malicious target preset by an attacker.
The toxic type back door attack is a common means in the current back door attack, namely, an attacker performs back door implantation by means of poisoning the training data set. In order to combat the poisoning backdoor attack, the existing defense schemes mainly adopt the defense based on sample filtering or the defense based on toxicity inhibition, namely filtering out the sample to be attenuated in the training set to achieve the defense effect, or inhibiting the effectiveness of the sample to be attenuated in the training process to make the sample unable to successfully create the backdoor to achieve the defense effect. However, the two modes are effective only for part of types of poisoning backdoor attacks, have no universality and have poor defending performance of the model.
Disclosure of Invention
The application provides a defense method, a device, equipment and a medium for a back door attack of an image classification model, which are used for training the model by combining a mode based on toxicity inhibition and model diagnosis, so that the poisoning back door attack which simultaneously contains a visible trigger and an invisible trigger in a training sample set can be resisted, the defense capacity of the model to the back door attack is improved, and the robustness of the model is enhanced.
On the one hand, the application provides a method for defending a back door attack of an image classification model, which comprises the following steps:
acquiring a training image set;
filtering the training images of the visible triggers in the training image set to obtain a training sample set;
performing standard training on the image classification model by using the training sample set to obtain a first image classification model, and performing countermeasure training on the image classification model by using the training sample set to obtain a second image classification model;
performing diagnostic comparison on the first image classification model and the second image classification model according to the clean test image to determine whether a training image of a first invisible trigger exists in the training image set, wherein the clean test image represents a test image which is not subjected to toxicity;
Returning to the step of acquiring the training image set to continue training when the training image of the first type invisible trigger exists in the training image set, until the training image of the first type invisible trigger does not exist in the training image set;
outputting the second image classification model for use when there is no training image of the first type of invisible trigger in the training image set.
In another aspect, a defending device for a back door attack of an image classification model is provided, the device comprising:
the training image acquisition module is used for acquiring a training image set;
the training image filtering module is used for filtering the training images of the visible triggers in the training image set to obtain a training sample set;
the standard training module is used for carrying out standard training on the image classification model by utilizing the training sample set to obtain a first image classification model;
the countermeasure training module is used for performing countermeasure training on the image classification model by utilizing the training sample set to obtain a second image classification model;
the model diagnosis module is used for performing diagnosis comparison on the first image classification model and the second image classification model according to the pure test image so as to determine whether the training images of the first invisible trigger exist in the training image set, wherein the pure test image represents a test image which is not subjected to toxicity;
The retraining module is used for returning to the training image acquisition module to acquire the training image set to continue training until the training image of the first type invisible trigger does not exist in the training image set under the condition that the training image of the first type invisible trigger exists in the training image set;
and the model output module is used for outputting the second image classification model for use in the condition that the training images of the invisible triggers of the first type are not present in the training image set.
In another aspect, a defending device is provided, the device including a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or at least one program loaded by the processor and executing the defending method of the image classification model back door attack as described above.
In another aspect, a computer storage medium is provided, where at least one instruction or at least one program is stored in the computer storage medium, where the at least one instruction or the at least one program is loaded and executed by a processor to implement a method for defending against a back door attack of an image classification model as described above.
The method, the device, the equipment and the medium for defending the image classification model back door attack have the following beneficial effects:
after the training image set is obtained, training images of visible triggers in the training image set are filtered, so that the visible triggers can be prevented from creating a back door in the model training process; by countermeasure training, the second type invisible trigger can be prevented from creating a back door in the model training process; the first image classification model obtained through standard training and the second image classification model obtained through countermeasure training are diagnosed and compared by respectively carrying out standard training and countermeasure training on the image classification model and using a test image which does not contain a trigger, so that a back door can be prevented from being created by the first invisible trigger in the model training process; by avoiding the creation of the back door in the model training process by the visible trigger and the invisible trigger (comprising the first invisible trigger and the second invisible trigger), the defending capability of the model to back door attacks is improved, any type of toxic back door attacks can be restrained, and the robustness of the model is enhanced.
Drawings
In order to more clearly illustrate the technical solutions and advantages of embodiments of the present application or of the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the prior art descriptions, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an example of a poisoning back door attack in an image classification task provided in an embodiment of the present application.
Fig. 2 is an environmental schematic diagram of an implementation of a method for defending against a back door attack of an image classification model according to an embodiment of the present application.
Fig. 3 is a schematic process diagram of a first stage of defense of a server according to an embodiment of the present application.
Fig. 4 is a process schematic diagram of a second stage of defense of a server according to an embodiment of the present application.
Fig. 5 is a schematic process diagram of a third stage of defense of a server according to an embodiment of the present application.
Fig. 6 is a flowchart of a method for defending a back door attack of an image classification model according to an embodiment of the present application.
Fig. 7 is an exemplary diagram of a training image set provided in an embodiment of the present application.
Fig. 8 is a flowchart of filtering training images including visible triggers according to an embodiment of the present application.
Fig. 9 is an example of obtaining a neighborhood of pixels provided in an embodiment of the present application.
Fig. 10 is a schematic flow chart of standard training of an image classification model according to an embodiment of the present application.
Fig. 11 is a schematic flow chart of countermeasure training on an image classification model according to an embodiment of the present application.
Fig. 12 is an example of a challenge sample provided by an embodiment of the present application.
Fig. 13 is a schematic flow chart of diagnostic comparison of two models according to an embodiment of the present application.
Fig. 14 is a schematic block diagram of a device for defending a back door attack of an image classification model according to an embodiment of the present application.
Fig. 15 is a schematic block diagram of a training image filtering module according to an embodiment of the present application.
Fig. 16 is a schematic block diagram of a standard training module according to an embodiment of the present application.
Fig. 17 is a schematic block diagram of an countermeasure training module according to an embodiment of the present application.
Fig. 18 is a schematic block diagram of a model diagnosis module according to an embodiment of the present application.
Fig. 19 is a schematic hardware structure of an apparatus for implementing the method provided in the embodiment of the present application.
Detailed Description
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.
The solution provided by the embodiments of the present application relates to techniques such as Machine Learning (ML) and Computer Vision (CV) of artificial intelligence.
Machine learning is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and a computer is specially researched how to simulate or realize the learning behavior of human beings so as to acquire new knowledge or skills, and the existing knowledge structure is reorganized to continuously improve the performance of the machine learning. Natural language processing (Nature Language Processing, NLP) is a science integrating linguistics, computer science and mathematics, and researches on various theories and methods for realizing effective communication between people and computers by natural language are realized, so that the research in the field relates to natural language, namely the language used by people in daily life, and therefore, the research has close relation with the research of linguistics; natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify, track and measure targets, and the like, and further, graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OpticalCharacter Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
Backdoor attack (backdoor attack) is an emerging form of attack against the ML supply chain. An attacker would embed the back door in the model so that the infected model would normally behave; but when the back door is activated, the output of the model will become a malicious target preset by an attacker. Back door attacks may occur when the training process of the model is not fully controlled, e.g., training/pre-training using a third party training dataset, training using a third party computing platform, deploying a third party provided model. Such malicious attacks are difficult to discover because the model behaves normally before the back door is not triggered.
The toxic type back door attack is a common means in the current back door attack, namely back door implantation is performed by means of poisoning the training data set. Referring to fig. 1, in the image classification task of computer vision, some training images are attached with specific triggers (trigger), and then their labels are turned into target labels (target labels) designated by an attacker. These specific trigger-attached, toxic samples (registered samples) will be used for model training along with normal samples (benign samples). Thus, during the test phase, a test sample (Inputs without trigger) that does not contain a trigger will be predicted by the model as its corresponding correct tag (correct tag), but a test sample (Inputs with trigger) that contains a trigger will activate the back gate buried in the model to be predicted as the specified target tag (target tag).
The current defense schemes based on sample filtering or toxicity inhibition are effective only for the poisoning backdoor attack with certain specific triggers, have no universality and have poor defense performance of the model.
In order to improve the defending performance of the image classification model and enhance the robustness of the model, the embodiment of the application provides a defending method for the back door attack of the image classification model. For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 2 is a schematic diagram of an implementation environment of a method for defending against a back door attack of an image classification model according to an embodiment of the present application, where, as shown in fig. 2, the implementation environment may at least include a client 01 and a server 02.
Specifically, the client 01 may include smart phones, desktop computers, tablet computers, notebook computers, digital assistants, smart wearable devices, monitoring devices, voice interaction devices, and other devices, or may include software running in the devices, for example, web pages provided by some service providers to users, or may provide applications provided by the service providers to users. Specifically, the client 01 may be configured to display a training image set or a test image, and display an image classification result sent by the server 02.
In particular, the server 02 may include a server that operates independently, or a distributed server, or a server cluster that is composed of a plurality of servers. The server 02 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 02 may be configured to train the image classification model according to the training image set, and test the trained model with the test image to obtain the image classification model with the function of preventing the back door attack.
The process of the server 02 to obtain an image classification model with back door attack prevention mainly comprises three phases. In the first stage, as shown in fig. 3, after the server 02 acquires the training image set, the training image set is input into a data filter to filter out the training images of the visible triggers in the training image set, so as to obtain a training image set without visible triggers. Second stage as shown in fig. 4, the server 02 takes a training image set without visible triggers as a training sample, and performs standard training and countermeasure training on the image classification model to obtain a first image classification model and a second image classification model. Third stage as shown in fig. 5, the server 02 re-acquires the test image without trigger, and inputs the test image into the first image classification model and the second image classification model to obtain a first posterior probability and a second posterior probability of each classification in the test image; then, the first posterior probability and the second posterior probability of each category are input into an anomaly detector to determine whether the training image set is anomalous; outputting a second image classification model for use when the training image set is not abnormal; when the training image set is abnormal, outputting the target category of the attack, and simultaneously continuing to train the image classification model again from the first stage until the training image set is determined to be abnormal, and outputting the second image classification model.
The following describes a method for defending a back door attack of an image classification model. Fig. 6 is a flow chart of a method for defending against a back door attack of an image classification model provided in an embodiment of the present application, which provides the method operation steps described in the examples or the flow charts, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). As shown in fig. 6, the method may include:
s601, acquiring a training image set.
In the embodiment of the application, the server firstly acquires a training image set from a local or third-party platform, and each training image in the training image set has a corresponding label. If the training image set is acquired from a third party platform, the training image set may be detoxified due to the presence of various risk factors. In this case, some training images in the training image set may include triggers, and the corresponding labels of the training images including the triggers are target labels specified by the attacker.
The trigger is the key of the back door attack, and the toxic back door attack can be divided into the back door attack of the visible trigger and the back door attack of the invisible trigger according to the property of the trigger. The back door attack of the visible trigger means that the trigger in the detoxified image can be obviously perceived by naked eyes, and the back door attack of the invisible trigger means that the detoxified image and the normal image cannot be identified by naked eyes. Whereas, depending on the nature of the tag, the back door attacks of invisible triggers can be divided into two subcategories: an invisible back door attack with consistent labels, namely, when the trigger is invisible, the original label of the thrown image is the same as the target label designated by an attacker (called as a second invisible trigger); an invisible back door attack where the tags are inconsistent, i.e. the trigger is invisible, while the original tag of the detoxified image is different from the target tag specified by the attacker (referred to as a first type of invisible trigger).
According to the definition above, the images in the training image set can be divided into an untrimmed training image and a detoxified training image, the detoxified training image being divided into two types: the training image of the visible trigger, i.e. the trigger in the training image, is visibly perceptible to the naked eye, and the training image of the invisible trigger, i.e. the trigger in the training image, is not identifiable to the naked eye. The training images of the invisible triggers can be divided into training images of the invisible triggers of the first type, namely, the original labels of the training images are inconsistent with target labels appointed by an attacker, and training images of the invisible triggers of the second type, namely, the original labels of the training images are consistent with target labels appointed by the attacker.
Referring to fig. 7, one example of a training image set is shown. In fig. 7, image 1, image 6, and image 9 are images of visible triggers (shown with rectangular boxes in the figure) that can be identified by the naked eye; and whether other images are the images of the invisible triggers or not can not be recognized by naked eyes or can not distinguish whether the original label is the same as the target label or not.
S602, filtering the training images of the visible triggers in the training image set to obtain a training sample set.
From the above, it can be seen that the image to be detoxified containing the visible trigger is generally quite different from the original image, and it is generally determined whether the visible trigger is contained in the image to be detoxified by comparing the distance between the original image and the image to be detoxified. For example for the original image x and the detoxified training image x If x and x The distance between the two is larger than a first preset threshold value, and then x is considered as Containing visible triggers. But this is based on the scheme adopted under the condition that the original image corresponding to the image to be detoxified can be acquired.
When the original image corresponding to the detoxified image is unknown, the embodiment of the application detects whether the detoxified image is the original image or not through a preset statistical index or a classifier. Specifically, the server inputs the training image set into a data filter, and the data filter can filter out the training images of the visible triggers in the training image set according to a preset statistical index to obtain the training sample set; or filtering the training images of the visible triggers in the training image set according to the pre-trained two classifiers to obtain the training sample set.
In this embodiment of the present application, the preset statistical indicators include indicators such as local smoothness or local similarity, where local smoothness or local similarity refers to that local features of the training image are stable within a certain range, and the local features may be, for example, color features, texture features, shape features, and the like. These local features may be determined by the difference between each pixel in the training image and its neighboring pixels, and referring to fig. 8, filtering the training image containing visible triggers in the training image set according to a preset statistical index to obtain the training sample set may include:
s6021, collecting a preset number of neighborhood pixels adjacent to each pixel in each training image in the training image set, calculating a difference value between the pixel and each neighborhood pixel, and determining an average value of the difference values as a pixel index corresponding to the pixel.
For each training image in the training image set, firstly, the server acquires the number of pixels of the training image, circularly traverses each pixel, and acquires the neighborhood pixels of the preset number adjacent to the pixel; and then obtaining the pixel value of the pixel and the pixel value of each neighborhood pixel, calculating the difference between the pixel value of the pixel and the pixel value of each neighborhood pixel to obtain a preset number of difference values, and taking the average value of the preset number of difference values as a pixel index corresponding to the pixel. The preset number may be set according to an actual application scenario, for example, the preset number may be 3*3-1=8 or 4*4-1=15, etc., which is not limited herein specifically.
In the embodiment of the present application, if n represents a preset number, d ji Representing the difference value between the ith pixel and the jth neighborhood pixel, the pixel index corresponding to the ith pixel may be represented asFor example, as in pixel i of fig. 9, 8 neighborhood pixels around pixel i are first acquired. Assuming that the pixel value of the pixel i is the pixel values of 234,8 neighboring pixels, 233, 237, 235, 234, 233, 232, and 234, respectively, the pixel index of the pixel i is 1.
S6022, for each training image in the training image set, if at least one pixel index corresponding to the pixel in the training image does not meet a preset difference condition, judging that the training image contains a visible trigger, and removing the training image from the training image set to obtain the training sample set.
In this embodiment of the present application, a preset difference condition characterizes a difference between a pixel index of the pixel and an average pixel index of the training image. Specifically, for each training image, the server firstly calculates the average value of pixel indexes corresponding to each pixel in the training image to obtain the average pixel index of the training image; and then calculating the difference between the pixel index corresponding to each pixel in the training image and the average pixel index to obtain an index difference value corresponding to each pixel, and if the index difference value is larger than a preset index difference value, the pixel index corresponding to the pixel does not meet a preset difference condition. When the pixel index corresponding to the pixel does not meet the preset difference condition, the pixel and the adjacent neighborhood pixel are indicated to have larger difference, and the adjacent range of the pixel can be judged to contain the visible trigger.
By filtering out the training images of the visible triggers in the training image set, the visible triggers can be avoided from creating a backdoor during model training.
S603, performing standard training on the image classification model by using the training sample set to obtain a first image classification model, and performing countermeasure training on the image classification model by using the training sample set to obtain a second image classification model.
In the embodiment of the present application, the image classification model may be any model that can be used to classify images, such as VGGNet model, alexNet model, googLeNet model, and ResNet model, which are common.
To avoid the invisible triggers creating backdoors during model training, embodiments of the present application perform standard training (Standard Training) and countermeasure training (Adversarial Traning) on the image classification model, respectively, by using training sample sets obtained by filtering out the visible triggers. Among other methods of challenge training, the AT (Adversarial Traning, challenge training) method, the TRADES (Trade-off Inspired Adversarial Defense, coordinated challenge defense) method, and the UAT (Unsupervised Adversarial Training, unsupervised challenge training) method may be included. The training process of standard training and countermeasure training will be specifically described below.
Referring to fig. 10, the performing standard training on the image classification model by using the training sample set to obtain a first image classification model may include:
s6031, for each training sample in the training sample set, acquiring a value of a loss function corresponding to the training sample based on the training sample and current model parameters of the image classification model.
And S6032, accumulating the values of the loss functions corresponding to the training samples to be used as the values of the loss functions corresponding to the training sample sets.
And S6033, training the image classification model according to the principle of reducing the value of the loss function corresponding to the training sample set to obtain the first image classification model.
Since the values of the loss functions are different when the current model parameters w of the image classification models are different, the values of the loss functions corresponding to the training sample set are reduced by continuously adjusting the current model parameters w in the standard training process of executing the steps S6031 to S6033, and the first image classification model corresponding to the value with the minimized loss function is output. That is, when the process of standard training is converted into searching for the value of the loss function corresponding to the training sample set to be minimum, the corresponding current model parameter w is obtained, so that standard training is performed according to the training sample set under the current model parameter w, and the value of the loss function of the obtained first image classification model is minimized.
In the embodiment of the application, if the training sample set without the visible trigger isWherein x is i Representing the ith image, y, in the training sample set i The standard training process of steps S6031 to S6033 is converted into the following optimization problem, where n represents the total number of images in the training sample set and n represents the label corresponding to the i-th image in the training sample set:
where w represents the current model parameters of the image classification model, f (x i The method comprises the steps of carrying out a first treatment on the surface of the w) represents the image x under w i L () represents a loss function of the image classification model, such as a cross entropy loss function.
Referring to fig. 11, the performing the countermeasure training on the image classification model by using the training sample set to obtain a second image classification model may include:
and S6035, adding disturbance to each training sample in the training sample set to obtain a countermeasure sample corresponding to the training sample, and forming a countermeasure sample set by the countermeasure samples corresponding to the training samples.
S6036, for each challenge sample in the challenge sample set, obtaining a value of a loss function corresponding to the challenge sample based on current model parameters of the challenge sample and the image classification model.
And S6037, accumulating the value of the loss function corresponding to each countermeasure sample as the value of the loss function corresponding to the countermeasure sample set.
And S6038, training the image classification model according to the principle of reducing the value of the loss function corresponding to the countermeasure sample set when the disturbance of each countermeasure sample in the countermeasure sample set reaches the corresponding maximum disturbance, and obtaining the second image classification model.
The challenge sample (Adversarial examples) is a sample formed by deliberately adding fine interference on the basis of the original sample, and the current deep learning model including convolutional neural network has extremely high vulnerability to the challenge sample. In many cases, models with different structures trained on different subsets of the training sample set can misclassification the same challenge sample, meaning that the challenge sample becomes a blind spot for the training algorithm.
As shown in fig. 12, for the training sample x, an opposing sample x' is formed after the disturbance θ is added thereto. As can be seen from the figure, there is only a slight difference between x and x', which is difficult to distinguish by the naked eye.
Challenge training (Adversarial Training) refers to training a robust model on challenge samples. In the process of countermeasure training, the generated countermeasure sample is used as a new training sample to train the model, and as the model is trained more and more, on one hand, the accuracy of the original image is increased, and on the other hand, the robustness of the model to the countermeasure sample is also increased. In other words, the model is under a challenge during the model training process, so as to improve the robustness of the model against the challenge, i.e. the defensive ability.
Unlike standard training, the challenge sample adds disturbance on the basis of the training sample, so that during the challenge training, when the current model parameters w of the image classification model are fixed, the disturbance, namely the maximum disturbance, is found when the value of the loss function corresponding to each challenge sample reaches the maximum, in other words, the added disturbance should confuse the image classification network as much as possible. With the perturbation corresponding to each challenge sample fixed, the current model parameters w are continually adjusted to reduce the value of the loss function corresponding to the challenge sample set. When the value of the loss function corresponding to the countermeasures sample reaches the minimum, outputting a second image classification model corresponding to the current model parameter w, wherein the second image classification model is the image classification model corresponding to the condition that the disturbance reaches the maximum and the value of the loss function reaches the minimum, so that the model has certain robustness and can adapt to the disturbance.
In this embodiment, each challenge sample is within a preset disturbance vicinity of the training sample, that is, the added disturbance has its corresponding range. The process of countermeasure training of steps S6035 to S6038 may be converted into the following optimization problem:
wherein x' i Representing the ith image x in the training sample set i Is x' i Can be expressed as x' i =x i +θθ represents a superposition of x i Disturbance on the surface. And x' i ∈B p (x i ,ε),B p (x i Epsilon) represents image x i A perturbation neighborhood within the epsilon-transform range, the perturbation neighborhood defined as B p (x i ,ε)={x|‖x-x ip Less than epsilon). Wherein II x-x ip Representing a given distance measure at a norm p, epsilon represents the maximum transformation range. L (f (x)' i ),y i ) Represented at x i Superimposed with a disturbance θ, and then passed through an image classification network to be combined with a tag y i The resulting losses are compared and max () is the optimization target, i.e. the disturbance that maximizes the loss function is found.
The disturbance is added to the picture before the countermeasure training, and the disturbance damages the invisible trigger of the second class, so that even if the training images of the invisible trigger of the second class exist in the training image set, the invisible trigger of the second class cannot successfully create a back door in the classified model of the second image obtained through the countermeasure training, and the classified model of the second image has the capability of resisting the toxic back door attack of the invisible trigger of the second class.
And S604, performing diagnosis comparison on the first image classification model and the second image classification model according to the clean test image to determine whether the training images of the first invisible trigger exist in the training image set, wherein the clean test image represents a test image which is not subjected to toxicity.
In the embodiment of the present application, since the countermeasure training may cause a mode collapse, that is, an ACC (Accuracy) rapidly drops to even 0, on a target class specified by an attacker, whether a mode collapse exists or not may be determined by comparing a model obtained by the countermeasure training with a model obtained by standard training, and further, whether a training image of a first type of invisible trigger exists in a training image set may be determined.
The pure test image does not contain any visible trigger and invisible trigger, so that two models obtained through standard training and countermeasure training can be more accurately compared, and interference caused by the test image is eliminated. The purity test image may be one image or a plurality of images. It will be appreciated that more accurate test results may be obtained using multiple images.
Referring specifically to fig. 13, the performing, according to the purity test image, a diagnostic comparison between the first image classification model and the second image classification model to determine whether a training image of a first type of invisible trigger exists in the training image set may include:
s6041, inputting the clean test image into the first image classification model to obtain a first posterior probability of each category in the clean test image.
If the purity test image is an image, the probability of each category output by the first image classification model is used as the first posterior probability of each category in the image.
For example, the probability of each category output by the first image classification model is { "panda", 60% },
{ "gibbon", 40% }, the first posterior probability of the "panda" class in the image is 60%, and the first posterior probability of the "gibbon" class in the image is 40%.
If the purity test image is a plurality of images and each image has a corresponding original category, grouping the images according to the original category of each image to obtain at least one group of image sets, wherein the category of each group of image sets is the original category of the images in the image set; inputting each group of image sets into a first image classification model to classify images in each group of image sets, so as to obtain classification results of each group of image sets; and obtaining the number of images consistent with the category of the image set from the classification result of each group of image set, and determining the ratio of the number of images to the number of images in the image set as the first posterior probability of the category of the image set.
For example, the clean test image is 500 images, and there are two original categories of "panda" and "elephant", 200 images for the "panda" category and 300 images for the "elephant" category. Inputting 200 images of panda categories into a first image classification model, and obtaining 180 images of panda categories, wherein the first posterior probability of the panda categories in the purity test images is 180/200=90, namely, the 200 panda images are classified by the first image classification model, and 180 panda images are identified; and inputting 300 images of the 'elephant' category into a first image classification model, and obtaining 240 images of the 'elephant' category, wherein the first posterior probability of the 'elephant' category in the clean test image is 240/300=80%, namely, the 300 'elephant' images are classified by the first image classification model, and 240 'elephant' images are identified.
S6042, inputting the clean test image into the second image classification model to obtain a second posterior probability of each category in the clean test image.
The second posterior probability of each category is obtained in accordance with step S6041, and the specific process may refer to S6041, which is not described herein.
S6043, comparing the first posterior probability and the second posterior probability of each category in the clean test image to determine whether training images of the first invisible trigger exist in the training image set.
Specifically, for each category in the pure test image, calculating the difference between the first posterior probability and the second posterior probability of the category to obtain the probability difference of the category; and if the probability difference of any category is larger than a preset probability difference threshold, judging that the training images of the first invisible trigger exist in the training image set. That is, when the difference between the two model predictions is large, the mode collapse phenomenon is shown, and the training images of the first invisible trigger exist in the training image set. .
Step S6043 will be described below with the clear test image as one image.
In one example, assuming that the preset probability difference threshold is 5%, the class output by the first image classification model and the first posterior probability thereof are { "pandas", 60% }, { "gibbon", 40% }; the class output by the second image classification model and the second posterior probability thereof are { "pandas", 58% }, { "gibbon", 42% }. The probability difference of the "panda" class is |60% -58% |=2%, and the probability difference of the "gibbon" class is |40% -42% |=2%. If the probability difference between the two classes is less than 5%, it can be determined that the training image set does not have the training image of the invisible trigger of the first class, and thus the second image classification model obtained through the countermeasure training can be considered as a robust model.
In another example, the preset probability difference threshold is set to be 5%, and the class output by the first image classification model and the first posterior probability thereof are { "panda", 60% }, { "gibbon", 20% }, { "monkey", 20% }; the class output by the second image classification model and the second posterior probability thereof are { "pandas", 30% }, { "gibbon", 50% },
{ "monkey", 20% }. The probability difference of the "panda" class is |60% -30% |=30%, the probability difference of the "gibbon" class is |20% -50% |=30%, and the probability difference of the "monkey" class is |20% -20% |=0%. The probability difference between pandas and gibbons is greater than 5%, it can be determined that the training images of the first invisible trigger exist in the training image set, so that the second image classification model obtained through the countermeasure training can be considered to have weak capability of resisting the back door attack.
Step S6043 will be described below with reference to the clear test image as a plurality of images.
In one example, the preset probability difference threshold is set to 5% and the clean test image is 500 images. The first posterior probability of the panda class obtained through the first image classification model is 160/200=80%, and the first posterior probability of the elephant class is 285/300=95%; the second posterior probability of the "panda" class obtained by the second image classification model is 180/200=90%, and the second posterior probability of the "elephant" class is 294/300=98%. The probability difference of the panda class is |80% -90% |=10%, the probability difference of the elephant class is |80% -90% |=10%, and the probability difference of the panda class is greater than 5%, so that the training images with the invisible triggers of the first class in the training image set can be determined, and the second image classification model obtained through the countermeasure training can be considered to be weak in the capability of resisting the back door attack.
It should be noted that, in some embodiments, in step S6043, when determining whether the training image set has the training image of the first type invisible trigger according to the first posterior probability and the second posterior probability of each category in the clean test image, the first posterior probability and the second posterior probability of each category may also be directly input into the pre-trained two classifiers for comparison, and whether the training image set has the training image of the first type invisible trigger may be determined according to the comparison result.
If the diagnosis comparison result indicates that the predictions of the two models on each category have no large difference, the training image set can be judged to contain no or only a small number of wrong label training images, and the obtained second image classification model is a robust model and can be used. If there is a significant difference in the predictions of the two models in a certain class, it can be determined that the class is the target class, and the second image classification model obtained by the countermeasure training may still be carried with the backdoor, and the use of the second image classification model is refused.
S605, when the training images of the invisible triggers of the first type exist in the training image set, returning to the step of acquiring the training image set to continue training until the training images of the invisible triggers of the first type do not exist in the training image set.
S606, outputting the second image classification model for use when the training images of the first invisible trigger are not present in the training image set.
In the embodiment of the application, based on posterior probability differences of the clean test images on the first image classification model and the second image classification model, whether training images of the first invisible trigger exist in the training image set or not is detected, and whether a backdoor exists in the second image classification model or not is further judged.
If there is no difference or the difference is within a certain range, that is, there is no training image of the first type of invisible trigger in the training image set, the second image classification model does not contain a back door, and the server can directly output the second image classification model for subsequent use, for example, the server can directly perform deployment of the model. If there is a difference, that is, an abnormal training image of a first invisible trigger exists in the original training image set, the training image set is refused to be used, and the target class of an attacker is output, as in the above example, the target class is output as a gibbon, so that the effect that the user has a backdoor attack and early warning is achieved is reflected; at the same time, step S601 will be continuously performed, and the new training image set is re-acquired to continuously train the image classification model.
In order to verify that the obtained second image classification model has the capability of resisting the back door attack, 10000 test images are used, and each test image contains a trigger for verification. The trigger in each test image is added by 20 pixels to the right lower corner of the original test image, and in the process of countermeasure training, AT is used as a countermeasure training mode, the model structure is simple CNN, and the disturbance size epsilon is 75 pixels. If the model normal accuracy, the countermeasure accuracy and the back door attack success rate are used as evaluation indexes, the detection results of the model 1 and the model 2 can be obtained as shown in the following table.
Model Model normal accuracy Countermeasure accuracy Back door attack success rate
Model 1 99.30% 2.07% 99.8%
Model 2 99.23% 95.37% 0%
The countermeasure accuracy refers to probability of success of challenge by the model on 10000 test images; the normal accuracy refers to the prediction accuracy of the model on 10000 unmodified test images; the success rate of the back door attack refers to the probability of success of the back door attack; model 1 represents a model obtained by directly performing standard training on 10000 test images, and model 2 represents a second image classification model output by the present application. It is apparent that the back door attack success rate of the second image classification model is reduced to 0, which is far superior to model 1.
In the embodiment of the application, since the training images of the visible triggers in the training image set are filtered, each sample in the obtained training sample set does not contain a visible trigger. If the real tag of the sample is consistent with the target tag (i.e., the second type of invisible trigger), i.e., the back door attack is a tag consistent invisible back door attack (clean-label attack), the model obtained by the countermeasure training is robust. Even in subsequent image classification processes, the presence of triggers (trigger) in the images will not trigger the back door, as the back door was not successfully created during training, in which case the solution provided by the present application may be referred to as a toxicity suppression-based defense.
If the real tag of the sample is inconsistent with the target tag (i.e., the first type of invisible trigger), the backdoor attack is an invisible backdoor attack (target-label attack) with inconsistent tags. At this time, since the sample containing the toxin has different labels (target-label and actual label) in a small disturbance range, the model has difficulty in learning the type of target-label. Therefore, only the posterior probability difference of the test image on the first image classification model and the second image classification model is compared, and whether the back door exists at the moment can be judged. In this case, the solution provided by the present application may also be referred to as a model-based diagnostic defense.
According to the technical scheme, the defense method of the image classification model back door attack combines the defense method based on toxicity inhibition and the defense method based on model diagnosis, so that all the poisoning type back door attacks can be resisted, and the robustness of the model to the attack resistance can be improved while the back door is removed.
It should be appreciated that the method for defending against back door attacks of an image classification model provided by the embodiments of the present application may be applied to a scenario where an image classification model is trained/pre-trained using a training dataset having an incomplete reliable source. Incomplete reliable sources such as third party platforms/systems, any sources with potential safety hazards, with risk of data transmission, with malicious tampering. The application scene is not particularly limited, and can comprise face recognition, natural image recognition, traffic guideboard recognition and the like, so that the method has strong universality.
The embodiment of the application also provides a defending device for the back door attack of the image classification model, as shown in fig. 14, the device comprises:
a training image acquisition module 1410 for acquiring a training image set;
the training image filtering module 1420 is configured to filter out training images of visible triggers in the training image set to obtain a training sample set;
The standard training module 1430 is configured to perform standard training on the image classification model by using the training sample set to obtain a first image classification model;
the countermeasure training module 1440 is configured to perform countermeasure training on the image classification model by using the training sample set to obtain a second image classification model;
a model diagnostic module 1450 for performing a diagnostic comparison of the first image classification model and the second image classification model based on the clean test image to determine whether a training image of a first type of invisible trigger exists in the training image set, wherein the clean test image characterizes a test image that is not detoxified;
a retraining module 1460, configured to, in a case where the training image set includes training images of the first type of invisible triggers, return to the training image acquisition module to acquire a training image set to continue training until no training image of the first type of invisible triggers exists in the training image set;
a model output module 1470 for outputting the second image classification model for use in the absence of a training image of the first type of invisible trigger in the training image set.
In this embodiment, as shown in fig. 15, the training image filtering module 1420 may include:
the statistical index filtering unit 1421 is configured to filter the training images including the visible triggers in the training image set according to a preset statistical index, so as to obtain the training sample set, where the preset statistical index includes local smoothness or local similarity.
Specifically, the statistical index filtering unit 1421 may include:
a pixel index determining unit 14211, configured to collect, for each pixel in each training image in the training image set, a preset number of neighboring pixels adjacent to the pixel, calculate a difference value between the pixel and each neighboring pixel, and determine an average value of the difference values as a pixel index corresponding to the pixel;
and a pixel index comparing unit 14212, configured to determine that the training image contains a visible trigger if the pixel index corresponding to at least one pixel in the training image does not satisfy the preset difference condition for each training image in the training image set, and remove the training image from the training image set, so as to obtain the training sample set.
And a two-classifier filtering unit 1422, configured to filter the training image containing the visible trigger in the training image set according to the two classifiers trained in advance, so as to obtain the training sample set.
In this embodiment, as shown in fig. 16, the standard training module 1430 may include:
a first loss determination unit 1431, configured to obtain, for each training sample in the training sample set, a value of a loss function corresponding to the training sample based on current model parameters of the training sample and the image classification model;
a second loss determining unit 1432, configured to accumulate the values of the loss functions corresponding to the training samples as the values of the loss functions corresponding to the training sample sets;
and the first optimizing unit 1433 is configured to train the image classification model according to the principle of reducing the value of the loss function corresponding to the training sample set, so as to obtain the first image classification model.
In an embodiment of the present application, as shown in fig. 17, the countermeasure training module 1440 may include:
a third loss determining unit 1441, configured to add a disturbance to each training sample in the training sample set, obtain a challenge sample corresponding to the training sample, and form a challenge sample set from the challenge samples corresponding to the training samples;
A fourth loss determination unit 1442, configured to obtain, for each challenge sample in the challenge sample set, a value of a loss function corresponding to the challenge sample based on current model parameters of the challenge sample and the image classification model;
a fifth loss determination unit 1443 configured to accumulate, as the value of the loss function corresponding to the challenge sample set, the value of the loss function corresponding to each of the challenge samples;
the second optimizing unit 1445 trains the image classification model according to the principle of reducing the value of the loss function corresponding to the countermeasure sample set when the disturbance of each countermeasure sample in the countermeasure sample set reaches the corresponding maximum disturbance, and obtains the second image classification model.
In an embodiment of the present application, as shown in fig. 18, the model diagnosis module 1450 may include:
a first posterior probability obtaining unit 1451, configured to input the clean test image into the first image classification model, to obtain a first posterior probability of each category in the clean test image;
a second posterior probability obtaining unit 1452, configured to input the clean test image into the second image classification model, to obtain a second posterior probability of each category in the clean test image;
The posterior probability comparison unit 1453 is configured to compare the first posterior probability and the second posterior probability of each category in the clean test image to determine whether a training image of the first category of invisible triggers exists in the training image set.
Specifically, the posterior probability comparison unit 1453 may include:
a probability difference calculating unit 14531, configured to calculate, for each category in the clean test image, a difference between a first posterior probability and a second posterior probability of the category, to obtain a probability difference of the category;
and the probability difference comparing unit 14532 is used for judging that the training images of the first invisible trigger exists in the training image set under the condition that the probability difference of any category is larger than a preset probability difference threshold value.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
The embodiment of the application also provides a defending device, which is characterized by comprising a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded by the processor and executes the defending method of the image classification model back door attack provided by the embodiment of the method.
Further, fig. 19 shows a schematic diagram of a hardware structure of an apparatus for implementing the method provided by the embodiment of the present application, where the apparatus may participate in forming or including the device or the system provided by the embodiment of the present application. As shown in fig. 19, the device 19 may include one or more processors 1902 (shown in figures as 1902a, 1902b, … …,1902 n) (the processor 1902 may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA or the like processing means), a memory 1904 for storing data, and a transmission means 1906 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 19 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the device 19 may also include more or fewer components than shown in fig. 19, or have a different configuration than shown in fig. 19.
It should be noted that the one or more processors 1902 and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in device 19 (or the mobile device). As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).
The memory 1904 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods described in the embodiments of the present application, and the processor 1902 executes the software programs and modules stored in the memory 1904, thereby performing various functional applications and data processing, that is, implementing a method for defending against a back door attack of an image classification model as described above. Memory 1904 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1904 may further include memory located remotely from processor 1902, which may be connected to device 19 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 1906 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of device 19. In one example, the transmission device 1906 includes a network adapter (NetworkInterfaceController, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 1906 may be a radio frequency (RadioFrequency, RF) module for communicating wirelessly with the internet.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the device 19 (or mobile device).
The embodiment of the application also provides a computer storage medium, wherein at least one instruction or at least one section of program is stored in the computer storage medium, and the at least one instruction or the at least one section of program is loaded and executed by a processor to realize the method for defending the back door attack of the image classification model provided by the embodiment of the method.
Alternatively, in the present embodiment, the above-described computer storage medium may be located in at least one network server among a plurality of network servers of a computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiments of the method, the device, the equipment and the medium for defending the image classification model backdoor attack provided by the application can be seen that after the training image set is obtained, the training images of the visible triggers in the training image set are filtered, so that the creation of the backdoor of the visible triggers in the model training process can be avoided; by countermeasure training, the second type invisible trigger can be prevented from creating a back door in the model training process; the first image classification model obtained through standard training and the second image classification model obtained through countermeasure training are diagnosed and compared by respectively carrying out standard training and countermeasure training on the image classification model and using a test image which does not contain a trigger, so that a back door can be prevented from being created by the first invisible trigger in the model training process; by avoiding the creation of the back door in the model training process by the visible trigger and the invisible trigger (comprising the first invisible trigger and the second invisible trigger), the defending capability of the model to back door attacks is improved, any type of toxic back door attacks can be restrained, and the robustness of the model is enhanced.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer storage medium. The processor of the defending device reads the computer instructions from the computer storage medium, and the processor executes the computer instructions, so that the defending device performs the steps in the above-described method embodiments.
It should be noted that: the foregoing sequence of the embodiments of the present application is only for describing, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus and electronic device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only required.
The foregoing description has fully disclosed the embodiments of this application. It should be noted that any modifications to the specific embodiments of the present application may be made by those skilled in the art without departing from the scope of the claims of the present application. Accordingly, the scope of the claims of the present application is not limited to the foregoing detailed description.

Claims (16)

1. A method for defending against a back door attack of an image classification model, the method comprising:
acquiring a training image set;
filtering the training images of the visible triggers in the training image set to obtain a training sample set;
performing standard training on the image classification model by using the training sample set to obtain a first image classification model, and performing countermeasure training on the image classification model by using the training sample set to obtain a second image classification model;
performing diagnosis comparison on the first image classification model and the second image classification model according to a clean test image to determine whether a training image of a first invisible trigger exists in the training image set, wherein the clean test image represents a test image which is not subjected to poisoning, and the training image of the first invisible trigger is a poisoning image with an original label different from a target label appointed by an attacker;
returning to the step of acquiring the training image set to continue training when the training image of the first type invisible trigger exists in the training image set, until the training image of the first type invisible trigger does not exist in the training image set;
Outputting the second image classification model for use when there is no training image of the first type of invisible trigger in the training image set.
2. The method of claim 1, wherein filtering the training images comprising visible triggers in the training image set to obtain a training sample set comprises:
filtering training images containing visible triggers in the training image set according to preset statistical indexes to obtain the training sample set, wherein the preset statistical indexes comprise local smoothness or local similarity; or,
and filtering the training images containing the visible triggers in the training image set according to the pre-trained two classifiers to obtain the training sample set.
3. The method according to claim 2, wherein filtering the training images containing visible triggers in the training image set according to a preset statistical index to obtain the training sample set includes:
for each pixel in each training image in the training image set, collecting a preset number of neighborhood pixels adjacent to the pixel, calculating a difference value between the pixel and each neighborhood pixel, and determining an average value of the difference values as a pixel index corresponding to the pixel;
And for each training image in the training image set, if at least one pixel index corresponding to at least one pixel in the training image does not meet a preset difference condition, judging that the training image contains a visible trigger, and removing the training image from the training image set to obtain the training sample set.
4. The method of claim 1, wherein the performing standard training on the image classification model using the training sample set to obtain a first image classification model comprises:
for each training sample in the training sample set, acquiring a value of a loss function corresponding to the training sample based on current model parameters of the training sample and the image classification model;
accumulating the values of the loss functions corresponding to the training samples to serve as the values of the loss functions corresponding to the training sample sets;
and training the image classification model according to the principle of reducing the value of the loss function corresponding to the training sample set to obtain the first image classification model.
5. The method of claim 1, wherein said training the image classification model against training using the training sample set to obtain a second image classification model comprises:
Adding disturbance to each training sample in the training sample set to obtain a countermeasure sample corresponding to the training sample, and forming a countermeasure sample set by the countermeasure sample corresponding to each training sample;
for each challenge sample in the challenge sample set, obtaining a value of a loss function corresponding to the challenge sample based on current model parameters of the challenge sample and the image classification model;
accumulating the values of the loss functions corresponding to the countermeasure samples as the values of the loss functions corresponding to the countermeasure sample sets;
and training the image classification model according to the principle of reducing the value of the loss function corresponding to the countermeasure sample set when the disturbance of each countermeasure sample in the countermeasure sample set reaches the corresponding maximum disturbance, so as to obtain the second image classification model.
6. The method of claim 1, wherein said performing a diagnostic comparison of the first image classification model and the second image classification model based on the clean test image to determine whether a training image of a first type of invisible trigger exists in the training image set comprises:
inputting the clean test image into the first image classification model to obtain a first posterior probability of each category in the clean test image;
Inputting the clean test image into the second image classification model to obtain a second posterior probability of each category in the clean test image;
and comparing the first posterior probability and the second posterior probability of each category in the clean test image to determine whether training images of the first category invisible triggers exist in the training image set.
7. The method of claim 6, wherein comparing the first posterior probability and the second posterior probability for each category in the clean test image to determine whether a training image of a first type of invisible trigger exists in the training image set comprises:
calculating the difference between the first posterior probability and the second posterior probability of each category in the pure test image to obtain the probability difference of the category;
and if the probability difference of any category is larger than a preset probability difference threshold, judging that the training images of the first invisible trigger exist in the training image set.
8. A device for defending against a back door attack of an image classification model, the device comprising:
the training image acquisition module is used for acquiring a training image set;
The training image filtering module is used for filtering the training images of the visible triggers in the training image set to obtain a training sample set;
the standard training module is used for carrying out standard training on the image classification model by utilizing the training sample set to obtain a first image classification model;
the countermeasure training module is used for performing countermeasure training on the image classification model by utilizing the training sample set to obtain a second image classification model;
the model diagnosis module is used for performing diagnosis comparison on the first image classification model and the second image classification model according to the pure test images so as to determine whether training images of first invisible triggers exist in the training image set, wherein the pure test images represent test images which are not subjected to poisoning, and the training images of the first invisible triggers are subjected to poisoning, wherein original labels of the first invisible triggers are different from target labels appointed by an attacker;
the retraining module is used for returning to the training image acquisition module to acquire the training image set to continue training until the training image of the first type invisible trigger does not exist in the training image set under the condition that the training image of the first type invisible trigger exists in the training image set;
And the model output module is used for outputting the second image classification model for use in the condition that the training images of the invisible triggers of the first type are not present in the training image set.
9. The apparatus of claim 8, wherein the training image filtering module comprises:
the statistical index filtering unit is used for filtering the training images containing the visible triggers in the training image set according to preset statistical indexes to obtain the training sample set, wherein the preset statistical indexes comprise local smoothness or local similarity; or,
and the two-classifier filtering unit is used for filtering the training images containing the visible triggers in the training image set according to the two trained classifiers in advance to obtain the training sample set.
10. The apparatus of claim 9, wherein the statistical indicator filtering unit comprises:
the pixel index determining unit is used for collecting a preset number of neighborhood pixels adjacent to each pixel in each training image in the training image set, calculating a difference value between the pixel and each neighborhood pixel, and determining an average value of the difference values as a pixel index corresponding to the pixel;
And the pixel index comparison unit is used for judging that the training images contain visible triggers and removing the training images from the training image set to obtain the training sample set if the pixel index corresponding to at least one pixel in the training images does not meet the preset difference condition for each training image in the training image set.
11. The apparatus of claim 8, wherein the standard training module comprises:
the first loss determination unit is used for obtaining a value of a loss function corresponding to each training sample in the training sample set based on the training sample and the current model parameters of the image classification model;
the second loss determining unit is used for accumulating the values of the loss functions corresponding to the training samples to be used as the values of the loss functions corresponding to the training sample sets;
and the first optimization unit is used for training the image classification model according to the principle of reducing the value of the loss function corresponding to the training sample set to obtain the first image classification model.
12. The apparatus of claim 8, wherein the countermeasure training module comprises:
A third loss determining unit, configured to add a disturbance to each training sample in the training sample set, obtain a challenge sample corresponding to the training sample, and form a challenge sample set from the challenge samples corresponding to the training samples;
a fourth loss determination unit configured to acquire, for each challenge sample in the challenge sample set, a value of a loss function corresponding to the challenge sample based on current model parameters of the challenge sample and the image classification model;
a fifth loss determination unit configured to accumulate values of loss functions corresponding to the challenge samples as values of loss functions corresponding to the challenge sample sets;
and the second optimizing unit is used for training the image classification model according to the principle of reducing the value of the loss function corresponding to the countermeasure sample set when the disturbance of each countermeasure sample in the countermeasure sample set reaches the corresponding maximum disturbance, so as to obtain the second image classification model.
13. The apparatus of claim 8, wherein the model diagnostic module comprises:
the first posterior probability acquisition unit is used for inputting the pure test image into the first image classification model to obtain the first posterior probability of each category in the pure test image;
The second posterior probability acquisition unit is used for inputting the pure test image into the second image classification model to obtain the second posterior probability of each category in the pure test image;
and the posterior probability comparison unit is used for comparing the first posterior probability and the second posterior probability of each category in the clean test image so as to determine whether the training image of the first invisible trigger exists in the training image set.
14. The apparatus of claim 13, wherein the posterior probability comparison unit comprises:
the probability difference calculation unit is used for calculating the difference between the first posterior probability and the second posterior probability of each category in the pure test image to obtain the probability difference of the category;
and the probability difference comparison unit is used for judging that the training images of the first invisible trigger exist in the training image set if the probability difference of any category is larger than a preset probability difference threshold.
15. A defending device, characterized in that it comprises a processor and a memory, said memory having stored therein at least one instruction or at least one program, said at least one instruction or at least one program being loaded by said processor and executing the defending method of the image classification model back door attack according to any of claims 1-7.
16. A computer storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the method of defending against a back door attack of an image classification model according to any of claims 1-7.
CN202011122124.9A 2020-10-20 2020-10-20 Method, device, equipment and medium for defending image classification model back door attack Active CN112163638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011122124.9A CN112163638B (en) 2020-10-20 2020-10-20 Method, device, equipment and medium for defending image classification model back door attack

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011122124.9A CN112163638B (en) 2020-10-20 2020-10-20 Method, device, equipment and medium for defending image classification model back door attack

Publications (2)

Publication Number Publication Date
CN112163638A CN112163638A (en) 2021-01-01
CN112163638B true CN112163638B (en) 2024-02-13

Family

ID=73867520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011122124.9A Active CN112163638B (en) 2020-10-20 2020-10-20 Method, device, equipment and medium for defending image classification model back door attack

Country Status (1)

Country Link
CN (1) CN112163638B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766401B (en) * 2021-01-28 2022-03-01 哈尔滨工业大学 Countermeasure sample defense method based on significance countermeasure training
US11977626B2 (en) 2021-03-09 2024-05-07 Nec Corporation Securing machine learning models against adversarial samples through backdoor misclassification
CN113762053B (en) * 2021-05-14 2023-07-25 腾讯科技(深圳)有限公司 Image processing method, device, computer and readable storage medium
CN113255909B (en) * 2021-05-31 2022-12-13 北京理工大学 Clean label neural network back door implantation system based on universal countermeasure trigger
CN113255784B (en) * 2021-05-31 2022-09-13 北京理工大学 Neural network back door injection system based on discrete Fourier transform
CN113269308B (en) * 2021-05-31 2022-11-18 北京理工大学 Clean label neural network back door implantation method based on universal countermeasure trigger
CN113222120B (en) * 2021-05-31 2022-09-16 北京理工大学 Neural network back door injection method based on discrete Fourier transform
CN113779986A (en) * 2021-08-20 2021-12-10 清华大学 Text backdoor attack method and system
CN113688382B (en) * 2021-08-31 2022-05-03 中科柏诚科技(北京)股份有限公司 Attack intention mining method based on information security and artificial intelligence analysis system
CN114021119A (en) * 2021-10-14 2022-02-08 清华大学 Text backdoor attack method and device
CN113989548B (en) * 2021-10-20 2024-07-02 平安银行股份有限公司 Certificate classification model training method and device, electronic equipment and storage medium
CN113806754A (en) * 2021-11-17 2021-12-17 支付宝(杭州)信息技术有限公司 Back door defense method and system
CN114494771B (en) * 2022-01-10 2024-06-07 北京理工大学 Federal learning image classification method capable of defending back door attack
CN115495578B (en) * 2022-09-02 2023-12-22 国网江苏省电力有限公司南通供电分公司 Text pre-training model backdoor elimination method, system and medium based on maximum entropy loss
CN115659171B (en) * 2022-09-26 2023-06-06 中国工程物理研究院计算机应用研究所 Model back door detection method and device based on multi-element feature interaction and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598400A (en) * 2019-08-29 2019-12-20 浙江工业大学 Defense method for high hidden poisoning attack based on generation countermeasure network and application
CN111242291A (en) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 Neural network backdoor attack detection method and device and electronic equipment
CN111260059A (en) * 2020-01-23 2020-06-09 复旦大学 Back door attack method of video analysis neural network model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11132444B2 (en) * 2018-04-16 2021-09-28 International Business Machines Corporation Using gradients to detect backdoors in neural networks
US11188789B2 (en) * 2018-08-07 2021-11-30 International Business Machines Corporation Detecting poisoning attacks on neural networks by activation clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598400A (en) * 2019-08-29 2019-12-20 浙江工业大学 Defense method for high hidden poisoning attack based on generation countermeasure network and application
CN111260059A (en) * 2020-01-23 2020-06-09 复旦大学 Back door attack method of video analysis neural network model
CN111242291A (en) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 Neural network backdoor attack detection method and device and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Hidden Trigger Backdoor Attacks;Aniruddha Saha,et al;AAAI Technical Track:Vision;第34卷(第7期);全文 *
Rethinking the Trigger of Backdoor Attack;li yiming,et al;arXiv:2004.04692v1;全文 *
深度学习模型的中毒攻击与防御综述;陈晋音,等;信息安全学报;第5卷(第4期);第14-29页 *
针对神经网络的对抗攻击及其防御;何正保;航空兵器;第27卷(第3期);第11-19页 *

Also Published As

Publication number Publication date
CN112163638A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN112163638B (en) Method, device, equipment and medium for defending image classification model back door attack
Dang et al. Face image manipulation detection based on a convolutional neural network
Al-Qershi et al. Evaluation of copy-move forgery detection: datasets and evaluation metrics
Gao et al. A novel target detection method for SAR images based on shadow proposal and saliency analysis
Lin et al. Image manipulation detection by multiple tampering traces and edge artifact enhancement
CN114331829A (en) Countermeasure sample generation method, device, equipment and readable storage medium
Xiao et al. RGB-‘D’saliency detection with pseudo depth
CN114299365B (en) Method and system for detecting hidden back door of image model, storage medium and terminal
Davy et al. Reducing anomaly detection in images to detection in noise
Yang et al. Convolutional neural network for smooth filtering detection
CN111046957B (en) Model embezzlement detection method, model training method and device
CN108416797A (en) A kind of method, equipment and the storage medium of detection Behavioral change
Kwon Multi-model selective backdoor attack with different trigger positions
CN114419346A (en) Model robustness detection method, device, equipment and medium
Alkhowaiter et al. Evaluating perceptual hashing algorithms in detecting image manipulation over social media platforms
Huang et al. Multi-Teacher Single-Student Visual Transformer with Multi-Level Attention for Face Spoofing Detection.
Xu et al. Feature enhancement and supervised contrastive learning for image splicing forgery detection
Sedik et al. AI-enabled digital forgery analysis and crucial interactions monitoring in smart communities
CN114638356B (en) Static weight guided deep neural network back door detection method and system
Dunphy et al. On automated image choice for secure and usable graphical passwords
Ilyas et al. E-Cap Net: an efficient-capsule network for shallow and deepfakes forgery detection
CN115758337A (en) Back door real-time monitoring method based on timing diagram convolutional network, electronic equipment and medium
Ali et al. Robust Machine Learning Algorithms and Systems for Detection and Mitigation of Adversarial Attacks and Anomalies: Proceedings of a Workshop
CN113762237A (en) Text image processing method, device and equipment and storage medium
Vyas et al. Study of Image Steganalysis Techniques.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant