CN113642029B - Method and system for measuring correlation between data sample and model decision boundary - Google Patents

Method and system for measuring correlation between data sample and model decision boundary Download PDF

Info

Publication number
CN113642029B
CN113642029B CN202111188034.4A CN202111188034A CN113642029B CN 113642029 B CN113642029 B CN 113642029B CN 202111188034 A CN202111188034 A CN 202111188034A CN 113642029 B CN113642029 B CN 113642029B
Authority
CN
China
Prior art keywords
sample
model
decision boundary
distance
confrontation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111188034.4A
Other languages
Chinese (zh)
Other versions
CN113642029A (en
Inventor
王琛
刘高扬
田泽豪
彭凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202111188034.4A priority Critical patent/CN113642029B/en
Publication of CN113642029A publication Critical patent/CN113642029A/en
Application granted granted Critical
Publication of CN113642029B publication Critical patent/CN113642029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for measuring the relevance of a data sample and a model decision boundary, belonging to the field of data protection of the Internet of things, wherein the method comprises the following steps: after an input sample of a model to be evaluated is obtained from the Internet of things, an initial confrontation sample is generated at a model decision boundary, gradient estimation is carried out to obtain a normal vector perpendicular to the decision boundary, the correlation between a difference vector from the input sample to the initial confrontation sample and the normal vector is solved, the sample on the decision boundary is updated, and finally the distance matrix from each sample to each model decision boundary in the deep learning training process is obtained by calculating the distance between the final sample and the input sample, so that the correlation between each data sample and the model decision boundary is measured. Therefore, under the condition that the internal information of the model does not need to be deeply learned and the training flow of the model is not modified, the privacy protection of the data can be realized, and the method has high practicability and universality.

Description

Method and system for measuring correlation between data sample and model decision boundary
Technical Field
The invention belongs to the field of data protection of the Internet of things, and particularly relates to a method and a system for measuring correlation between a data sample and a model decision boundary.
Background
With the increase of the data volume of the internet of things and the improvement of computing power of computing equipment, the deep learning technology is widely applied. However, the current deep learning technology requires a large amount of data for training, so that the current deep learning model faces serious problems of data security and privacy protection. For example, most companies adopt a centralized learning mode to train a model, large-scale collection of data information of users is required, but no uniform standard exists for privacy protection of the users, and an attacker can shift a decision boundary of the model by modifying, deleting or injecting bad data, so that wrong prediction is generated. With the coming of the general data protection regulation, the data privacy protection and security of users are improved to some extent, but the privacy protection of data samples in the deep learning model still faces great challenges. Accurately characterizing and measuring the correlation between the data samples and the model decision boundary can provide technical and theoretical support for evaluating the safety of the deep learning model and the privacy of data.
At present, researchers at home and abroad carry out systematic and deep research on the correlation between data and models in deep learning, but existing research works all have certain defects and problems: 1. most of the existing research works are to evaluate the relevance between the model and the data on the premise of knowing the internal parameters and the training settings of the deep learning model. However, in practical scenarios, in order to secure the model and training data, the model owner typically only discloses the prediction interface of the model for use by the evaluator. Therefore, most of the existing works cannot be used in actual scenes; 2. part of the metric work needs to use different training data combinations to retrain the deep learning model, and then the correlation between the multiple evaluation models and the data is obtained. The training overhead of the method is obviously increased along with the increase of the data volume, so that the application of the method in an actual scene is greatly reduced; 3. part of the research work evaluated the relationship between the model and the data using challenge samples. However, most of the existing countermeasure sample generation techniques only focus on disturbance magnitude control of the countermeasure samples, and ignore the geometric association between the samples and the decision boundary. The obtained confrontation sample cannot accurately represent the decision boundary, so that the result of the correlation analysis is deviated. 4. The existing research works are all static analysis works, namely, a decision boundary of a model after training is analyzed, and the change of the relation between the decision boundary and data in the whole training process is ignored.
In summary, how to dynamically evaluate the correlation between the data sample and the model decision boundary under the condition of only the deep learning model black box prediction interface is an urgent problem to be solved for the privacy and security of deep learning.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a method and a system for measuring correlation between a data sample and a model decision boundary, so as to evaluate correlation between the data sample and the model in a training process, thereby implementing privacy protection of data, and having extremely high practicability and universality.
In order to achieve the above object, the present invention provides a method for measuring correlation between data samples and model decision boundaries, comprising the following steps: s1, obtaining an input sample of the model to be evaluated from the Internet of things, and adding Gaussian noise to the input sample to generate an initial confrontation sample at a model decision boundary; s2, calculating a normal vector of the initial confrontation sample on the model decision boundary and a difference vector from the input sample to the initial confrontation sample, and calculating an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector; s3, calculating an updated challenge sample with the angular difference loss as a loss function for updating the initial challenge sample; s4, projecting the updated confrontation sample onto the model decision boundary, and taking the projected sample as an initial confrontation sample for the next iteration; s5, repeating the steps S2 to S4 until the convergence of the loss function or the iteration turns reach the set times, and obtaining a final confrontation sample; calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and a model decision boundary; and S6, taking the final confrontation sample as an initial confrontation sample, repeating the step S5 to start the next round of model training, sequentially calculating the distance from the input sample to the model decision boundary in each round of model training process to obtain a distance matrix, and measuring the correlation between the input sample and the model decision boundary according to the distance matrix.
Further, in S1, generating an initial confrontation sample at a model decision boundary by adding gaussian noise to the input sample, including: adding multiple groups of random Gaussian noises to the input sample until a first noise which enables the model to be subjected to error classification is obtained; projecting the disturbed sample to a model decision boundary by utilizing a dichotomy to obtain an initial confrontation sample; the perturbed sample is a superposition of the input sample and a first noise.
Further, in S2, calculating a normal vector of the initial confrontation sample on the model decision boundary includes: s21, to the initial confrontation sample
Figure 799948DEST_PATH_IMAGE001
To carry outBGaussian perturbation of individual direction
Figure 752861DEST_PATH_IMAGE002
To obtainBA perturbed sample
Figure 284336DEST_PATH_IMAGE003
Figure 204757DEST_PATH_IMAGE004
Figure 174987DEST_PATH_IMAGE005
Is a disturbance constant; s22, calculating a disturbance sample
Figure 795455DEST_PATH_IMAGE006
Determination value
Figure 560149DEST_PATH_IMAGE007
Figure 390702DEST_PATH_IMAGE008
(ii) a Wherein the content of the first and second substances,
Figure 694118DEST_PATH_IMAGE009
for disturbing the sample
Figure 621623DEST_PATH_IMAGE006
Is antagonistic, and
Figure 229321DEST_PATH_IMAGE010
Figure 422536DEST_PATH_IMAGE011
is a sample
Figure 468990DEST_PATH_IMAGE006
The real label of (a) is,
Figure 923105DEST_PATH_IMAGE012
is a sample
Figure 341186DEST_PATH_IMAGE006
The prediction tag of (a) is determined,
Figure 208648DEST_PATH_IMAGE013
representing the probability that the model predicts as a true tag,
Figure 730896DEST_PATH_IMAGE014
representing the maximum probability value that the model predicts as a non-genuine tag; s23, the initial challenge sample
Figure 180463DEST_PATH_IMAGE015
Normal vectors at decision boundaries of the model
Figure 129964DEST_PATH_IMAGE016
Expressed as:
Figure 219143DEST_PATH_IMAGE017
further, in S2, the angle difference loss is expressed as:
Figure 187492DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 350620DEST_PATH_IMAGE019
representing the input samples in question, and,
Figure 798919DEST_PATH_IMAGE020
the inner product is represented by the sum of the two,
Figure 922864DEST_PATH_IMAGE021
representing a two-norm.
Further, the S3 includes: s31, taking the inverse of the angular difference loss as the loss function; s32, obtaining an update direction of the initial confrontation sample by using a monte carlo gradient estimation method, and calculating an updated confrontation sample by using a first-order gradient optimization method.
Further, the S4 includes: confrontation sample after the update
Figure 957554DEST_PATH_IMAGE022
And input samples
Figure 240768DEST_PATH_IMAGE023
On the connection line of (2), by searching
Figure 922285DEST_PATH_IMAGE024
Will satisfy
Figure 658160DEST_PATH_IMAGE025
Of (2) a sample
Figure 138951DEST_PATH_IMAGE026
As an initial confrontation sample for the next iteration; wherein the content of the first and second substances,
Figure 666884DEST_PATH_IMAGE027
Figure 394669DEST_PATH_IMAGE028
further, in S5, calculating a distance between the input sample and the final confrontation sample as a distance between the input sample and the model decision boundary includes: and calculating the norm value of the input sample and the final confrontation sample as the distance between the input sample and the decision boundary of the model.
Further, in the step S6, if the models are performed togetherKRound of training, the principleKThe distance from the input sample to the model decision boundary in the round of training is expressed asD T = [d 1 , d 2 , …, d K ](ii) a Wherein the content of the first and second substances,d k is shown askThe distance between the input sample and the final confrontation sample in the training process is calculated,k=1,2,…,K(ii) a Get firstkIn the round training process, before obtaining the final confrontation sampleUAll challenge samples generated within a sub-iteration and calculating each studentThe distance between the confrontation sample and the input sample is expressed asD k = [d T-U , d T-U+1 , …, d T ](ii) a Wherein the content of the first and second substances,d u is shown askInputting samples and the second in the round training processuThe distance of the challenge sample generated by the sub-iteration,u=T-U,T-U+1,…,TTis as followskObtaining the total iteration times of the final confrontation sample in the round training process;Kinputting a distance matrix of the sample and a model decision boundary after the round training is finishedDExpressed as:D=
Figure 932354DEST_PATH_IMAGE029
(ii) a Wherein the content of the first and second substances,
Figure 793999DEST_PATH_IMAGE030
is shown inkIn the course of round traininguThe distance of the challenge sample generated by the sub-iteration from the input sample.
To achieve the above object, the present invention further provides a system for measuring correlation between data samples and model decision boundaries, comprising: the system comprises a data initial module, a model decision boundary evaluation module and a data analysis module, wherein the data initial module is used for acquiring an input sample of a model to be evaluated from the Internet of things and generating an initial confrontation sample at the model decision boundary by adding Gaussian noise to the input sample; the difference calculation module is used for calculating a normal vector of the initial confrontation sample on the model decision boundary and a difference vector from the input sample to the initial confrontation sample, and calculating an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector; a data updating module for calculating an updated confrontation sample with the angle difference loss as a loss function for updating the initial confrontation sample; projecting the updated confrontation sample to the model decision boundary, and taking the projected sample as an initial confrontation sample of the next iteration; the distance calculation module is used for repeatedly executing the operations of the difference calculation module and the data updating module until the convergence of the loss function or the iteration turns reach a set number of times, and obtaining a final confrontation sample; calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and a model decision boundary; and the correlation measurement module is used for taking the final confrontation sample as an initial confrontation sample, repeatedly executing the operation of the distance calculation module to start next round of model training, sequentially calculating the distance from the input sample to a model decision boundary in each round of model training to obtain a distance matrix, and measuring the correlation between the input sample and the model decision boundary according to the distance matrix.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) after an input sample of a model to be evaluated is obtained from the Internet of things, an initial confrontation sample is generated at a model decision boundary, gradient estimation is carried out to obtain a normal vector perpendicular to the decision boundary, the correlation between a difference vector from the input sample to the initial confrontation sample and the normal vector is solved, the sample on the decision boundary is updated, and finally the distance matrix from each sample to each model decision boundary in the deep learning training process is obtained by calculating the distance between the final sample and the input sample, so that the correlation between each data sample and the model decision boundary is measured. Therefore, under the condition that the internal information of the model does not need to be deeply learned and the training flow of the model is not modified, the privacy protection of the data can be realized, and the method has high practicability and universality.
(2) The method can obtain the confrontation sample closest to the original sample, and calculate the minimum distance from the original sample to the decision boundary of the model, thereby evaluating the robustness and stability of the model.
(3) The invention takes the correlation between the data sample and the decision boundary of the model as a loss function to update the countermeasure sample, and has better accuracy and less query times.
(4) The invention can acquire all data meeting the confrontation conditions within a certain range, thereby better judging the stability of the model, realizing the privacy protection of the model and having generalization capability.
(5) The invention can capture the change of the decision boundary in the whole training process of the model and calculate the distance between the data sample and the decision boundary in real time, thereby evaluating the safety of the model more effectively.
Drawings
Fig. 1 is a flowchart of a method for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention.
Fig. 2 is a block diagram of a system for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention.
Fig. 3 is a second block diagram of a system for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
In this embodiment, the present invention can be divided into 2 stages: a data processing stage and a correlation measurement stage. The user needs to upload a query API and a certain amount of training samples, i.e., data for training the model, of the model to be evaluated, which operates in a black-box mechanism. In the data processing stage, a model at each moment of deep learning is taken, and a data sample needing to be evaluated is selected; in the correlation measurement stage, each sample is operated one by one, firstly, an initial sample is generated at a decision boundary, gradient estimation is carried out to obtain a vector value vertical to the decision boundary, the correlation of the vector between the vector and the sample is solved, the sample on the decision boundary is updated, and finally, the distance matrix from each sample to each model decision boundary in the deep learning training process is obtained by calculating the distance between the final sample and the initial sample to measure the correlation between each data sample and the model decision boundary.
Fig. 1 is a flowchart of a method for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention. The method includes operation S1-operation S6.
Operation S1, an input sample of the model to be evaluated is obtained from the internet of things, and an initial confrontation sample is generated at a model decision boundary by adding gaussian noise to the input sample.
It should be noted that, in this embodiment, the model to be evaluated and the input sample are input by the end user, and the input sample is from the data set of the internet of things; the data set of the internet of things is a data set formed by integrating a plurality of data collected by devices such as sensors in the internet of things. For example, the model to be evaluated is an image recognition model, accordingly, data representing an image is extracted as features in the internet of things data set, and data representing an image name is extracted as a label to serve as an input sample.
Specifically, in S1, generating an initial challenge sample at a model decision boundary by adding gaussian noise to the input sample comprises:
adding multiple groups of random Gaussian noises to the input sample until a first noise which enables the model to be subjected to error classification is obtained; projecting the disturbed sample to a model decision boundary by utilizing a dichotomy to obtain an initial confrontation sample; the perturbed sample is a superposition of the input sample and a first noise.
In this embodiment, an initial input sample is given
Figure 864855DEST_PATH_IMAGE031
Wherein
Figure 763541DEST_PATH_IMAGE032
Which is representative of the characteristics of the input,
Figure 598641DEST_PATH_IMAGE033
for class labeling, multiple sets of random Gaussian noise are added to the sample
Figure 185349DEST_PATH_IMAGE034
WhereiniIs counted until noise is obtained that satisfies the criteria that cause the model to be misclassified
Figure 297662DEST_PATH_IMAGE035
I.e. by
Figure 491883DEST_PATH_IMAGE036
Wherein
Figure 502695DEST_PATH_IMAGE037
As a predictive label of the model, i.e.
Figure 581510DEST_PATH_IMAGE038
xIn order to input the model, the model is input,
Figure 876225DEST_PATH_IMAGE039
the representation model is predicted to bekThe probability of a class; using dichotomy to divide the disturbed sample
Figure 493544DEST_PATH_IMAGE040
Projecting the image on a decision boundary to obtain an initial confrontation sample
Figure 506500DEST_PATH_IMAGE041
In operation S2, a normal vector of the initial challenge sample on the model decision boundary and a difference vector of the input sample to the initial challenge sample are calculated, and an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector is calculated.
In this embodiment, operation S2 includes sub-operations S21 through S25.
In sub-operation S21, the samples are updated at the model decision boundary, through multiple iterations of steps S2 through S4, to betThe samples generated by the wheel at the model decision boundaries are recorded as
Figure 123426DEST_PATH_IMAGE042
Wherein
Figure 820118DEST_PATH_IMAGE043
TAs a result of the total number of iterations,
Figure 293824DEST_PATH_IMAGE044
the samples of time are the initial challenge samples generated in step S1
Figure 466180DEST_PATH_IMAGE045
(ii) a For the obtained confrontation sample
Figure 198381DEST_PATH_IMAGE046
Performing multi-directional Gaussian disturbance
Figure 998847DEST_PATH_IMAGE047
Wherein
Figure 846717DEST_PATH_IMAGE048
Is a covariance matrix, obtainBGroup perturbation samples
Figure 319418DEST_PATH_IMAGE049
Figure 402781DEST_PATH_IMAGE050
Figure 198698DEST_PATH_IMAGE051
For perturbing constants, e.g. taking
Figure 520265DEST_PATH_IMAGE051
=1.01 or 1.001.
In sub-operation S22, the disturbance samples are calculated
Figure 526267DEST_PATH_IMAGE052
Determination value
Figure 101736DEST_PATH_IMAGE053
Figure 17740DEST_PATH_IMAGE054
(ii) a Wherein the content of the first and second substances,
Figure 269729DEST_PATH_IMAGE055
for disturbing the sample
Figure 277875DEST_PATH_IMAGE056
Is antagonistic, and
Figure 312827DEST_PATH_IMAGE057
Figure 411233DEST_PATH_IMAGE058
is a sample
Figure 381594DEST_PATH_IMAGE059
The real label of (a) is,
Figure 565450DEST_PATH_IMAGE060
is a sample
Figure 984187DEST_PATH_IMAGE059
The prediction tag of (a) is determined,
Figure 874783DEST_PATH_IMAGE061
representing the probability that the model predicts as a true tag,
Figure 265313DEST_PATH_IMAGE062
representing the maximum probability value that the model predicts as a non-genuine tag;
if the prediction result output by the model is inconsistent with the original sample, then
Figure 421619DEST_PATH_IMAGE063
Otherwise
Figure 391849DEST_PATH_IMAGE064
In sub-operation S23, the decision value is taken as the direction of the perturbation vector, and the direction of each perturbation vector is averaged, so that the obtained result is the sample
Figure 136951DEST_PATH_IMAGE065
Gradient values of, i.e. challenge samples
Figure 150912DEST_PATH_IMAGE065
Normal vectors at model decision boundaries
Figure 840519DEST_PATH_IMAGE066
Expressed as:
Figure 365173DEST_PATH_IMAGE067
in sub-operation S24, the secondtGenerating countermeasure samples at model boundaries
Figure 230360DEST_PATH_IMAGE068
And the original input sample
Figure 431535DEST_PATH_IMAGE069
Vector of (2)
Figure 798318DEST_PATH_IMAGE070
In sub-operation S25, a vector is calculated
Figure 907089DEST_PATH_IMAGE070
And the normal vector
Figure 626783DEST_PATH_IMAGE071
Cosine similarity between them, expressed as:
Figure 484012DEST_PATH_IMAGE072
wherein, the numerator is the inner product of two vectors, and the denominator is the product of the lengths of the two vectors, which are respectively expressed by two-norm distances.
The invention expects to obtain an original input sample
Figure 148211DEST_PATH_IMAGE073
To the mouldDistance of decision boundary, need to generate the sample closest to the original input sample at the decision boundary
Figure 247623DEST_PATH_IMAGE074
When the sample is
Figure 556245DEST_PATH_IMAGE074
And the original input sample
Figure 895959DEST_PATH_IMAGE073
The closer the vector is
Figure 1450DEST_PATH_IMAGE075
And
Figure 327389DEST_PATH_IMAGE074
the closer the direction of the gradient of (a),
Figure 880730DEST_PATH_IMAGE076
the larger. Therefore, the inverse of the similarity is used as the loss function of the updating process of the countersample, so that the updating of the data sample can be better realized, namely
Figure 581226DEST_PATH_IMAGE077
. The optimization objective of the present invention is that the function can be expressed as:
Figure 95384DEST_PATH_IMAGE078
Subjectto:
Figure 287331DEST_PATH_IMAGE079
operation S3 calculates an updated challenge sample with the angular difference loss as a loss function for updating the initial challenge sample.
In this embodiment, operation S3 includes sub-operations S31 through S33.
In sub-operation S31, a sample is estimated by using an absolute difference method
Figure 445911DEST_PATH_IMAGE080
The gradient values of (A) are:
Figure 330690DEST_PATH_IMAGE081
wherein the content of the first and second substances,
Figure 332144DEST_PATH_IMAGE082
in order to be the objective function, the target function,
Figure 373787DEST_PATH_IMAGE083
is a normal basis vector, whereintThe number of the components is 1 and,
Figure 839404DEST_PATH_IMAGE084
in sub-operation S32, the objective function is optimized using first order gradient optimization to obtain the best coordinate update
Figure 645817DEST_PATH_IMAGE085
. Taking Adam's algorithm as an example, the moving average is updated
Figure 868988DEST_PATH_IMAGE086
Square gradient of
Figure 730633DEST_PATH_IMAGE087
And calculate
Figure 722860DEST_PATH_IMAGE088
Figure 947779DEST_PATH_IMAGE089
To obtain the best coordinate update
Figure 48459DEST_PATH_IMAGE090
. In addition, the optimization can be performed by methods such as SGD and RMSprop.
In sub-operation S33, the countermeasure sample is performed according to the update direction obtained in step S32Updating
Figure 58003DEST_PATH_IMAGE091
In the optimization process, it is necessary to make the updated samples
Figure 983365DEST_PATH_IMAGE092
Satisfy the confrontation condition
Figure 443165DEST_PATH_IMAGE093
Operation S4, project the updated confrontation sample onto the model decision boundary, and take the projected sample as the initial confrontation sample for the next iteration.
In particular, the challenge sample after updating
Figure 952513DEST_PATH_IMAGE094
And input samples
Figure 31327DEST_PATH_IMAGE095
On the connection line of (2), by searching
Figure 60463DEST_PATH_IMAGE096
Will satisfy
Figure 441897DEST_PATH_IMAGE097
Of (2) a sample
Figure 189273DEST_PATH_IMAGE098
As an initial confrontation sample for the next iteration; wherein the content of the first and second substances,
Figure 540620DEST_PATH_IMAGE099
Figure 4356DEST_PATH_IMAGE100
operation S5, repeat steps S2 to S4 until the loss function converges or the iteration turns reach a set number of times, and obtain a final confrontation sample; and calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and the decision boundary of the model.
Specifically, the sample is mixed
Figure 743642DEST_PATH_IMAGE101
Repeating the steps S2 to S4 as the initial confrontation sample of the next iteration until the loss function converges or the iteration number reaches the set number, and obtaining the final confrontation sample
Figure 915997DEST_PATH_IMAGE102
(ii) a Computing raw input samples
Figure 149664DEST_PATH_IMAGE103
And final confrontation sample
Figure 215709DEST_PATH_IMAGE102
In betweenpNorm value as distance of final input sample to model decision boundary
Figure 375163DEST_PATH_IMAGE104
Operation S6, taking the final confrontation sample as an initial confrontation sample, repeating step S5 to start the next round of model training, sequentially calculating distances from the input sample to model decision boundaries in each round of model training, obtaining a distance matrix, and measuring the correlation between the input sample and the model decision boundaries according to the distance matrix.
Specifically, if the models are performed togetherKRound of training, the principleKThe distance from the input sample to the model decision boundary in the round of training is expressed asD T = [d 1 , d 2 , …, d K ](ii) a Wherein the content of the first and second substances,d k is shown askThe distance between the input sample and the final confrontation sample in the training process is calculated,k=1,2,…,K
get firstkIn the round training process, before obtaining the final confrontation sampleUAll the challenge samples generated in the sub-iteration are calculated, and the distance between each generated challenge sample and the input sample is expressed asD k = [d T-U , d T-U+1 , …, d T ](ii) a Wherein the content of the first and second substances,d u is shown askInputting samples and the second in the round training processuThe distance of the challenge sample generated by the sub-iteration,u=T-U,T-U+1,…,TTis as followskObtaining the total iteration times of the final confrontation sample in the round training process;
Kinputting a distance matrix of the sample and a model decision boundary after the round training is finishedDExpressed as:D=
Figure 769236DEST_PATH_IMAGE105
(ii) a Wherein the content of the first and second substances,
Figure 852598DEST_PATH_IMAGE106
is shown inkIn the course of round traininguThe distance of the challenge sample generated by the sub-iteration from the input sample.
It should be noted that the distance matrix represents only the correlation measure between the data sample and one decision boundary, and for a multi-class model, the present invention can generate a corresponding distance matrix for the decision boundary of each class, which represents the correlation measure between the data sample and all the decision boundaries of the model.
Fig. 2 is a block diagram of a system for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention. Referring to fig. 2, the system 200 includes a data initialization module 210, a difference calculation module 220, a data update module 230, a distance calculation module 240, and a correlation metric module 250.
The data initialization module 210, for example, performs operation S1, configured to obtain an input sample of a model to be evaluated from the internet of things, and generate an initial confrontation sample at a model decision boundary by adding gaussian noise to the input sample;
the difference calculating module 220 performs, for example, operation S2, to calculate a normal vector of the initial confrontation sample on the model decision boundary and a difference vector of the input sample to the initial confrontation sample, and calculate an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector;
the data update module 230 performs, for example, operations S3 and S4, for calculating an updated confrontation sample with the angular difference loss as a loss function for updating the initial confrontation sample; projecting the updated confrontation sample to the model decision boundary, and taking the projected sample as an initial confrontation sample of the next iteration;
the distance calculation module 240, for example, performs operation S5, configured to repeatedly perform the operations of the difference calculation module and the data update module until the loss function converges or the iteration turns reach a set number of times, and obtain a final confrontation sample; calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and a model decision boundary;
the correlation measurement module 250, for example, performs operation S6, and is configured to start a next round of model training by using the final confrontation sample as an initial confrontation sample and repeatedly performing the operation of the distance calculation module, sequentially calculate distances from the input sample to model decision boundaries in each round of model training, obtain a distance matrix, and measure the correlation between the input sample and the model decision boundaries according to the distance matrix.
The system 200 is used to perform the method for measuring correlation between data samples and model decision boundaries in the embodiment shown in FIG. 1. For details that are not described in the present embodiment, please refer to the method for measuring the correlation between data samples and model decision boundaries in the embodiment shown in fig. 1, which is not described herein again.
Fig. 3 is a second block diagram of a system for measuring correlation between data samples and model decision boundaries according to an embodiment of the present invention, where the system includes an initial data generation module, a gradient estimation module, a correlation calculation module, a data update module, and a distance calculation module. Inputting a data sample set and a model provided by a user into a system, and generating a confrontation sample which is closer to an original sample on a model decision boundary by an initial data generation module; the gradient estimation module is used for estimating a gradient value of a data point on the decision boundary to represent the normal vector direction and the magnitude of the decision boundary of the model; the correlation calculation module calculates the correlation between the vector between the countermeasure sample and the original input sample and the gradient vector by adopting a cosine similarity method; the data updating module comprises an optimal gradient updating part and a sample projecting part, the optimal data updating direction at the moment is calculated, the updated data is projected to the model decision boundary, the next moment is updated, a confrontation sample with the minimum loss function is obtained through repeated iterative updating, and the distance between the sample and the original sample is used as the shortest distance from the original data to the model decision boundary; the distance calculation module calculates the two-norm distance between the original sample and the confrontation sample as the distance between the original sample and the model decision boundary, and solves each model in the training process to generate a distance matrix of the final data sample to represent the correlation measurement between the data sample and the model decision boundary.
The effects of the present invention are further illustrated by the following experimental results: the method is applied to member inference attack under deep learning, a model at each moment in a training process is selected, and an result is tested by adopting an Adult, MNIST and Purchase (10) data set. By adopting the method for measuring the correlation between the data samples and the model decision boundaries, corresponding confrontation samples are generated at the decision boundaries, and the distance change matrix from the data samples to each model decision boundary is calculated. With the training of the deep learning model, the model decision boundary changes continuously, and because the training data participates in the training of the model and the test data does not participate, the distance change from the training data to the model decision boundary is different from that from the test data; and respectively selecting training data and testing data to obtain a corresponding distance change matrix. And (3) taking the distance characteristics of the data as input, and taking whether the data is training data as an output label, and training the member inference attack model. Through simulation tests, the success rate of resisting samples and the accuracy rate of member inference attacks of the method under three data sets are shown in table 1.
Figure 727144DEST_PATH_IMAGE107
It can be seen that the method for measuring the correlation between the data samples and the model decision boundary provided by the invention has higher success rate of resisting the samples on each data set and exceeds the baseline level; the accuracy of the completed member inference attack exceeds that of most of the current experiments. The method can accurately measure the relevance between the data sample in the deep learning model and the decision boundary of the model, thereby realizing the privacy protection of the data and having extremely high practicability and universality.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A method for measuring correlation between data samples and model decision boundaries is used for protecting data of the Internet of things, and is characterized by comprising the following steps:
s1, obtaining an input sample of the model to be evaluated from the Internet of things, and adding Gaussian noise to the input sample to generate an initial confrontation sample at a model decision boundary; the model to be evaluated is a deep learning model;
s2, calculating a normal vector of the initial confrontation sample on the model decision boundary and a difference vector from the input sample to the initial confrontation sample, and calculating an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector;
s3, calculating an updated challenge sample with the angular difference loss as a loss function for updating the initial challenge sample;
s4, projecting the updated confrontation sample onto the model decision boundary, and taking the projected sample as an initial confrontation sample for the next iteration;
s5, repeating the steps S2 to S4 until the convergence of the loss function or the iteration turns reach the set times, and obtaining a final confrontation sample; calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and a model decision boundary;
and S6, taking the final confrontation sample as an initial confrontation sample, repeating the step S5 to start the next round of model training, sequentially calculating the distance from the input sample to a model decision boundary in each round of model training process to obtain a distance matrix, and measuring the correlation between the input sample and the model decision boundary according to the distance matrix, wherein the correlation is used for evaluating the privacy of the input sample.
2. The method of claim 1, wherein the step of generating the initial confrontation sample at the model decision boundary by adding gaussian noise to the input sample in S1 comprises:
adding multiple groups of random Gaussian noises to the input sample until a first noise which enables the model to be subjected to error classification is obtained;
projecting the disturbed sample to a model decision boundary by utilizing a dichotomy to obtain an initial confrontation sample; the perturbed sample is a superposition of the input sample and a first noise.
3. The method of claim 1 or 2, wherein the step of calculating the normal vector of the initial confrontation sample on the model decision boundary in the step S2 comprises:
s21, to the initial confrontation sample
Figure DEST_PATH_IMAGE001
To carry outBGaussian perturbation of individual direction
Figure 156323DEST_PATH_IMAGE002
To obtainBA perturbed sample
Figure DEST_PATH_IMAGE003
Figure 546501DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Is a disturbance constant;
s22, calculating a disturbance sample
Figure 44347DEST_PATH_IMAGE006
Determination value
Figure DEST_PATH_IMAGE007
Figure 846081DEST_PATH_IMAGE008
(ii) a Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
for disturbing the sample
Figure 99601DEST_PATH_IMAGE010
Is antagonistic, and
Figure DEST_PATH_IMAGE011
Figure 843435DEST_PATH_IMAGE012
is a sample
Figure DEST_PATH_IMAGE013
The real label of (a) is,
Figure 782572DEST_PATH_IMAGE014
is a sample
Figure DEST_PATH_IMAGE015
The prediction tag of (a) is determined,
Figure 227810DEST_PATH_IMAGE016
the representation model predicts asThe probability of a true tag being present,
Figure DEST_PATH_IMAGE017
representing the maximum probability value that the model predicts as a non-genuine tag;
s23, the initial challenge sample
Figure 801749DEST_PATH_IMAGE018
Normal vectors at decision boundaries of the model
Figure DEST_PATH_IMAGE019
Expressed as:
Figure 47310DEST_PATH_IMAGE020
4. the method of claim 3, wherein in S2, the angular difference loss is expressed as:
Figure DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 723011DEST_PATH_IMAGE022
representing the input samples in question, and,
Figure DEST_PATH_IMAGE023
the inner product is represented by the sum of the two,
Figure 440781DEST_PATH_IMAGE024
representing a two-norm.
5. The method of claim 4, wherein the S3 comprises:
s31, taking the inverse of the angular difference loss as the loss function;
s32, obtaining an update direction of the initial confrontation sample by using a monte carlo gradient estimation method, and calculating an updated confrontation sample by using a first-order gradient optimization method.
6. The method of claim 5, wherein the S4 comprises:
confrontation sample after the update
Figure DEST_PATH_IMAGE025
And input samples
Figure 636270DEST_PATH_IMAGE026
On the connection line of (2), by searching
Figure DEST_PATH_IMAGE027
Will satisfy
Figure 784224DEST_PATH_IMAGE028
Of (2) a sample
Figure DEST_PATH_IMAGE029
As an initial confrontation sample for the next iteration; wherein the content of the first and second substances,
Figure 199418DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
7. the method of claim 1 or 6, wherein the step of calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and the model decision boundary in the step S5 comprises: and calculating the norm value of the input sample and the final confrontation sample as the distance between the input sample and the decision boundary of the model.
8. The method of claim 7, wherein in step S6, if models are performed togetherKRound of training, the principleKThe distance from the input sample to the model decision boundary in the round of training is expressed asD T = [d 1 , d 2 , …, d K ](ii) a Wherein the content of the first and second substances,d k is shown askThe distance between the input sample and the final confrontation sample in the training process is calculated,k=1,2,…,K
get firstkIn the round training process, before obtaining the final confrontation sampleUAll the challenge samples generated in the sub-iteration are calculated, and the distance between each generated challenge sample and the input sample is expressed asD k = [d T-U , d T-U+1 , …, d T ](ii) a Wherein the content of the first and second substances,d u is shown askInputting samples and the second in the round training processuThe distance of the challenge sample generated by the sub-iteration,u=T-U,T-U+1,…,TTis as followskObtaining the total iteration times of the final confrontation sample in the round training process;
Kinputting a distance matrix of the sample and a model decision boundary after the round training is finishedDExpressed as:D=
Figure 192651DEST_PATH_IMAGE032
(ii) a Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE033
is shown inkIn the course of round traininguThe distance of the challenge sample generated by the sub-iteration from the input sample.
9. A system for measuring correlation between data samples and model decision boundaries, which is used for data protection of the Internet of things, is characterized by comprising:
the system comprises a data initial module, a model decision boundary evaluation module and a data analysis module, wherein the data initial module is used for acquiring an input sample of a model to be evaluated from the Internet of things and generating an initial confrontation sample at the model decision boundary by adding Gaussian noise to the input sample; the model to be evaluated is a deep learning model;
the difference calculation module is used for calculating a normal vector of the initial confrontation sample on the model decision boundary and a difference vector from the input sample to the initial confrontation sample, and calculating an angle difference loss between a unit vector of the difference vector and a unit vector of the normal vector;
a data updating module for calculating an updated confrontation sample with the angle difference loss as a loss function for updating the initial confrontation sample; projecting the updated confrontation sample to the model decision boundary, and taking the projected sample as an initial confrontation sample of the next iteration;
the distance calculation module is used for repeatedly executing the operations of the difference calculation module and the data updating module until the convergence of the loss function or the iteration turns reach a set number of times, and obtaining a final confrontation sample; calculating the distance between the input sample and the final confrontation sample as the distance between the input sample and a model decision boundary;
and the correlation measurement module is used for starting next round of model training by taking the final confrontation sample as an initial confrontation sample and repeatedly executing the operation of the distance calculation module, sequentially calculating the distance from the input sample to a model decision boundary in each round of model training process to obtain a distance matrix, and measuring the correlation between the input sample and the model decision boundary according to the distance matrix, wherein the correlation is used for evaluating the privacy of the input sample.
CN202111188034.4A 2021-10-12 2021-10-12 Method and system for measuring correlation between data sample and model decision boundary Active CN113642029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111188034.4A CN113642029B (en) 2021-10-12 2021-10-12 Method and system for measuring correlation between data sample and model decision boundary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111188034.4A CN113642029B (en) 2021-10-12 2021-10-12 Method and system for measuring correlation between data sample and model decision boundary

Publications (2)

Publication Number Publication Date
CN113642029A CN113642029A (en) 2021-11-12
CN113642029B true CN113642029B (en) 2021-12-24

Family

ID=78426406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111188034.4A Active CN113642029B (en) 2021-10-12 2021-10-12 Method and system for measuring correlation between data sample and model decision boundary

Country Status (1)

Country Link
CN (1) CN113642029B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349344B (en) * 2023-10-23 2024-03-05 广州欧派创意家居设计有限公司 Intelligent product sales data acquisition method and system based on big data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961145A (en) * 2018-12-21 2019-07-02 北京理工大学 A kind of confrontation sample generating method for image recognition category of model boundary sensitivity

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837637B (en) * 2019-10-16 2022-02-15 华中科技大学 Black box attack method for brain-computer interface system
CN113204782A (en) * 2021-04-15 2021-08-03 西安邮电大学 Centralized privacy protection method for decision model release

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961145A (en) * 2018-12-21 2019-07-02 北京理工大学 A kind of confrontation sample generating method for image recognition category of model boundary sensitivity

Also Published As

Publication number Publication date
CN113642029A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
Yang et al. Ridge and lasso regression models for cross-version defect prediction
CN106951695B (en) Method and system for calculating residual service life of mechanical equipment under multiple working conditions
Marzat et al. Worst-case global optimization of black-box functions through Kriging and relaxation
Mai et al. Surrogate modeling for stochastic dynamical systems by combining nonlinear autoregressive with exogenous input models and polynomial chaos expansions
CN112187554B (en) Operation and maintenance system fault positioning method and system based on Monte Carlo tree search
CN107832789B (en) Feature weighting K nearest neighbor fault diagnosis method based on average influence value data transformation
CN112861066B (en) Machine learning and FFT (fast Fourier transform) -based blind source separation information source number parallel estimation method
CN109685104B (en) Determination method and device for recognition model
CN116151485B (en) Method and system for predicting inverse facts and evaluating effects
Mao et al. Physics-informed neural networks with residual/gradient-based adaptive sampling methods for solving partial differential equations with sharp solutions
CN113642029B (en) Method and system for measuring correlation between data sample and model decision boundary
Lawrence et al. Explaining neural matrix factorization with gradient rollback
Mai et al. Surrogate modelling for stochastic dynamical systems by combining NARX models and polynomial chaos expansions
Xiang et al. Fault classification for high‐dimensional data streams: A directional diagnostic framework based on multiple hypothesis testing
Li et al. Symbolization‐based differential evolution strategy for identification of structural parameters
Li et al. Nonlinear model identification from multiple data sets using an orthogonal forward search algorithm
Mao et al. Physics-informed neural networks with residual/gradient-based adaptive sampling methods for solving PDEs with sharp solutions
CN116680639A (en) Deep-learning-based anomaly detection method for sensor data of deep-sea submersible
Mansouri et al. Modeling of nonlinear biological phenomena modeled by s-systems using bayesian method
CN106709598B (en) Voltage stability prediction and judgment method based on single-class samples
Hori et al. A state-space realization approach to set identification of biochemical kinetic parameters
CN114139601A (en) Evaluation method and system for artificial intelligence algorithm model of power inspection scene
Alshareef et al. Quantifying the Importance of Latent Features in Neural Networks.
Malmström et al. Asymptotic prediction error variance for feedforward neural networks
Soltani et al. A new objective function for fuzzy c-regression model and its application to TS fuzzy model identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant