CN117056957A

CN117056957A - Verifiable data forgetting privacy protection method and device for minimum and maximum learning model

Info

Publication number: CN117056957A
Application number: CN202310498860.1A
Authority: CN
Inventors: 刘佳琪; 秦湛; 任奎; 娄坚
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-05-06
Filing date: 2023-05-06
Publication date: 2023-11-14

Abstract

The application discloses a verifiable data forgetting privacy protection method and device for a very small and very large learning model, which are based on a Quan Haisen curvature matrix, carry out Newton step update on parameters of the very small and very large model and add random disturbance, so as to remove influence of forgotten data from the model, thereby realizing effective and verifiable machine learning model forgetting and approximately achieving the effect of retraining on the residual data. The application provides a verifiable machine learning model forgetting method aiming at the extremely small and extremely large problem for the first time, fully utilizes the parameters and data of the trained model to obtain new parameters updated by a model forgetting mechanism, avoids the high computational overhead of retraining, and protects the data privacy while processing the user data deleting request.

Description

Verifiable data forgetting privacy protection method and device for minimum and maximum learning model

Technical Field

The application relates to the field of machine learning privacy protection, in particular to a verifiable data forgetting privacy protection method and device for an extremely small and extremely large learning model.

Background

Machine learning is an important branch of artificial intelligence, and is to enable a computer system to automatically learn patterns and rules from data through learning and analysis of the data, so that the computer system can perform tasks such as prediction, classification, identification and the like autonomously. Machine learning algorithms generate predictive models by analyzing large amounts of user data from which rules and patterns are found. These models can be applied in a variety of different fields such as natural language processing, image recognition, recommendation systems, predictive analysis, and the like. The very small and very large learning model (Minimax Learning Model) is a model commonly used in game theory and machine learning that attempts to find a state of equilibrium among multiple decision makers (also called players) so that each decision maker can get the best results in the worst case. Very small and very large learning models are widely used in the field of machine learning, including challenge-generation networks, robust learning, challenge training, algorithmic fairness, markov decision processes, and so forth.

In practice, machine learning models may use a number of sensitive data to train, such as medical records, financial data, personal identity information, etc., so protecting user privacy becomes critical. A series of laws and regulations for protecting privacy of data have been put under way in recent years at home and abroad, and important legal guarantee is provided for personal information, particularly sensitive personal information, including deletion rights (also called forgetting rights) for personal information. These specifications require deletion of personal data upon user request and may even include deletion of models and algorithms extracted from the user data. Although deleting target data from the database in which the training data set resides is relatively easy to implement, merely staying at this step does not ensure that the machine learning model that has been trained and deployed on this data set is able to adequately adhere to the rules of deletion rights. In fact, if the trained model is not updated to forget the target deletion training data from the model, the machine learning model still risks revealing the target deletion data privacy. Therefore, further machine learning model data forgetting updating of the trained model is required to ensure that the model does not reveal personal privacy information in subsequent use.

One of the simplest ways to forget the machine learning model data is to retrain the model on the new data set after the target deletion data is removed, but this approach incurs high computational overhead and time costs. In recent years, related researches on model forgetting propose a series of forgetting mechanism designs based on different theoretical ideas and technical routes so as to avoid a retraining mode. Model forgetting can be roughly classified into accurate model forgetting and approximate model forgetting according to the nature of the model forgetting mechanism. The accurate model forgetting mechanism refers to a model updated by the mechanism, and the model is completely consistent with a model obtained by retraining, which means that the model forgetting mechanism can completely remove information related to target deletion data. The approximate model forgetting means that the model updated by the mechanism is approximately the same as the model obtained by retraining, which means that the model forgetting mechanism approximately clears the information related to the target deletion data. Verifiable model forgetting refers to ensuring that after data is deleted, the machine learning model operates as if the deleted data was never observed.

However, the existing machine learning model forgetting methods are limited to standard learning models, only optimization of univariate parameters is considered, and data forgetting methods for extremely small and extremely large learning models containing bivariate parameters are not considered.

Disclosure of Invention

The application aims to provide a verifiable data forgetting privacy protection method and device of a very small and very large learning model aiming at the defects of the prior art. The application can realize approximate model forgetting containing bivariate parameters.

The aim of the application is realized by the following technical scheme: the first aspect of the embodiment of the application provides a verifiable data forgetting privacy protection method of a very small and very large learning model, which comprises the following steps:

(1) Calculating the average value of all sample loss functions aiming at an original data set to acquire experience risks, and training a minimum and maximum learning model to acquire an optimal solution enabling the experience risks to be minimum and maximum as maximum parameters and minimum parameters of the minimum and maximum learning model;

(2) Calculating a Quan Haisen matrix at the optimal solution obtained in said step (1), said Quan Haisen matrix comprising a direct hessian matrix portion and an indirect hessian matrix portion;

(3) According to the optimal solution obtained in the step (1), the Quan Haisen matrix obtained in the step (2) and the data deleting request of the user, carrying out Newton step forgetting updating on the maximum parameter and the minimum parameter of the minimum maximum learning model so as to obtain the updated minimum parameter and the maximum parameter;

(4) And (3) adding Gaussian noise to the updated minimum parameter and maximum parameter obtained in the step (3) as random disturbance to obtain a final forgetting model, and finishing verifiable data forgetting privacy protection according to the forgetting model.

Further, the specific process of obtaining the optimal solution with extremely small experience risk in the step (1) is as follows:

where n is the size of the original dataset S, z _i For the ith data sample in the dataset, F (·) is the loss function, F _S (. Cndot.) is the empirical risk on the original dataset, w and v represent the minimum and maximum parameters of the minimum and maximum learning model to be learned respectively,minimum parameters for minimizing experience risk, < ->To maximize the experience risk.

Further, the specific process of calculating the Quan Haisen matrix at the optimal solution obtained in the step (1) in the step (2) is as follows:

wherein,quan Haisen matrix representing minimum and maximum parameters at optimal solution, respectively, +.>Respectively represent an empirical risk function F _S Second partial derivatives with respect to w and v, +.>Representing an empirical risk function F _s Sequentially solving the bias guide for w twice, and +.>Representing an empirical risk function F _s First, bias is determined for w and then bias is determined for v, and then the bias is determined for v>Representing an empirical risk function F _s Partial derivative of v and then of w is first calculated, and the first part is the first part of the partial derivative of w>Representing an empirical risk function F _s And (5) solving the partial derivatives of v twice.

Further, the step (3) includes the following substeps:

(3.1) constructing a deletion request data set according to a user's data deletion request, and according to the deletion request data set, utilizing the optimal solution with extremely small experience risk on the original data set obtained in the step (1)And Quan Haisen matrix of empirical risk at optimal solution obtained in said step (2)>And->Calculating Quan Haisen matrix TH at optimal solution on remaining dataset _w And TH _v ：

Where n is the size of the original dataset, m is the size of the delete request dataset U, z _i An ith data sample in the data set;

(3.2) using the optimal solution obtained in step (1) with minimal risk of experience on the raw datasetAnd Quan Haisen matrix TH on the remaining data set obtained in said step (3.1) _w And TH _v Newton step forgetting update is carried out on the minimum parameter and the maximum parameter so as to obtain the updated minimum parameter and maximum parameter +.>

Wherein,and->Representing the first derivatives with respect to w and v, respectively, n being the size of the original dataset, U being the deletion request dataset, m being the size of the deletion request dataset U.

Further, the specific process of adding the gaussian noise to the updated minimum parameter and maximum parameter obtained in the step (3) in the step (4) is as follows:

wherein w is ^u 、v ^u Obtaining a final forgetting model; zeta type toy ₁ 、ξ ₂ Respectively represent the updated minimum parametersAnd updated maximum parameter +.>Additive Gaussian noise->I is an identity matrix; sigma (sigma) ₁ 、σ ₂ The standard deviation of gaussian noise distribution, respectively.

The second aspect of the embodiment of the application provides a verifiable data forgetting privacy protection device of a very small and very large learning model, which comprises one or more processors and is used for realizing the verifiable data forgetting privacy protection method of the very small and very large learning model.

A third aspect of the embodiments of the present application provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, is configured to implement the verifiable data forgetting privacy preserving method of the extremely small and extremely large learning model described above.

The application has the beneficial effects that Newton step updating is carried out on the existing model parameters by calculating the Quan Haisen matrix, and well-designed random disturbance is added to achieve verifiable deletion assurance, so that the memory cost is low, and the effect of retraining on the residual data can be approximately achieved; the application provides a verifiable data forgetting privacy protection method of a very small and very large learning model for the first time, avoids high computational overhead of retraining, and can protect the data privacy of a user.

Drawings

FIG. 1 is a schematic diagram of the overall flow of a verifiable data forgetting privacy protection method of an extremely small and extremely large learning model of the application;

fig. 2 is a schematic structural diagram of a verifiable data forgetting privacy protection device of the very small and very large learning model.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

The verifiable data forgetting of the minimum and maximum learning model can be classified as approximate model forgetting, and the core technology is that Newton step forgetting update is carried out on the existing model parameters based on a Quan Haisen matrix, and random disturbance is added on the updated model parameters so as to realize the privacy protection of the verifiable data forgetting.

The very small and very large learning model (Minimax Learning Model) is a machine learning model for solving the problem of two-person zero and game play. In this model, two opponents achieve their own goals by constantly alternating actions, i.e. minimizing their own losses or maximizing their own profits. One of the opponents is called a Minimizer (Minimizer), and the other opponent is called a Maximizer (maxiizer).

Referring to fig. 1, the verifiable data forgetting privacy protection method of the very small and very large learning model, disclosed by the application, carries out newton step update on the parameters of the existing very small and very large learning model based on a Quan Haisen curvature matrix and adds carefully designed random disturbance to achieve verifiable deletion assurance, and specifically comprises the following steps:

(1) For the original data set, calculating the average value of all sample loss functions to obtain experience risks, and training a minimum and maximum learning model to obtain an optimal solution enabling the experience risks to be minimum and maximum as maximum parameters and minimum parameters of the minimum and maximum learning model.

In this embodiment, for the original dataset, the empirical risk is obtained by calculating the sum of all sample loss functions and then averaging, and the minimum and maximum learning model is trained and optimized by the learning algorithm to obtain the optimal solution for making the empirical risk minimum and maximumThe maximum parameter and the minimum parameter are the minimum and maximum learning model. It should be understood that the very small and very large learning model may be trained and optimized by a random gradient descent method, a grid search method, or the like, or may be trained and optimized by other methods, as long as an optimal solution that makes experience risk very large can be calculated. Wherein the optimal solution obtained with minimal risk of experience is taken as a minimal parameter +.>The resulting optimal solution, which makes experience risk greatest, is taken as a maximum parameter +.>The specific process is as follows:

It should be understood that f (·) is a loss function, the loss function may be determined according to a specific problem, the sample in the original data set is used to train the very small and very large learning model, the corresponding loss may be calculated according to the training result and the result corresponding to the original data set, and the average value of all the losses may be further calculated to obtain the experience risk.

In this embodiment, there is a nested influence between two parameters of the very small and very large learning model, and the very small parameter W is a function dependent on the very large parameter v, and can be expressed as W _S (v)：＝argmin _w F _S (w, v) wherein: =representation definition, the function representation defines W _S (v) A function; also, the maximum parameter V is a function dependent on the minimum parameter w, and can be expressed as V _S (w)：＝armax _v F _S (w, V) the functional representation defines V _S (w) function.

(2) Calculating a Quan Haisen matrix at the optimal solution obtained in step (1), wherein the Quan Haisen matrix comprises a direct hessian matrix portion and an indirect hessian matrix portion.

In this embodiment, the Quan Haisen matrix includes two parts: a direct hessian matrix section and an indirect hessian matrix section. On the original data set, for the minimum parameter w and the maximum parameter v of the minimum and maximum learning model, the direct hessian matrix part comprisesAnd->The indirect hessian matrix part comprises-> And->Thus, in the optimal solution->Quan Haisen matrix of places->And->The expression of (2) is:

wherein,quan Haisen matrix representing minimum and maximum parameters at optimal solution, respectively, +.>Respectively represent the functions F _S With respect to the second partial derivatives of w and v,representing a function F _s Sequentially solving the bias guide for w twice, and +.>Representing a function F _s First, bias is determined for w and then bias is determined for v, and then the bias is determined for v>Represents F _s Representing a function F _s Partial derivative of v and then of w is first calculated, and the first part is the first part of the partial derivative of w>Representing a function F _s Solving the partial derivatives of v twice in sequence; when (when)And->Reversible time, ->And->Respectively->And->Is a shorthand method of (c).

It should be understood that in this embodiment, only the operation results of step (1) and step (2) need be stored, and the complete original data set need not be stored, and the storage overhead is independent of the size of the original data set.

(3) And (3) carrying out Newton step forgetting updating on the maximum parameters and the minimum parameters of the minimum maximum learning model according to the optimal solution obtained in the step (1), the Quan Haisen matrix obtained in the step (2) and the data deleting request of the user so as to obtain the updated minimum parameters and the updated maximum parameters.

(3.1) constructing a deletion request data set according to the data deletion request of the user, and utilizing the optimal solution with extremely small experience risk on the original data set obtained in the step (1) according to the deletion request data setAnd Quan Haisen matrix of empirical risk at optimal solution obtained in step (2)>And->Calculating Quan Haisen matrix TH at optimal solution on remaining dataset _w And TH _v ：

Where n is the size of the original dataset, m is the size of the delete request dataset U, z _i For the ith data sample in the data set,and->The calculation method of (2) and +.>Andthe same way of calculation.

It should be appreciated that the original data set contains the delete request data set, and thus, the data in the original data set minus the data in the delete request data set is the remaining data, so that the remaining data set can be constructed.

(3.2) using the optimal solution obtained in step (1) with very little risk of experience on the raw datasetAnd Quan Haisen matrix TH on the remaining data set obtained in step (3.1) _w And TH _v Newton step forgetting update is carried out on the minimum parameter and the maximum parameter so as to obtain the updated minimum parameter and maximum parameter +.>

Wherein,and->Representing the first derivatives with respect to w and v, respectively, n being the size of the original dataset, U being the deletion request dataset, m being the size of the deletion request dataset, when TH _w And TH _v Reversible, TH _w ^-1 And TH _v ^-1 Respectively represent TH _w And TH _v Is the inverse of (a).

The forgetting update step of the minimum and maximum parameters of the minimum and maximum learning model is a newton step composed of the sum of the gradient of the loss function on the target deletion data point and the average value of the sum of the Quan Haisen matrix on all the remaining points is used as the curvature. The nested influence between the very small parameter w and the very large parameter v can be captured using the full hessian matrix instead of the simple direct hessian matrix.

In this embodiment, the specific process of adding gaussian noise to the updated minimum parameter and maximum parameter obtained in the step (3) is as follows:

wherein w is ^u 、v ^u Obtaining a final forgetting model; zeta type toy ₁ 、ξ ₂ Respectively represent the updated minimum parametersAnd updated maximum parameter +.>Additive Gaussian noise->I is an identity matrix; sigma (sigma) ₁ 、σ ₂ Standard deviation of gaussian noise distribution, +.>Wherein γ1 and γ2 are jointly determined by the nature of the loss function f (·), the original dataset size n and the target deleted dataset size m; epsilon and delta each represent a self-defined privacy parameter.

In this embodiment, by adding random disturbance, a verifiable deletion guarantee is achieved, that is, after data deletion, the operation of the minimum and maximum learning model is guaranteed as if the deleted data were never observed.

Corresponding to the embodiment of the verifiable data forgetting privacy protection method of the extremely small and extremely large learning model, the application also provides the embodiment of the verifiable data forgetting privacy protection device of the extremely small and extremely large learning model.

Referring to fig. 2, the verifiable data forgetting privacy protection device of the very small and very large learning model provided by the embodiment of the application comprises one or more processors, and is used for realizing the verifiable data forgetting privacy protection method of the very small and very large learning model in the embodiment.

The embodiment of the verifiable data forgetting privacy protection device of the minimum and maximum learning model can be applied to any device with data processing capability, such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 2, a hardware structure diagram of an apparatus with data processing capability where the verifiable data forgetting privacy protection apparatus of the very small and very large learning model of the present application is located is shown in fig. 2, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 2, the apparatus with data processing capability in the embodiment generally includes other hardware according to the actual function of the apparatus with data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

The embodiment of the application also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the verifiable data forgetting privacy protection method of the very small and very large learning model in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The verifiable data forgetting privacy protection method of the minimum and maximum learning model is characterized by comprising the following steps of:

2. The verifiable data forgetting privacy protection method of a very small and very large learning model according to claim 1, wherein the specific process of obtaining the optimal solution for making the experience risk very small and very large in the step (1) is as follows:

where n is the size of the original dataset S, z _i For the ith data sample in the dataset, f (·) is the loss function, _S (. Being an empirical risk on the original dataset, w and v represent the minimum and maximum parameters respectively of the minimum and maximum learning model to be learned,minimum parameters for minimizing experience risk, < ->To maximize the experience risk.

3. The verifiable data forgetting privacy protection method of a very small and very large learning model according to claim 1, wherein the specific process of calculating the Quan Haisen matrix at the optimal solution obtained in step (1) in step (2) is as follows:

wherein,quan Haisen matrix representing minimum and maximum parameters at optimal solution, respectively, +.>Respectively represent an empirical risk function F _S With respect to the second partial derivatives of w and v,representing an empirical risk function F _s Sequentially solving the bias guide for w twice, and +.>Representing an empirical risk function F _s First, bias is determined for w and then bias is determined for v, and then the bias is determined for v>Representing an empirical risk function F _s Partial derivative of v and then of w is first calculated, and the first part is the first part of the partial derivative of w>Representing an empirical risk function F _s And (5) solving the partial derivatives of v twice.

4. The verifiable data forgetting privacy protection method of a very small and very large learning model according to claim 1, characterized in that said step (3) comprises the sub-steps of:

(3.2) using the optimal solution obtained in step (1) with minimal risk of experience on the raw datasetAnd Quan Haisen matrix TH on the remaining data set obtained in said step (3.1) _w And TH _v For extremely small and large parametersCarrying out Newton step forgetting update to obtain updated minimum parameter and maximum parameter +.>

5. The verifiable data forgetting privacy protection method of a very small and very large learning model according to claim 1, wherein the specific process of adding gaussian noise to the updated very small parameters and the very large parameters obtained in the step (3) in the step (4) is as follows:

6. A verifiable data forgetting privacy protection device of a very small and very large learning model, characterized by comprising one or more processors for implementing the verifiable data forgetting privacy protection method of a very small and very large learning model as claimed in any one of claims 1-5.

7. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, is adapted to carry out the verifiable data forgetting privacy protection method of the very small and very large learning model of any one of claims 1-5.