CN113779608A

CN113779608A - Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training

Info

Publication number: CN113779608A
Application number: CN202111102941.2A
Authority: CN
Inventors: 曾佳; 祝文伟
Original assignee: Shenpu Technology Shanghai Co ltd
Current assignee: Shenpu Technology Shanghai Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-10

Abstract

The invention belongs to the field of private data protection, and particularly discloses a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training, which comprises the following steps: each participant side of federal study prepares training data; performing WOE encoding on original data locally at each node; training a longitudinal federal learning LightGBM model; and performing model evaluation and pruning based on the trained model. According to the method, the WOE coding is performed on the data locally at each node, namely, the WOE value is used for replacing the original data, the coded data is used for model training, and the characteristic name is also coded. Meanwhile, homomorphic encryption is carried out on the transmitted data and the gradient in the model construction process, so that the safety is greatly improved.

Description

Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training

Technical Field

The invention relates to the field of private data protection, in particular to a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training.

Background

Data is distributed as a novel production element, and as a key link for releasing the value of the element, the open sharing and exchange circulation of data resources become important trends, and the demand is increasing day by day. Meanwhile, a barrier which is difficult to break exists between data sources, and generally, data required by artificial intelligence relates to multiple fields, for example, in artificial intelligence-based product recommendation service, a product seller has data of products and data of goods purchased by users, but does not have data of purchasing ability and payment habits of the users. In most industries, data exists in an isolated island form, and due to problems of industry competition, privacy safety, complex administrative procedures and the like, even if data integration is realized among different departments of the same company, important resistance is faced, and in reality, it is almost impossible or the required cost is huge to integrate data scattered in various places and various organizations.

With the further development of big data, it has become a worldwide trend to attach importance to data privacy and security. Each public data leak has raised a tremendous interest to media and the public, for example, the recent Facebook data leak event has raised a wide range of anti-counseling actions. Meanwhile, all countries are strengthening the Protection of Data security and privacy, and the act of official Data Protection Regulation (GDPR) of european union 2-18 years shows that the increasingly strict management of user Data privacy and security will be a worldwide trend. The method brings unprecedented challenges to the field of artificial intelligence, and the current situation of the research and business circles is that a party collecting data is not a party using data generally, for example, a party collects data, transfers the data to a party B for cleaning, transfers the data to a party C for modeling, and finally sells the model to a party D for use. Such data is transferred, exchanged and transacted between entities in a form that violates the GDPR and may be subject to a legally severe penalty.

How to design a machine learning framework on the premise of meeting the requirements of data privacy, safety and supervision to enable an artificial intelligence system to use respective data together more efficiently and accurately is an important subject of current artificial intelligence development. At present, the technical scheme of meeting privacy protection and data safety and solving the data island problem is federal learning.

All data in all parties are kept in the local area in the federal learning process, so that privacy is not disclosed, and the regulation is not violated; a system which is used for establishing a virtual common model by combining data of a plurality of participants and benefits jointly; the modeling effect of federated learning is the same as that of placing the entire data set in one model, or is not very different (under the condition of user alignment or feature alignment of the individual data). With the emergence of federal learning technology, security attacks in the process of federal learning are more and more. In the federal learning process, even if the original data is not local, gradient information needs to be transmitted in the model training phase, and splitting characteristics and splitting points need to be transmitted in the tree model. An attacker can recover the original input by inverting the gradient and the features.

Disclosure of Invention

The invention aims to provide a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the data protection method based on the WOE mask in the multi-party longitudinal federal learning LightGBM training comprises the following steps:

s1: each participant side of federal study prepares training data;

s2: performing WOE encoding on original data locally at each node;

s3: training a longitudinal federal learning LightGBM model;

s4: and performing model evaluation and pruning based on the trained model in the S3.

Preferably, the federally learned participants in S1 may have multiple parties p1, p2

Wherein m is_pIs the sample size of the p node, n_pThe method comprises the steps that the characteristic quantity of p nodes containing ID tags is obtained, one party has a tag y, if the node with the tag is p1, the node with the tag is defined as an initiator, if the node without the tag is p2, pn is defined as a participant, through fusing data sets, samples which are common to all nodes are screened out according to sample IDs, the sample quantity of each node is the same, and the sample IDs are kept consistent, namely the method is characterized in that

Preferably, S2 specifically includes:

s2-1: private key for initiator₁After homomorphic encryption is carried out on the label, the label is encrypted

Sending the information to each participant;

s2-2: after the participator completes the encryption and the barrel division, the barrel division sequence is disturbed, the barrel division interval is randomly mapped, the mapping rule mapping is stored, the feature _ direct is calculated, the feature _ direct comprises the barrel division information after the random coding, and the encrypted barrel division information is passed

Calculating the number of positive samples in each barrel and the total number of samples by using a homomorphic encryption technology, and then sending feature _ fact to the initiator;

s2-3: after the initiator decrypts the data, the WOE value of each bucket is calculated, WOE _ dit is calculated and sent to the participant; assuming a certain feature is divided into q buckets, the calculation formula of WOE in a single feature single bucket is as follows:

calculating to obtain a WOE value in each sub-bucket of each characteristic, and respectively sending the calculated WOE table to each participant according to the characteristic owned by each participant;

s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and each sample and characteristic of the original data are processed according to the mapping rule of WOE _ fact _ originLine coding to form final in-mode data _ train used in longitudinal Federal learning LightGBM model training_pAnd data _ test for model evaluation_p。

Preferably, the training of the longitudinal federal learning LightGBM model in S3 includes the steps of:

s3-1: the data are subjected to bucket separation, so that the training speed is improved;

s3-2: mutually exclusive feature binding is carried out on the data, and the training speed is improved;

s3-3: the initiator performs unilateral gradient sampling according to the real label and the current predicted value to obtain data sample _ data _ index and sample weight sample _ weight of the current round of training, and sends the data sample _ data _ index and the sample weight sample _ weight to the participants;

s3-4: the initiator establishes a LightGBM tree model node queue and establishes a root node which is put into the queue;

s3-5: the initiator pops up a splittable node with the largest gain;

s3-6: the initiator calculates first-order gradient entries and second-order gradient hemssians of the round, homomorphic encryption is carried out on the gradients and hemssians to protect privacy and security, and encrypted encrypt _ gradients and encrypt _ hemssians are sent to the participants;

s3-7: each participant node calculates a histogram through encrypt _ gradients and encrypt _ hemssians by using a homomorphic encryption technology and sends the histogram to an initiator, and the initiator can locally calculate the histogram of the node by using a plaintext;

s3-8: the initiator decrypts the histograms of the nodes and calculates to obtain the optimal splitting characteristic split _ feature and the splitting point split _ value, if the characteristic of the optimal splitting point is on the initiator, splitting is performed locally, if the characteristic of the optimal splitting point is on a participant, splitting information is sent to a specific participant, and splitting is performed on the participant;

s3-9: the initiator calculates the predicted value of the split tree model node and marks whether the node can be split continuously, and then the two split tree model nodes are added into a LightGBM tree model node queue;

s3-10: repeating S3-5 to S3-9 until no nodes which can be split exist in the LightGBM model node queue, and finishing the training of the round;

s3-11: and repeating S3-3 to S3-10 until the model reaches the preset parameter value, and finishing the whole model training.

Preferably, after the training in the S4 is finished, the model effect is evaluated and verified by using the test data _ test, the number of model rounds with the best effect in the test set is selected by using evaluation indexes such as AUC and recall, the tree model is pruned, overfitting is avoided, and a final model is formed.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the WOE coding is performed on the data locally at each node, namely, the WOE value is used for replacing the original data, the coded data is used for model training, and the characteristic name is also coded. Meanwhile, homomorphic encryption is carried out on the transmitted data and the gradient in the model construction process, so that the safety is greatly improved. Meanwhile, even if an attacker cracks the ciphertext information, the WOE value can be obtained only, the corresponding characteristics and the value range of the characteristics are unknown, the WOE value mapped by the larger value of the original data is smaller, the WOE value mapped by the smaller original data is larger, the original data and the WOE value lose the size corresponding relation, and therefore the attacker cannot deduce the original information.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flow chart of WOE calculation and encoding according to an embodiment of the present invention;

fig. 3 is a flow chart of the training process of the multi-party vertical security federal learning LightGBM in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution: the data protection method based on the WOE mask in the multi-party longitudinal federal learning LightGBM training comprises the following steps:

s1: each participant side of federal study prepares training data;

s2: performing WOE encoding on original data locally at each node;

s3: training a longitudinal federal learning LightGBM model;

In this embodiment, the federately learned participant in S1 may have multiple parties p1, p2₁,data₂,...,data_nI.e. by

In the embodiment, in the longitudinal federal learning, data should be submitted first, that is, the sample IDs commonly owned by each participant are screened to form a fused data set fuse _ data, which is called a fused data set, although the fused data set is called, the data is still distributed locally among the participants and is not really aggregated, and is called the fused data set in a general way. At this time, each node screens out data sets data _1_ fuse, data _2_ fuse, …, data _ n _ fuse having the same sample ID and different sample characteristics.

In this embodiment, S2 specifically includes:

Sending the information to each participant;

s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and encodes each sample and characteristic of the original data according to the mapping rule of the WOE _ fact _ origin to form the WOE _ fact _ origin used for the vertical directionFinal in-mode data _ train in federal learning LightGBM model training_pAnd data _ test for model evaluation_p

In this embodiment, after determining the data set that needs to be modulo, the buckets and WOE values of each feature need to be calculated and the WOE mask mapping needs to be performed on the original data. In the process, the initiator has a y label and can locally calculate, and for each participant, the initiator uses a private key₁And carrying out homomorphic encryption on the label, and then sending the label to each participant to assist the participants in carrying out WOE calculation. After the participator completes encryption and barreling, the barrel-dividing sequence is disturbed, random mapping is carried out on a barreled interval, mapping rules are stored locally, feature _ dit is calculated, then the feature _ dit is sent to the initiator, the initiator calculates the WOE value of each barrel after decryption, WOE _ dit is obtained and sent to the participator, the participator maps the number of the WOE _ dit barrel to the original interval range according to the mapping rules stored before to obtain WOE _ dit _ origin, each sample and feature of original data are encoded according to the mapping rules of WOE _ dit _ origin, and finally mode-entering data _ train used in Federal learning LightGBM model training and data _ test used for model evaluation are formed. This completes the mapping process of the original data. Because woe codes contain the ability to distinguish positive and negative samples within each value or interval of each feature, the size of the coded data value contains more information for the original discrete features, which can improve the training effect of the model.

In this embodiment, S3 includes: s3-1: the data are subjected to bucket separation, so that the training speed is improved; s3-2: mutually exclusive feature binding is carried out on the data, and the training speed is improved; s3-3: the initiator performs unilateral gradient sampling according to the real label and the current predicted value to obtain data sample _ data _ index and sample weight sample _ weight of the current round of training, and sends the data sample _ data _ index and the sample weight sample _ weight to the participants; s3-4: the initiator establishes a LightGBM tree model node queue and establishes a root node which is put into the queue; s3-5: the initiator pops up a splittable node with the largest gain; s3-6: the initiator calculates first-order gradient entries and second-order gradient hemssians of the round, homomorphic encryption is carried out on the gradients and hemssians to protect privacy and security, and encrypted encrypt _ gradients and encrypt _ hemssians are sent to the participants; s3-7: each participant node calculates a histogram through encrypt _ gradients and encrypt _ hemssians by using a homomorphic encryption technology and sends the histogram to an initiator, and the initiator can locally calculate the histogram of the node by using a plaintext; s3-8: the initiator decrypts the histograms of the nodes and calculates to obtain the optimal splitting characteristic split _ feature and the splitting point split _ value, if the characteristic of the optimal splitting point is on the initiator, splitting is performed locally, if the characteristic of the optimal splitting point is on a participant, splitting information is sent to a specific participant, and splitting is performed on the participant; s3-9: the initiator calculates the predicted value of the split tree model node and marks whether the node can be split continuously, and then the two split tree model nodes are added into a LightGBM tree model node queue; s3-10: repeating S3-5 to S3-9 until no nodes which can be split exist in the LightGBM model node queue, and finishing the training of the round; s3-11: and repeating S3-3 to S3-10 until the model reaches the preset parameter value, and finishing the whole model training.

In this embodiment, each node first performs bucket splitting and mutual exclusion feature binding on data _ train and data _ test, where the purpose of the bucket splitting is to reduce the comparison times for finding an optimal split point in each feature in the lightGBM splitting process, and the purpose of the mutual exclusion feature binding is to reduce the total number of features. Since there are many missing values of some features, a plurality of sparse features are bound as one feature. The number of to-be-divided points and the number of characteristics are reduced, so that the model training speed can be improved on the premise of not influencing the model accuracy.

In this embodiment, during each round of training, single-sided gradient sampling is performed according to the real label and the current predicted value, the current predicted value is initialized to 0 vector to obtain the data sample _ data _ index and the sample weight sample _ weight of the current round participating in training, and this step is only completed at the initiator. And after completing sampling, the initiator sends the sample number sample _ data _ index and the sample weight sample _ weight to the participant. And then the initiator calculates first-order gradient grams and second-order gradient hemssians of the round, homomorphic encryption is carried out on the grams and the hemssians, privacy and safety are protected, and the encrypted encrypt _ grams and encrypt _ hemssians are sent to the participator to assist the participator in modeling.

In this embodiment, the initiator and the participant establish respective histograms in the process of tree establishment, a first-order gradient and a second-order gradient sum in each feature value and a total number of samples need to be calculated in the process of calculating the histograms, data is a plaintext in the initiator, the participant performs homomorphic encryption calculation on a ciphertext, and a result is sent back to the initiator after the calculation is completed. The initiator establishes a root node and records a LightGBM tree model node queue. For each splittable node in the LightGBM tree model node queue, selecting a node with the largest gain and popping the node out of the queue, searching for the optimal splitting characteristic split _ feature and the splitting point split _ value in histograms of all the initiators and participants, splitting locally if the characteristic of the optimal splitting point is in the initiator, sending splitting information to a specific participant if the characteristic of the optimal splitting point is in the participant, splitting in the participant, and sending the split information to the initiator. And the initiator calculates the predicted value of the split tree model node, marks whether the node can be split continuously or not, and adds the two split tree model nodes into a LightGBM tree model node queue. When all nodes in the LightGBM tree model node queue can not be split again, the current tree is indicated to be completely split, and the next tree is built until the set training parameters (such as the number of trees) are reached.

In this embodiment, after training is completed, the model effect is evaluated and verified by using the test data _ test, and the tree model is pruned by using evaluation indexes such as AUC, recall and the like and selecting the number of model rounds with the best effect in the test set, so that overfitting is avoided, and a final model is formed. In the whole process, the gradient with possibility of revealing privacy is encrypted, the feature names are randomly mapped, and in addition, the WOE coding is carried out on the original data in the data processing stage, so that the original data cannot be reversely pushed out by each node, and the privacy safety is guaranteed.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training is characterized by comprising the following steps:

s1: each participant side of federal study prepares training data;

s2: performing WOE encoding on original data locally at each node;

s3: training a longitudinal federal learning LightGBM model;

2. The method for protecting data based on WOE mask in training of LightGBM in longitudinal federal learning in multiple parties as claimed in claim 1, wherein the participants in the federal learning in S1 can have multiple parties p1, p2

3. The method for protecting data based on a WOE mask in multi-party longitudinal federal learning LightGBM training as claimed in claim 1, wherein the S2 specifically comprises: s2-1: private key for initiator₁After homomorphic encryption is carried out on the label, the label is encrypted

Sending the information to each participant; s2-2: after the participator completes the encryption and the barrel division, the barrel division sequence is disturbed, the barrel division interval is randomly mapped, the mapping rule mapping is stored, the feature _ direct is calculated, the feature _ direct comprises the barrel division information after the random coding, and the encrypted barrel division information is passed

Calculating the number of positive samples in each barrel and the total number of samples by using a homomorphic encryption technology, and then sending feature _ fact to the initiator; s2-3: after the initiator decrypts the data, the WOE value of each bucket is calculated, WOE _ dit is calculated and sent to the participant; s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and encodes each sample and characteristic of the original data according to the mapping rule of the WOE _ fact _ origin to form final in-mode data _ train used in longitudinal federal learning lightGBM model training_pAnd data _ test for model evaluation_p。

4. The method for data protection based on WOE mask in multi-party longitudinal federal learning LightGBM training as claimed in claim 1, wherein the longitudinal federal learning LightGBM model training in S3 comprises the steps of:

s3-5: the initiator pops up a splittable node with the largest gain;

5. The method as claimed in claim 1, wherein the data protection method based on the WOE mask in the multi-party vertical federal learning LightGBM training is characterized in that after the training in S4 is completed, the model effect is evaluated and verified by using the test data _ test, and the number of model rounds with the best effect in the test set is selected by using evaluation indexes such as AUC and call, so as to prune the tree model and avoid overfitting to form the final model.