CN113779608A - Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training - Google Patents

Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training Download PDF

Info

Publication number
CN113779608A
CN113779608A CN202111102941.2A CN202111102941A CN113779608A CN 113779608 A CN113779608 A CN 113779608A CN 202111102941 A CN202111102941 A CN 202111102941A CN 113779608 A CN113779608 A CN 113779608A
Authority
CN
China
Prior art keywords
data
training
woe
node
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111102941.2A
Other languages
Chinese (zh)
Inventor
曾佳
祝文伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenpu Technology Shanghai Co ltd
Original Assignee
Shenpu Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenpu Technology Shanghai Co ltd filed Critical Shenpu Technology Shanghai Co ltd
Priority to CN202111102941.2A priority Critical patent/CN113779608A/en
Publication of CN113779608A publication Critical patent/CN113779608A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of private data protection, and particularly discloses a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training, which comprises the following steps: each participant side of federal study prepares training data; performing WOE encoding on original data locally at each node; training a longitudinal federal learning LightGBM model; and performing model evaluation and pruning based on the trained model. According to the method, the WOE coding is performed on the data locally at each node, namely, the WOE value is used for replacing the original data, the coded data is used for model training, and the characteristic name is also coded. Meanwhile, homomorphic encryption is carried out on the transmitted data and the gradient in the model construction process, so that the safety is greatly improved.

Description

Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training
Technical Field
The invention relates to the field of private data protection, in particular to a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training.
Background
Data is distributed as a novel production element, and as a key link for releasing the value of the element, the open sharing and exchange circulation of data resources become important trends, and the demand is increasing day by day. Meanwhile, a barrier which is difficult to break exists between data sources, and generally, data required by artificial intelligence relates to multiple fields, for example, in artificial intelligence-based product recommendation service, a product seller has data of products and data of goods purchased by users, but does not have data of purchasing ability and payment habits of the users. In most industries, data exists in an isolated island form, and due to problems of industry competition, privacy safety, complex administrative procedures and the like, even if data integration is realized among different departments of the same company, important resistance is faced, and in reality, it is almost impossible or the required cost is huge to integrate data scattered in various places and various organizations.
With the further development of big data, it has become a worldwide trend to attach importance to data privacy and security. Each public data leak has raised a tremendous interest to media and the public, for example, the recent Facebook data leak event has raised a wide range of anti-counseling actions. Meanwhile, all countries are strengthening the Protection of Data security and privacy, and the act of official Data Protection Regulation (GDPR) of european union 2-18 years shows that the increasingly strict management of user Data privacy and security will be a worldwide trend. The method brings unprecedented challenges to the field of artificial intelligence, and the current situation of the research and business circles is that a party collecting data is not a party using data generally, for example, a party collects data, transfers the data to a party B for cleaning, transfers the data to a party C for modeling, and finally sells the model to a party D for use. Such data is transferred, exchanged and transacted between entities in a form that violates the GDPR and may be subject to a legally severe penalty.
How to design a machine learning framework on the premise of meeting the requirements of data privacy, safety and supervision to enable an artificial intelligence system to use respective data together more efficiently and accurately is an important subject of current artificial intelligence development. At present, the technical scheme of meeting privacy protection and data safety and solving the data island problem is federal learning.
All data in all parties are kept in the local area in the federal learning process, so that privacy is not disclosed, and the regulation is not violated; a system which is used for establishing a virtual common model by combining data of a plurality of participants and benefits jointly; the modeling effect of federated learning is the same as that of placing the entire data set in one model, or is not very different (under the condition of user alignment or feature alignment of the individual data). With the emergence of federal learning technology, security attacks in the process of federal learning are more and more. In the federal learning process, even if the original data is not local, gradient information needs to be transmitted in the model training phase, and splitting characteristics and splitting points need to be transmitted in the tree model. An attacker can recover the original input by inverting the gradient and the features.
Disclosure of Invention
The invention aims to provide a data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: the data protection method based on the WOE mask in the multi-party longitudinal federal learning LightGBM training comprises the following steps:
s1: each participant side of federal study prepares training data;
s2: performing WOE encoding on original data locally at each node;
s3: training a longitudinal federal learning LightGBM model;
s4: and performing model evaluation and pruning based on the trained model in the S3.
Preferably, the federally learned participants in S1 may have multiple parties p1, p2
Figure BDA0003268189730000021
Wherein m ispIs the sample size of the p node, npThe method comprises the steps that the characteristic quantity of p nodes containing ID tags is obtained, one party has a tag y, if the node with the tag is p1, the node with the tag is defined as an initiator, if the node without the tag is p2, pn is defined as a participant, through fusing data sets, samples which are common to all nodes are screened out according to sample IDs, the sample quantity of each node is the same, and the sample IDs are kept consistent, namely the method is characterized in that
Figure BDA0003268189730000031
Preferably, S2 specifically includes:
s2-1: private key for initiator1After homomorphic encryption is carried out on the label, the label is encrypted
Figure BDA0003268189730000032
Sending the information to each participant;
s2-2: after the participator completes the encryption and the barrel division, the barrel division sequence is disturbed, the barrel division interval is randomly mapped, the mapping rule mapping is stored, the feature _ direct is calculated, the feature _ direct comprises the barrel division information after the random coding, and the encrypted barrel division information is passed
Figure BDA0003268189730000033
Calculating the number of positive samples in each barrel and the total number of samples by using a homomorphic encryption technology, and then sending feature _ fact to the initiator;
s2-3: after the initiator decrypts the data, the WOE value of each bucket is calculated, WOE _ dit is calculated and sent to the participant; assuming a certain feature is divided into q buckets, the calculation formula of WOE in a single feature single bucket is as follows:
Figure BDA0003268189730000034
Figure BDA0003268189730000035
Figure BDA0003268189730000036
calculating to obtain a WOE value in each sub-bucket of each characteristic, and respectively sending the calculated WOE table to each participant according to the characteristic owned by each participant;
s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and each sample and characteristic of the original data are processed according to the mapping rule of WOE _ fact _ originLine coding to form final in-mode data _ train used in longitudinal Federal learning LightGBM model trainingpAnd data _ test for model evaluationp
Preferably, the training of the longitudinal federal learning LightGBM model in S3 includes the steps of:
s3-1: the data are subjected to bucket separation, so that the training speed is improved;
s3-2: mutually exclusive feature binding is carried out on the data, and the training speed is improved;
s3-3: the initiator performs unilateral gradient sampling according to the real label and the current predicted value to obtain data sample _ data _ index and sample weight sample _ weight of the current round of training, and sends the data sample _ data _ index and the sample weight sample _ weight to the participants;
s3-4: the initiator establishes a LightGBM tree model node queue and establishes a root node which is put into the queue;
s3-5: the initiator pops up a splittable node with the largest gain;
s3-6: the initiator calculates first-order gradient entries and second-order gradient hemssians of the round, homomorphic encryption is carried out on the gradients and hemssians to protect privacy and security, and encrypted encrypt _ gradients and encrypt _ hemssians are sent to the participants;
s3-7: each participant node calculates a histogram through encrypt _ gradients and encrypt _ hemssians by using a homomorphic encryption technology and sends the histogram to an initiator, and the initiator can locally calculate the histogram of the node by using a plaintext;
s3-8: the initiator decrypts the histograms of the nodes and calculates to obtain the optimal splitting characteristic split _ feature and the splitting point split _ value, if the characteristic of the optimal splitting point is on the initiator, splitting is performed locally, if the characteristic of the optimal splitting point is on a participant, splitting information is sent to a specific participant, and splitting is performed on the participant;
s3-9: the initiator calculates the predicted value of the split tree model node and marks whether the node can be split continuously, and then the two split tree model nodes are added into a LightGBM tree model node queue;
s3-10: repeating S3-5 to S3-9 until no nodes which can be split exist in the LightGBM model node queue, and finishing the training of the round;
s3-11: and repeating S3-3 to S3-10 until the model reaches the preset parameter value, and finishing the whole model training.
Preferably, after the training in the S4 is finished, the model effect is evaluated and verified by using the test data _ test, the number of model rounds with the best effect in the test set is selected by using evaluation indexes such as AUC and recall, the tree model is pruned, overfitting is avoided, and a final model is formed.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the WOE coding is performed on the data locally at each node, namely, the WOE value is used for replacing the original data, the coded data is used for model training, and the characteristic name is also coded. Meanwhile, homomorphic encryption is carried out on the transmitted data and the gradient in the model construction process, so that the safety is greatly improved. Meanwhile, even if an attacker cracks the ciphertext information, the WOE value can be obtained only, the corresponding characteristics and the value range of the characteristics are unknown, the WOE value mapped by the larger value of the original data is smaller, the WOE value mapped by the smaller original data is larger, the original data and the WOE value lose the size corresponding relation, and therefore the attacker cannot deduce the original information.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a flow chart of WOE calculation and encoding according to an embodiment of the present invention;
fig. 3 is a flow chart of the training process of the multi-party vertical security federal learning LightGBM in the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides a technical solution: the data protection method based on the WOE mask in the multi-party longitudinal federal learning LightGBM training comprises the following steps:
s1: each participant side of federal study prepares training data;
s2: performing WOE encoding on original data locally at each node;
s3: training a longitudinal federal learning LightGBM model;
s4: and performing model evaluation and pruning based on the trained model in the S3.
In this embodiment, the federately learned participant in S1 may have multiple parties p1, p21,data2,...,datanI.e. by
Figure BDA0003268189730000051
Figure BDA0003268189730000052
Wherein m ispIs the sample size of the p node, npThe method comprises the steps that the characteristic quantity of p nodes containing ID tags is obtained, one party has a tag y, if the node with the tag is p1, the node with the tag is defined as an initiator, if the node without the tag is p2, pn is defined as a participant, through fusing data sets, samples which are common to all nodes are screened out according to sample IDs, the sample quantity of each node is the same, and the sample IDs are kept consistent, namely the method is characterized in that
Figure BDA0003268189730000061
In the embodiment, in the longitudinal federal learning, data should be submitted first, that is, the sample IDs commonly owned by each participant are screened to form a fused data set fuse _ data, which is called a fused data set, although the fused data set is called, the data is still distributed locally among the participants and is not really aggregated, and is called the fused data set in a general way. At this time, each node screens out data sets data _1_ fuse, data _2_ fuse, …, data _ n _ fuse having the same sample ID and different sample characteristics.
In this embodiment, S2 specifically includes:
s2-1: private key for initiator1After homomorphic encryption is carried out on the label, the label is encrypted
Figure BDA0003268189730000062
Sending the information to each participant;
s2-2: after the participator completes the encryption and the barrel division, the barrel division sequence is disturbed, the barrel division interval is randomly mapped, the mapping rule mapping is stored, the feature _ direct is calculated, the feature _ direct comprises the barrel division information after the random coding, and the encrypted barrel division information is passed
Figure BDA0003268189730000063
Calculating the number of positive samples in each barrel and the total number of samples by using a homomorphic encryption technology, and then sending feature _ fact to the initiator;
s2-3: after the initiator decrypts the data, the WOE value of each bucket is calculated, WOE _ dit is calculated and sent to the participant; assuming a certain feature is divided into q buckets, the calculation formula of WOE in a single feature single bucket is as follows:
Figure BDA0003268189730000064
Figure BDA0003268189730000065
Figure BDA0003268189730000066
calculating to obtain a WOE value in each sub-bucket of each characteristic, and respectively sending the calculated WOE table to each participant according to the characteristic owned by each participant;
s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and encodes each sample and characteristic of the original data according to the mapping rule of the WOE _ fact _ origin to form the WOE _ fact _ origin used for the vertical directionFinal in-mode data _ train in federal learning LightGBM model trainingpAnd data _ test for model evaluationp
In this embodiment, after determining the data set that needs to be modulo, the buckets and WOE values of each feature need to be calculated and the WOE mask mapping needs to be performed on the original data. In the process, the initiator has a y label and can locally calculate, and for each participant, the initiator uses a private key1And carrying out homomorphic encryption on the label, and then sending the label to each participant to assist the participants in carrying out WOE calculation. After the participator completes encryption and barreling, the barrel-dividing sequence is disturbed, random mapping is carried out on a barreled interval, mapping rules are stored locally, feature _ dit is calculated, then the feature _ dit is sent to the initiator, the initiator calculates the WOE value of each barrel after decryption, WOE _ dit is obtained and sent to the participator, the participator maps the number of the WOE _ dit barrel to the original interval range according to the mapping rules stored before to obtain WOE _ dit _ origin, each sample and feature of original data are encoded according to the mapping rules of WOE _ dit _ origin, and finally mode-entering data _ train used in Federal learning LightGBM model training and data _ test used for model evaluation are formed. This completes the mapping process of the original data. Because woe codes contain the ability to distinguish positive and negative samples within each value or interval of each feature, the size of the coded data value contains more information for the original discrete features, which can improve the training effect of the model.
In this embodiment, S3 includes: s3-1: the data are subjected to bucket separation, so that the training speed is improved; s3-2: mutually exclusive feature binding is carried out on the data, and the training speed is improved; s3-3: the initiator performs unilateral gradient sampling according to the real label and the current predicted value to obtain data sample _ data _ index and sample weight sample _ weight of the current round of training, and sends the data sample _ data _ index and the sample weight sample _ weight to the participants; s3-4: the initiator establishes a LightGBM tree model node queue and establishes a root node which is put into the queue; s3-5: the initiator pops up a splittable node with the largest gain; s3-6: the initiator calculates first-order gradient entries and second-order gradient hemssians of the round, homomorphic encryption is carried out on the gradients and hemssians to protect privacy and security, and encrypted encrypt _ gradients and encrypt _ hemssians are sent to the participants; s3-7: each participant node calculates a histogram through encrypt _ gradients and encrypt _ hemssians by using a homomorphic encryption technology and sends the histogram to an initiator, and the initiator can locally calculate the histogram of the node by using a plaintext; s3-8: the initiator decrypts the histograms of the nodes and calculates to obtain the optimal splitting characteristic split _ feature and the splitting point split _ value, if the characteristic of the optimal splitting point is on the initiator, splitting is performed locally, if the characteristic of the optimal splitting point is on a participant, splitting information is sent to a specific participant, and splitting is performed on the participant; s3-9: the initiator calculates the predicted value of the split tree model node and marks whether the node can be split continuously, and then the two split tree model nodes are added into a LightGBM tree model node queue; s3-10: repeating S3-5 to S3-9 until no nodes which can be split exist in the LightGBM model node queue, and finishing the training of the round; s3-11: and repeating S3-3 to S3-10 until the model reaches the preset parameter value, and finishing the whole model training.
In this embodiment, each node first performs bucket splitting and mutual exclusion feature binding on data _ train and data _ test, where the purpose of the bucket splitting is to reduce the comparison times for finding an optimal split point in each feature in the lightGBM splitting process, and the purpose of the mutual exclusion feature binding is to reduce the total number of features. Since there are many missing values of some features, a plurality of sparse features are bound as one feature. The number of to-be-divided points and the number of characteristics are reduced, so that the model training speed can be improved on the premise of not influencing the model accuracy.
In this embodiment, during each round of training, single-sided gradient sampling is performed according to the real label and the current predicted value, the current predicted value is initialized to 0 vector to obtain the data sample _ data _ index and the sample weight sample _ weight of the current round participating in training, and this step is only completed at the initiator. And after completing sampling, the initiator sends the sample number sample _ data _ index and the sample weight sample _ weight to the participant. And then the initiator calculates first-order gradient grams and second-order gradient hemssians of the round, homomorphic encryption is carried out on the grams and the hemssians, privacy and safety are protected, and the encrypted encrypt _ grams and encrypt _ hemssians are sent to the participator to assist the participator in modeling.
In this embodiment, the initiator and the participant establish respective histograms in the process of tree establishment, a first-order gradient and a second-order gradient sum in each feature value and a total number of samples need to be calculated in the process of calculating the histograms, data is a plaintext in the initiator, the participant performs homomorphic encryption calculation on a ciphertext, and a result is sent back to the initiator after the calculation is completed. The initiator establishes a root node and records a LightGBM tree model node queue. For each splittable node in the LightGBM tree model node queue, selecting a node with the largest gain and popping the node out of the queue, searching for the optimal splitting characteristic split _ feature and the splitting point split _ value in histograms of all the initiators and participants, splitting locally if the characteristic of the optimal splitting point is in the initiator, sending splitting information to a specific participant if the characteristic of the optimal splitting point is in the participant, splitting in the participant, and sending the split information to the initiator. And the initiator calculates the predicted value of the split tree model node, marks whether the node can be split continuously or not, and adds the two split tree model nodes into a LightGBM tree model node queue. When all nodes in the LightGBM tree model node queue can not be split again, the current tree is indicated to be completely split, and the next tree is built until the set training parameters (such as the number of trees) are reached.
In this embodiment, after training is completed, the model effect is evaluated and verified by using the test data _ test, and the tree model is pruned by using evaluation indexes such as AUC, recall and the like and selecting the number of model rounds with the best effect in the test set, so that overfitting is avoided, and a final model is formed. In the whole process, the gradient with possibility of revealing privacy is encrypted, the feature names are randomly mapped, and in addition, the WOE coding is carried out on the original data in the data processing stage, so that the original data cannot be reversely pushed out by each node, and the privacy safety is guaranteed.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. A data protection method based on a WOE mask in multi-party longitudinal federal learning LightGBM training is characterized by comprising the following steps:
s1: each participant side of federal study prepares training data;
s2: performing WOE encoding on original data locally at each node;
s3: training a longitudinal federal learning LightGBM model;
s4: and performing model evaluation and pruning based on the trained model in the S3.
2. The method for protecting data based on WOE mask in training of LightGBM in longitudinal federal learning in multiple parties as claimed in claim 1, wherein the participants in the federal learning in S1 can have multiple parties p1, p2
Figure FDA0003268189720000011
Wherein m ispIs the sample size of the p node, npThe method comprises the steps that the characteristic quantity of p nodes containing ID tags is obtained, one party has a tag y, if the node with the tag is p1, the node with the tag is defined as an initiator, if the node without the tag is p2, pn is defined as a participant, through fusing data sets, samples which are common to all nodes are screened out according to sample IDs, the sample quantity of each node is the same, and the sample IDs are kept consistent, namely the method is characterized in that
Figure FDA0003268189720000012
Figure FDA0003268189720000013
3. The method for protecting data based on a WOE mask in multi-party longitudinal federal learning LightGBM training as claimed in claim 1, wherein the S2 specifically comprises: s2-1: private key for initiator1After homomorphic encryption is carried out on the label, the label is encrypted
Figure FDA0003268189720000014
Sending the information to each participant; s2-2: after the participator completes the encryption and the barrel division, the barrel division sequence is disturbed, the barrel division interval is randomly mapped, the mapping rule mapping is stored, the feature _ direct is calculated, the feature _ direct comprises the barrel division information after the random coding, and the encrypted barrel division information is passed
Figure FDA0003268189720000015
Calculating the number of positive samples in each barrel and the total number of samples by using a homomorphic encryption technology, and then sending feature _ fact to the initiator; s2-3: after the initiator decrypts the data, the WOE value of each bucket is calculated, WOE _ dit is calculated and sent to the participant; s2-4: the participator maps the number of the WOE _ fact bucket back to the original interval range to obtain WOE _ fact _ origin, and encodes each sample and characteristic of the original data according to the mapping rule of the WOE _ fact _ origin to form final in-mode data _ train used in longitudinal federal learning lightGBM model trainingpAnd data _ test for model evaluationp
4. The method for data protection based on WOE mask in multi-party longitudinal federal learning LightGBM training as claimed in claim 1, wherein the longitudinal federal learning LightGBM model training in S3 comprises the steps of:
s3-1: the data are subjected to bucket separation, so that the training speed is improved;
s3-2: mutually exclusive feature binding is carried out on the data, and the training speed is improved;
s3-3: the initiator performs unilateral gradient sampling according to the real label and the current predicted value to obtain data sample _ data _ index and sample weight sample _ weight of the current round of training, and sends the data sample _ data _ index and the sample weight sample _ weight to the participants;
s3-4: the initiator establishes a LightGBM tree model node queue and establishes a root node which is put into the queue;
s3-5: the initiator pops up a splittable node with the largest gain;
s3-6: the initiator calculates first-order gradient entries and second-order gradient hemssians of the round, homomorphic encryption is carried out on the gradients and hemssians to protect privacy and security, and encrypted encrypt _ gradients and encrypt _ hemssians are sent to the participants;
s3-7: each participant node calculates a histogram through encrypt _ gradients and encrypt _ hemssians by using a homomorphic encryption technology and sends the histogram to an initiator, and the initiator can locally calculate the histogram of the node by using a plaintext;
s3-8: the initiator decrypts the histograms of the nodes and calculates to obtain the optimal splitting characteristic split _ feature and the splitting point split _ value, if the characteristic of the optimal splitting point is on the initiator, splitting is performed locally, if the characteristic of the optimal splitting point is on a participant, splitting information is sent to a specific participant, and splitting is performed on the participant;
s3-9: the initiator calculates the predicted value of the split tree model node and marks whether the node can be split continuously, and then the two split tree model nodes are added into a LightGBM tree model node queue;
s3-10: repeating S3-5 to S3-9 until no nodes which can be split exist in the LightGBM model node queue, and finishing the training of the round;
s3-11: and repeating S3-3 to S3-10 until the model reaches the preset parameter value, and finishing the whole model training.
5. The method as claimed in claim 1, wherein the data protection method based on the WOE mask in the multi-party vertical federal learning LightGBM training is characterized in that after the training in S4 is completed, the model effect is evaluated and verified by using the test data _ test, and the number of model rounds with the best effect in the test set is selected by using evaluation indexes such as AUC and call, so as to prune the tree model and avoid overfitting to form the final model.
CN202111102941.2A 2021-09-17 2021-09-17 Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training Pending CN113779608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111102941.2A CN113779608A (en) 2021-09-17 2021-09-17 Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111102941.2A CN113779608A (en) 2021-09-17 2021-09-17 Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training

Publications (1)

Publication Number Publication Date
CN113779608A true CN113779608A (en) 2021-12-10

Family

ID=78852348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111102941.2A Pending CN113779608A (en) 2021-09-17 2021-09-17 Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training

Country Status (1)

Country Link
CN (1) CN113779608A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989036A (en) * 2021-12-30 2022-01-28 百融至信(北京)征信有限公司 Federal learning prediction method and system without exposure of model-entering variable
CN114330759A (en) * 2022-03-08 2022-04-12 富算科技(上海)有限公司 Training method and system for longitudinal federated learning model
CN115292738A (en) * 2022-10-08 2022-11-04 豪符密码检测技术(成都)有限责任公司 Method for detecting security and correctness of federated learning model and data
CN115334005A (en) * 2022-03-31 2022-11-11 北京邮电大学 Encrypted flow identification method based on pruning convolution neural network and machine learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989036A (en) * 2021-12-30 2022-01-28 百融至信(北京)征信有限公司 Federal learning prediction method and system without exposure of model-entering variable
CN113989036B (en) * 2021-12-30 2022-03-18 百融至信(北京)征信有限公司 Federal learning prediction method and system without exposure of model-entering variable
CN114330759A (en) * 2022-03-08 2022-04-12 富算科技(上海)有限公司 Training method and system for longitudinal federated learning model
CN114330759B (en) * 2022-03-08 2022-08-02 富算科技(上海)有限公司 Training method and system for longitudinal federated learning model
CN115334005A (en) * 2022-03-31 2022-11-11 北京邮电大学 Encrypted flow identification method based on pruning convolution neural network and machine learning
CN115334005B (en) * 2022-03-31 2024-03-22 北京邮电大学 Encryption flow identification method based on pruning convolutional neural network and machine learning
CN115292738A (en) * 2022-10-08 2022-11-04 豪符密码检测技术(成都)有限责任公司 Method for detecting security and correctness of federated learning model and data
CN115292738B (en) * 2022-10-08 2023-01-17 豪符密码检测技术(成都)有限责任公司 Method for detecting security and correctness of federated learning model and data

Similar Documents

Publication Publication Date Title
CN113779608A (en) Data protection method based on WOE mask in multi-party longitudinal federal learning LightGBM training
CN113051557B (en) Social network cross-platform malicious user detection method based on longitudinal federal learning
CN111931242A (en) Data sharing method, computer equipment applying same and readable storage medium
CN113420232B (en) Privacy protection-oriented federated recommendation method for neural network of graph
Chen et al. Secure social recommendation based on secret sharing
CN113127916A (en) Data set processing method, data processing device and storage medium
CN101729554B (en) Construction method of division protocol based on cryptology in distributed computation
CN110163008B (en) Security audit method and system for deployed encryption model
CN108667717A (en) Block chain processing method, medium, device and computing device based on instant communication message record
Li et al. SPFM: Scalable and privacy-preserving friend matching in mobile cloud
CN113779355B (en) Network rumor tracing evidence obtaining method and system based on blockchain
CN113283902A (en) Multi-channel block chain fishing node detection method based on graph neural network
CN114362948B (en) Federated derived feature logistic regression modeling method
CN115269983A (en) Target sample recommendation method based on two-party data privacy protection
CN104618098B (en) Cryptography building method and system that a kind of set member's relation judges
CN117216788A (en) Video scene identification method based on federal learning privacy protection of block chain
Shekhtman et al. Critical field-exponents for secure message-passing in modular networks
CN110059097A (en) Data processing method and device
CN116821952A (en) Privacy data calculation traceability system and method based on block chain consensus mechanism
CN110175283A (en) A kind of generation method and device of recommended models
CN116681141A (en) Federal learning method, terminal and storage medium for privacy protection
CN113221170B (en) Privacy information matching and data transaction method and system based on blockchain
Gulyás et al. Analysis of identity separation against a passive clique-based de-anonymization attack
Zhang et al. Multi-party Secure Comparison of Strings Based on Outsourced Computation
Li et al. An integrated recommendation approach based on influence and trust in social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination