CN115630713A - Longitudinal federated learning method, device and medium under condition of different sample identifiers - Google Patents

Longitudinal federated learning method, device and medium under condition of different sample identifiers Download PDF

Info

Publication number
CN115630713A
CN115630713A CN202211061861.1A CN202211061861A CN115630713A CN 115630713 A CN115630713 A CN 115630713A CN 202211061861 A CN202211061861 A CN 202211061861A CN 115630713 A CN115630713 A CN 115630713A
Authority
CN
China
Prior art keywords
participant
party
data
opprf
bin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211061861.1A
Other languages
Chinese (zh)
Inventor
赖俊祚
李钰
张蓉
李燕玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN202211061861.1A priority Critical patent/CN115630713A/en
Publication of CN115630713A publication Critical patent/CN115630713A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Storage Device Security (AREA)

Abstract

The invention relates to the technical field of federal learning, in particular to a method, equipment and a medium for longitudinal federal learning under different sample identifiers, wherein the method comprises the following steps: in an alignment stage in federal learning, sample alignment is carried out on participants with different sample identifiers through an inadvertent and programmable pseudo-random function OPPRF, a Cuckoo Hashing and a Simple Hashing, and noise is added to the sample identifiers in the process of sample alignment; completing a participant co-training model through Paillier homomorphic encryption, and ensuring privacy information of participants in a training process by using an encryption means; the method and the device can realize sample alignment under different participant sample identifiers, and achieve the purpose of training the model together on the basis of protecting privacy information of each participant.

Description

Longitudinal federated learning method, device and medium under condition of different sample identifiers
Technical Field
The invention relates to the technical field of federal learning, in particular to a longitudinal federal learning method, equipment and medium under different sample identifiers.
Background
Longitudinal federated learning applies to datasets with more overlapping user samples but less overlapping user features, such as banks and e-commerce companies, which provide different services to users and therefore have different aspects of their features, but with a large overlap of users they serve, which increases the feature dimension of the training dataset. Before training the models together, the samples are aligned, and when the samples are aligned, the current longitudinal federal learning scheme mainly adopts a Privacy Set Intersection (PSI) technology to perform sample alignment, for example, the federal learning framework fane developed by the micro-banking mainly adopts PSI based on RSA and hash function to perform sample alignment, and Liu et al uses PSI based on a Pohlig-hellman structure in the document "asymmetric vertical fed learning" to perform sample alignment under an asymmetric federal learning environment, but all of the schemes default that the participant sample identifiers are the same, and do not consider the alignment problem when the sample identifiers are different. After the samples are aligned, the participators can train the model together, and the current longitudinal federal learning scheme mainly adopts the cryptography methods of homomorphic encryption, secret sharing, differential privacy and the like to protect the privacy information in the training process. Hardy et al protect the privacy information in the interaction process by using an addition homomorphic encryption scheme in the document "Private fed learning on vertical encryption scheme and addition homomorphic encryption", and when a sender sends a gradient cipher text, in order to further protect the gradient information, the gradient plus a random number needs to be blinded and then sent to a receiver, and the scheme needs the participation of a coordinator.
With the advent of the big data era, data is used as a new production element, and how to dig out valuable information from massive data is a research hotspot gradually. The traditional data processing method is to aggregate all data for analysis and modeling, and the aggregation of the data may expose sensitive information of a user, namely, if a person who owns the data uses the data maliciously in the process of analysis and modeling by using the data, for example, the collected data is sold in a private package mode, the leakage of privacy information of the user can be caused. Data are dare to flow among enterprises, and cannot and unwillingly go, so that the flow of the data is severely restricted, and data isolated islands are caused, so that the method for protecting the privacy of the data and fully playing the data value becomes a research hotspot. The federal learning comes from now, which aims to help the participants to finish training together without revealing the local private data of the participants, and is characterized in that the private data do not flow out of the local, and the federal learning is widely applied to various fields due to the good characteristic of protecting the data privacy. According to the distribution of data, federal learning can be divided into longitudinal federal learning, transverse federal learning and federal migration learning. In the longitudinal federal learning, sample alignment is needed before the participants train the model together, and the existing longitudinal federal learning method either does not consider the problem of sample alignment or only considers the alignment problem that the sample unique Identifiers (IDs) in the databases of the participants are the same, but the information used by the same user when registering on different platforms in real life may be different, for example, the user registers on the platform a with a mailbox number, and registers on the platform B with a telephone number, so that the unique identifiers of the same user are different in the two databases of a and B. If the two platforms need to train the model together, first performing sample alignment, then the traditional vertical federal learning method is no longer applicable. Therefore, it is necessary to design a longitudinal federal learning method capable of solving the problem that the participant sample identifiers are different, and the method has important application value.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a longitudinal federal learning method, equipment and a medium under the condition of different sample identifiers.
The invention aims to provide a longitudinal federal learning method under the condition that sample identifiers are different.
It is a second object of the invention to provide a computer apparatus.
It is a third object of the present invention to provide a storage medium.
The first purpose of the invention can be achieved by adopting the following technical scheme:
a method for longitudinal federated learning with disparate sample identifiers, the method comprising:
s1, a party A and a party B respectively operate an inadvertent and programmable pseudo-random function OPPRF with a third party C to obtain corresponding inadvertent and programmable pseudo-random function OPPRF outputs, and the party A and the party B respectively add the corresponding inadvertent and programmable pseudo-random function OPPRF outputs as noise to a data set of the own party;
s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise according to the data set added with the noise by the participant B, and the data sets of the participant A are sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;
s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient by the participant A and the participant B through a ciphertext of an interactive intermediate result, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C sends the plaintext to the participant A and the participant B respectively, the participant A and the participant B remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta A The participator B obtains a trained logistic regression model theta B
In a preferred embodiment, the step S1 includes:
s11, a participant A, a participant B and a third party C respectively hold a set formed by all samples in a database of the own party under identifiers, and the participant A, the participant B and the third party C negotiate three hash functions;
the third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another one by one, and a new value is obtained after the third party C adds noise to each element in the two sets; wherein the same noise is added to the same positions of the two sets; elements in one set and values added with noise in the other set are combined into a point value pair according to rows to form two point value pair sets, and the point value pairs in the point value pair sets are used as input of C when an OPPRF is operated;
s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if none are stored in a bin of parties a and B, party a and party B store an invalid element in that bin;
s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding values of the elements after noise is added, and the participant A obtains corresponding accidental programmable pseudo-random function OPPRF output;
s14, for each bin, a participant B and a participant C operate a pseudo-random function OPPRF, the participant B is a receiving party, the participant C serves as a sending party, and the participant B obtains a corresponding careless and programmable pseudo-random function OPPRF output.
The second purpose of the invention can be achieved by adopting the following technical scheme:
a computer device comprises a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to realize the longitudinal federal learning method under the condition that the sample identifiers are different.
The third purpose of the invention can be achieved by adopting the following technical scheme:
a storage medium storing a program which, when executed by a processor, implements the above-described vertical federal learning method in which sample identifiers are different.
Compared with the prior art, the invention has the following advantages and beneficial effects:
according to the longitudinal federal learning method, the device and the medium under different sample identifiers, in the alignment stage in the federal learning, the participants with different sample identifiers are subjected to sample alignment through the technology of an inadvertent and programmable pseudo-random function OPPRF, cuckoo Hashing, simple hash Hashing and the like, and noise is added to the sample identifiers when the sample alignment is carried out; completing a participant co-training model through Paillier homomorphic encryption, and ensuring privacy information of participants in a training process by using an encryption means; the invention can realize sample alignment under different participant sample identifiers, and can achieve the purpose of training the model together on the basis of protecting privacy information of each participant.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a flow chart of a longitudinal federated learning method with disparate sample identifiers in an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a logistic regression model trained by two medical institutions in accordance with an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described in further detail with reference to the accompanying drawings and examples, and it is obvious that the described examples are some, but not all, examples of the present invention, and the embodiments of the present invention are not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1:
the invention provides a longitudinal federated learning method under the condition that the existing scheme can not solve the alignment problem under the condition that sample identifiers are different, which realizes the alignment of samples under the condition that sample identifiers of participants are different, and utilizes homomorphic encryption to realize the training of a model after the samples are aligned.
The invention provides a longitudinal federal learning method under different sample identifiers, which mainly utilizes four technologies of OPPRF, simple Hashing, cuckoo Hashing and Paillier homomorphic encryption to construct a scheme, wherein the four technologies are introduced as follows:
1. an inadvertent, programmable pseudo-random function OPPRF (Oblivious Programmable PRF)
OPPRF is a combination of an inadvertent, pseudo-random function OPRF and a programmable pseudo-random function PPRF, and is primarily concerned with the sender and receiver, the sender being A, the input to A being { (x) i ,y i ),i∈[n]And B is the receiver, and the input of B is x. The invention adopts an OPPRF based on a table structure, which mainly comprises the following 5 steps:
(1) A and B run OPRF protocol, the input of A is k, the input of B is x, after the protocol is run, B receives F (k, x).
(2) For { x i ,i∈[n]A calculates { F (k, x) 1 ),F(k,x 2 ),..,F(k,x n ) }. A samples the value v until all { H (F (k, x) } i )||v),∈[n]Are different.
(3) For i e n]A first calculates h i =H(F(k,x i ) V), then h of table T i The element of each position is set as
Figure BDA0003826598870000041
(4) A stores the positions of table T where no element is stored in a random number, and then sends tables T and v to B.
(5) B calculates H = H (F (k, x) | | v), and then outputs
Figure BDA0003826598870000051
If x = x i B will yield x i Corresponding to y i Otherwise, a random number is obtained.
2. Simple Hashing Simple hash function
Simple Hashing contains B bins B 1 ,B 2 ,..,B b H and k hash functions h 1 ,h 2 ,..,h k }. Each bin may store multiple elements. When storing element x into bin, first calculate the hash function value { h) of x under k hash functions 1 (x),h 2 (x),…,h k (x) Then put x into corresponding k bins
Figure BDA0003826598870000052
3. Cuckoo Hashing Cuckoo hash function
Cuckoo Hashing contains k hash functions { h 1 ,h 2 ,..,h k And B bins B 1 ,B 2 ,..,B b }. Each bin can only store at most one element. Element x is stored in
Figure BDA0003826598870000053
One of the bins.
4. Paillier homomorphic encryption
The invention utilizes a Paillier homomorphic encryption scheme to ensure the security of private information in the interactive process of the training phase, and the Paillier homomorphic encryption scheme generally comprises KeyGen, enc and Den algorithms. The KeyGen algorithm inputs safety parameters and outputs pk and sk; enc algorithm inputs plaintext m and pk and outputs ciphertext of m
Figure BDA0003826598870000054
Dec algorithm input ciphertext
Figure BDA0003826598870000055
And sk, outputting the decrypted plaintext m. The Paillier homomorphic encryption scheme supports the following two operations:
the two ciphertexts are added:
Figure BDA0003826598870000056
constant times ciphertext:
Figure BDA0003826598870000057
the federal learning is longitudinal federal learning and comprises a participant A, a participant B and a trusted third party C, wherein the participants A and B perform sample alignment and model training with the help of the third party C to finally obtain trained models respectively. The longitudinal federal learning method includes a sample alignment phase and a training phase.
The method comprises the following steps that a participant A and a participant B hold own databases, each sample in the databases comprises an identifier and a characteristic value, each participant holds a part of characteristics, the sample identifiers of the participant A and the participant B are different, and the sample identifier of the participant B is assumed to be ID1 and ID2; the third party C owns a huge amount of data and each sample in the database contains two identifiers (ID 1, ID 2).
As shown in fig. 1, the method for longitudinal federated learning with different sample identifiers according to the present invention includes a party a, a party B, and a third party C, and is characterized in that the method includes the steps of:
s1, a participant A and a participant B respectively obtain corresponding OPPRF outputs with an inadvertent and programmable pseudo-random function OPPRF operated by a third party C, and the participant A and the participant B respectively add the corresponding OPPRF outputs as noise to a data set of the participant.
S11, the participant A, the participant B and the third party C respectively hold a set formed by all samples in the own database under the sample identifiers, and the participant A, the participant B and the third party C negotiate three hash functions.
The method includes the steps that a participant A and a participant B need to train a model together, sample alignment needs to be conducted firstly, and due to the fact that sample IDs of the participant A and the participant B are different, a trusted third party C with mass data is introduced to assist the participant A and the participant B in conducting sample alignment.
The third party C has two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another, noise is added to each element in the two sets by the third party C, the same noise is added to the same position of the two sets, and a new value is obtained. This step is to protect the privacy of the identifier of the third party C in the process described below.
S12, storing a set of own parties into bins through a cuckoo function by the parties A and B, wherein each bin stores at most one element; the third party C maps the elements in the two sets into bins by using a simple hash function respectively, and each bin stores a plurality of elements; after mapping to bins, for party a and party B, if there is no storage in a certain bin, party a and party B store an invalid element ×) in that bin. This step of mapping into bins facilitates the corresponding operation between participants on a bin by bin basis in the steps that the participants only need to operate between bins with the same label and do not need to operate between bins with different labels, thus reducing the operation frequency between the participants bins.
S13, for each bin, party a and the third party C run an inadvertent, programmable pseudo-random function (OPPRF) involving two parties, a sender and a receiver, the sender inputting pairs of values (x) i ,y i ) The receiving party inputs x, then if x equals some x i Then the receiver receives the corresponding y i . For each bin, then party a acts as a receiver of the OPPRF, inputs the elements in its bin, and third party C acts as a sender of the OPPRF, inputs the elements stored in the bin and their corresponding noise-added values, and party a gets the corresponding OPPRF output.
S14, and in a similar manner, for each bin, party B and third party C run an inadvertent, programmable pseudo random function (OPPRF), party B acts as a receiver, third party C acts as a sender, and party B can obtain a corresponding OPPRF output.
S2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise according to the data set added with the noise by the participant B, and sorts the data set of the participant A according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;
s21, circulating each bin, if the bin stores non-invalid elements, the participant B outputs a correspondingly careless programmable pseudo-random function OPPRF to be exclusive-OR with the elements in the own bin to form a set K B Will aggregate K B Sending the data to a participant A; party A XOR the output of the respective inadvertent, programmable pseudo-random function OPPRF with the elements in the own bin to form the set K A By matching K A And K B According to the set K A And set K B The intersection of the two sets yields the intersection K A ', will intersect K A ' corresponding element e i Put into intersection S A Will intersect K A ' to party B.
S22, participant B according to K A ' and K B The intersection of (2) with the corresponding element p i Put into the intersection S B
After the alignment phase, if the same user is registered in both party a and party B, the two identifiers of the user will be at the intersection S A And intersection S B The same row.
Specifically, in the present embodiment, assuming the ID of the sample in Party A database is a mailbox, the set E = { E = { (E) 1 ,e 2 ,…,e m }; the ID of the sample in the party B database is the telephone number, set P = { P = 1 ,p 2 ,…,p m }; and the sample identifier in trusted third party C database has both a mailbox and a phone number, the set EP = { (e' 1 ,p′ 1 ),(e′ 2 ,p′ 2 ),…,(e′ n ,p′ n ) }. Participant A, B and third party C have negotiated 3 hash functions { h } 1 ,h 2 ,h 3 And the number of bins, b, the specific steps of sample alignment are as follows:
third party C pairs Each pair of elements { (e) 'in set EP' j ,p′ j ) J belongs to [ n and select a random number { r ∈ [ n ] j ,j∈[n]}. Then, calculating:
Figure BDA0003826598870000071
and
Figure BDA0003826598870000072
party A maps each element in set E to B bins { B } using Cuckoo Hashing a [1],B a [2]…,B a [b]In which B is a [i]Representing the elements in the ith bin of a. Party B maps each element in set P to B bins using Cuckoo Hashing { B } b [1],B b [2]…,B b [b]In which B is b [i]Representing the elements in the ith bin of B. And after the Cuckoo Hashing operation of the A and the B is finished, filling an invalid element T in a certain bin of the A and the B if the bin is empty. C has two sets of bins
Figure BDA0003826598870000073
And
Figure BDA0003826598870000074
c will { e 'by Simple Hashing' 1 ,e′ 2 ,…,e′ n Mapping to the first bin set
Figure BDA0003826598870000075
In
Figure BDA0003826598870000076
Likewise C will { p 'using Simple Hashing' 1 ,p′ 2 ,…,p′ n Mapping to a second set of bins
Figure BDA0003826598870000077
In
Figure BDA0003826598870000078
Wherein
Figure BDA0003826598870000079
Representing all elements in the ith bin.
For each bin, third parties C and A run the OPPRF protocol once, and at the ith bin, participant A acts as the receiver and enters element B in the ith bin a [i]Third party C acts as sender and enters the element in the ith bin
Figure BDA00038265988700000710
Y 'is received from A after running of protocol' i If stored in the ith bin of participant a is not invalid element ×, participant a calculates
Figure BDA00038265988700000711
Similarly, for each bin, third parties C and B run the OPPRF protocol once, and at the ith bin, B enters B b [i]C input
Figure BDA00038265988700000712
B calculates if stored in the ith bin of B is not invalid element ^ T
Figure BDA00038265988700000713
Party B order
Figure BDA00038265988700000714
Then put K B And sending the signal to A.
Party A order
Figure BDA0003826598870000081
Initializing set S a And K' A For an empty set, for { K A [j]∈K A J is equal to m, A checks K A [j]∈K B If true, participant A sends B a [j]Put into intersection S A In, K A [j]Put into K' A In, K 'is finally prepared' A And sending the data to B.
For { K B [k]∈K B K is equal to m, and the participant B checks K B [k]∈K′ A If it is true, if K B [k]=K′ A [j]Then, B is b [k]Put into set S b Line j, the last participant B has an intersection S B
After the above protocol is run, if E is mailbox i And telephone number p i If belong to the same user, then e i And p i Will appear in the set S respectively A And set S B Thereby completing the sample alignment.
And S3, jointly training the logistic regression model by the participant A and the participant B through a gradient descent method, calculating the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B to obtain a complete encrypted gradient, adding noise to the encrypted gradient and transmitting the gradient to the third party C, and decrypting the ciphertext by the third party C, removing the noise and updating the gradient.
After aligning the samples using the alignment method of the present invention, the participants can train multiple models together, such as a linear regression model, a logistic regression model, a neural network, and the like. In order to realize a complete longitudinal federal learning process, the invention constructs a safe model training process by using homomorphic encryption, and under the help of a third party C, the participators A and B jointly train a logistic regression model by using a gradient descent method
In the training phase, after the alignment phase is set, the participant a and the participant B rearrange the own data set according to the intersection result, wherein the data set of the participant B further includes the label of each piece of data, and the training comprises the following steps:
s31, the third party C generates a homomorphic encrypted public and private key, the public key is sent to the participant A and the participant B, and the participant A and the participant B initialize own party weights respectively.
S32, for each piece of data in the data set, the participant A calculates the ciphertext of a local intermediate result according to the weight and the piece of data, the participant B calculates the ciphertext of the local intermediate result according to the weight, the piece of data and the tag value, the participant A and the participant B mutually send the ciphertext of the intermediate result to each other, and the participant A and the participant B respectively homomorphically add the ciphertext of the own party and the received ciphertext of the intermediate result to obtain a new ciphertext; and the participating party A and the participating party B multiply each piece of data in the own data set with the new ciphertext, and thus, after each piece of data in the data set is circulated, the data are accumulated to obtain the complete encryption gradient of the own party. Wherein the ciphertext of the intermediate result is used to calculate the complete gradient values of party a and party B.
And S33, the participant A and the participant B respectively add noise to the complete encrypted gradient and send the complete encrypted gradient to the third party C, and the third party C decrypts the two ciphertexts and sends the ciphertexts to the participant A and the participant B respectively.
And S34, removing noise from the received values by the party A and the party B, and updating the gradient of the party A and the gradient of the party B.
And S35, circularly executing the steps S32-S34 until the model converges or the maximum iteration times is reached, and obtaining the trained logistic regression model by the participator A and the participator B.
In this embodiment, after aligning the samples of the party a and the party B, the party a and the party B have n intersection elements, and the party a and the party B rearrange the own data sets, so that both the two data sets have n samples, where the party a data set is X A =(x A,1 ,x A,2 ,…,x A,n ) Each sample x A,i Has d A The weight of the feature value, party A initialization model is column vector theta A Length of d A . Participant B dataset X B =((x B,1 ,y 1 ),(x B,2 ,y 2 ),…,(x B,n ,y n ) Each sample x) of which B,i Has d B Each characteristic value and its corresponding label is y i The weight of the B initialization model is a column vector theta B Length d of B
Party a and party B negotiate a learning rate η and a logistic regression loss function together:
Figure BDA0003826598870000091
wherein
Figure BDA0003826598870000092
Then the loss function L (theta) is applied to theta A And solving the partial derivative to obtain the gradient of the participant A, wherein the calculation formula of the gradient of the participant A is as follows:
Figure BDA0003826598870000093
wherein i is the ith data, and n data are participated in training by each participant; theta is a weight column vector, because there are a plurality of features, one feature has a weight value, the features of the parties A and B are different, and the weight of the party A is theta A The weight of the participant B is theta B Theta includes theta A And theta B
Figure BDA0003826598870000094
Representing a longitudinal concatenation of vectors.
Loss function L (theta) vs. theta B And solving the partial derivative to obtain the gradient of the participant B, wherein the calculation formula of the gradient of the participant B is as follows:
Figure BDA0003826598870000095
participant a and participant B may update the gradient using the following formula:
Figure BDA0003826598870000096
specifically, the specific steps of co-training a logistic regression model by the participant a and the participant B with the help of the third party C are as follows:
(1) The third party C generates a public and private key (pk, sk) through a homomorphic encryption key generation KeyGen algorithm and then transmits the public key pk to the party a and the party B.
For i = 1., n, steps (2) to (4) are performed.
(2) Party A local computation
Figure BDA0003826598870000097
U is then encrypted using the public key pk A,i To obtain a ciphertext
Figure BDA0003826598870000098
And sends it to party B.
(3) Party B local computation
Figure BDA0003826598870000101
U is then encrypted using the public key pk B,i To obtain a ciphertext
Figure BDA0003826598870000102
And sends it to party a.
(4) Party A local computation
Figure BDA0003826598870000103
And calculate
Figure BDA0003826598870000104
Party B local computation
Figure BDA0003826598870000105
And calculate
Figure BDA0003826598870000106
(5) A calculating the gradient
Figure BDA0003826598870000107
Then a random number R is selected A To blindly calculate the local gradient
Figure BDA0003826598870000108
And sends the calculation result to C.
(6) Likewise, B calculates the gradient
Figure BDA0003826598870000109
Then a random number R is selected B To blindly calculate the local gradient
Figure BDA00038265988700001010
And sends the calculation result to C.
(7) And C, updating the gradient after receiving the gradient sent by A and B. Firstly, the private key sk of the user is used for decryption to obtain
Figure BDA00038265988700001011
And
Figure BDA00038265988700001012
then hold
Figure BDA00038265988700001013
Is sent to A, is
Figure BDA00038265988700001014
And sending the data to B.
(8) A is obtained by removing random number
Figure BDA00038265988700001015
The gradient is then updated locally
Figure BDA00038265988700001016
(9) B removing the random number to obtain
Figure BDA00038265988700001017
The gradient is then updated locally
Figure BDA00038265988700001018
And (5) repeating the steps (2) to (9) until the model converges.
In this embodiment, taking an example of a logistic regression model trained by two medical institutions at home and abroad as an example, as shown in fig. 2, the application example mainly includes a medical institution a at foreign countries (a sample identifier is a mailbox), a medical institution B at home (a sample identifier is a telephone number) and a trusted third party C (a sample identifier includes both a mailbox and a telephone number), and then the steps of the logistic regression model trained by the medical institution a at foreign countries and the medical institution B at home with the help of the trusted third party C are as follows:
(1) Run OPPRF: and after the elements in the mailbox set and the telephone number set are stored in the bin by the overseas medical institution A and the domestic medical institution B respectively by Cuckoo Hashing, and the elements in the mailbox set and the telephone number set are stored in the bin by the C by the Simple Hashing, for each bin, the overseas medical institution A and the domestic medical institution B respectively perform OPPRF operation with the trusted third party C for one time, and the overseas medical institution A and the domestic medical institution B respectively obtain an OPPRF output value corresponding to each element in the set.
(2) Information is exchanged to achieve sample alignment: and the foreign medical institution A and the domestic medical institution B respectively perform exclusive OR on the OPPRF values corresponding to the elements in the set, and perform interaction to obtain intersection elements, so that sample alignment is realized.
(3) Sending a public key: c, the public key is sent to the foreign medical institution A and the domestic medical institution B.
(4) Sending the ciphertext of the intermediate result: for each data, u is calculated by a foreign medical institution A and a domestic medical institution B respectively A,i And u B,i And then the corresponding ciphertext is sent to the opposite side to assist the opposite side in calculating the gradient.
(5) Calculating the gradient: and the two parties respectively calculate the ciphertext of the own party gradient by using the information obtained by interaction.
(6) Sending the encrypted, noisy gradient: and the foreign medical institution A and the domestic medical institution B blindly transmit the own gradient ciphertext to the trusted third party C for decryption.
(7) Gradient with noise: and the credible third party C decrypts the ciphertexts sent by the foreign medical institution A and the domestic medical institution B and respectively returns the ciphertexts to the A and the B.
(8) Updating the gradient: after receiving the plaintext, foreign medical institutions A and domestic medical institutions B remove noise to obtain gradients, and then locally update the gradients.
And (5) circularly executing the steps (4) to (8) until the model converges.
After the steps are executed and the model is converged, the foreign medical institution A holds the trained model theta A Having a trained model theta with the domestic medical institution B B . When prediction is performed using this model, it is assumed that one piece of prediction data of A at this time is x' A One prediction data of B is x' B Then A calculates
Figure BDA0003826598870000111
And substituted into the activation function
Figure BDA0003826598870000112
To obtain a predicted value y' A Wherein y' A ∈[0,1]. Similarly, B can also be calculated to obtain predicted data x' B Corresponding predicted value y' B
In summary, the longitudinal federated learning method provided by the invention can realize sample alignment under different sample identifiers of the participants, and can achieve the purpose of training the model together on the basis of protecting privacy information of each participant.
Example 2:
the present embodiment provides a computer device, which may be a server, a computer, or the like, and includes a processor, a memory, an input device, a display, and a network interface connected by a system bus, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium and an internal memory, the nonvolatile storage medium stores an operating system, a computer program, and a database, the internal memory provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the processor executes the computer program stored in the memory, the longitudinal federal learning method in which the sample identifiers of the foregoing embodiment 1 are different is implemented, as follows:
s1, a party A and a party B respectively operate an inadvertent and programmable pseudo-random function OPPRF with a third party C to obtain corresponding inadvertent and programmable pseudo-random function OPPRF outputs, and the party A and the party B respectively add the corresponding inadvertent and programmable pseudo-random function OPPRF outputs as noise to a data set of the own party;
s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise of the two parties according to the data set added with the noise of the participant B, and the data set of the participant A is sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;
s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C sends the plaintext to the participant A and the participant B respectively, the participant A and the participant B remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta A The participator B obtains a trained logistic regression model theta B
The step S1 includes:
s11, a participant A, a participant B and a third party C respectively hold a set formed by all samples in a database of the own party under identifiers, and the participant A, the participant B and the third party C negotiate three hash functions;
a third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another, and a new value is obtained after the third party C adds noise to each element in the two sets; wherein the same noise is added to the same positions of the two sets; elements in one set and values after noise is added in the other set are combined into a point value pair according to lines to form two point value pair sets, and the point value pair in the point value pair sets is used as input of C when the OPPRF is operated;
s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if no invalid element is stored in a bin of participants a and B, participant a and participant B store an invalid element in the bin;
s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding noise-added values, and the participant A obtains the corresponding accidental programmable pseudo-random function OPPRF output;
s14, for each bin, a participant B and a participant C operate a pseudo-random function OPPRF, the participant B is a receiving party, the participant C serves as a sending party, and the participant B obtains a corresponding careless and programmable pseudo-random function OPPRF output.
Example 3:
the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, where when the program is executed by a processor, and the processor executes the computer program stored in the memory, the method for vertical federal learning under different sample identifiers according to embodiment 1 is implemented as follows:
s1, a party A and a party B respectively operate an inadvertent and programmable pseudo-random function OPPRF with a third party C to obtain corresponding inadvertent and programmable pseudo-random function OPPRF outputs, and the party A and the party B respectively add the corresponding inadvertent and programmable pseudo-random function OPPRF outputs as noise to a data set of the own party;
s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise of the two parties according to the data set added with the noise of the participant B, and the data set of the participant A is sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;
s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C respectively sends the plaintext to the participant A and the participant B, the participant A and the participant B respectively remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta A The participator B obtains a trained logistic regression model theta B
The step S1 includes:
s11, a participant A, a participant B and a third party C respectively hold a set formed by all samples in a database of the own party under identifiers, and the participant A, the participant B and the third party C negotiate three hash functions;
the third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another one by one, and a new value is obtained after the third party C adds noise to each element in the two sets; wherein the same noise is added to the same positions of the two sets; elements in one set and values added with noise in the other set are combined into a point value pair according to rows to form two point value pair sets, and the point value pairs in the point value pair sets are used as input of C when an OPPRF is operated;
s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if no invalid element is stored in a bin of participants a and B, participant a and participant B store an invalid element in the bin;
s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding noise-added values, and the participant A obtains the corresponding accidental programmable pseudo-random function OPPRF output;
s14, for each bin, a participant B and a participant C operate a pseudo-random function OPPRF, the participant B is a receiving party, the participant C serves as a sending party, and the participant B obtains a corresponding careless and programmable pseudo-random function OPPRF output.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. The method for longitudinal federal learning under different sample identifiers comprises a participant A, a participant B and a third party C, and is characterized by comprising the following steps:
s1, a party A and a party B respectively operate an inadvertent and programmable pseudo-random function OPPRF with a third party C to obtain corresponding inadvertent and programmable pseudo-random function OPPRF outputs, and the party A and the party B respectively add the corresponding inadvertent and programmable pseudo-random function OPPRF outputs as noise to a data set of the own party;
s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise according to the data set added with the noise by the participant B, and the data sets of the participant A are sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;
s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain the plaintext, and the third party C decrypts the plaintextRespectively sending the data to a participant A and a participant B, respectively updating the gradient after the noise of the received plaintext is removed by the participant A and the participant B, and obtaining a trained logistic regression model theta by the participant A A The participator B obtains a trained logistic regression model theta B
2. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S1 includes:
s11, a participant A, a participant B and a third party C respectively hold a set formed by all samples in a database of the own party under identifiers, and the participant A, the participant B and the third party C negotiate three hash functions;
the third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another one by one, and a new value is obtained after the third party C adds noise to each element in the two sets; adding the same noise to the same positions of the two sets, combining elements in one set and values added with noise in the other set into a point value pair according to rows to form two point value pair sets, and taking the point value pair in the point value pair sets as the input of C when the OPPRF is operated;
s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if no element is stored in a bin of participants a and B, participant a and participant B store an invalid element in the bin;
s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding noise-added values, and the participant A obtains the corresponding accidental programmable pseudo-random function OPPRF output;
s14, for each bin, a participant B and a participant C operate a pseudo-random function OPPRF, the participant B is a receiving party, the participant C serves as a sending party, and the participant B obtains a corresponding careless and programmable pseudo-random function OPPRF output.
3. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S2 includes:
s21, circulating each bin, if the bin stores non-invalid elements, the participant B outputs a correspondingly careless programmable pseudo-random function OPPRF to be exclusive-OR with the elements in the own bin to form a set K B Will aggregate K B Sending the data to a participant A; similarly, party A xors the output of the pseudo-random function OPPRF, which is programmable and not intended, with the elements in the own bin to form set K A By matching K A And K B According to the set K A And set K B The intersection of the two sets yields the intersection K A ', will intersect K A ' corresponding elements are put into the intersection S A Will intersect K A ' to participant B;
s22, the participator B performs the operation according to the intersection K A ' and set K B Putting the corresponding elements into the intersection S B
4. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S3 includes:
s31, a third party C generates a homomorphic encrypted public and private key, and sends the public key to a participant A and a participant B, and the participant A and the participant B initialize own party weights respectively;
s32, for each piece of data in the data set, the participant A calculates the ciphertext of a local intermediate result according to the weight and the piece of data, the participant B calculates the ciphertext of the local intermediate result according to the weight, the piece of data and the tag value, the participant A and the participant B mutually send the ciphertext of the intermediate result to each other, and the participant A and the participant B respectively homomorphically add the ciphertext of the own party and the received ciphertext of the intermediate result to obtain a new ciphertext; the participating party A and the participating party B multiply each piece of data in the own data set with the new ciphertext, and after each piece of data in the data set is circulated, the data are accumulated to obtain the complete encryption gradient of the own party; the ciphertext of the intermediate result is used for calculating the gradient value of the participant A and the gradient value of the participant B;
s33, the participant A and the participant B respectively add noise to the complete encrypted gradient and send the complete encrypted gradient to the participant C, and the participant C decrypts the two ciphertexts and sends the ciphertexts to the participant A and the participant B respectively;
s34, removing noise from the received values by the party A and the party B, and updating the gradient of the party A and the gradient of the party B;
and S35, circularly executing the steps S32-S34 until the logistic regression model converges or reaches the maximum iteration times, and obtaining the trained logistic regression model by the participator A and the participator B.
5. The longitudinal federated learning method under the condition that the sample identifiers are different according to claim 4, is characterized in that the calculation formula for calculating the gradient value of the participant A is as follows:
Figure FDA0003826598860000021
wherein X A For participant A dataset, x A,i For participant A dataset ith sample, X B For party B data set, x B,i For participant B dataset i sample, y i Is x B,i Corresponding label, theta is the weight column vector, theta A As a weight of the party a,
Figure FDA0003826598860000031
denotes theta A The process of transposition is carried out,
Figure FDA0003826598860000032
denotes theta B Transposition is carried out;
the calculation formula for calculating the gradient value of the participant B is as follows:
Figure FDA0003826598860000033
wherein, X A For participant A dataset, x A,i For the ith sample, X B For the participant B dataset, x B,i For participant B dataset i sample, y i Is x B,i Corresponding label, theta is the weight column vector, theta B Is the weight of the participant B and,
Figure FDA0003826598860000034
denotes theta A The process of transposition is carried out,
Figure FDA0003826598860000035
denotes theta B And (4) transposition.
6. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the method for longitudinal federal learning under different sample identifiers as claimed in any of claims 1 to 5 when executing the program stored in the memory.
7. A storage medium storing a program which, when executed by a processor, implements the longitudinal federal learning method in which sample identifiers are not the same as in any of claims 1 to 5.
CN202211061861.1A 2022-08-31 2022-08-31 Longitudinal federated learning method, device and medium under condition of different sample identifiers Pending CN115630713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211061861.1A CN115630713A (en) 2022-08-31 2022-08-31 Longitudinal federated learning method, device and medium under condition of different sample identifiers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211061861.1A CN115630713A (en) 2022-08-31 2022-08-31 Longitudinal federated learning method, device and medium under condition of different sample identifiers

Publications (1)

Publication Number Publication Date
CN115630713A true CN115630713A (en) 2023-01-20

Family

ID=84903027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211061861.1A Pending CN115630713A (en) 2022-08-31 2022-08-31 Longitudinal federated learning method, device and medium under condition of different sample identifiers

Country Status (1)

Country Link
CN (1) CN115630713A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115913554A (en) * 2023-03-13 2023-04-04 深圳市洞见智慧科技有限公司 Efficient hidden trace federal learning method and system based on state secret and related equipment
CN117034000A (en) * 2023-03-22 2023-11-10 浙江明日数据智能有限公司 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment
CN118114306A (en) * 2024-04-28 2024-05-31 暨南大学 Collusion attack resistant privacy contact discovery method entrusted to cloud server
CN118114306B (en) * 2024-04-28 2024-07-26 暨南大学 Collusion attack resistant privacy contact discovery method entrusted to cloud server

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115913554A (en) * 2023-03-13 2023-04-04 深圳市洞见智慧科技有限公司 Efficient hidden trace federal learning method and system based on state secret and related equipment
CN117034000A (en) * 2023-03-22 2023-11-10 浙江明日数据智能有限公司 Modeling method and device for longitudinal federal learning, storage medium and electronic equipment
CN118114306A (en) * 2024-04-28 2024-05-31 暨南大学 Collusion attack resistant privacy contact discovery method entrusted to cloud server
CN118114306B (en) * 2024-04-28 2024-07-26 暨南大学 Collusion attack resistant privacy contact discovery method entrusted to cloud server

Similar Documents

Publication Publication Date Title
CN110572253B (en) Method and system for enhancing privacy of federated learning training data
Liu et al. An efficient privacy-preserving outsourced calculation toolkit with multiple keys
Li et al. Privacy-preserving machine learning with multiple data providers
CN112910631B (en) Efficient privacy set intersection calculation method and system based on cloud server assistance
US20230087864A1 (en) Secure multi-party computation method and apparatus, device, and storage medium
CN110719159A (en) Multi-party privacy set intersection method for resisting malicious enemies
CN115630713A (en) Longitudinal federated learning method, device and medium under condition of different sample identifiers
Zhao et al. Can you find the one for me?
CN110190945A (en) Based on adding close linear regression method for secret protection and system
CN112183767A (en) Multi-key lower model aggregation federal learning method and related equipment
CN112383550A (en) Dynamic authority access control method based on privacy protection
Ma et al. Practical privacy-preserving frequent itemset mining on supermarket transactions
Morales et al. Private set intersection: A systematic literature review
Jalali et al. Federated learning security and privacy-preserving algorithm and experiments research under internet of things critical infrastructure
Zhao et al. SGBoost: An efficient and privacy-preserving vertical federated tree boosting framework
KR20230148200A (en) Data processing methods, devices and electronic devices, and storage media for multi-source data
CN111859440B (en) Sample classification method of distributed privacy protection logistic regression model based on mixed protocol
Hu et al. Fully homomorphic encryption-based protocols for enhanced private set intersection functionalities
CN113204788A (en) Privacy protection method for fine-grained attribute matching
CN117675270A (en) Multi-mode data encryption transmission method and system for longitudinal federal learning
CN116861484A (en) Two-side transverse federal learning method and device for Chinese address word segmentation
CN116681141A (en) Federal learning method, terminal and storage medium for privacy protection
Li et al. Privacy-preserving ID3 data mining over encrypted data in outsourced environments with multiple keys
CN115361196A (en) Service interaction method based on block chain network
Zhou et al. A survey of security aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination