CN115630713A

CN115630713A - Longitudinal federated learning method, device and medium under condition of different sample identifiers

Info

Publication number: CN115630713A
Application number: CN202211061861.1A
Authority: CN
Inventors: 赖俊祚; 李钰; 张蓉; 李燕玲
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2023-01-20

Abstract

The invention relates to the technical field of federal learning, in particular to a method, equipment and a medium for longitudinal federal learning under different sample identifiers, wherein the method comprises the following steps: in an alignment stage in federal learning, sample alignment is carried out on participants with different sample identifiers through an inadvertent and programmable pseudo-random function OPPRF, a Cuckoo Hashing and a Simple Hashing, and noise is added to the sample identifiers in the process of sample alignment; completing a participant co-training model through Paillier homomorphic encryption, and ensuring privacy information of participants in a training process by using an encryption means; the method and the device can realize sample alignment under different participant sample identifiers, and achieve the purpose of training the model together on the basis of protecting privacy information of each participant.

Description

Longitudinal federated learning method, device and medium under condition of different sample identifiers

Technical Field

The invention relates to the technical field of federal learning, in particular to a longitudinal federal learning method, equipment and medium under different sample identifiers.

Background

Longitudinal federated learning applies to datasets with more overlapping user samples but less overlapping user features, such as banks and e-commerce companies, which provide different services to users and therefore have different aspects of their features, but with a large overlap of users they serve, which increases the feature dimension of the training dataset. Before training the models together, the samples are aligned, and when the samples are aligned, the current longitudinal federal learning scheme mainly adopts a Privacy Set Intersection (PSI) technology to perform sample alignment, for example, the federal learning framework fane developed by the micro-banking mainly adopts PSI based on RSA and hash function to perform sample alignment, and Liu et al uses PSI based on a Pohlig-hellman structure in the document "asymmetric vertical fed learning" to perform sample alignment under an asymmetric federal learning environment, but all of the schemes default that the participant sample identifiers are the same, and do not consider the alignment problem when the sample identifiers are different. After the samples are aligned, the participators can train the model together, and the current longitudinal federal learning scheme mainly adopts the cryptography methods of homomorphic encryption, secret sharing, differential privacy and the like to protect the privacy information in the training process. Hardy et al protect the privacy information in the interaction process by using an addition homomorphic encryption scheme in the document "Private fed learning on vertical encryption scheme and addition homomorphic encryption", and when a sender sends a gradient cipher text, in order to further protect the gradient information, the gradient plus a random number needs to be blinded and then sent to a receiver, and the scheme needs the participation of a coordinator.

With the advent of the big data era, data is used as a new production element, and how to dig out valuable information from massive data is a research hotspot gradually. The traditional data processing method is to aggregate all data for analysis and modeling, and the aggregation of the data may expose sensitive information of a user, namely, if a person who owns the data uses the data maliciously in the process of analysis and modeling by using the data, for example, the collected data is sold in a private package mode, the leakage of privacy information of the user can be caused. Data are dare to flow among enterprises, and cannot and unwillingly go, so that the flow of the data is severely restricted, and data isolated islands are caused, so that the method for protecting the privacy of the data and fully playing the data value becomes a research hotspot. The federal learning comes from now, which aims to help the participants to finish training together without revealing the local private data of the participants, and is characterized in that the private data do not flow out of the local, and the federal learning is widely applied to various fields due to the good characteristic of protecting the data privacy. According to the distribution of data, federal learning can be divided into longitudinal federal learning, transverse federal learning and federal migration learning. In the longitudinal federal learning, sample alignment is needed before the participants train the model together, and the existing longitudinal federal learning method either does not consider the problem of sample alignment or only considers the alignment problem that the sample unique Identifiers (IDs) in the databases of the participants are the same, but the information used by the same user when registering on different platforms in real life may be different, for example, the user registers on the platform a with a mailbox number, and registers on the platform B with a telephone number, so that the unique identifiers of the same user are different in the two databases of a and B. If the two platforms need to train the model together, first performing sample alignment, then the traditional vertical federal learning method is no longer applicable. Therefore, it is necessary to design a longitudinal federal learning method capable of solving the problem that the participant sample identifiers are different, and the method has important application value.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a longitudinal federal learning method, equipment and a medium under the condition of different sample identifiers.

The invention aims to provide a longitudinal federal learning method under the condition that sample identifiers are different.

It is a second object of the invention to provide a computer apparatus.

It is a third object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a method for longitudinal federated learning with disparate sample identifiers, the method comprising:

s1, a party A and a party B respectively operate an inadvertent and programmable pseudo-random function OPPRF with a third party C to obtain corresponding inadvertent and programmable pseudo-random function OPPRF outputs, and the party A and the party B respectively add the corresponding inadvertent and programmable pseudo-random function OPPRF outputs as noise to a data set of the own party;

s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise according to the data set added with the noise by the participant B, and the data sets of the participant A are sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;

s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient by the participant A and the participant B through a ciphertext of an interactive intermediate result, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C sends the plaintext to the participant A and the participant B respectively, the participant A and the participant B remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta _A The participator B obtains a trained logistic regression model theta _B 。

In a preferred embodiment, the step S1 includes:

s11, a participant A, a participant B and a third party C respectively hold a set formed by all samples in a database of the own party under identifiers, and the participant A, the participant B and the third party C negotiate three hash functions;

the third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another one by one, and a new value is obtained after the third party C adds noise to each element in the two sets; wherein the same noise is added to the same positions of the two sets; elements in one set and values added with noise in the other set are combined into a point value pair according to rows to form two point value pair sets, and the point value pairs in the point value pair sets are used as input of C when an OPPRF is operated;

s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if none are stored in a bin of parties a and B, party a and party B store an invalid element in that bin;

s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding values of the elements after noise is added, and the participant A obtains corresponding accidental programmable pseudo-random function OPPRF output;

s14, for each bin, a participant B and a participant C operate a pseudo-random function OPPRF, the participant B is a receiving party, the participant C serves as a sending party, and the participant B obtains a corresponding careless and programmable pseudo-random function OPPRF output.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprises a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to realize the longitudinal federal learning method under the condition that the sample identifiers are different.

The third purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium storing a program which, when executed by a processor, implements the above-described vertical federal learning method in which sample identifiers are different.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the longitudinal federal learning method, the device and the medium under different sample identifiers, in the alignment stage in the federal learning, the participants with different sample identifiers are subjected to sample alignment through the technology of an inadvertent and programmable pseudo-random function OPPRF, cuckoo Hashing, simple hash Hashing and the like, and noise is added to the sample identifiers when the sample alignment is carried out; completing a participant co-training model through Paillier homomorphic encryption, and ensuring privacy information of participants in a training process by using an encryption means; the invention can realize sample alignment under different participant sample identifiers, and can achieve the purpose of training the model together on the basis of protecting privacy information of each participant.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow chart of a longitudinal federated learning method with disparate sample identifiers in an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a logistic regression model trained by two medical institutions in accordance with an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described in further detail with reference to the accompanying drawings and examples, and it is obvious that the described examples are some, but not all, examples of the present invention, and the embodiments of the present invention are not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

the invention provides a longitudinal federated learning method under the condition that the existing scheme can not solve the alignment problem under the condition that sample identifiers are different, which realizes the alignment of samples under the condition that sample identifiers of participants are different, and utilizes homomorphic encryption to realize the training of a model after the samples are aligned.

The invention provides a longitudinal federal learning method under different sample identifiers, which mainly utilizes four technologies of OPPRF, simple Hashing, cuckoo Hashing and Paillier homomorphic encryption to construct a scheme, wherein the four technologies are introduced as follows:

1. an inadvertent, programmable pseudo-random function OPPRF (Oblivious Programmable PRF)

OPPRF is a combination of an inadvertent, pseudo-random function OPRF and a programmable pseudo-random function PPRF, and is primarily concerned with the sender and receiver, the sender being A, the input to A being { (x) _i ,y _i ),i∈[n]And B is the receiver, and the input of B is x. The invention adopts an OPPRF based on a table structure, which mainly comprises the following 5 steps:

(1) A and B run OPRF protocol, the input of A is k, the input of B is x, after the protocol is run, B receives F (k, x).

(2) For { x _i ,i∈[n]A calculates { F (k, x) ₁ ),F(k,x ₂ ),..,F(k,x _n ) }. A samples the value v until all { H (F (k, x) } _i )||v),∈[n]Are different.

(3) For i e n]A first calculates h _i ＝H(F(k,x _i ) V), then h of table T _i The element of each position is set as

(4) A stores the positions of table T where no element is stored in a random number, and then sends tables T and v to B.

(5) B calculates H = H (F (k, x) | | v), and then outputs

If x = x _i B will yield x _i Corresponding to y _i Otherwise, a random number is obtained.

2. Simple Hashing Simple hash function

Simple Hashing contains B bins B ₁ ,B ₂ ,..,B _b H and k hash functions h ₁ ,h ₂ ,..,h _k }. Each bin may store multiple elements. When storing element x into bin, first calculate the hash function value { h) of x under k hash functions ₁ (x),h ₂ (x),…,h _k (x) Then put x into corresponding k bins

3. Cuckoo Hashing Cuckoo hash function

Cuckoo Hashing contains k hash functions { h ₁ ,h ₂ ,..,h _k And B bins B ₁ ,B ₂ ,..,B _b }. Each bin can only store at most one element. Element x is stored in

One of the bins.

4. Paillier homomorphic encryption

The invention utilizes a Paillier homomorphic encryption scheme to ensure the security of private information in the interactive process of the training phase, and the Paillier homomorphic encryption scheme generally comprises KeyGen, enc and Den algorithms. The KeyGen algorithm inputs safety parameters and outputs pk and sk; enc algorithm inputs plaintext m and pk and outputs ciphertext of m

Dec algorithm input ciphertext

And sk, outputting the decrypted plaintext m. The Paillier homomorphic encryption scheme supports the following two operations:

the two ciphertexts are added:

constant times ciphertext:

the federal learning is longitudinal federal learning and comprises a participant A, a participant B and a trusted third party C, wherein the participants A and B perform sample alignment and model training with the help of the third party C to finally obtain trained models respectively. The longitudinal federal learning method includes a sample alignment phase and a training phase.

The method comprises the following steps that a participant A and a participant B hold own databases, each sample in the databases comprises an identifier and a characteristic value, each participant holds a part of characteristics, the sample identifiers of the participant A and the participant B are different, and the sample identifier of the participant B is assumed to be ID1 and ID2; the third party C owns a huge amount of data and each sample in the database contains two identifiers (ID 1, ID 2).

As shown in fig. 1, the method for longitudinal federated learning with different sample identifiers according to the present invention includes a party a, a party B, and a third party C, and is characterized in that the method includes the steps of:

s1, a participant A and a participant B respectively obtain corresponding OPPRF outputs with an inadvertent and programmable pseudo-random function OPPRF operated by a third party C, and the participant A and the participant B respectively add the corresponding OPPRF outputs as noise to a data set of the participant.

S11, the participant A, the participant B and the third party C respectively hold a set formed by all samples in the own database under the sample identifiers, and the participant A, the participant B and the third party C negotiate three hash functions.

The method includes the steps that a participant A and a participant B need to train a model together, sample alignment needs to be conducted firstly, and due to the fact that sample IDs of the participant A and the participant B are different, a trusted third party C with mass data is introduced to assist the participant A and the participant B in conducting sample alignment.

The third party C has two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another, noise is added to each element in the two sets by the third party C, the same noise is added to the same position of the two sets, and a new value is obtained. This step is to protect the privacy of the identifier of the third party C in the process described below.

S12, storing a set of own parties into bins through a cuckoo function by the parties A and B, wherein each bin stores at most one element; the third party C maps the elements in the two sets into bins by using a simple hash function respectively, and each bin stores a plurality of elements; after mapping to bins, for party a and party B, if there is no storage in a certain bin, party a and party B store an invalid element ×) in that bin. This step of mapping into bins facilitates the corresponding operation between participants on a bin by bin basis in the steps that the participants only need to operate between bins with the same label and do not need to operate between bins with different labels, thus reducing the operation frequency between the participants bins.

S13, for each bin, party a and the third party C run an inadvertent, programmable pseudo-random function (OPPRF) involving two parties, a sender and a receiver, the sender inputting pairs of values (x) _i ,y _i ) The receiving party inputs x, then if x equals some x _i Then the receiver receives the corresponding y _i . For each bin, then party a acts as a receiver of the OPPRF, inputs the elements in its bin, and third party C acts as a sender of the OPPRF, inputs the elements stored in the bin and their corresponding noise-added values, and party a gets the corresponding OPPRF output.

S14, and in a similar manner, for each bin, party B and third party C run an inadvertent, programmable pseudo random function (OPPRF), party B acts as a receiver, third party C acts as a sender, and party B can obtain a corresponding OPPRF output.

S2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise according to the data set added with the noise by the participant B, and sorts the data set of the participant A according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;

s21, circulating each bin, if the bin stores non-invalid elements, the participant B outputs a correspondingly careless programmable pseudo-random function OPPRF to be exclusive-OR with the elements in the own bin to form a set K _B Will aggregate K _B Sending the data to a participant A; party A XOR the output of the respective inadvertent, programmable pseudo-random function OPPRF with the elements in the own bin to form the set K _A By matching K _A And K _B According to the set K _A And set K _B The intersection of the two sets yields the intersection K _A ', will intersect K _A ' corresponding element e _i Put into intersection S _A Will intersect K _A ' to party B.

S22, participant B according to K _A ' and K _B The intersection of (2) with the corresponding element p _i Put into the intersection S _B 。

After the alignment phase, if the same user is registered in both party a and party B, the two identifiers of the user will be at the intersection S _A And intersection S _B The same row.

Specifically, in the present embodiment, assuming the ID of the sample in Party A database is a mailbox, the set E = { E = { (E) ₁ ,e ₂ ,…,e _m }; the ID of the sample in the party B database is the telephone number, set P = { P = ₁ ,p ₂ ,…,p _m }; and the sample identifier in trusted third party C database has both a mailbox and a phone number, the set EP = { (e' ₁ ,p′ ₁ ),(e′ ₂ ,p′ ₂ ),…,(e′ _n ,p′ _n ) }. Participant A, B and third party C have negotiated 3 hash functions { h } ₁ ,h ₂ ,h ₃ And the number of bins, b, the specific steps of sample alignment are as follows:

third party C pairs Each pair of elements { (e) 'in set EP' _j ,p′ _j ) J belongs to [ n and select a random number { r ∈ [ n ] _j ,j∈[n]}. Then, calculating:

and

party A maps each element in set E to B bins { B } using Cuckoo Hashing _a [1],B _a [2]…,B _a [b]In which B is _a [i]Representing the elements in the ith bin of a. Party B maps each element in set P to B bins using Cuckoo Hashing { B } _b [1],B _b [2]…,B _b [b]In which B is _b [i]Representing the elements in the ith bin of B. And after the Cuckoo Hashing operation of the A and the B is finished, filling an invalid element T in a certain bin of the A and the B if the bin is empty. C has two sets of bins

And

c will { e 'by Simple Hashing' ₁ ,e′ ₂ ,…,e′ _n Mapping to the first bin set

In

Likewise C will { p 'using Simple Hashing' ₁ ,p′ ₂ ,…,p′ _n Mapping to a second set of bins

In

Wherein

Representing all elements in the ith bin.

For each bin, third parties C and A run the OPPRF protocol once, and at the ith bin, participant A acts as the receiver and enters element B in the ith bin _a [i]Third party C acts as sender and enters the element in the ith bin

Y 'is received from A after running of protocol' _i If stored in the ith bin of participant a is not invalid element ×, participant a calculates

Similarly, for each bin, third parties C and B run the OPPRF protocol once, and at the ith bin, B enters B _b [i]C input

B calculates if stored in the ith bin of B is not invalid element ^ T

Party B order

Then put K _B And sending the signal to A.

Party A order

Initializing set S _a And K' _A For an empty set, for { K _A [j]∈K _A J is equal to m, A checks K _A [j]∈K _B If true, participant A sends B _a [j]Put into intersection S _A In, K _A [j]Put into K' _A In, K 'is finally prepared' _A And sending the data to B.

For { K _B [k]∈K _B K is equal to m, and the participant B checks K _B [k]∈K′ _A If it is true, if K _B [k]＝K′ _A [j]Then, B is _b [k]Put into set S _b Line j, the last participant B has an intersection S _B 。

After the above protocol is run, if E is mailbox _i And telephone number p _i If belong to the same user, then e _i And p _i Will appear in the set S respectively _A And set S _B Thereby completing the sample alignment.

And S3, jointly training the logistic regression model by the participant A and the participant B through a gradient descent method, calculating the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B to obtain a complete encrypted gradient, adding noise to the encrypted gradient and transmitting the gradient to the third party C, and decrypting the ciphertext by the third party C, removing the noise and updating the gradient.

After aligning the samples using the alignment method of the present invention, the participants can train multiple models together, such as a linear regression model, a logistic regression model, a neural network, and the like. In order to realize a complete longitudinal federal learning process, the invention constructs a safe model training process by using homomorphic encryption, and under the help of a third party C, the participators A and B jointly train a logistic regression model by using a gradient descent method

In the training phase, after the alignment phase is set, the participant a and the participant B rearrange the own data set according to the intersection result, wherein the data set of the participant B further includes the label of each piece of data, and the training comprises the following steps:

s31, the third party C generates a homomorphic encrypted public and private key, the public key is sent to the participant A and the participant B, and the participant A and the participant B initialize own party weights respectively.

S32, for each piece of data in the data set, the participant A calculates the ciphertext of a local intermediate result according to the weight and the piece of data, the participant B calculates the ciphertext of the local intermediate result according to the weight, the piece of data and the tag value, the participant A and the participant B mutually send the ciphertext of the intermediate result to each other, and the participant A and the participant B respectively homomorphically add the ciphertext of the own party and the received ciphertext of the intermediate result to obtain a new ciphertext; and the participating party A and the participating party B multiply each piece of data in the own data set with the new ciphertext, and thus, after each piece of data in the data set is circulated, the data are accumulated to obtain the complete encryption gradient of the own party. Wherein the ciphertext of the intermediate result is used to calculate the complete gradient values of party a and party B.

And S33, the participant A and the participant B respectively add noise to the complete encrypted gradient and send the complete encrypted gradient to the third party C, and the third party C decrypts the two ciphertexts and sends the ciphertexts to the participant A and the participant B respectively.

And S34, removing noise from the received values by the party A and the party B, and updating the gradient of the party A and the gradient of the party B.

And S35, circularly executing the steps S32-S34 until the model converges or the maximum iteration times is reached, and obtaining the trained logistic regression model by the participator A and the participator B.

In this embodiment, after aligning the samples of the party a and the party B, the party a and the party B have n intersection elements, and the party a and the party B rearrange the own data sets, so that both the two data sets have n samples, where the party a data set is X _A ＝(x _A,1 ,x _A,2 ,…,x _A,n ) Each sample x _A,i Has d _A The weight of the feature value, party A initialization model is column vector theta _A Length of d _A . Participant B dataset X _B ＝((x _B,1 ,y ₁ ),(x _B,2 ,y ₂ ),…,(x _B,n ,y _n ) Each sample x) of which _B,i Has d _B Each characteristic value and its corresponding label is y _i The weight of the B initialization model is a column vector theta _B Length d of _B 。

Party a and party B negotiate a learning rate η and a logistic regression loss function together:

wherein

Then the loss function L (theta) is applied to theta _A And solving the partial derivative to obtain the gradient of the participant A, wherein the calculation formula of the gradient of the participant A is as follows:

wherein i is the ith data, and n data are participated in training by each participant; theta is a weight column vector, because there are a plurality of features, one feature has a weight value, the features of the parties A and B are different, and the weight of the party A is theta _A The weight of the participant B is theta _B Theta includes theta _A And theta _B ，

Representing a longitudinal concatenation of vectors.

Loss function L (theta) vs. theta _B And solving the partial derivative to obtain the gradient of the participant B, wherein the calculation formula of the gradient of the participant B is as follows:

participant a and participant B may update the gradient using the following formula:

specifically, the specific steps of co-training a logistic regression model by the participant a and the participant B with the help of the third party C are as follows:

(1) The third party C generates a public and private key (pk, sk) through a homomorphic encryption key generation KeyGen algorithm and then transmits the public key pk to the party a and the party B.

For i = 1., n, steps (2) to (4) are performed.

(2) Party A local computation

U is then encrypted using the public key pk _A,i To obtain a ciphertext

And sends it to party B.

(3) Party B local computation

U is then encrypted using the public key pk _B,i To obtain a ciphertext

And sends it to party a.

(4) Party A local computation

And calculate

Party B local computation

And calculate

(5) A calculating the gradient

Then a random number R is selected _A To blindly calculate the local gradient

And sends the calculation result to C.

(6) Likewise, B calculates the gradient

Then a random number R is selected _B To blindly calculate the local gradient

And sends the calculation result to C.

(7) And C, updating the gradient after receiving the gradient sent by A and B. Firstly, the private key sk of the user is used for decryption to obtain

And

then hold

Is sent to A, is

And sending the data to B.

(8) A is obtained by removing random number

The gradient is then updated locally

(9) B removing the random number to obtain

The gradient is then updated locally

And (5) repeating the steps (2) to (9) until the model converges.

In this embodiment, taking an example of a logistic regression model trained by two medical institutions at home and abroad as an example, as shown in fig. 2, the application example mainly includes a medical institution a at foreign countries (a sample identifier is a mailbox), a medical institution B at home (a sample identifier is a telephone number) and a trusted third party C (a sample identifier includes both a mailbox and a telephone number), and then the steps of the logistic regression model trained by the medical institution a at foreign countries and the medical institution B at home with the help of the trusted third party C are as follows:

(1) Run OPPRF: and after the elements in the mailbox set and the telephone number set are stored in the bin by the overseas medical institution A and the domestic medical institution B respectively by Cuckoo Hashing, and the elements in the mailbox set and the telephone number set are stored in the bin by the C by the Simple Hashing, for each bin, the overseas medical institution A and the domestic medical institution B respectively perform OPPRF operation with the trusted third party C for one time, and the overseas medical institution A and the domestic medical institution B respectively obtain an OPPRF output value corresponding to each element in the set.

(2) Information is exchanged to achieve sample alignment: and the foreign medical institution A and the domestic medical institution B respectively perform exclusive OR on the OPPRF values corresponding to the elements in the set, and perform interaction to obtain intersection elements, so that sample alignment is realized.

(3) Sending a public key: c, the public key is sent to the foreign medical institution A and the domestic medical institution B.

(4) Sending the ciphertext of the intermediate result: for each data, u is calculated by a foreign medical institution A and a domestic medical institution B respectively _A,i And u _B,i And then the corresponding ciphertext is sent to the opposite side to assist the opposite side in calculating the gradient.

(5) Calculating the gradient: and the two parties respectively calculate the ciphertext of the own party gradient by using the information obtained by interaction.

(6) Sending the encrypted, noisy gradient: and the foreign medical institution A and the domestic medical institution B blindly transmit the own gradient ciphertext to the trusted third party C for decryption.

(7) Gradient with noise: and the credible third party C decrypts the ciphertexts sent by the foreign medical institution A and the domestic medical institution B and respectively returns the ciphertexts to the A and the B.

(8) Updating the gradient: after receiving the plaintext, foreign medical institutions A and domestic medical institutions B remove noise to obtain gradients, and then locally update the gradients.

And (5) circularly executing the steps (4) to (8) until the model converges.

After the steps are executed and the model is converged, the foreign medical institution A holds the trained model theta _A Having a trained model theta with the domestic medical institution B _B . When prediction is performed using this model, it is assumed that one piece of prediction data of A at this time is x' _A One prediction data of B is x' _B Then A calculates

And substituted into the activation function

To obtain a predicted value y' _A Wherein y' _A ∈[0,1]. Similarly, B can also be calculated to obtain predicted data x' _B Corresponding predicted value y' _B 。

In summary, the longitudinal federated learning method provided by the invention can realize sample alignment under different sample identifiers of the participants, and can achieve the purpose of training the model together on the basis of protecting privacy information of each participant.

Example 2:

the present embodiment provides a computer device, which may be a server, a computer, or the like, and includes a processor, a memory, an input device, a display, and a network interface connected by a system bus, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium and an internal memory, the nonvolatile storage medium stores an operating system, a computer program, and a database, the internal memory provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the processor executes the computer program stored in the memory, the longitudinal federal learning method in which the sample identifiers of the foregoing embodiment 1 are different is implemented, as follows:

s2, the participant B sends the data set added with the noise to the participant A, the participant A obtains the intersection of the data sets added with the noise of the two parties according to the data set added with the noise of the participant B, and the data set of the participant A is sequenced according to the intersection of the data sets; the participant A sends the intersection of the data sets to the participant B, and the participant B sorts the data sets of the participant B according to the intersection of the data sets;

s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C sends the plaintext to the participant A and the participant B respectively, the participant A and the participant B remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta _A The participator B obtains a trained logistic regression model theta _B 。

The step S1 includes:

a third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another, and a new value is obtained after the third party C adds noise to each element in the two sets; wherein the same noise is added to the same positions of the two sets; elements in one set and values after noise is added in the other set are combined into a point value pair according to lines to form two point value pair sets, and the point value pair in the point value pair sets is used as input of C when the OPPRF is operated;

s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if no invalid element is stored in a bin of participants a and B, participant a and participant B store an invalid element in the bin;

s13, for each bin, a participant A and a participant C run a pseudo-random function OPPRF, wherein the participant A is a receiver of the accidental programmable pseudo-random function OPPRF and inputs elements in the bin, the participant C is a sender of the accidental programmable pseudo-random function OPPRF and inputs the elements in the bin and corresponding noise-added values, and the participant A obtains the corresponding accidental programmable pseudo-random function OPPRF output;

Example 3:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, where when the program is executed by a processor, and the processor executes the computer program stored in the memory, the method for vertical federal learning under different sample identifiers according to embodiment 1 is implemented as follows:

s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain a plaintext, the third party C respectively sends the plaintext to the participant A and the participant B, the participant A and the participant B respectively remove noise from the received plaintext and update the gradient, and the participant A obtains a trained logistic regression model theta _A The participator B obtains a trained logistic regression model theta _B 。

The step S1 includes:

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for longitudinal federal learning under different sample identifiers comprises a participant A, a participant B and a third party C, and is characterized by comprising the following steps:

s3, jointly training a logistic regression model by the participant A and the participant B through a gradient descent method, calculating a complete encrypted gradient of the ciphertext of the intermediate result interactively encrypted by the participant A and the participant B, adding noise to the encrypted gradient, and sending the encrypted gradient to a third party C; the third party C decrypts the received ciphertext to obtain the plaintext, and the third party C decrypts the plaintextRespectively sending the data to a participant A and a participant B, respectively updating the gradient after the noise of the received plaintext is removed by the participant A and the participant B, and obtaining a trained logistic regression model theta by the participant A _A The participator B obtains a trained logistic regression model theta _B 。

2. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S1 includes:

the third party C holds two identifiers, two sets are formed under the two identifiers, elements in the two sets correspond to one another one by one, and a new value is obtained after the third party C adds noise to each element in the two sets; adding the same noise to the same positions of the two sets, combining elements in one set and values added with noise in the other set into a point value pair according to rows to form two point value pair sets, and taking the point value pair in the point value pair sets as the input of C when the OPPRF is operated;

s12, storing the set of the own party into bins through a cuckoo hash function by the party A and the party B, wherein each bin stores at most one element; the participator C maps the elements in the own set to bins by using a simple hash function respectively, and each bin stores a plurality of elements; if no element is stored in a bin of participants a and B, participant a and participant B store an invalid element in the bin;

3. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S2 includes:

s21, circulating each bin, if the bin stores non-invalid elements, the participant B outputs a correspondingly careless programmable pseudo-random function OPPRF to be exclusive-OR with the elements in the own bin to form a set K _B Will aggregate K _B Sending the data to a participant A; similarly, party A xors the output of the pseudo-random function OPPRF, which is programmable and not intended, with the elements in the own bin to form set K _A By matching K _A And K _B According to the set K _A And set K _B The intersection of the two sets yields the intersection K _A ', will intersect K _A ' corresponding elements are put into the intersection S _A Will intersect K _A ' to participant B;

s22, the participator B performs the operation according to the intersection K _A ' and set K _B Putting the corresponding elements into the intersection S _B 。

4. The longitudinal federated learning method under which the sample identifiers are not identical according to claim 1, wherein the step S3 includes:

s31, a third party C generates a homomorphic encrypted public and private key, and sends the public key to a participant A and a participant B, and the participant A and the participant B initialize own party weights respectively;

s32, for each piece of data in the data set, the participant A calculates the ciphertext of a local intermediate result according to the weight and the piece of data, the participant B calculates the ciphertext of the local intermediate result according to the weight, the piece of data and the tag value, the participant A and the participant B mutually send the ciphertext of the intermediate result to each other, and the participant A and the participant B respectively homomorphically add the ciphertext of the own party and the received ciphertext of the intermediate result to obtain a new ciphertext; the participating party A and the participating party B multiply each piece of data in the own data set with the new ciphertext, and after each piece of data in the data set is circulated, the data are accumulated to obtain the complete encryption gradient of the own party; the ciphertext of the intermediate result is used for calculating the gradient value of the participant A and the gradient value of the participant B;

s33, the participant A and the participant B respectively add noise to the complete encrypted gradient and send the complete encrypted gradient to the participant C, and the participant C decrypts the two ciphertexts and sends the ciphertexts to the participant A and the participant B respectively;

s34, removing noise from the received values by the party A and the party B, and updating the gradient of the party A and the gradient of the party B;

and S35, circularly executing the steps S32-S34 until the logistic regression model converges or reaches the maximum iteration times, and obtaining the trained logistic regression model by the participator A and the participator B.

5. The longitudinal federated learning method under the condition that the sample identifiers are different according to claim 4, is characterized in that the calculation formula for calculating the gradient value of the participant A is as follows:

wherein X _A For participant A dataset, x _A,i For participant A dataset ith sample, X _B For party B data set, x _B,i For participant B dataset i sample, y _i Is x _B,i Corresponding label, theta is the weight column vector, theta _A As a weight of the party a,

denotes theta _A The process of transposition is carried out,

denotes theta _B Transposition is carried out;

the calculation formula for calculating the gradient value of the participant B is as follows:

wherein, X _A For participant A dataset, x _A,i For the ith sample, X _B For the participant B dataset, x _B,i For participant B dataset i sample, y _i Is x _B,i Corresponding label, theta is the weight column vector, theta _B Is the weight of the participant B and,

denotes theta _A The process of transposition is carried out,

denotes theta _B And (4) transposition.

6. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor implements the method for longitudinal federal learning under different sample identifiers as claimed in any of claims 1 to 5 when executing the program stored in the memory.

7. A storage medium storing a program which, when executed by a processor, implements the longitudinal federal learning method in which sample identifiers are not the same as in any of claims 1 to 5.