CN112132292B

CN112132292B - Longitudinal federation learning data processing method, device and system based on block chain

Info

Publication number: CN112132292B
Application number: CN202010971408.9A
Authority: CN
Inventors: 权纯; 刘春伟; 王雪; 霍昱光; 李武璐
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2024-05-14
Anticipated expiration: 2040-09-16
Also published as: CN112132292A

Abstract

The invention discloses a longitudinal federation learning data processing method, device and system based on a blockchain, wherein the method comprises the following steps: acquiring a local full-volume data sample for longitudinal federal learning, receiving encrypted full-volume data samples from a partner, each full-volume data sample comprising: a data identifier; performing identification alignment operation according to the local full-quantity data sample and the partner full-quantity data sample to generate an identification intersection; receiving an encrypted partner model training intermediate result from a partner, and carrying out local longitudinal federal learning model training by combining a local data sample corresponding to the identification intersection set to obtain a trained local longitudinal federal learning model; uploading the hash operation result of the local full data sample, the data identification of the partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the partner model training intermediate result and the hash operation result of the local longitudinal federation learning model to the blockchain.

Description

Longitudinal federation learning data processing method, device and system based on block chain

Technical Field

The invention relates to the field of machine learning, in particular to a longitudinal federal learning data processing method, device and system based on block chains.

Background

The federal learning technology can establish a model based on a distributed data set on the premise of protecting data privacy. However, the development of federal learning techniques is incomplete and certain security holes still exist. In a practical application scenario, data and models are core assets and revenue sources of enterprises, and the safety and the correctness of the data and models are of great importance. If the enterprises cannot mutually trust, the federal learning technology cannot be used.

Federal learning can be classified into horizontal federal learning and vertical federal learning according to the distribution type of data. The horizontal federation learning is also called federation learning divided by samples, and is applied to scenes in which data sets of all participants have the same feature space and different sample spaces. Longitudinal federation is also called federation learning divided by features, and is applied to scenes in which data sets of all participants have the same sample space and different feature spaces.

Currently, there are blockchain-based certification and audit schemes for lateral federal learning. However, the horizontal federal learning is mostly used in toC (personal oriented) scenes, for example, *** adopts a horizontal federal learning mode to solve the problem that an android mobile phone terminal user updates a model locally; or for unstructured data, such as models trained using image and audio data.

However, data among enterprises are distributed in a longitudinal form, and no effective evidence and audit scheme aiming at longitudinal federal learning exists at present, so that the enterprises lack of trust, and the enterprises cannot perform longitudinal federal learning modeling.

Disclosure of Invention

In view of the foregoing, the present invention provides a method, apparatus and system for processing vertical federal learning data based on blockchain to solve at least one of the above-mentioned problems.

According to a first aspect of the present invention, there is provided a blockchain-based longitudinal federal learning data processing method, the method comprising:

Obtaining local full data samples for vertical federal learning and receiving encrypted full data samples from a partner, wherein each full data sample comprises: a data identifier;

Performing identification alignment operation according to the local full-scale data sample and the partner full-scale data sample to generate an identification intersection;

Receiving an encrypted partner model training intermediate result from a partner, and performing local longitudinal federal learning model training according to the partner model training intermediate result and a local data sample corresponding to the identification intersection to obtain a trained local longitudinal federal learning model;

And respectively uploading the hash operation result of the local full data sample, the data identification of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the intermediate result of the encrypted partner model training and the hash operation result of the trained local longitudinal federal learning model to a blockchain so as to facilitate later audit operation.

According to a second aspect of the present invention there is provided a blockchain-based longitudinal federal learning data processing apparatus, the apparatus comprising:

the sample acquisition unit is used for acquiring a local full-quantity data sample for longitudinal federal learning;

A sample receiving unit for receiving encrypted partner full data samples from a partner, wherein each full data comprises: a data identifier;

An alignment operation unit, configured to perform an identification alignment operation according to the local full-volume data sample and the partner full-volume data sample, so as to generate an identification intersection;

the model training unit is used for receiving the encrypted partner model training intermediate result from the partner, and carrying out local longitudinal federal learning model training according to the partner model training intermediate result and the local data sample corresponding to the identification intersection set so as to obtain a trained local longitudinal federal learning model;

and the data uploading unit is used for respectively uploading the hash operation result of the local full data sample, the data identification of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the intermediate result of the encrypted partner model training and the hash operation result of the trained local longitudinal federal learning model to a block chain so as to facilitate the later audit operation.

According to a third aspect of the present invention there is provided a blockchain-based longitudinal federal learning data processing system, the system comprising: the system comprises a plurality of partner servers, a local server, the longitudinal federal learning data processing device based on the blockchain and the blockchain, wherein the longitudinal federal learning data processing device uploads interaction information in the longitudinal federal learning model training process of the plurality of partner servers and the local server to the blockchain.

According to a fourth aspect of the present invention there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when the program is executed.

According to a fifth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

According to the technical scheme, the obtained local full-quantity data sample and the obtained partner full-quantity data sample are subjected to identification alignment operation, an identification intersection is generated, local longitudinal federal learning model training is carried out based on a received partner model training intermediate result and the local data sample corresponding to the identification intersection, a trained local longitudinal federal learning model is obtained, and meanwhile, the hash operation result of the local full-quantity data sample, the data identification of the encrypted partner full-quantity data sample, the use sequence of the local data sample in the model training process, the scheduling log, the encrypted partner model training intermediate result and the hash operation result of the trained local longitudinal federal learning model are respectively uploaded to a block chain so as to facilitate later audit operation. Through the technical scheme, a trained local longitudinal federation learning model can be obtained, and meanwhile, the whole flow of longitudinal federation learning can be audited according to each data uploaded to the blockchain, so that the reliability and safety of the longitudinal federation learning can be enhanced, and the application of the longitudinal federation learning can be effectively promoted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a vertical federal learning data processing method based on blockchain in accordance with an embodiment of the present invention;

FIG. 2 is a detailed flow diagram of a blockchain-based vertical federal learning data processing method in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of a architecture of a blockchain-based vertical federal learning data processing system in accordance with an embodiment of the present invention;

fig. 4 is a block diagram of the structure of the longitudinal federal learning data processing apparatus 3 according to the embodiment of the present invention;

fig. 5 is a detailed structural block diagram of the longitudinal federal learning data processing apparatus 3 according to an embodiment of the present invention;

Fig. 6 is a schematic block diagram of a system configuration of an electronic device 600 according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Although a certification and audit scheme aiming at the transverse federal learning exists at present, the certification and audit scheme of the transverse federal learning is not suitable for the longitudinal federal learning due to the difference between the longitudinal federal learning and the transverse federal learning, and an effective certification and audit scheme of the longitudinal federal learning is not yet available at present. Based on the above, the embodiment of the invention provides a processing scheme of longitudinal federal learning data based on a blockchain, which can realize the evidence and audit of longitudinal federal learning, and the longitudinal federal learning whole process can be auditable by matching with supervision intervention, so that the participant behavior can be overtime, the reliability and safety of the longitudinal federal learning are enhanced, and the application of the longitudinal federal learning can be effectively promoted.

It should be noted that the cooperation of longitudinal federal learning should be based on a certain trust of both parties (or multiple parties). For example, both parties (or multiple parties) are known companies, and the reputation of the company is guaranteed; or both (or more) parties may be group introns. Therefore, the embodiment of the invention does not consider the situation that the participant maliciously damages the system or maliciously damages the interests of other participants, for example, a certain participant sends a malicious virus packet to other participants.

In addition, the cooperation of longitudinal federal learning should be established on the basis of win-win of both parties, i.e. successful establishment of the model can bring benefits to each participant. The benefits may motivate the parties to contribute to the bang study, rather than being destroyed. Meanwhile, before longitudinal federal learning starts, each participant needs to make a certain legal agreement to support the scheme.

Embodiments of the present invention are described in detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart of a blockchain-based vertical federal learning data processing method according to an embodiment of the present invention, as shown in FIG. 1, the method comprising:

Step 101, acquiring a local full-volume data sample for longitudinal federal learning, and receiving an encrypted full-volume data sample of a partner from the partner, wherein each full-volume data comprises: and (5) data identification.

And 102, performing identification alignment operation according to the local full-volume data sample and the partner full-volume data sample to generate an identification intersection.

Step 103, receiving the encrypted intermediate result of partner model training from the partner, and performing local longitudinal federal learning model training according to the intermediate result of partner model training and the local data sample corresponding to the identification intersection, so as to obtain a trained local longitudinal federal learning model.

In actual operation, the local longitudinal federal learning model may be trained in combination with the local data samples corresponding to the identity intersections, the local model training intermediaries, and the partner model training intermediaries, wherein the order of use of the local data samples is a predetermined order, where the predetermined order is an order associated with the identity.

The model training of the partner side uses the partner data samples corresponding to the identification intersection, and the use sequence of the partner data samples is also based on the sequence related to the identification.

For example, the intersection is identified as { id ₁,id₂,id₃ }, if the data samples of id ₁→id₂→id₃ corresponding order are used locally in model training, the partner must also use the data samples of id ₁→id₂→id₃ order in model training.

In longitudinal federal learning, the modeling process is more complex with different feature spaces for each participant (i.e., partner). Each participant first performs an identifier alignment operation, or referred to as encryption sample alignment, which may be referred to as PSI (privacy preserving set intersection, PRIVATE SET Intersection) operation for short, that is, under the condition of encryption, an intersection of all participant data identifiers is found for use in subsequent model training. For training data, each participant provides a partial feature of the data, and one participant provides a label of the data, we call the labeled participant an active party and the unlabeled participant a passive party. Because the features provided by the participants are different, the participants are not in equal relationship and the operations performed are not identical. In the training process, each participant holds part of the model or intermediate results required by the model (for different algorithms, the required intermediate results are different), and each participant optimizes model parameters through the intermediate results of multiple interactive encryption, and finally obtains a final result on the active side. After training, each participant holds a partial model or parameter. In the prediction process, the cooperation of all the participants is still needed, and the final result is given by the initiative.

That is, the local longitudinal federal learning model described above is a partial model.

Step 104, uploading the hash operation result of the local full data sample, the data identification of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log (comprising the scheduling information and the running state information of the partner and the local), the encrypted partner model training intermediate result and the hash operation result of the trained local longitudinal federal learning model to a blockchain respectively, so as to facilitate the later audit operation.

This information uploading step may be performed simultaneously with the corresponding steps of steps 101-103 described above.

The hash operation result of the local total data sample can be obtained by the following way: and performing Merck tree building operation on the local full data sample, and taking the root of the Merck tree as a hash operation result of the local full data sample.

The hash operation result of the local longitudinal federal learning model can be obtained by the following way: and carrying out hash mapping operation on the local longitudinal federation learning model to obtain a hash operation result of the local longitudinal federation learning model.

In actual operation, the data identification of the local data sample in the training process of the local longitudinal federal learning model can be uploaded to the blockchain according to the training sequence; information identifying the partner received in the align operation may also be uploaded to the blockchain.

The method comprises the steps of performing identification alignment operation on an obtained local full data sample and a partner full data sample, generating an identification intersection, performing local longitudinal federal learning model training based on a received partner model training intermediate result and a local data sample corresponding to the identification intersection, obtaining a trained local longitudinal federal learning model, and simultaneously uploading a hash operation result of the local full data sample, a data identification of the encrypted partner full data sample, a use sequence of the local data sample in a model training process, a scheduling log, the encrypted partner model training intermediate result and a hash operation result of the trained local longitudinal federal learning model to a block chain respectively so as to facilitate later audit operation. According to the embodiment of the invention, a trained local longitudinal federation learning model can be obtained, and meanwhile, the whole flow of longitudinal federation learning can be audited according to each data uploaded to the blockchain, so that the reliability and safety of the longitudinal federation learning can be enhanced, and the application of the longitudinal federation learning can be effectively promoted.

For a better understanding of the present invention, embodiments of the present invention are described in detail below in conjunction with the example flow of longitudinal federal learning shown in fig. 2. In this example, for convenience of description, description will be given by taking longitudinal federal learning of both party a and party B as an example, and in actual operation, the scheme may be generalized to the case of at most parties (parties greater than 2).

As shown in fig. 2, the embodiment of the present invention verifies the data, dispatch logs, interactive contents, model results, etc. of each participant based on the blockchain technique.

Referring to fig. 2, this example is based on a blockchain, storing the following information:

(1) Data sample evidence: to normalize participant behavior, determining that the participants use contracted data samples to perform federal learning tasks requires recording the full data sample information for each participant as the "data asset attestation" of the participants. The full data sample information, i.e., the data samples that participated in federal learning for each piece of participant (the full amount before the PSI module performs the encrypted sample alignment operation) should be recorded.

For example, if party a provided 5 data samples, then record {(id₁,info₁),(id₂,info₂),(id₃,info₃),(id₄,info₄),(id₅,info₅)}. should be recorded since the party data samples are not exposable, the party builds a merck tree locally for each data sample, uploading only the root of the merck tree to the blockchain store as the "asset proof" for that data sample.

(2) The aligned data identification (id) is verified: and after the sample identification alignment is carried out on the full data samples, the id of the intersection of the data samples of the two parties can be obtained. The alignment of data samples refers to taking intersections of identifiers corresponding to data of two parties participating in federal learning. In longitudinal federal learning, id is a unique identification of a data sample (e.g., based on information such as an identification card number) whose characteristics x _i and tags y are provided by multiple parties to the complete model of longitudinal federal learning. For example, the data format of the data sample id _i corresponding to the full model is (id _i,x₁,x₂,x₃, y), where party a provides (id _i,x₁,x₂) and party B provides (id _i,x₃, y). If party a owns the data sample set ID _A and party B owns the data sample set ID _B, then the sample identity alignment is to find the identity intersection ID _A}∩{ID_B of the data that parties a and B can provide, and only the data samples at the intersection identity ID _A}∩{ID_B can constitute a complete data sample (ID _i,x₁,x₂,x₃, y). Since the purpose of the PSI is to protect data samples that are not in the intersection, i.e., protect { ID _A}-{ID_A}∩{ID_B } and { ID _B}-{ID_A}∩{ID_B }, the data sample IDs in the intersection { ID _A}∩{ID_B } may be exposed to the parties. The data sample id in the intersection is required to be recorded to determine the data sample that each participant finally participates in the federal learning model training.

(3) Scheduling log evidence storage: the federal learning scheduling log mainly records actions, scheduling information and running state information of each participant. The scheduling information here can uniquely identify the meaning of the encrypted intermediate result sent by each party. The scheduling log is recorded, so that the traceability of the whole federal learning process can be ensured.

(4) Batch processing information evidence storage: as explained for sample identity alignment in (2), participants a and B perform data sample identity alignment in the PSI module, and only the data corresponding to the intersection identity will enter the subsequent flow. The samples identify the data samples prior to alignment as full data, after which only the intersection data samples are left. For example, there are 7 full samples for party a, 8 full samples for party B, and only 5 intersection samples, only 5 of which would go into subsequent passes. If the federal learning adopts a batch processing mode to train the model, the id and the sequence of the data samples in each batch are required to be recorded; if training is directly performed using the full data samples, only the data sample use sequence is required to be recorded.

Since in longitudinal federal learning, the features of one data sample are provided by multiple parties, the same data sample order is used to ensure intermediate result alignment when each party uses the local data sample for computation. Specifically, after the sample identification is aligned, there is a data id in party a, which must also be present in party B. Since parties a and B each provide part of the characteristics and labels of one data sample, the calculation order needs to be fixed at the time of calculation. For example, if party a uses the order of id ₁→id₂→id₃ for computation, party B must also use the order of id ₁→id₂→id₃ for computation, as the intersection sample is identified as { id ₁,id₂,id₃ }. Only then, when party a sends the intermediate result f _A(id₁),f_A(id₂),f_A(id₃) to party B, party B can calculate f_A(id₁)⊙f_B(id₁),f_A(id₂)⊙f_B(id₂),f_A(id₃)⊙f_B(id₃), correctly without producing an erroneous result like f_A(id₁)⊙f_B(id₃),f_A(id₂)⊙f_B(id₂),f_A(id₃)⊙f_B(id₁).

(5) Encryption id information and intermediate result certificate: the id information of each participant in the interaction of the PSI module and the intermediate result of the interaction of the machine learning training module are required to be recorded. Since each party issues encrypted id information or intermediate results, it can be directly recorded.

(6) Model evidence: the hash value of the model finally obtained by each participant needs to be recorded. Since the model belongs to the privacy of the participants, it needs to be mapped to hash values. Existing hash mapping techniques may be used herein.

The following describes the longitudinal federal learning training steps for participants a and B based on the flow shown in fig. 2:

(1) The two parties respectively prepare full data samples, and for each data sample, a merck tree is established, and the root value of the merck tree is uploaded to the blockchain certification and used as the data asset certification of the party. The storage form may be (id ₁,hash₁), which may uniquely identify the data.

(2) The full data id enters the PSI module, and the identification intersection of the data samples of the two parties is found through the id information of the repeated interactive encryption. In the process, all scheduling logs are uploaded to a blockchain memory card; after each time of information interaction, the receiver uploads the encrypted id information with the signature of the sender to the blockchain, so that all the encrypted id information interacted by the two parties is stored in the blockchain; finally, the intersection id of the data samples of the two parties is uploaded to the blockchain memory card.

(3) And taking the data corresponding to the intersection data sample identification as a training data sample, inputting the training data sample into a machine learning model training module, and uploading the use sequence of the training data sample to the blockchain memory card. The data sample ids used by parties a and B are the same, but the characteristics and labels of the data samples provided by the parties are different. For example, training is performed using three pieces of data corresponding to the sample identification { id ₁,id₂,id₃ }, but the data corresponding to party a is (id _i,x₁,x₂) and the data corresponding to party B is (id _i,x₃, y).

In machine learning model training, all dispatch logs are uploaded to blockchain certificates. The two parties train part of the model locally or calculate by utilizing own data samples, and then send the encrypted intermediate result required by the other party to the other party. And the two parties exchange the encrypted intermediate result for a plurality of times, operate according to the encrypted intermediate result, and iterate the own model until training is completed. The main encryption algorithm in the federal learning training process is addition homomorphic encryption, namely the result of encryption data calculation is equal to the result of encryption after data calculation, and En (a) +en (b) =en (a+b), so that a receiver does not need decryption, only needs to calculate on ciphertext, and then returns the ciphertext to a sender. The sender can decrypt the calculation result, but does not know the data of the receiver, so that the data security can be ensured.

After each interaction, the receiving party uploads the encrypted intermediate result with the signature of the sending party to the blockchain, so that all the encrypted intermediate results interacted by the two parties are stored in the blockchain.

In one embodiment, when the two sides of the certificate store interact information, in order to ensure that the data sent by the sender, received by the receiver and stored on the blockchain are consistent, and the federal learning efficiency is not affected, the following certificate store method can be used: the sender sends a data packet with a signature to the receiver, and the receiver immediately executes local operation according to the data packet after receiving the data packet; meanwhile, the receiving party uploads the data packet with the signature of the sending party to the blockchain certificate. Therefore, the operation efficiency can be ensured, and meanwhile, the interactive information can be accurately stored.

(4) And obtaining a part model by both sides, mapping the part model into a hash value, and uploading the hash value to the blockchain.

By storing data asset evidence (data sample evidence), scheduling log information (scheduling log evidence), interaction information (aligned data id evidence, batch information evidence, encryption id information and intermediate result evidence) and model information (model evidence) on the blockchain, the full flow of federal learning can be recorded. Wherein the data asset records data information of the participant; the scheduling log records the operation behaviors of all the participants in the federal learning process; the interaction information records the operation content of each participant; the model information records the results of the modeling. The information is combined, so that the full flow of federal learning modeling can be restored for subsequent audit.

Because the dispatch log does not include sensitive information of the participants, the dispatch log can be directly stored; the data asset evidence stores the Merker tree root, the interactive information is stored in an encrypted mode, and the model information is mapped into a hash mode, so that the sensitive information of the participants is not exposed.

However, forensic content cannot be used directly for auditing, just because of the security of forensic information. The embodiment of the invention introduces a trusted third party, and when a problem occurs, the Union learning process and the operation of each participant are audited by using the evidence storage content.

In one embodiment, a third party (hereinafter referred to as an arbitrator) approved by each party, such as a regulatory agency or the like, is introduced, which dominates the audit process. Each participant agrees, and when problems occur in the federal learning process, the participants cooperate with auditing to provide required conditions and data for the arbitrators. When a certain participant considers that the federal learning process has a problem, providing a problem evidence for an arbitrator and providing an audit request; after checking the evidence, the secondary party formulates a checking scheme and requires each participant to provide relevant conditions and data. The arbitrator needs to audit at the side of the participators: ① Whether the participant use code is tampered with; ② Whether the participant usage data is consistent with the appointment data; ③ In the case where the code and data are correct, the participant records whether the content in the blockchain is correct. The arbitrator inputs data matched with the blockchain records on the correct codes, and can restore the full flow of federal learning modeling to audit whether the participants have improper operation or wrongly act.

For example, when auditing the PSI module, the arbitrator first requires each participant to provide a code and matches the correct code, wherein the correct code is not tampered by the participant, and since the code is locally deployed by the participant, there is a risk of tampering, the correct code only needs to be pulled down again from the server, and no certificate is required. And then, the arbitrator verifies whether the data is matched with the data asset evidence of the participant, finally, the PSI process is reproduced, the matching result is compared with the aligned data id of the evidence, and if the matching result is different from the registered data id, the problematic participant can be found out. For another example, when auditing the machine learning model training module, the arbitrator first verifies the code correctness and matches the Merkle root value and id correctness of the intersection data; then, the secondary party uses the input data of the module to gradually restore the whole process of machine learning training, check important intermediate results such as gradient and the like, and if the intermediate result of a certain party is found to be not matched with the evidence storage content, the responsibility of the party is pursued; finally, it is necessary to verify whether the hash value of the obtained model matches the blockchain record value.

Specifically, because of the aligned data ids in the PSI module, the interaction information such as the encrypted id information and the intermediate result in the machine learning model training module are all stored and verified, and the arbitrator can request the participant to decrypt the intermediate result and then audit a certain step of interaction of the participant. At this time, the arbitrator can avoid contacting the original data of each participant, and only use the interactive information to deposit evidence to audit each participant operation, so as to ensure the data privacy of each participant to a greater extent.

As can be seen from the above description, in the longitudinal federation learning, based on the blockchain technology, the embodiment of the invention performs evidence storage on the data asset evidence (Merkle root of each data sample), the scheduling log, the interaction information (the alignment data id, the batch processing information, the encryption id information, the intermediate result and the model information) in the learning process, thereby ensuring the traceability of the whole flow of the federation learning.

Based on similar inventive concepts, embodiments of the present invention also provide a vertical federal learning data processing system based on blockchain, as shown in fig. 3, the system includes: the system comprises a plurality of partner servers 1 (one is shown in the figure), a local server 2, a longitudinal federal learning data processing device 3 based on a blockchain and the blockchain 4, wherein the longitudinal federal learning data processing device uploads interaction information in the longitudinal federal learning model training process of the plurality of partner servers and the local server to the blockchain. Preferably, the longitudinal federal learning data processing means 3 can implement the procedure in the above-described method embodiment.

Fig. 4 is a block diagram of the structure of the longitudinal federal learning data processing apparatus 3, and as shown in fig. 4, the longitudinal federal learning data processing apparatus 3 includes: a sample acquisition unit 31, a sample receiving unit 32, an alignment operation unit 33, a model training unit 34, and a data uploading unit 35, wherein:

a sample acquisition unit 31 for acquiring a local full-volume data sample for longitudinal federal learning.

A sample receiving unit 32 for receiving encrypted partner full data samples from a partner, wherein each full data comprises: and (5) data identification.

An alignment operation unit 33, configured to perform an identification alignment operation according to the local full-scale data sample and the partner full-scale data sample, so as to generate an identification intersection.

The model training unit 34 is configured to receive the encrypted intermediate result of training the partner model from the partner, and perform local longitudinal federal learning model training according to the intermediate result of training the partner model and the local data sample corresponding to the identified intersection set, so as to obtain a trained local longitudinal federal learning model.

Specifically, the model training unit includes: an intermediate result receiving module and a model training module, wherein:

The intermediate result receiving module is used for receiving the encrypted intermediate result trained by the partner model from the partner;

And the model training module is used for training the local longitudinal federal learning model by combining the local data sample corresponding to the identification intersection, the local model training intermediate result and the partner model training intermediate result, wherein the use sequence of the local data sample is a preset sequence.

The data uploading unit 35 is configured to upload the hash operation result of the local full-size data sample, the data identifier of the encrypted partner full-size data sample, the use sequence of the local data sample in the model training process, the scheduling log (including the partner and the local scheduling information and the running state information), the encrypted partner model training intermediate result, and the hash operation result of the trained local longitudinal federal learning model to the blockchain, so as to facilitate the post-audit operation.

In actual operation, the data uploading unit is also used for: uploading the data identification of the local data sample to a blockchain according to a training sequence in the training process of the local longitudinal federal learning model; and uploading information identifying the partner received in the align operation to the blockchain.

The alignment operation unit 33 performs an identification alignment operation on the local full data sample acquired by the sample acquisition unit 31 and the partner full data sample received by the sample receiving unit 32, so as to generate an identification intersection, the model training unit 34 performs a local vertical federal learning model training based on the received partner model training intermediate result and the local data sample corresponding to the identification intersection, so as to obtain a trained local vertical federal learning model, and the data uploading unit 35 respectively uploads the hash operation result of the local full data sample, the data identification of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the encrypted partner model training intermediate result and the hash operation result of the trained local vertical federal learning model to the block chain, so as to facilitate the later audit operation. According to the embodiment of the invention, a trained local longitudinal federation learning model can be obtained, and meanwhile, the whole flow of longitudinal federation learning can be audited according to each data uploaded to the blockchain, so that the reliability and safety of the longitudinal federation learning can be enhanced, and the application of the longitudinal federation learning can be effectively promoted.

In actual operation, as shown in fig. 5, the longitudinal federal learning data processing apparatus 3 further includes the following units:

The sample processing unit 36 is configured to perform a merck tree building operation on the local full data sample, and take a root of the merck tree as a hash result of the local full data sample.

And the hash mapping unit 37 is configured to perform a hash mapping operation on the local longitudinal federation learning model to obtain a hash operation result of the local longitudinal federation learning model.

In one embodiment, the longitudinal federal learning data processing system further includes: an audit server, the audit server comprising: the system comprises an information acquisition unit and an audit operation unit, wherein:

The information acquisition unit is used for acquiring information uploaded to the block chain;

And the auditing operation unit is used for auditing the parties according to the acquired information.

The specific execution process of each unit and each module may be referred to the description in the above method embodiment, and will not be repeated here.

In actual operation, the units, the modules and the sub-modules may be combined or may be arranged singly, and the invention is not limited thereto.

The present embodiment also provides an electronic device, which may be a desktop computer, a tablet computer, a mobile terminal, or the like, and the present embodiment is not limited thereto. In this embodiment, the electronic device may be implemented with reference to the above method embodiment and the embodiment of the vertical federal learning data processing apparatus/system based on blockchain, and the contents thereof are incorporated herein, and the repetition is omitted.

Fig. 6 is a schematic block diagram of a system configuration of an electronic device 600 according to an embodiment of the present invention. As shown in fig. 6, the electronic device 600 may include a central processor 100 and a memory 140; memory 140 is coupled to central processor 100. Notably, the diagram is exemplary; other types of structures may also be used in addition to or in place of the structures to implement telecommunications functions or other functions.

In one embodiment, the blockchain-based vertical federal learning data processing functionality may be integrated into the central processor 100. Wherein the central processor 100 may be configured to control as follows:

As can be seen from the above description, the electronic device provided by the embodiment of the present application generates the identifier intersection by performing the identifier alignment operation on the acquired local full data sample and the partner full data sample, and performs the local vertical federal learning model training based on the received intermediate result of partner model training and the local data sample corresponding to the identifier intersection, so as to obtain the trained local vertical federal learning model, and simultaneously, uploads the hash operation result of the local full data sample, the data identifier of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the intermediate result of encrypted partner model training, and the hash operation result of the trained local vertical federal learning model to the block chain, so as to facilitate the later audit operation. Therefore, a trained local longitudinal federation learning model can be obtained, and meanwhile, the whole flow of longitudinal federation learning can be audited according to all data uploaded to the blockchain, so that the reliability and safety of the longitudinal federation learning can be enhanced, and the application of the longitudinal federation learning can be effectively promoted.

In another embodiment, the blockchain-based vertical federal learning data processing device/system may be configured separately from the central processor 100, for example, the blockchain-based vertical federal learning data processing device/system may be configured as a chip connected to the central processor 100, and the blockchain-based vertical federal learning data processing function is implemented by control of the central processor.

As shown in fig. 6, the electronic device 600 may further include: a communication module 110, an input unit 120, an audio processor 130, a display 160, a power supply 170. It is noted that the electronic device 600 need not include all of the components shown in fig. 6; in addition, the electronic device 600 may further include components not shown in fig. 6, to which reference is made to the prior art.

As shown in fig. 6, the central processor 100, also sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 100 receives inputs and controls the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 100 can execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides an input to the central processor 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, or the like. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. Memory 140 may also be some other type of device. Memory 140 includes a buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage 142, the application/function storage 142 for storing application programs and function programs or a flow for executing operations of the electronic device 600 by the central processor 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. A communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and to receive audio input from the microphone 132 to implement usual telecommunication functions. The audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 130 is also coupled to the central processor 100 so that sound can be recorded locally through the microphone 132 and so that sound stored locally can be played through the speaker 131.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, is used for realizing the steps of the vertical federal learning data processing method based on the blockchain.

In summary, in the traditional longitudinal federal learning, because of the lack of perfect evidence and audit schemes, all the participants cannot mutually trust, which hinders the wide application of longitudinal federal learning. Based on the above, the embodiment of the invention provides a longitudinal federal learning evidence-storing and auditing scheme based on a blockchain, which is used for storing the longitudinal federal learning whole process and cooperating with supervision intervention, so that the longitudinal federal learning whole process can be audited, traced and overtime, the reliability and safety of the longitudinal federal learning are enhanced, and the application of the longitudinal federal learning among enterprises is promoted.

Preferred embodiments of the present invention are described above with reference to the accompanying drawings. The many features and advantages of the embodiments are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the embodiments which fall within the true spirit and scope thereof. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the embodiments of the invention to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope thereof.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method for processing vertical federal learning data based on blockchain, the method comprising:

The hash operation result of the local full data sample, the data identification of the encrypted partner full data sample, the use sequence of the local data sample in the model training process, the scheduling log, the intermediate result of the encrypted partner model training and the hash operation result of the trained local longitudinal federal learning model are respectively uploaded to a block chain so as to facilitate later audit operation;

the method for training the local longitudinal federal learning model according to the encrypted partner model training intermediate result from the partner, and the local data sample corresponding to the identification intersection comprises the following steps:

receiving an encrypted partner model training intermediate result from a partner;

Training the local longitudinal federal learning model by combining a local data sample corresponding to the identification intersection, a local model training intermediate result and the partner model training intermediate result, wherein the use sequence of the local data sample is a preset sequence; wherein the predetermined order is an order associated with the identification; partner data samples corresponding to the identified intersections are also used for partner model training; the order of usage of the partner data samples is also based on the order associated with the identity.

2. The method according to claim 1, wherein the hash result of the local full-size data samples is obtained by:

And performing Merck tree building operation on the local full data sample, and taking the root of the Merck tree as a hash operation result of the local full data sample.

3. The method of claim 1, wherein the hash result of the local longitudinal federal learning model is obtained by:

And carrying out hash mapping operation on the local longitudinal federation learning model to obtain a hash operation result of the local longitudinal federation learning model.

4. The method according to claim 1, wherein the method further comprises:

And uploading the data identification of the local data sample in the training process of the local longitudinal federal learning model to a blockchain according to a training sequence.

5. The method of claim 1, wherein after performing the identity alignment operation based on the local full-scale data samples and the partner full-scale data samples, the method further comprises:

information identifying the partner received in the align operation is uploaded to the blockchain.

6. The method of claim 1, wherein the dispatch log comprises: partner and local scheduling information and running state information.

7. A blockchain-based vertical federal learning data processing device, the device comprising:

The data uploading unit is used for respectively uploading the hash operation result of the local full-quantity data sample, the data identification of the encrypted partner full-quantity data sample, the use sequence of the local data sample in the model training process, the scheduling log, the intermediate result of the encrypted partner model training and the hash operation result of the trained local longitudinal federal learning model to a block chain so as to facilitate later audit operation;

wherein the model training unit includes:

The model training module is used for training the local longitudinal federal learning model by combining the local data sample corresponding to the identification intersection, the local model training intermediate result and the partner model training intermediate result, wherein the use sequence of the local data sample is a preset sequence; wherein the predetermined order is an order associated with the identification; partner data samples corresponding to the identified intersections are also used for partner model training; the order of usage of the partner data samples is also based on the order associated with the identity.

8. The apparatus of claim 7, wherein the apparatus further comprises:

the sample processing unit is used for performing Merck tree building operation on the local full data samples, and taking the root of the Merck tree as a hash operation result of the local full data samples.

9. The apparatus of claim 7, wherein the apparatus further comprises:

and the hash mapping unit is used for carrying out hash mapping operation on the local longitudinal federation learning model so as to obtain a hash operation result of the local longitudinal federation learning model.

10. The apparatus of claim 7, wherein the data upload unit is further configured to:

11. The apparatus of claim 7, wherein after performing the identification alignment operation based on the local full-scale data samples and the partner full-scale data samples, the data uploading unit is further configured to:

12. The apparatus of claim 7, wherein the dispatch log comprises: partner and local scheduling information and running state information.

13. A blockchain-based longitudinal federal learning data processing system, the system comprising: the vertical federal learning data processing device based on a blockchain, and a blockchain, wherein the vertical federal learning data processing device uploads interaction information in a vertical federal learning model training process of the plurality of partner servers and the local server to the blockchain.

14. The system of claim 13, wherein the system further comprises: an audit server, the audit server comprising:

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 6 when the program is executed by the processor.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.