CN112598138B

CN112598138B - Data processing method and device, federal learning system and electronic equipment

Info

Publication number: CN112598138B
Application number: CN202011528941.4A
Authority: CN
Inventors: 何恺; 蒋精华; 杨青友; 洪爵
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2023-07-21
Anticipated expiration: 2040-12-22
Also published as: CN112598138A

Abstract

The disclosure provides a data processing method, a data processing device, a federal learning system and electronic equipment, and relates to the field of artificial intelligence such as deep learning and big data processing. The specific implementation scheme is as follows: a first participant in the federal learning system interacts with a second participant in the federal learning system based on an unintentional transmission OT protocol to obtain an unintentional pseudo-random function OPRF seed; the first participant determines OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant; the first party sends OPRF output information of the first party; wherein the OPRF output information of the first participant is used to determine an intersection of the set of data identifications of the federal learning system. According to the technical scheme, data security can be improved.

Description

Data processing method and device, federal learning system and electronic equipment

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of artificial intelligence for deep learning and big data processing.

Background

Machine learning has been commonly applied to various fields of finance, medical treatment, and the like. Machine learning can achieve good effects in various fields, and is related to the leap development of the related technology, the rapid improvement of the computing capacity of hardware and the explosive growth of data. Federal learning is the use of data from multiple institutions for joint analysis or joint modeling under conditions that satisfy user privacy protection, data security, and related rules. Data alignment is often required by multiple institutions, i.e., participants in federal learning, prior to federal learning. Data alignment refers to the alignment of data having the same user identity in a data set of multiple participants, including the process of determining the same user identity in a data set of multiple participants.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, a federal learning system and electronic equipment.

According to an aspect of the present disclosure, there is provided a data processing method including:

a first participant in the federal learning system interacts with a second participant in the federal learning system based on an unintentional transmission OT protocol to obtain an unintentional pseudo-random function OPRF seed;

the first participant determines OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant;

the first party sends OPRF output information of the first party; wherein the OPRF output information of the first participant is used to determine an intersection of the set of data identifications of the federal learning system.

According to another aspect of the present disclosure, there is provided a data processing apparatus for use with a first party in a federal learning system, the apparatus comprising:

the interaction module is used for interacting with a second participant in the federal learning system based on an unintentional transmission OT protocol to obtain an unintentional pseudo-random function OPRF seed;

a first determining module, configured to determine, by a first participant, OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant;

The first sending module is used for sending OPRF output information of the first participant; wherein the OPRF output information of the first participant is used to determine an intersection of the set of data identifications of the federal learning system.

According to another aspect of the present disclosure, there is provided a federal learning system including a first party and a second party;

the second party is used for interacting with the first party based on the OT protocol and a data identification set of the second party to obtain OPRF output information of the second party;

the first participant is used for interacting with the second participant based on the OT protocol to obtain an OPRF seed, determining OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant, and sending the OPRF output information of the first participant;

wherein the OPRF output information of the first party and the OPRF output information of the second party are used to determine an intersection of the set of data identities of the federal learning system.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by embodiments of the present application.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods provided by the embodiments of the present application.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method provided by the embodiments of the present application.

According to the technical scheme, the data identification information of each participant can be protected from being leaked in the data alignment process of federal learning, and the data safety is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a data processing method provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a data processing method provided in accordance with another embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a data processing method provided in accordance with yet another embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a federal learning system provided in accordance with one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a federal learning system provided in accordance with another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing apparatus provided in accordance with one embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data processing apparatus provided in accordance with another embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device for implementing a data processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In general, in order to avoid revealing data owned by a client, each participant of the federal learning system performs hash computation on a user identifier of the client data, so as to disclose a hash value corresponding to the user identifier to other participants. Each participant determines whether an intersection exists by comparing the hash value disclosed by other participants with the hash value of the user identifier of own data, so as to determine whether the data corresponding to the same user identifier is owned. However, under the condition that the number of the user identifications is limited, on the basis of the scheme, each participant can reversely deduce hash input according to the hash values disclosed by other participants, so that not only can data with the same user identification be determined, but also other user identification data of other participants can be acquired, and the data security risk is caused.

To this end, the embodiments of the present disclosure provide a data processing method. Referring to fig. 1, the method includes:

step S11, a first participant in the federal learning system interacts with a second participant in the federal learning system based on OT (Oblivious Transfer, inadvertent transmission) protocol to obtain OPRF (Oblivious Pseudorandom Function, inadvertent pseudo-random function) seeds;

step S12, the first participant determines OPRF output information of the first participant based on the OPRF seed and the data identification set of the first participant;

step S13, the first participant transmits OPRF output information of the first participant; wherein the OPRF output information of the first participant is used to determine an intersection of the set of data identifications of the federal learning system.

The OT protocol is a two-party communication protocol capable of protecting private data. In the embodiment of the application, two parties interact based on the OT protocol, wherein one party can obtain OPRF seeds, and the other party can obtain OPRF output information of own input data. Illustratively, the second party inputs its own set of data identifications into the OT protocol, which returns OPRF output information corresponding to the second party's set of data identifications. That is, the second party may derive the OPRF output information of the second party based on the OT protocol and the data identification set of the second party.

Illustratively, the data identification set may include identification information for each user data, such as a unique identification of the user's cell phone number, identification card number, and the like. In some examples, the identification information may also be referred to as a row ID (Identity Document, identification number), i.e., information representing the row data in each of the rows of data.

Illustratively, the OPRF output information for a participant may include OPRF output information for each data identification in the set of data identifications for that participant.

The OPRF seed can be used to obtain OPRF output information corresponding to arbitrary data. In the embodiment of the present application, the first participant obtains the OPRF seed, so that the data identifier set corresponding to the first participant can be calculated.

The first participant sends its own OPRF output information, and then the receiver, such as another participant or an intermediate party, that receives the OPRF output information of the first participant may determine whether an intersection exists in the corresponding set of data identifiers and an element of the intersection based on the OPRF output information of the first participant and the OPRF output information of the other participant. Since the receiver does not hold the OPRF seed, the OPRF output information of the first party is similar to a random number for the receiver, and the receiver cannot obtain the data of the data identification set of the first party. In addition, although the first party holds the OPRF seed, the first party transmits the OPRF output information, the receiver determines the intersection, and the first party does not acquire specific data of other parties, so that the data of other parties cannot be revealed.

In the embodiment of the application, the data identification set of the federal learning system includes the data identification set of the participants in the federal learning system. In practical applications, where the federal learning system includes two participants, the intersection of the data identification sets of the federal learning system may be the intersection of the data identification set of the first participant and the data identification set of the second participant. In the case where the federal learning system includes at least three participants, the intersection of the data identification sets of the federal learning system may be the intersection of the data identification sets of at least three participants including the first participant, the second participant.

It can be seen that the above method can implement PSI (Private Set Intersection, private intersection), and obtains the intersection of the data identification sets of the federal learning system as the data alignment result based on the PSI. Therefore, the data identification information of each participant can be protected from being revealed, and the data security is improved.

In some examples, the federal learning system includes a first party and a second party. In the step S13, the first party sends the OPRF output information of the first party, which may include:

the first party sends OPRF output information of the first party to the second party; the second party is used for determining the intersection of the data identification sets of the federal learning system according to the OPRF output information of the first party and the OPRF output information of the second party.

In particular, reference may be made to fig. 2. As shown in fig. 2, after the first participant and the second participant interact based on the OT protocol, the first participant determines, based on the obtained PRS seed, OPRF information of the first participant, where the OPRF information corresponding to each element in the data identifier set of the first participant is included. The first party transmits the OPRF output information to the second party. Since the second party obtains the OPRF output information of the second party based on the OT protocol, the second party determines the OPRF output information of each element in the own data identification set.

The specific processing procedure of the second participant to obtain the intersection of the data identification set of the federal learning system based on the OPRF output information of the second participant and the OPRF output information of the first participant may include:

the second party initializes the intersection of the data identification sets of the federal learning system to 0, then traverses each element in the own data identification set, and determines whether the OPRF output information of the traversed element is in the OPRF output information of the first party. If so, the element is added to the intersection. After traversing the full element, an intersection of the data identification sets of the first and second parties, i.e., an intersection of the data identification sets of the federal learning system, is obtained. The second party sends the intersection to the first party, so that the first party also obtains the intersection, and the two parties can share data corresponding to the identification information in the intersection based on the intersection, for example, joint analysis or joint modeling is performed.

According to the method, the intersection of the data identification sets of the federal learning system can be determined on the basis of not revealing the data of the first participant and the second participant, and the data security is improved.

In some examples, the federal learning system may include at least three parties, e.g., four parties, and then two parties may be selected as the first party and the second party, and the above method is performed, with both parties obtaining an intersection of the first party and the second party. And then taking one party as a new first party, selecting one party from other parties in the federal learning system as a new second party, and executing the method, wherein the two parties obtain the intersection of the first party and the second party. And by analogy, after all the participants in the federation learning system are executed with the method, the intersection of the data identification sets in the federation learning system can be obtained.

In an embodiment of the present application, another implementation is provided for determining intersections of data identification sets of multiple participants. Specifically, the data processing method may further include:

the first party sending the OPRF seed to a third party in the federal learning system; the third party is used for obtaining OPRF output information of the third party based on the OPRF seed and the data identification set of the third party, and sending the OPRF output information of the third party to the intermediate party;

Accordingly, in step S13, the first party sends the OPRF output information of the first party, including:

the first party sends OPRF output information of the first party to the intermediate party; the intermediate party is used for determining an OPRF output information intersection based on OPRF output information of the first party and the third party, and the OPRF output information intersection is used for determining an intersection of a data identification set of the federal learning system.

Wherein the number of third parties may be a plurality. The first party sends the OPRF seeds to a plurality of third parties, each third party in the plurality of third parties obtains OPRF output information of the third party based on the OPRF seeds and the data identification set of the third party, and sends the OPRF output information of the third party to the intermediate party.

For example, for a federal learning system comprising m participants party-1 to party-m:

first, the first party-1 and the second party-m interact based on the OT protocol. The first party-1 obtains an OPRF seed (C, s, Q), wherein C is a random number sampled by party-1, s is a random key in the OT protocol, and Q is a Q matrix in the OT protocol. The second party-m obtains OPRF output information (C, T0, T1), wherein T0 and T1 are T matrices in the OT protocol, and elements in T0 and T1 are related to input data of party-m with s and Q.

Part-1 then goes to a third party, including part-2 to party- (m-1), broadcast (C, s, Q), then all parts y-1 to part y- (m-1) hold OPRF seeds. parts-1 to part- (m-1) obtain corresponding OPRF output information { H ] based on OPRF seeds and own data identification set _k ,S _k } _{k∈{1,…,m-1}} 。

part-1 to part- (m-1) send the OPRF output information to a semi-honest (semi-honest) intermediate party, and the intermediate party gathers to obtain an OPRF output information intersection set:

{H＝∩ _{k∈1,…,m-1} H _k ,S＝∩ _{k∈1,…,m-1} S _k }。

based on the OPRF output information intersection { H, S } and the OPRF output information (C, T0, T1) of the second party-m, an intersection of the set of data identifications of the federated learning system may be determined.

In practical application, the second participant can send the OPRF output information of the second participant and the data identification set of the second participant corresponding to the OPRF output information to the intermediate party, and the intermediate party determines the intersection of the data identification sets of the federal learning system. Or the intermediate party sends the OPRF output information intersection to the second party, and the second party determines the intersection of the data identifiers of the federal learning system, so that the privacy security of the second party can be ensured. The processing procedure for obtaining the intersection of the data identification set of the federal learning system based on the intersection of the OPRF output information of the second participant and the OPRF output information of the first participant may be implemented by referring to the specific processing procedure for obtaining the intersection of the data identification set of the federal learning system based on the OPRF output information of the second participant and the OPRF output information of the first participant.

According to the above embodiment, in the case that there are at least three participants, the first participant broadcasts the OPRF seed to the third participant, so that all the other participants except the second participant obtain the OPRF seed, each participant calculates to obtain own OPRF output information, then the other participants calculate the OPRF output information intersection by the intermediate party, and then the intersection of the data identification sets of the system is determined based on the OPRF output information intersection. Therefore, each participant can only acquire the data intersection of the whole system, and cannot acquire the intersection with any other participant, so that the data security is further improved.

Illustratively, the data processing method may further include:

the first participant receives the intersection of the data identification sets of all the participants in the federal learning system, which are sent by the second participant; the second party is used for receiving the OPRF output information intersection sent by the intermediate party and determining the intersection of the data identification set of the federal learning system based on the OPRF output information intersection and the OPRF output information of the second party.

According to this embodiment, the intermediary determines an intersection of the OPRF output information and then transmits the intersection to the second party, which determines an intersection of the set of data identifiers of the federal learning system and transmits the intersection to the first party. The second party may also send the intersection to a third party, for example, so that each party in the federal learning system can obtain the intersection.

According to the embodiment, the second party obtains the intersection of the whole system based on the self data identification set, the OPRF output information and the OPRF output information, so that the second party can be prevented from sending the data identification set of the second party, the privacy data of the second party is protected, and the data security is further improved.

In some examples, the data identification sets of each participant in the bang-up learning system may be first populated such that the data identification sets are the same in set size (i.e., the number of elements included) in order to determine the OPRF output information of each participant. For example, party a inputs set X, party B inputs set Y, and by randomly populating data identification in sets X and Y, |x|= |y|=n.

In other examples, in performing the above method, the set size of the second party of the OT protocol, i.e. the number of elements in the set, is sent to the first party, so that the first party can also accurately determine the OPRF output information, based on which the set of data identities of the parties may not be populated. Illustratively, the data processing method may further include:

the first party receives the aggregate size of the second party;

Accordingly, the first party determines OPRF output information of the first party based on the OPRF seed and the data identification set of the first party, comprising:

the first party determines OPRF output information for the first party based on the OPRF seed, the first party's data identification set, and the second party's set size.

According to the embodiment, the data identification sets of all the participants are not filled, so that the data transmission amount and the calculation amount can be reduced.

As an implementation of the above methods, the embodiments of the present disclosure further provide a federal learning system. As shown in fig. 3, the system may include a first participant 310 and a second participant 320;

a second participant 320, configured to interact with the first participant 310 based on the OT protocol and a data identification set of the second participant 320, so as to obtain OPRF output information of the second participant 320;

the first participant 310 is configured to interact with the second participant 320 based on the OT protocol to obtain an OPRF seed, determine OPRF output information of the first participant 310 based on the OPRF seed and a data identification set of the first participant 310, and send the OPRF output information of the first participant 310;

wherein the OPRF output information of the first participant 310 and the OPRF output information of the second participant 320 are used to determine an intersection of the set of data identifications of the federal learning system.

Illustratively, the first participant 310 is configured to transmit the OPRF output information of the first participant 310 to the second participant 320;

the second party 320 is configured to determine an intersection of the set of data identifications of the federal learning system based on the OPRF output information of the first party 310 and the OPRF output information of the second party 320.

Illustratively, as shown in fig. 4, the federal learning system further includes a third party 410 and an intermediary party 420;

the first party 310 is also configured to send the OPRF seed to a third party 410 in the federal learning system, and to send OPRF output information of the first party 310 to an intermediary party 420;

the third party 410 is configured to obtain the OPRF output information of the third party 410 based on the OPRF seed and the data identification set of the third party 410, and send the OPRF output information of the third party 410 to the intermediate party 420;

the intermediary 420 is configured to determine an OPRF output information intersection based on the OPRF output information of the first participant 310 and the third participant 410, the OPRF output information intersection being configured to determine an intersection of the set of data identifications of the federal learning system.

Illustratively, the intermediary 420 is also configured to send the OPRF output information intersection to the second participant 320;

the second party 320 is further configured to receive the OPRF output information intersection sent by the intermediate party 420, determine an intersection of the data identification set of the federal learning system based on the OPRF output information intersection and the OPRF output information of the second party 320, and send the intersection of the data identification set of the federal learning system to the first party 310 and the third party 410.

Illustratively, the first party 310 is further configured to receive a set size of the second party 320, and determine the OPRF output information of the first party 310 based on the OPRF seed, the set of data identifications of the first party 310, and the set size of the second party 320.

Fig. 5 shows the interaction flow of the above-mentioned principals in the federal learning system. As shown in fig. 5, the interaction flow includes:

step S51: the first participant interacts with the second participant based on an OT protocol to obtain an OPRF seed;

step S52: the first party sends the OPRF seed to the third party;

step S53: the first participant determines OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant;

step S54: the first party sends OPRF output information of the first party to the intermediate party;

step S55: the third party obtains OPRF output information of the third party based on the OPRF seed and the data identification set of the third party;

step S56: transmitting the OPRF output information of the third party to the intermediate party;

step S57: the intermediary determines an OPRF output information intersection based on the OPRF output information of the first participant and the third participant;

step S58: the second party receives the OPRF output information intersection sent by the intermediate party;

Step S59: the second party determining an intersection of the set of data identifications of the federal learning system based on the intersection of the OPRF output information with the OPRF output information of the second party 320;

step S60: the second party sends an intersection of the set of data identifications of the federal learning system to the first party and the third party.

The federal learning system provided by the embodiment of the application can execute the data processing method and has corresponding technical effects. In practical applications, the federal learning system may be set accordingly with reference to the above-described data processing method.

As an implementation of the foregoing methods, an embodiment of the present disclosure further provides a data processing apparatus, where the apparatus is applied to a first participant in a federal learning system, as shown in fig. 6, and the apparatus includes:

an interaction module 610, configured to interact with a second participant in the federal learning system based on an unintentional transmission OT protocol to obtain a pseudo-random function OPRF seed;

a first determining module 620, configured to determine, by the first participant, OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant;

a first transmitting module 630, configured to transmit the OPRF output information of the first participant; wherein the OPRF output information of the first participant is used to determine an intersection of the set of data identifications of the federal learning system.

Illustratively, the first transmitting module 630 is configured to:

transmitting the OPRF output information of the first party to the second party; the second party is used for determining intersection of the data identification set of the federal learning system according to the OPRF output information of the first party and the OPRF output information of the second party.

Illustratively, as shown in FIG. 7, the apparatus further comprises:

a second transmitting module 710 for transmitting the OPRF seed to a third party in the federal learning system; the third party is used for obtaining OPRF output information of the third party based on the OPRF seed and the data identification set of the third party, and sending the OPRF output information of the third party to the intermediate party;

accordingly, the first sending module 630 is configured to:

transmitting the OPRF output information of the first participant to the intermediate party; the intermediate party is used for determining an OPRF output information intersection based on OPRF output information of the first party and the third party, and the OPRF output information intersection is used for determining an intersection of a data identification set of the federal learning system.

Illustratively, as shown in FIG. 7, the apparatus further comprises:

a first receiving module 720, configured to receive an intersection of data identifier sets of each participant in the federal learning system sent by the second participant; the second party is used for receiving the OPRF output information intersection sent by the intermediate party, and determining the intersection of the data identification set of the federal learning system based on the OPRF output information intersection and the OPRF output information of the second party.

Illustratively, as shown in FIG. 7, the apparatus further comprises:

a second receiving module 730, configured to receive a set size of the second participant;

accordingly, the first determining module 620 is configured to:

the OPRF output information of the first party is determined based on the OPRF seed, the set of data identifications of the first party, and the set size of the second party.

The data processing device provided by the embodiment of the application can realize the data processing method provided by the embodiment of the application and has the corresponding beneficial effects.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM803 and executed by computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A data processing method, comprising:

the first party determining OPRF output information for the first party based on the OPRF seed and a set of data identifications for the first party;

the first participant transmits OPRF output information of the first participant; wherein the OPRF output information of the first participant is used to determine an intersection of a set of data identifications of the federal learning system;

the first party transmitting OPRF output information for the first party, comprising:

the first party sends the OPRF output information of the first party to the second party; the second participant is used for determining an intersection of the data identification set of the federal learning system according to the OPRF output information of the first participant and the OPRF output information of the second participant;

The second party determining an intersection of a set of data identifications of the federal learning system based on the OPRF output information of the first party and the OPRF output information of the second party, comprising:

the second participant initializes an intersection of a data identification set of the federal learning system to 0, traverses each element in the data identification set of the second participant, and determines whether OPRF output information of the traversed element is in OPRF output information of the first participant or not according to the traversed element;

if yes, adding the element into an intersection of a data identification set of the federal learning system; and after traversing all elements in the data identification set of the second party, obtaining an intersection of the data identification set of the federal learning system.

2. The method of claim 1, further comprising:

the first party sending the OPRF seed to a third party in the federal learning system; the third party is used for obtaining OPRF output information of the third party based on the OPRF seed and the data identification set of the third party, and sending the OPRF output information of the third party to an intermediate party;

Correspondingly, the first participant transmits the OPRF output information of the first participant, including:

the first party sends OPRF output information of the first party to the intermediate party; wherein the intermediary party is configured to determine an OPRF output information intersection based on the OPRF output information of the first and third parties, the OPRF output information intersection being used to determine an intersection of a set of data identifications of the federal learning system.

3. The method of claim 2, further comprising:

the first participant receives an intersection of the data identification sets of the federal learning system transmitted by the second participant; the second participant is configured to receive the intersection of the OPRF output information sent by the intermediate party, and determine an intersection of the data identification set of the federal learning system based on the intersection of the OPRF output information and the OPRF output information of the second participant.

4. A method according to any one of claims 1-3, further comprising:

the first party receives the aggregate size of the second party;

accordingly, the first participant determines, based on the OPRF seed and the data identification set of the first participant, OPRF output information of the first participant, including:

The first party determines OPRF output information for the first party based on the OPRF seed, the first party's set of data identifications, and the second party's set size.

5. A data processing apparatus for use with a first party in a federal learning system, the apparatus comprising:

a first determining module, configured to determine, by the first participant, OPRF output information of the first participant based on the OPRF seed and a data identification set of the first participant;

a first sending module, configured to send OPRF output information of the first participant; wherein the OPRF output information of the first participant is used to determine an intersection of a set of data identifications of the federal learning system;

the first sending module is used for:

transmitting the OPRF output information of the first party to the second party; wherein the second participant is configured to determine an intersection of the set of data identifications of the federal learning system according to the OPRF output information of the first participant and the OPRF output information of the second participant; the second participant is specifically configured to initialize an intersection of a data identifier set of the federal learning system to 0, traverse each element in the data identifier set of the second participant, and determine, for the traversed element, whether OPRF output information of the element is in OPRF output information of the first participant; if yes, adding the element into an intersection of a data identification set of the federal learning system; and after traversing all elements in the data identification set of the second party, obtaining an intersection of the data identification set of the federal learning system.

6. The apparatus of claim 5, further comprising:

a second sending module for sending the OPRF seed to a third party in the federal learning system; the third party is used for obtaining OPRF output information of the third party based on the OPRF seed and the data identification set of the third party, and sending the OPRF output information of the third party to an intermediate party;

correspondingly, the first sending module is used for:

transmitting the OPRF output information of the first party to the intermediary party; wherein the intermediary party is configured to determine an OPRF output information intersection based on the OPRF output information of the first and third parties, the OPRF output information intersection being used to determine an intersection of a set of data identifications of the federal learning system.

7. The apparatus of claim 6, further comprising:

a first receiving module, configured to receive an intersection of a set of data identifiers of the federal learning system sent by the second party; the second participant is configured to receive the intersection of the OPRF output information sent by the intermediate party, and determine an intersection of the data identification set of the federal learning system based on the intersection of the OPRF output information and the OPRF output information of the second participant.

8. The apparatus of any of claims 5-7, further comprising:

a second receiving module, configured to receive a set size of the second party;

correspondingly, the first determining module is used for:

9. A federal learning system comprising a first party and a second party;

the second participant is used for interacting with the first participant based on an OT protocol and a data identification set of the second participant to obtain OPRF output information of the second participant;

the first participant is configured to interact with the second participant based on an OT protocol to obtain an OPRF seed, determine OPRF output information of the first participant based on the OPRF seed and a data identifier set of the first participant, and send the OPRF output information of the first participant;

wherein the OPRF output information of the first participant and the OPRF output information of the second participant are used to determine an intersection of a set of data identifications of the federal learning system;

Wherein the first participant is configured to send OPRF output information of the first participant to the second participant;

the second participant is used for determining an intersection of the data identification set of the federal learning system according to the OPRF output information of the first participant and the OPRF output information of the second participant;

the second participant is specifically configured to initialize an intersection of a data identifier set of the federal learning system to 0, traverse each element in the data identifier set of the second participant, and determine, for the traversed element, whether OPRF output information of the element is in OPRF output information of the first participant; if yes, adding the element into an intersection of a data identification set of the federal learning system; and after traversing all elements in the data identification set of the second party, obtaining an intersection of the data identification set of the federal learning system.

10. The federal learning system according to claim 9, further comprising a third party and an intermediary;

the first party is further configured to send the OPRF seed to a third party in the federal learning system, and send OPRF output information of the first party to the intermediary party;

The third party is used for obtaining OPRF output information of the third party based on the OPRF seed and the data identification set of the third party, and sending the OPRF output information of the third party to the intermediate party;

the intermediary party is configured to determine an OPRF output information intersection based on OPRF output information of the first party and the third party, the OPRF output information intersection being used to determine an intersection of a set of data identifications of the federal learning system.

11. The federal learning system according to claim 10, wherein the intermediary party is further configured to send the OPRF output information intersection to the second party;

the second party is further configured to receive the intersection of the OPRF output information sent by the intermediate party, determine an intersection of the data identification set of the federal learning system based on the intersection of the OPRF output information and the OPRF output information of the second party, and send the intersection of the data identification set of the federal learning system to the first party and the third party.

12. The federal learning system according to any one of claims 9-11, wherein the first participant is further configured to receive a collective size of the second participant, the first participant's OPRF output information being determined based on the OPRF seed, the first participant's data identification collection, and the collective size of the second participant.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.