CN114330756A - Federal ensemble learning method, apparatus, device and storage medium - Google Patents

Federal ensemble learning method, apparatus, device and storage medium Download PDF

Info

Publication number
CN114330756A
CN114330756A CN202111261571.7A CN202111261571A CN114330756A CN 114330756 A CN114330756 A CN 114330756A CN 202111261571 A CN202111261571 A CN 202111261571A CN 114330756 A CN114330756 A CN 114330756A
Authority
CN
China
Prior art keywords
logistic regression
feature
features
regression models
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111261571.7A
Other languages
Chinese (zh)
Inventor
程勇
蒋杰
韦康
刘煜宏
陈鹏
陶阳宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111261571.7A priority Critical patent/CN114330756A/en
Publication of CN114330756A publication Critical patent/CN114330756A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure provide a method, an apparatus, a device, and a computer-readable storage medium for federated ensemble learning. According to the method, feature selection and model selection are performed on each participant locally based on the difference privacy of an index mechanism, and the selected training model is sent to a federal server for integration, so that a federal integrated model with better performance is generated. By the method, the parameters of the selected training model can be sent to the federal server in a plaintext form without using any cryptography method, and the problem of ciphertext expansion based on the cryptography method is solved, so that more efficient and low-communication-cost federal learning is realized under the condition of ensuring no data leakage risk. Furthermore, the method provided by the embodiments of the present disclosure may also support scenarios with only two participants by direct transmission of training models between the participants, and support direct communication and model fusion between multiple participants without a federated server.

Description

Federal ensemble learning method, apparatus, device and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence and robots, and more particularly, to a method, an apparatus, a device, and a storage medium for federated integrated learning.
Background
In order to protect user privacy and compete for commercial interests, data collaboration faces many difficulties, and the potential value of decentralized data sources is not fully utilized. In recent years, the Federal Learning (FL) technology is rapidly developed, a new solution is provided for cross-department, cross-organization and cross-industry data cooperation, and the objective is to realize common modeling and improve the effect of a training model on the basis of ensuring data privacy safety and legal compliance.
Since an attacker may deduce training data information used in training the model and even original training data from the trained model parameters, in the conventional federal learning method, a participant cannot directly send the model parameters trained locally to a federal server or other participants in a plaintext manner, but sends the model parameters in an encrypted form for safe model fusion by a cryptography-based (or secret sharing) method, or disturbs the generated model by a gaussian mechanism-based random gradient descent method, so as to protect the safety and privacy of the data. However, for the cryptography (or secret sharing) based method, the communication overhead is high, and both the network communication bandwidth and the network communication stability are high, and for the gaussian mechanism based random gradient descent method, it is difficult to generate effective model parameters to ensure the performance of the fusion model generated by the federated server under the condition that the data distribution of the participants is unbalanced.
Therefore, there is a need for an efficient and secure federated learning approach that allows data collaboration applicable to a variety of scenarios to be achieved with low communication overhead.
Disclosure of Invention
In order to solve the problems, the method and the device protect the local data of the participants based on the differential privacy of the exponential mechanism, and the participants can directly send the local model parameters to the federal server in a clear text mode, so that the communication overhead is reduced, and the risk of revealing training data is avoided.
Embodiments of the present disclosure provide a method, an apparatus, a device, and a computer-readable storage medium for federated ensemble learning.
The embodiment of the disclosure provides a federated ensemble learning method, which includes: selecting a first number of features from a set of features of a participant according to a first probability distribution, the first probability distribution being obtained for the set of features of the participant based on an exponential mechanism; obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features, and selecting a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution, the second probability distribution being obtained for the plurality of logistic regression models based on an exponential mechanism; and sending at least a part of the logistic regression models in the second number of logistic regression models to a fusion terminal so as to perform integration and fusion based on the at least a part of logistic regression models and generate a federal integration model.
The embodiment of the present disclosure further provides a federated integrated learning method, including: receiving at least one logistic regression model from each of a plurality of participants; de-duplicating all logistic regression models from the plurality of participants to remove duplicate logistic regression models; performing integration and fusion on the logic regression model subjected to the duplication removal processing to generate a federal integration model; wherein, for each of the plurality of participants, the at least one logistic regression model from the participant comprises at least a portion of the second number of logistic regression models as described by the Federal Integrated learning method above.
The embodiment of the present disclosure provides a united learning device for nation, including: a feature selection module configured to select a first number of features from a set of features of a participant according to a first probability distribution obtained for the set of features of the participant based on an exponential mechanism; a model selection module configured to obtain a plurality of logistic regression models based on at least a portion of the selected first number of features and to select a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution obtained for the plurality of logistic regression models based on an exponential mechanism; and the model sending module is configured to send at least a part of the logistic regression models in the second number of logistic regression models to the fusion terminal so as to perform integration and fusion based on the at least a part of logistic regression models and generate a federal integration model.
The embodiment of the present disclosure provides a united learning device, including: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer-executable program that, when executed by the processor, performs the federal integrated learning method as described above.
Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions, which when executed by a processor, operate to implement a federal integrated learning method as described above.
Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform a federal integrated learning method in accordance with an embodiment of the present disclosure.
Compared with the traditional federal learning method based on cryptography or secret sharing, the method provided by the embodiment of the disclosure can send the trained model parameters to the federal server in a plaintext form, and each participant and the federal server have only one message interaction and small transmission data volume, thereby significantly reducing the requirements on a communication network.
Compared with the traditional federal learning method based on a random gradient descent model, the method provided by the embodiment of the disclosure can be used for training effective model parameters under the condition of unbalanced data distribution of the participants, and the stability of the federal learning method is improved.
According to the method provided by the embodiment of the disclosure, feature selection and model selection are performed on each participant locally based on the difference privacy of an index mechanism, and the selected training model is sent to a federated server for integration, so that a federated integration model with better performance is generated. By the method, the parameters of the selected training model can be sent to the federal server in a plaintext form without using any cryptography method, and the problem of ciphertext expansion based on the cryptography method is solved, so that more efficient and low-communication-cost federal learning is realized under the condition of ensuring no data leakage risk.
Furthermore, the method provided by the embodiments of the present disclosure may also support scenarios with only two participants by direct transmission of training models between the participants, and support direct communication and model fusion between multiple participants without a federated server.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.
FIG. 1 is a schematic diagram illustrating a lateral federated learning based on cryptography, in accordance with an embodiment of the present disclosure;
FIG. 2A is a flow diagram illustrating a federated ensemble learning method according to an embodiment of the present disclosure;
FIG. 2B is a schematic diagram illustrating a Federal ensemble learning method, according to an embodiment of the present disclosure;
FIG. 3A is a flow diagram illustrating feature selection based on an exponential mechanism, according to an embodiment of the disclosure;
FIG. 3B is a schematic diagram illustrating feature selection based on an exponential mechanism, according to an embodiment of the disclosure;
FIG. 4A is a flow diagram illustrating model construction according to an embodiment of the present disclosure;
FIG. 4B is a flow diagram illustrating model selection according to an embodiment of the present disclosure;
FIG. 4C is a schematic diagram illustrating model construction and selection according to an embodiment of the present disclosure;
FIG. 5 is a flow diagram illustrating a federated ensemble learning method according to an embodiment of the present disclosure;
FIG. 6A is a schematic diagram illustrating model fusion via a fusion center according to an embodiment of the present disclosure;
FIG. 6B is a schematic diagram illustrating model fusion without a fusion center according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram illustrating a federated integrated learning device, in accordance with an embodiment of the present disclosure;
FIG. 8 shows a schematic diagram of a Federal Integrated learning device, according to an embodiment of the present disclosure;
FIG. 9 shows a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure; and
FIG. 10 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.
The federal integrated learning approach of the present disclosure may be based on Artificial Intelligence (AI). Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, for the artificial intelligence based federated ensemble learning approach, it can centrally fuse partial models trained based on partial features of each participant to form a better training model in a manner similar to that of human beings by collecting data from multiple parties and comprehensively analyzing the data to perform specific decision making decisions. Artificial intelligence enables the federal integrated learning method disclosed by the invention to have the functions of automatically selecting partial characteristics of each participant for model training and selecting partial models from the partial characteristics for centralized fusion on the basis of an index mechanism of differential privacy in real time by researching the design principle and implementation method of various intelligent machines.
The federal ensemble learning method of the present disclosure may be based on federal learning techniques. Federal Learning can be classified into Horizontal Federal Learning (HFL), Vertical Federal Learning (VFL), and Federal Transfer Learning (FTL) according to the distribution of data among different participants. Among them, the essence of horizontal federal learning is sample federation, which is applicable to scenarios where participants are in the same state but reach different users, i.e., where features overlap more and samples overlap less, such as between banks in different regions, where the business is similar (features are similar) but the users are different (samples are different). The essence of the longitudinal federal learning is the combination of features, which is suitable for scenes with more overlapped samples and less overlapped features, such as supermarkets and banks in the same region, and the users who reach the supermarkets are residents in the region (the samples are the same), but the services are different (the features are different). When there is little overlap of features and samples between participants, federate migration learning may be used to apply models learned in the source domain to the target domain, such as a federation between banks and supermarkets in different regions, using similarities between data, tasks, or models. The embodiment of the disclosure performs horizontal federal learning aiming at the scene that the data sets owned by all the participants have the same feature space and different sample spaces, and has the advantages that the data volume participating in training can be increased, and only the learning model is interacted, so that the data privacy and the data security of the participants are protected to a certain extent.
In addition, the federal ensemble learning method of the present disclosure may also be based on a differential privacy approach. Differential Privacy (DP) is a mechanism to protect user data Privacy by preventing Differential attacks, and aims to make the probability of obtaining the same result by model inference very close for two data sets that differ by only one record, and to remove individual features to protect user Privacy while preserving statistical features. In federal learning, a model training algorithm does not distinguish between general features and individual features, so that the trained model may inadvertently reveal individual features in a training set, and a malicious attacker may obtain privacy information of a user from the model, so that it is necessary to protect the training model by using a differential privacy technology. In privacy protection by applying differential privacy, data to be processed can be mainly divided into a numerical type and a non-numerical type, wherein for numerical data, a laplacian or gaussian mechanism is generally adopted, and random noise is added to an obtained numerical result to realize differential privacy; for non-numerical data, an exponential mechanism is generally adopted and a scoring function is introduced, a score is obtained for each possible output, the score is normalized to serve as a probability value returned by the query, and after the query is received, a specific result is not output deterministically, but the result is returned with a certain probability value, so that differential privacy is realized. Where the probability value is determined by a scoring function, higher scores result higher probabilities of being output and lower scores lower probabilities of being output. The federated ensemble learning method disclosed by the present disclosure may implement differential privacy based on an exponential mechanism to protect the data privacy of each participant while effectively training a more optimal model.
In summary, the embodiments of the present disclosure provide solutions related to artificial intelligence, federal learning, and differential privacy, and will be further described with reference to the accompanying drawings.
Fig. 1 is a schematic diagram illustrating a lateral federal learning based cryptography in accordance with an embodiment of the present disclosure.
Alternatively, both the lateral federated learning system shown in FIG. 1 and the lateral federated learning system of the present application may include K participants (shown in FIG. 1 as participant 0 through participant (K-1)). Each participant can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, a big data and artificial intelligence platform, and the like. The participants may have the same feature space and different sample spaces between them.
The core idea of traditional horizontal federal learning is to let each participant train a model locally using own data, and then get a better global model through cryptography-based security model fusion (such as security model parameter averaging, also called federal averaging), or perform security model fusion based on secret sharing (i.e. encryption using mask), or perturb the generated model based on a gaussian mechanism random gradient descent method, so as to protect the security and privacy of data.
As shown in fig. 1, taking the cryptography-based horizontal federal learning as an example, after completing model training locally, each participant may send an encrypted gradient to the federal server through step (i), so as to perform secure model fusion at the federal server in step (ii). Next, in step (c), the federated server may distribute the fused encrypted gradient back to each participant for each participant to decrypt the gradient and perform model updates locally (step (c)) to train a better global model without revealing each participant's data privacy.
However, in the cryptographic (or secret sharing) based model fusion approach, the federated server and the participants cannot obtain the participant's local model (or model parameters) in clear form, but only in encrypted form. But for encrypted local model parameters, the federated server is typically only able to use the secure aggregation algorithm (i.e., the secure federated averaging algorithm) for model fusion. In the security aggregation algorithm, each participant needs to perform multiple message interactions with the federal server to transfer model parameters, which has high requirements on network communication bandwidth and stability, and consumes a large amount of computation and communication overhead. Furthermore, model fusion methods based on secure federal averaging typically require at least three or more participants, and do not support scenarios with only two participants.
In addition, in the conventional random gradient descent model scheme based on the gaussian mechanism, when the data distribution among the participants is unbalanced, the feature distribution of global data is difficult to learn, and a data preprocessing method such as upsampling or downsampling cannot be used, so that effective model parameters are difficult to generate. In this case, the performance of the model aggregated by the federal server is difficult to guarantee due to poor model effect provided by each participant.
Therefore, in view of the above problems, the present disclosure provides a federated ensemble learning method that protects local data of participants based on differential privacy of an exponential mechanism, eliminates the risk of data leakage, and avoids the above problems.
Compared with the traditional federal learning method based on cryptography or secret sharing, the method provided by the embodiment of the disclosure can send the trained model parameters to the federal server in a plaintext form, and each participant and the federal server have only one message interaction and small transmission data volume, thereby significantly reducing the requirements on a communication network.
Compared with the traditional federal learning method based on a random gradient descent model, the method provided by the embodiment of the disclosure can be used for training effective model parameters under the condition of unbalanced data distribution of the participants, and the stability of the federal learning method is improved.
According to the method provided by the embodiment of the disclosure, feature selection and model selection are performed on each participant locally based on the difference privacy of an index mechanism, and the selected training model is sent to a federated server for integration, so that a federated integration model with better performance is generated. By the method, the parameters of the selected training model can be sent to the federal server in a plaintext form without using any cryptography method, and the problem of ciphertext expansion based on the cryptography method is solved, so that more efficient and low-communication-cost federal learning is realized under the condition of ensuring no data leakage risk.
Furthermore, the method provided by the embodiments of the present disclosure may also support scenarios with only two participants by direct transmission of training models between the participants, and support direct communication and model fusion between multiple participants without a federated server.
Fig. 2A is a flow diagram illustrating a federal integrated learning method 200 in accordance with an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a federal integrated learning method 200 in accordance with an embodiment of the present disclosure.
As shown in fig. 2A, in step 201, a first number of features may be selected from a set of features of a participant according to a first probability distribution, which may be obtained for the set of features of the participant based on an exponential mechanism.
Optionally, each participant may use training data owned locally by each participant to perform feature selection, where the feature set of the participant may be a discrete data set subjected to one-hot encoding, and each feature in the feature set may be data taking values from 0 and 1 discrete data.
Alternatively, taking the binary task (i.e. classification as {0, 1}) as an example, for each sample involved in the training data owned locally by each participant, it may have a class label of 0 or 1, which may have some correlation with the respective features of the sample.
For example, for a gender-specific classification task, the classification results may include a male (class label 0) and a female (class label 1), while for each sample (here, each user or individual) included feature data, such as feature data regarding hair length and height, whose value may have an effect on the classification results, which may reflect the magnitude of the correlation of the feature with the classification results (i.e., the determined class labels). For example, for the hair length feature, it is assumed that the feature takes a value of 0 in a predetermined smaller length value interval and takes a value of 1 in a predetermined larger length value interval, so that in a case of an assumption that a sample based on the category label 0 (male) generally has a hair length value smaller than a sample based on the category label 1 (female), the hair length feature may be considered to be related to the category label 0 in the same direction and to be related to the category label 1 in the opposite direction.
It should be understood that the above-mentioned two-classification task is used as an example and not a limitation in the present disclosure, and for the multi-classification task, it can be implemented by converting the multi-classification task into a two-classification task. For example, in the case of a multi-classification task with three classifications { A, B, C }, this can be accomplished by converting it into three "two classification" tasks, namely: the task classification method comprises the following steps of (1) two classification tasks of A and { B, C }, two classification tasks of B and { A, C }, and two classification tasks of C and { A, B }. The present disclosure does not limit the number of specific categories for the classification task.
Optionally, each participant may use differential privacy implemented based on an exponential mechanism to protect training data when performing feature selection locally. Specifically, step 201 may include the steps shown in fig. 3A.
Fig. 3A is a flow diagram illustrating feature selection based on an exponential mechanism, according to an embodiment of the disclosure. Fig. 3B is a schematic diagram illustrating feature selection based on an exponential mechanism, according to an embodiment of the disclosure.
As shown in fig. 3A, in step 2011, for each feature in the set of features of the participant, a feature score for the feature is determined, the feature score being determined based on a relevance of the feature to category labels corresponding to multiple samples of the participant.
According to an embodiment of the present disclosure, each sample of the plurality of samples of the participant may correspond to one of two category labels, and for each of the features, the correlation of the feature with the two category labels may be indicated by a flip flag of the feature. That is, the correlation between the features and the category labels as described above may be indicated with the flip flag.
According to an embodiment of the present disclosure, the condition that the flip flag is a first value indicates that the feature is related to a first category label of the two category labels in a same direction, and the feature score of the feature includes a first feature score determined based on the same direction correlation of the feature and the first category label.
According to an embodiment of the present disclosure, the condition that the flip flag is the second value indicates that the feature is inversely related to the first category label of the two category labels, and the feature score of the feature includes a second feature score determined based on the inverse correlation of the feature to the first category label.
Similar to the above, the flip flag may indicate a homographic or inverse correlation of the feature with the class label of the sample, and the magnitude of the homographic or inverse correlation may be determined based on the sum of the distances of the class labels of the multiple samples of the participant from the value of the feature.
For example, assume a set of characteristics of a participant
Figure BDA0003324729170000091
The class label set is {0, 1}, and for the nth feature, the flip label is assumed to be qn(qn0 or 1), wherein q isn0 indicates that the nth feature is co-directionally correlated with the first of the two class labels 0 (i.e., inversely correlated with the second of the two class labels 1), and q isn1 indicates that the nth feature is inversely related to the first category label 0 of the two category labels (i.e., co-directionally related to the second category label 1 of the two category labels). Thus, it can be used to construct a scoring function for differential privacy based on an exponential mechanism, i.e. at qnThe first feature score is 0
Figure BDA0003324729170000092
And at qnThe second feature score is 1
Figure BDA0003324729170000093
Wherein Xm,nValue, y, representing the nth feature of the mth piece of data (mth sample)mA class label (0 or 1) indicating the mth piece of data (mth sample),
Figure BDA0003324729170000094
is shown when Xm,n=ymOutputs 1 when not, outputs 0 otherwise, and
Figure BDA0003324729170000095
is represented by when 1-Xm,n=ym1 is output, otherwise 0 is output, | I | represents the number of features included in the feature set.
As shown in fig. 3B, after determining the feature score for each feature in the participant's feature set, a probability of each feature being selected may be determined based on all the feature scores to form a complete feature selection probability distribution.
In step 2012, the first probability distribution is determined based on an exponential mechanism according to the feature score of each feature in the participant's feature set, the first probability distribution including a probability that each feature in the participant's feature set is selected.
According to an embodiment of the present disclosure, step 2012 may include: determining, for each feature in the feature set of the participant, a probability that the feature is selected based on an exponential mechanism according to a first feature score and a second feature score of all features in the feature set of the participant, the probabilities including a homonymy probability associated with the first feature score and a reverse probability associated with the second feature score, the first probability distribution including a homonymy probability and a reverse probability that each feature in the feature set of the participant is selected.
As described above, the exponential mechanism uses a scoring function to add a certain randomness when answering queries, so that the outputs have a certain probability of being the same, thereby ensuring epsilon-difference privacy. Where the scoring function u maps the dataset output pair (x, r) to a score u (x, r) (x is the input dataset and r is the output result). The exponential mechanism will output each possible output r with a probability proportional to exp (ε u (x, r)/Δ u), where Δ u represents sensitivity and ε represents a measure of privacy loss, and then the probability of output o ═ r can be expressed as
Figure BDA0003324729170000101
Wherein
Figure BDA0003324729170000102
Therefore, for any adjacent data set x, x', it has a certain probability of outputting the same result, and the probability is less than or equal to eε. Therefore, an observer can hardly perceive small changes of the data set by observing the output parameters, and specific training data cannot be deduced reversely by observing the output parameters, so that the aim of protecting data privacy can be fulfilled.
Alternatively, the participant may select a first number of features from its feature set based on an exponential mechanismCharacterization (e.g., L features), assuming a privacy budget for feature selection of ε1The privacy overhead per feature selection may be an even distribution of privacy budgets, i.e. each feature selection will consume
Figure BDA0003324729170000103
The privacy overhead of (1). In addition, the privacy overhead per feature selection may also vary with the specific features, for example, features that affect the outcome of the model more or features that are more prone to reveal the privacy of the individual user (such as individual features), may consume a higher privacy overhead when selected for model training, and may therefore be allocated a higher proportion of the privacy overhead in the privacy budget. The disclosure has been described with respect to average privacy budget allocation by way of example only and not limitation, and other privacy overhead representations are equally applicable.
Thus, according to the definition of the indexing mechanism described above, the probability that a particular feature is selected may be calculated based on the feature score of each feature in the participant's feature set, including being associated with the first feature score (i.e., q)nCase 0) and is associated with the second feature score (i.e., q)n1 case) is determined. Alternatively, taking the nth feature as an example, assuming that the sensitivity Δ u is 1, the homodromous probability θ is obtainedn(i.e., the probability of selecting the nth feature) can be expressed as:
Figure BDA0003324729170000111
and its inverse probability theta|I|+n(i.e., the probability of selecting a flip feature corresponding to the nth feature, where the flip feature corresponding to the nth feature may be a feature of the same type as the nth feature but with an opposite correlation to the category label with respect to the nth feature, which may be indicated by a flip flag) may be expressed as:
Figure BDA0003324729170000112
thus, the homonymous and inverse probabilities that all features in the participant's feature set are each selected can be computed as described above to form a first probability distribution.
In step 2013, a feature may be selected from the set of features of the participant based on the first probability distribution.
According to an embodiment of the present disclosure, step 2013 may comprise: selecting a feature from the set of features of the participant according to the first probability distribution, and determining a flip flag for the feature.
As described above, the sum of the probabilities that all features in the feature set of the participant are selected is 1, each feature is selected with a probability that includes the homonym probability and the inverse probability thereof, and the homonym probability and the inverse probability correspond to different feature scores (the first feature score and the second feature score) and the flip flag (q) of the same feature, respectivelyn0 and qn1), when a feature is selected from the set of features based on the first probability distribution, not only the index of the feature but also the flip identity of the feature is determined.
In step 2014, the selected features may be removed from the set of features of the participant to update the set of features of the participant and the first probability distribution, and the selection of features may continue based on the updated set of features and the first probability distribution until the total number of selected features reaches the first number.
Optionally, after each feature is selected, it may be removed from the feature set I, and the first probability distribution may be updated as shown in equations (2) and (3), so that new features may be selected as described in step 2013 based on the updated feature set I and the first probability distribution until the number of selected features reaches the first number, thereby selecting the first number of features.
Thus, as shown in FIG. 3B, each time a feature selection is completed, it may be determined whether the number of selected features reaches a predetermined first number, and in the event that the number of selected features is insufficient, the feature set I and the first probability distribution may be updated to continue feature selection, otherwise the selected first number of features is derived for use in the following local model training.
Next, returning to fig. 2A, in step 202, a plurality of logistic regression models may be obtained based on at least a portion of the selected first number of features, and a second number of logistic regression models may be selected from the plurality of logistic regression models according to a second probability distribution obtained for the plurality of logistic regression models based on an exponential mechanism.
Obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features in step 202 may include the steps shown in fig. 4A. Selecting a second number of logistic regression models from the plurality of logistic regression models based on the second probability distribution in step 202 may include the steps shown in fig. 4B.
Fig. 4A is a flow diagram illustrating model construction according to an embodiment of the present disclosure. Fig. 4B is a flow diagram illustrating model selection according to an embodiment of the present disclosure. Fig. 4C is a schematic diagram illustrating model construction and selection according to an embodiment of the present disclosure.
As shown in fig. 4A, in step 20211, an optimal feature of the first number of features and a corresponding one-dimensional logistic regression model thereof may be determined, and a feature score of the optimal feature is not less than feature scores of other features of the first number of features.
Optionally, the optimal feature may have the highest feature score among the selected first number of features, based on which the one-dimensional logistic regression model may be generated by independent training.
As shown in FIG. 4C, after the optimal feature selection and one-dimensional logistic regression model construction are completed, a single set of model selections may be entered, and during the single set of model selections, a round of model construction and selection may be performed to obtain a second set of logistic regression models.
In step 20212, a predetermined number of features may be randomly chosen from the other features in the first number of features.
Alternatively, the random selection of the predetermined number of features from the other features of the first number of features may be, for example, a selection based on equal probabilities, i.e. each of these other features is selected with equal probability. In addition, the random selection may be based on other selection manners, which is not limited by the present disclosure.
Alternatively, after completing the local feature selection, the participant may construct the training model using a first number of features (e.g., L features) that he or she selected locally. For example, for a logistic regression model with a dimension D +1, several features (shown as D in fig. 4C) may be randomly selected from the features other than the optimal feature in the first number of features to construct a logistic regression model with discrete spatial weights together with the optimal feature f in the first number of features, and the weight value space of the logistic regression model may be set by itself, which reflects the importance of different features to the output result of the model, i.e., the degree of influence on the decision of the model.
Note that the above determination and use of the optimal characteristic f may be optional, and the method of the present disclosure may also be performed without using the optimal characteristic f.
In step 20213, for the optimal feature and the selected predetermined number of features, a plurality of logistic regression models may be constructed based on a predetermined weight value space, where the number of the plurality of logistic regression models is related to the predetermined number and the number of weight values included in the predetermined weight value space.
Optionally, for the logistic regression model of the present disclosure, the output result is represented as a weighted output of the weight and the feature, wherein the weight value is a discrete value, and the dimension cannot be too large (preferably, less than or equal to 6). Therefore, assuming that the weight value space of the logistic regression model of the present disclosure is {0, 0.1, 0.2, 0.3, 0.4, 0.5}, for a set of feature sets (e.g., the above-mentioned D +1 features), T ═ V ∞ can be obtainedD+1A logistic regression model, where | V | represents the number of weights included in the weight-valued space V (in this example, | V | ═ 6). For convenience of description, the optimum characteristics are hereinafter denoted by SAnd a set of a predetermined number of selected features (i.e., the above-mentioned D +1 features).
Alternatively, the weight corresponding to the optimal feature f may be set to a fixed value (e.g., 0.5). Thus by way of enumeration, for a set of feature sets (e.g., (D +1) features described above), T | V tintmay be obtainedDA logistic regression model.
Thus, after implementing model construction in a single set of model selections as shown in FIG. 4C based on steps 20212 and 20213 above, the model selection process in a single set of model selections may be entered.
As described above, selecting a second number of logistic regression models from the plurality of logistic regression models based on the second probability distribution in step 202 may include the steps shown in FIG. 4B.
As shown in fig. 4B, in step 20221, a model score for each of the plurality of logistic regression models may be determined based on the prediction results of the plurality of logistic regression models for the category labels corresponding to the plurality of samples of the participant.
Alternatively, the model score of each of the multiple logistic regression models may be determined according to the degree of agreement between the prediction results of the respective logistic regression models on all the samples of the participant and the category labels respectively corresponding to the samples.
For example, the prediction result of the classification of the ith logistic regression model of the constructed plurality of logistic regression models for the mth sample of the participant
Figure BDA0003324729170000141
Can be expressed as:
Figure BDA0003324729170000142
wherein z isi,mRepresents an output value of the ith logistic regression model for the mth sample, and
Figure BDA0003324729170000143
Figure BDA0003324729170000144
wherein
Figure BDA0003324729170000145
S[d]Index representing selected features, qS[a]Flip marks, w, representing selected featuresi,dAnd e.g. V represents the weight of the logistic regression model. Therefore, the temperature of the molten metal is controlled,
Figure BDA0003324729170000146
is represented by zi,mWhen the content of the organic acid is more than or equal to 0.5,
Figure BDA0003324729170000147
i.e. determining the class label of the mth sample as 1, otherwise
Figure BDA0003324729170000148
I.e. the class label of the mth sample is determined to be 0.
Thus, optionally, the model score (scoring function) H of the i-th logistic regression modeliCan predict the result
Figure BDA0003324729170000149
Is calculated, i.e.
Figure BDA00033247291700001410
Wherein,
Figure BDA00033247291700001411
representing the predicted result of the m-th sample of the participant in the i-th logistic regression model
Figure BDA00033247291700001412
If the category label is the same (identical) as the category label corresponding to the sample, the output is 1, otherwise, the output is 0. Therefore, the temperature of the molten metal is controlled,
Figure BDA00033247291700001413
the degree of coincidence between the prediction results of the ith logistic regression model for all samples (M samples) of the participant and the class labels corresponding to the samples is reflected.
In step 20222, the second probability distribution is determined based on an exponential mechanism according to the model score of each of the plurality of logistic regression models, the second probability distribution including a probability that each of the plurality of logistic regression models is selected.
Alternatively, similar to that described above with reference to step 2012 during feature selection, the probability that the i-th logistic regression model is selected, as determined based on the exponential mechanism, may be expressed as:
Figure BDA00033247291700001414
wherein J ═ {1, 2., T }, epsilon2Representing the privacy budget for the model selection,
Figure BDA00033247291700001415
the privacy overhead consumed by this model selection is represented, where G represents repeating G times the operations of randomly selecting D features from the first number of features and selecting a second number (S) of LR models to obtain G × S +1 logistic regression models (including the one-dimensional logistic regression models corresponding to the above-mentioned optimal features), as described below with reference to step 20225.
In step 20223, a logistic regression model is selected from the plurality of logistic regression models based on the second probability distribution.
As described above, the determined probability of each of the plurality of logistic regression models being selected forms a second probability distribution based on which model selection from the plurality of logistic regression models can be performed with a particular probability, thereby adding randomness to the model selection results and model training.
In step 20224, the selected logistic regression model may be removed from the plurality of logistic regression models to update the plurality of logistic regression models and the second probability distribution, and the selection of logistic regression models may continue based on the updated plurality of logistic regression models and second probability distribution until the total number of selected logistic regression models reaches the second number.
Thus, as shown in FIG. 4C, each time a model selection is completed, it may be determined whether the number of selected models reaches a predetermined second number, and in the event that the number of selected models is insufficient, the plurality of logistic regression models and the second probability distribution may be updated to continue the model selection, otherwise the second number of logistic regression models is determined.
Furthermore, step 2022 may further include step 20225: randomly selecting a predetermined number of features from the other features in the first number of features a predetermined number of times; and obtaining a plurality of logistic regression models based on each selected predetermined number of features in the predetermined number of times, and selecting a second number of logistic regression models from the plurality of logistic regression models according to the second probability distribution to obtain a third number of logistic regression models, wherein the third number is a product of the second number and the predetermined number, and the third number of logistic regression models may include a plurality of groups of the second number of logistic regression models in the predetermined number of times.
Optionally, to obtain a better fusion model, more logistic regression models may be generated based on the selected first number of features. Thus, a single set of model selections, as shown in FIG. 4C, may be repeated to select different features from the first number of features for model construction multiple times to obtain multiple sets of a second number of logistic regression models, i.e., a third number of logistic regression models. For example, a logistic regression model with G x S dimensions (D +1) can be obtained by repeating G times the random selection of D features with S logistic regression models.
Thus, the participant may obtain a third number of logistic regression models and one-dimensional logistic regression model based on the selected first number of features, as shown in fig. 4C.
Next, in step 203, the participant may send at least a part of the second number of logistic regression models to the fusion end to perform integration fusion based on the at least a part of logistic regression models and generate a federate integration model.
According to an embodiment of the present disclosure, sending at least a portion of the second number of logistic regression models to the fusion peer may include: for each set of a second number of logistic regression models in the third number of logistic regression models, determining a better logistic regression model than the determined one-dimensional logistic regression model from the second number of logistic regression models as the at least a portion of logistic regression models; and sending the at least a portion of the logistic regression models in each set of the second number of logistic regression models in the third number of logistic regression models and the one-dimensional logistic regression model to the fusion terminal in a plaintext form.
Wherein, according to an embodiment of the present disclosure, sending the logistic regression model may include sending a feature index, a rollover flag, and a model weight parameter of a feature corresponding to the logistic regression model.
Optionally, the transmit logistic regression model may also include a feature name. For horizontal federated learning, the feature space of each participant is the same, but the sample space is different, where each feature of the participant has an index or name, and the features of the participants are aligned so that the participants can determine the feature name based on the index that determines the feature.
Optionally, after each participant selects a third number (G × S) of logistic regression models, it only retains the models that are better than the one-dimensional logistic regression models, so the number of models actually sent by each participant to the merging end may be less than (G × S +1) but at least equal to 1, that is, at least the one-dimensional logistic regression model corresponding to the optimal feature f is sent.
According to an embodiment of the disclosure, determining a better logistic regression model from the second number of logistic regression models than the determined one-dimensional logistic regression model is based on a comparison of model scores of the second number of logistic regression models and model scores of the one-dimensional logistic regression model, wherein the model scores of the one-dimensional logistic regression model are determined based on prediction results of the one-dimensional logistic regression model for category labels corresponding to a plurality of samples of the participant.
Alternatively, in the case where the model score of the logistic regression model is greater than the model score of the one-dimensional logistic regression model, the logistic regression model may be considered to be more optimal than the one-dimensional logistic regression model, and thus it is more beneficial for the training of the federal integrated model.
Thus, as described above, the federated ensemble learning method 200 of the present disclosure may perform local feature selection and model selection based on an exponential mechanism to achieve differential privacy, thereby achieving data privacy protection for the participants. Fig. 2B schematically combs the main steps in the federal integrated learning method 200 and the target choices involved.
As shown in fig. 2B, first, for an input feature set (shown as N-dimensional features) of a participant, feature selection may be performed based on an exponential mechanism to implement differential privacy, so that L features are selected from the N-dimensional features, and an optimal feature f and a one-dimensional logistic regression model corresponding to the optimal feature f are determined.
Then, D features for model construction are randomly selected from the selected L features (D < L), and a logistic regression model is constructed using the selected D features together with the feature f, thereby forming a set of logistic regression models (T ═ V @ L @D+1Logistic regression model) where the weight of the model takes discrete values, for example, values from V ═ {0, 0.1, 0.2, 0.3, 0.4, 0.5 }.
Next, model selection is performed for the set of logistic regression models based on an exponential mechanism to achieve differential privacy, thereby selecting K logistic regression models from the set of logistic regression models.
The model construction and model selection process described above may be repeated G times (i.e., G group models are generated) to select G x K models based on an exponential mechanism.
Therefore, (G × K +1) logistic regression models may be generated based on the exponential mechanism, and at least a part of the models may be selectively sent to the fusion end.
It should be understood that the convergence end in the present disclosure may be a convergence center for performing centralized convergence on models of all participants, such as a federal server, or may be a convergence end based on distributed convergence, such as another participant. Although the present disclosure is described primarily in terms of a centralized convergence based on a convergence side such as a federated server, the federated ensemble learning approach of the present disclosure is equally applicable in the absence of a convergence center.
Fig. 5 is a flow diagram illustrating a federal integrated learning method 500 in accordance with an embodiment of the present disclosure.
As shown in fig. 5, in step 501, at least one logistic regression model may be received from each of a plurality of participants.
According to an embodiment of the present disclosure, the at least one logistic regression model from the participant may include, for each participant in the plurality of participants, at least a portion of each set of the second number of logistic regression models in the third number of logistic regression models as described above and the one-dimensional logistic regression model.
Optionally, after receiving the local models (or model parameters) sent by two (or more or all) participants, the fusion end (such as the federal server) can perform integrated fusion on the received local models.
In step 502, all logistic regression models from the plurality of participants may be deduplicated to remove duplicate logistic regression models.
Alternatively, since there are multiple participants and the exponential mechanism is used to implement differential privacy, there may be a situation where the selected model is duplicated between models of the participants, and the duplicated model may indicate that the same part exists in the multiple models generated by the multiple participants, and the same part has the same effect on the classification task, so that the same part remains one of them at the time of fusion.
Therefore, before the fusion is performed at the fusion end, the repeated models need to be deduplicated, i.e., only one of the two or more repeated models is kept.
In step 503, the de-duplicated logistic regression models may be subjected to ensemble fusion to generate a federated ensemble model.
According to an embodiment of the disclosure, the integrated fusion of the deduplication processed logistic regression model comprises voting integrated fusion of the deduplication processed logistic regression model, and the generated prediction result of the federal integrated model is based on an average value of the prediction results of the deduplication processed logistic regression model and the one-dimensional logistic regression model.
Optionally, a convergence end, such as a Federated server, may perform a Voting integration convergence (fed fusing) on the received local models from the participants. The voting integration fusion mode can be used for fusion of classification models, for example, for a binary classification model (positive class and negative class), the classification result of the federal integration model can be determined by the average value of the classification results of the local models of the participants. For example, for a piece of data to be classified, if the average value of the classification results of the local models of the participants is greater than 0.5, the federal integrated model can determine that the classification result is a "positive type"; on the contrary, if the average value of the classification results of the local models of the participants is less than 0.5, the federal integrated model can judge that the classification results are in a negative class, and when the average value of the classification results of the local models of the participants is equal to 0.5, the classification results can be determined simply in a random selection mode.
Fig. 6A is a schematic diagram illustrating model fusion via a fusion center according to an embodiment of the present disclosure. Fig. 6B is a schematic diagram illustrating model fusion without a fusion center, according to an embodiment of the present disclosure.
The scenario shown in fig. 6A is the federal integrated learning scenario described above with reference to fig. 5, where the scenario includes K participants, and after the feature selection and the model selection based on the index mechanism are locally completed by each participant, global model update is implemented by a federal server. There is only one model (or model parameter) transmission from each participant to the federated server and is a message transmission in clear text for the (G x K +1) models described above. Similarly, the model (or model parameters) is transmitted from the federal server to each participant only once in the clear text, so that the resulting federal integrated model is centrally fused.
However, as shown in fig. 6B, in the case where there are only two parties (e.g., party 1 and party 2), the two parties may communicate directly without model fusion through the federated server.
Further, when there are more (e.g., K) participants, they may also communicate through a ring or mesh topology (P2P) to perform distributed fusion to generate a global model without relying on a federated server.
Thus, it should be understood that the convergence end in the present disclosure may be a convergence center, such as a federal server, for centralized convergence of models of all participants, or a convergence end based on distributed convergence, such as another participant. Although the present disclosure is described primarily in terms of a centralized convergence based on a convergence side such as a federated server, the federated ensemble learning approach of the present disclosure is equally applicable in the absence of a convergence center.
Fig. 7 is a schematic diagram illustrating a federal integrated learning device 700 in accordance with an embodiment of the present disclosure.
The federal integrated learning device 700 may include a feature selection module 701, a model selection module 702, and a model transmission module 703.
According to an embodiment of the present disclosure, the feature selection module 701 may be configured to select a first number of features from a set of features of a participant according to a first probability distribution obtained for the set of features of the participant based on an exponential mechanism.
According to an embodiment of the present disclosure, the feature selection module 701 selecting a first number of features from the set of features of the participants according to the first probability distribution may include operations as described with reference to fig. 3A, where each participant may protect training data using differential privacy implemented based on an exponential mechanism when performing feature selection locally thereto.
The model selection module 702 may be configured to obtain a plurality of logistic regression models based on at least a portion of the selected first number of features and select a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution obtained for the plurality of logistic regression models based on an exponential mechanism.
In accordance with an embodiment of the present disclosure, the model selection module 702 obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features includes operations as described with reference to fig. 4A.
Alternatively, after completing the local feature selection, the participant may construct the training model using a first number of features (e.g., L features) that he or she selected locally. For example, for a logistic regression model with dimension D +1, several features (shown as D in fig. 4C) may be randomly selected from the first number of features to construct a logistic regression model with discrete spatial weights together with the optimal feature f of the first number of features.
The model selection module 702 selecting a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution may include operations as described with reference to fig. 4B.
Alternatively, the model score of each of the multiple logistic regression models may be determined according to the degree of agreement between the prediction results of the respective logistic regression models on all the samples of the participant and the category labels respectively corresponding to the samples.
Optionally, a probability that each of the plurality of logistic regression models is selected may be determined based on an exponential mechanism according to the model score of each of the plurality of logistic regression models, thereby forming a second probability distribution based on which model selection from the plurality of logistic regression models may be performed with a certain probability, thereby adding randomness to the model selection result and the model training.
Optionally, to obtain a better fusion model, more logistic regression models may be generated based on the selected first number of features. Thus, according to an embodiment of the present disclosure, the model selection module 702 may be further configured to perform the operations as described with reference to step 20225, i.e., repeat the single set of model selections as shown in fig. 4C to select different features from the first number of features for model construction multiple times, thereby obtaining multiple sets of the second number of logistic regression models, i.e., the third number of logistic regression models.
Thus, the participant may obtain a third number of logistic regression models and one-dimensional logistic regression model based on the selected first number of features.
The model sending module 703 may be configured to send at least a portion of the second number of logistic regression models to the fusion end for performing integration fusion based on the at least a portion of logistic regression models and generating a federated integration model.
The model sending module 703 sending at least a portion of the second number of logistic regression models to the fusion end may include operations as described with reference to step 203.
Wherein, according to an embodiment of the present disclosure, sending the logistic regression model may include sending a feature index, a rollover flag, and a model weight parameter of a feature corresponding to the logistic regression model. Optionally, the transmit logistic regression model may also include a feature name.
According to still another aspect of the present disclosure, a bang ensemble learning device is also provided. Fig. 8 shows a schematic diagram of a federal integrated learning device 2000 in accordance with an embodiment of the present disclosure.
As shown in fig. 8, the federal integrated learning device 2000 may include one or more processors 2010 and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code that, when executed by the one or more processors 2010, may perform a federated ensemble learning method as described above.
The processor in the embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X86 or ARM architecture.
In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
For example, a method or apparatus in accordance with embodiments of the present disclosure may also be implemented by way of the architecture of computing device 3000 shown in fig. 9. As shown in fig. 9, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM)3030, a Random Access Memory (RAM)3040, a communication port 3050 to connect to a network, input/output components 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used in the processing and/or communication of the federal ensemble learning method provided by the present disclosure and program instructions executed by the CPU. Computing device 3000 can also include user interface 3080. Of course, the architecture shown in FIG. 8 is merely exemplary, and one or more components of the computing device shown in FIG. 9 may be omitted as needed in implementing different devices.
According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. Fig. 10 shows a schematic diagram 4000 of a storage medium according to the present disclosure.
As shown in fig. 10, the computer storage media 4020 has stored thereon computer readable instructions 4010. The computer readable instructions 4010, when executed by a processor, may perform a federal integrated learning method in accordance with embodiments of the present disclosure as described with reference to the above figures. The computer readable storage medium in embodiments of the present disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DRRAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.
Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform a federal integrated learning method in accordance with an embodiment of the present disclosure.
Embodiments of the present disclosure provide a method, an apparatus, a device, and a computer-readable storage medium for federated ensemble learning.
Compared with the traditional federal learning method based on cryptography or secret sharing, the method provided by the embodiment of the disclosure can send the trained model parameters to the federal server in a plaintext form, and each participant and the federal server have only one message interaction and small transmission data volume, thereby significantly reducing the requirements on a communication network.
Compared with the traditional federal learning method based on a random gradient descent model, the method provided by the embodiment of the disclosure can be used for training effective model parameters under the condition of unbalanced data distribution of the participants, and the stability of the federal learning method is improved.
According to the method provided by the embodiment of the disclosure, feature selection and model selection are performed on each participant locally based on the difference privacy of an index mechanism, and the selected training model is sent to a federated server for integration, so that a federated integration model with better performance is generated. By the method, the parameters of the selected training model can be sent to the federal server in a plaintext form without using any cryptography method, and the problem of ciphertext expansion based on the cryptography method is solved, so that more efficient and low-communication-cost federal learning is realized under the condition of ensuring no data leakage risk. Furthermore, the method provided by the embodiments of the present disclosure may also support scenarios with only two participants by direct transmission of training models between the participants, and support direct communication and model fusion between multiple participants without a federated server.
It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The exemplary embodiments of the present disclosure described in detail above are merely illustrative, and not restrictive. It will be appreciated by those skilled in the art that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and that such modifications are intended to be within the scope of the disclosure.

Claims (20)

1. A method for federated ensemble learning, comprising:
selecting a first number of features from a set of features of a participant according to a first probability distribution, the first probability distribution being obtained for the set of features of the participant based on an exponential mechanism;
obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features, and selecting a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution, the second probability distribution being obtained for the plurality of logistic regression models based on an exponential mechanism; and
and sending at least a part of the logistic regression models in the second number of logistic regression models to a fusion terminal so as to perform integration and fusion based on the at least a part of logistic regression models and generate a federal integration model.
2. The method of claim 1, wherein selecting a first number of features from a set of features of a participant according to a first probability distribution comprises:
for each feature in the set of features of the participant, determining a feature score for the feature, the feature score determined based on a relevance of the feature to category labels corresponding to a plurality of samples of the participant;
determining the first probability distribution based on an exponential mechanism according to the feature score of each feature in the participant's feature set, the first probability distribution including a probability that each feature in the participant's feature set is selected;
selecting a feature from the set of features of the participant in accordance with the first probability distribution; and
removing the selected features from the set of features of the participant to update the set of features of the participant and the first probability distribution, and continuing to select features based on the updated set of features and the first probability distribution until the total number of selected features reaches the first number.
3. The method of claim 2, wherein each sample of the participant's plurality of samples corresponds to one of two category labels, for each of the features, the correlation of the feature to the two category labels is indicated by a flip identification of the feature,
wherein a case that the flip flag is a first value indicates that the feature is related to a first category label of the two category labels in a same direction, the feature score of the feature comprising a first feature score determined based on the same-direction correlation of the feature with the first category label;
the condition that the flip flag is the second value indicates that the feature is inversely related to the first category label of the two category labels, and the feature score of the feature comprises a second feature score determined based on the inverse correlation of the feature with the first category label.
4. The method of claim 3, wherein determining the first probability distribution based on an exponential mechanism as a function of a feature score for each feature in the set of features of the participant comprises:
determining, for each feature in the participant's feature set, a probability that the feature is selected based on an exponential mechanism according to a first feature score and a second feature score of all features in the participant's feature set, the probabilities including a homonymy probability associated with the first feature score and a reverse probability associated with the second feature score;
wherein selecting a feature from the set of features of the participant according to the first probability distribution comprises:
selecting a feature from the set of features of the participant according to the first probability distribution, the first probability distribution including a homonymous probability and a reverse probability that each feature of the set of features of the participant is selected, and determining a flip identity for the feature.
5. The method of claim 2, wherein obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features comprises:
determining an optimal feature in the first number of features and a one-dimensional logistic regression model corresponding to the optimal feature, wherein the feature score of the optimal feature is not less than the feature scores of other features in the first number of features;
randomly selecting a predetermined number of features from the other features of the first number of features; and
and for the optimal features and the selected predetermined number of features, constructing a plurality of logistic regression models based on a predetermined weight value space, wherein the number of the plurality of logistic regression models is related to the predetermined number and the weight value number included in the predetermined weight value space.
6. The method of claim 5, wherein selecting a second number of logistic regression models from the plurality of logistic regression models based on a second probability distribution comprises:
determining a model score for each of the plurality of logistic regression models based on the predicted results of the plurality of logistic regression models for the category labels corresponding to the plurality of samples of the participant;
determining the second probability distribution based on an exponential mechanism from the model score of each of the plurality of logistic regression models, the second probability distribution comprising a probability that each of the plurality of logistic regression models is selected; and
selecting a logistic regression model from the plurality of logistic regression models according to the second probability distribution; and
removing the selected logistic regression model from the plurality of logistic regression models to update the plurality of logistic regression models and the second probability distribution, and continuing to select logistic regression models based on the updated plurality of logistic regression models and second probability distribution until the total number of selected logistic regression models reaches the second number.
7. The method of claim 6, further comprising:
randomly selecting a predetermined number of features from the other features in the first number of features a predetermined number of times; and
and obtaining a plurality of logistic regression models based on each selected predetermined number of features in the predetermined times, and selecting a second number of logistic regression models from the plurality of logistic regression models according to the second probability distribution to obtain a third number of logistic regression models, wherein the third number is a product of the second number and the predetermined times, and the third number of logistic regression models comprises a plurality of groups of the second number of logistic regression models with the predetermined times as the group number.
8. The method of claim 7, wherein sending at least a portion of the second number of logistic regression models to a fusion peer comprises:
for each set of a second number of logistic regression models in the third number of logistic regression models, determining a better logistic regression model than the determined one-dimensional logistic regression model from the second number of logistic regression models as the at least a portion of logistic regression models; and
sending the at least a portion of the logistic regression models in each set of the second number of logistic regression models and the one-dimensional logistic regression model in the third number of logistic regression models to the fusion terminal in a plaintext form,
and sending the logistic regression model comprises sending a feature index, an overturning mark and a model weight parameter of the feature corresponding to the logistic regression model.
9. The method of claim 8, wherein determining a better logistic regression model from the second number of logistic regression models than the determined one-dimensional logistic regression model is based on a comparison of model scores of the second number of logistic regression models to model scores of the one-dimensional logistic regression model, wherein the model scores of the one-dimensional logistic regression model are determined based on predictions of category labels corresponding to a plurality of samples of the participant by the one-dimensional logistic regression model.
10. A method for federated ensemble learning, comprising:
receiving at least one logistic regression model from each of a plurality of participants;
de-duplicating all logistic regression models from the plurality of participants to remove duplicate logistic regression models; and
performing integration and fusion on the logic regression model subjected to the duplication removal processing to generate a federal integration model;
wherein, for each participant in the plurality of participants, the at least one logistic regression model from the participant comprises at least a portion of the second number of logistic regression models as recited in claims 1-9.
11. The method of claim 10, wherein integratedly fusing the deduplication processed logistic regression model comprises voting integratedly fusing the deduplication processed logistic regression model, the generated prediction of the federated integration model being based on an average of the predictions of the deduplication processed logistic regression model and the one-dimensional logistic regression model.
12. A bang ensemble learning device, comprising:
a feature selection module configured to select a first number of features from a set of features of a participant according to a first probability distribution obtained for the set of features of the participant based on an exponential mechanism;
a model selection module configured to obtain a plurality of logistic regression models based on at least a portion of the selected first number of features and to select a second number of logistic regression models from the plurality of logistic regression models according to a second probability distribution obtained for the plurality of logistic regression models based on an exponential mechanism; and
and the model sending module is configured to send at least a part of the logistic regression models in the second number of logistic regression models to the fusion terminal so as to perform integration and fusion based on the at least a part of logistic regression models and generate a federal integration model.
13. The apparatus of claim 12, wherein the feature selection module to select a first number of features from a set of features of a participant according to a first probability distribution comprises:
for each feature in the set of features of the participant, determining a feature score for the feature, the feature score determined based on a relevance of the feature to category labels corresponding to a plurality of samples of the participant;
determining the first probability distribution based on an exponential mechanism according to the feature score of each feature in the participant's feature set, the first probability distribution including a probability that each feature in the participant's feature set is selected;
selecting a feature from the set of features of the participant in accordance with the first probability distribution; and
removing the selected features from the set of features of the participant to update the set of features of the participant and the first probability distribution, and continuing to select features based on the updated set of features and the first probability distribution until the total number of selected features reaches the first number.
14. The apparatus of claim 13, wherein the model selection module obtaining a plurality of logistic regression models based on at least a portion of the selected first number of features comprises:
determining an optimal feature in the first number of features and a one-dimensional logistic regression model corresponding to the optimal feature, wherein the feature score of the optimal feature is not less than the feature scores of other features in the first number of features;
randomly selecting a predetermined number of features from the other features of the first number of features; and
and for the optimal features and the selected predetermined number of features, constructing a plurality of logistic regression models based on a predetermined weight value space, wherein the number of the plurality of logistic regression models is related to the predetermined number and the weight value number included in the predetermined weight value space.
15. The apparatus of claim 14, wherein the model selection module selects a second number of logistic regression models from the plurality of logistic regression models based on a second probability distribution comprises:
determining a model score for each of the plurality of logistic regression models based on the predicted results of the plurality of logistic regression models for the category labels corresponding to the plurality of samples of the participant;
determining the second probability distribution based on an exponential mechanism from the model score of each of the plurality of logistic regression models, the second probability distribution comprising a probability that each of the plurality of logistic regression models is selected; and
selecting a logistic regression model from the plurality of logistic regression models according to the second probability distribution; and
removing the selected logistic regression model from the plurality of logistic regression models to update the plurality of logistic regression models and the second probability distribution, and continuing to select logistic regression models based on the updated plurality of logistic regression models and second probability distribution until the total number of selected logistic regression models reaches the second number.
16. The apparatus of claim 15, wherein the model selection module is further configured to:
randomly selecting a predetermined number of features from the other features in the first number of features a predetermined number of times; and
and obtaining a plurality of logistic regression models based on each selected predetermined number of features in the predetermined times, and selecting a second number of logistic regression models from the plurality of logistic regression models according to the second probability distribution to obtain a third number of logistic regression models, wherein the third number is a product of the second number and the predetermined times, and the third number of logistic regression models comprises a plurality of groups of the second number of logistic regression models with the predetermined times as the group number.
17. The apparatus of claim 16, wherein the model transmitting module transmits at least a portion of the second number of logistic regression models to a blending end comprises:
for each set of a second number of logistic regression models in the third number of logistic regression models, determining a better logistic regression model than the determined one-dimensional logistic regression model from the second number of logistic regression models as the at least a portion of logistic regression models; and
sending the at least a portion of the logistic regression models in each set of the second number of logistic regression models and the one-dimensional logistic regression model in the third number of logistic regression models to the fusion terminal in a plaintext form,
and sending the logistic regression model comprises sending a feature index, an overturning mark and a model weight parameter of the feature corresponding to the logistic regression model.
18. A bang ensemble learning device, comprising:
one or more processors; and
one or more memories having stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-11.
19. A computer program product comprising computer instructions which, when executed by a processor, cause a computer device to perform the method of any one of claims 1-11.
20. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-11 when executed by a processor.
CN202111261571.7A 2021-10-27 2021-10-27 Federal ensemble learning method, apparatus, device and storage medium Pending CN114330756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111261571.7A CN114330756A (en) 2021-10-27 2021-10-27 Federal ensemble learning method, apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111261571.7A CN114330756A (en) 2021-10-27 2021-10-27 Federal ensemble learning method, apparatus, device and storage medium

Publications (1)

Publication Number Publication Date
CN114330756A true CN114330756A (en) 2022-04-12

Family

ID=81045398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111261571.7A Pending CN114330756A (en) 2021-10-27 2021-10-27 Federal ensemble learning method, apparatus, device and storage medium

Country Status (1)

Country Link
CN (1) CN114330756A (en)

Similar Documents

Publication Publication Date Title
CN111600707B (en) Decentralized federal machine learning method under privacy protection
Wang et al. Measure contribution of participants in federated learning
CN110084377B (en) Method and device for constructing decision tree
Liu et al. Oblivious neural network predictions via minionn transformations
WO2021204040A1 (en) Federated learning data processing method and apparatus, and device and storage medium
US10396999B2 (en) Electronic apparatus, method for electronic apparatus and information processing system
CN112765677B (en) Federal learning method, device and system based on blockchain
CN113505882B (en) Data processing method based on federal neural network model, related equipment and medium
CN111858955B (en) Knowledge graph representation learning enhancement method and device based on encryption federal learning
WO2023071626A1 (en) Federated learning method and apparatus, and device, storage medium and product
JP2021515271A (en) Computer-based voting process and system
WO2021186754A1 (en) Information processing system, information processing method, information processing program, secret sharing system, secret sharing method, secret sharing program, secure computation system, secure computation method, and secure computation program
Liu et al. Ltsm: Lightweight and trusted sharing mechanism of iot data in smart city
CN112446791A (en) Automobile insurance grading method, device, equipment and storage medium based on federal learning
CN113779355B (en) Network rumor tracing evidence obtaining method and system based on blockchain
CN115563859A (en) Power load prediction method, device and medium based on layered federal learning
CN115310137B (en) Secrecy method and related device of intelligent settlement system
Toli et al. Privacy-preserving multibiometric authentication in cloud with untrusted database providers
CN114330756A (en) Federal ensemble learning method, apparatus, device and storage medium
Yang et al. TAPESTRY: a de-centralized service for trusted interaction online
CN114329127A (en) Characteristic box dividing method, device and storage medium
CN112054891B (en) Block chain-based common user determination method, electronic device and storage medium
CN114422105A (en) Joint modeling method and device, electronic equipment and storage medium
Tran et al. A comprehensive survey and taxonomy on privacy-preserving deep learning
WO2022081539A1 (en) Systems and methods for providing a modified loss function in federated-split learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination