CN112182982A

CN112182982A - Multi-party combined modeling method, device, equipment and storage medium

Info

Publication number: CN112182982A
Application number: CN202011165475.8A
Authority: CN
Inventors: 宋传园; 冯智; 吕亮亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2021-01-05
Anticipated expiration: 2040-10-27
Also published as: CN112182982B

Abstract

The disclosure provides a multi-party combined modeling method based on a distributed system, and relates to the fields of machine learning, safety calculation and the like. The multi-party combined modeling method comprises the following steps: intersecting the sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification included in each of the plurality of clusters, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are stored in a plurality of clients of the corresponding cluster in a distributed manner; respectively carrying out bucket dividing on the cluster sample data of each cluster in the plurality of clusters to obtain cluster bucket dividing data of each cluster in the plurality of clusters; constructing a global information gain histogram based on the sample labels and cluster bucket data for each of the plurality of clusters; and constructing a decision tree model based on the global information gain histogram.

Description

Multi-party combined modeling method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of machine learning, security computing, and the like, and more particularly, to a multiparty joint modeling method, apparatus, device, and storage medium.

Background

With the development of algorithms and big data, the algorithms and the calculation power are no longer bottlenecks that hinder the development of AI. The data source that is really valid in each domain is the most valuable resource. Meanwhile, a barrier which is difficult to break exists among data sources, in most industries, data exists in an isolated island form, and due to the problems of industry competition, privacy safety, complex administrative procedures and the like, even if data integration is realized among different departments of the same company, important resistance is also faced, and in reality, it is almost impossible to integrate data which is dispersed in various regions and various organizations, or the required cost is huge.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a multiparty joint modeling method, including: intersecting the sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification included in each of the plurality of clusters, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are stored in a plurality of clients of the corresponding cluster in a distributed manner; respectively carrying out bucket dividing on the cluster sample data of each cluster in the plurality of clusters to obtain cluster bucket dividing data of each cluster in the plurality of clusters; constructing a global information gain histogram based on the sample label and cluster bucket data of each of the plurality of clusters, wherein the sample label is a real value of each sample, and the sample label is stored in a specific cluster of the plurality of clusters; and constructing a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided a multiparty joint prediction method based on a distributed system, including: inputting the prediction samples into a decision tree model; aiming at each sub decision tree of the decision tree model, acquiring a cluster to which a root node belongs; communicating with the cluster to obtain the characteristics of the root node; sending the feature data of the features of the root nodes of the prediction sample to a cluster to which the nodes belong to obtain a cluster to which the child nodes belong; iterating the process to obtain the predicted value of each sub decision tree to the prediction sample; and summing the predicted values of the prediction samples by each sub-decision tree to obtain the predicted value of the prediction sample.

According to an aspect of the present disclosure, there is provided a multi-party unified modeling apparatus based on a distributed system, including: the intersection module is configured to intersect the data included in the plurality of clusters, so that each of the plurality of clusters obtains corresponding cluster sample data; the bucket dividing module is configured to divide the cluster sample data of each of the plurality of clusters into buckets to obtain cluster bucket dividing data of each of the plurality of clusters; a first construction module configured to construct a global information gain histogram based on the sample labels and the plurality of cluster bucketing data; and a second construction module configured to construct a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided an electronic device including: a processor; and a memory storing a program, the program comprising instructions and the instructions, when executed by the processor, causing the processor to perform a method according to the above-described multiparty joint modeling method and/or a method according to the above-described multiparty joint prediction method.

According to another aspect of the present disclosure, there is also provided a computer readable storage medium storing a program, the program comprising instructions and the instructions, when executed by a processor of an electronic device, cause the electronic device to perform a multiparty joint modeling method according to the above and/or a multiparty joint prediction method according to the above.

According to the technical scheme, the multi-party joint modeling method based on the distributed system is realized by bucket distribution of the distributed data and construction of the information gain histogram of the distributed data, so that the speed of multi-party joint modeling is increased, and modeling can be completed under the scene of large data volume.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

1-2 are flow diagrams illustrating a method of multi-party joint modeling in accordance with an illustrative embodiment;

FIG. 3 is a block diagram illustrating components of a distributed system in accordance with an illustrative embodiment;

FIG. 4 is a flowchart illustrating separately bucketing cluster sample data for each of a plurality of clusters in accordance with an illustrative embodiment;

FIG. 5 is a schematic diagram illustrating a bucketing process according to an exemplary embodiment;

FIG. 6 is a flowchart illustrating the generation of at least one data bucket based on feature data of a current feature in accordance with an illustrative embodiment;

FIG. 7 is a flowchart illustrating the construction of a global information gain histogram in accordance with an illustrative embodiment;

FIG. 8 is a flowchart illustrating constructing a first information gain histogram in accordance with an exemplary embodiment;

FIG. 9 is a flowchart illustrating deriving a first information gain sub-histogram or a first candidate splitting gain of a feature of a node to be split in accordance with an exemplary embodiment;

FIG. 10 is a flowchart illustrating the construction of a first information gain sub-histogram in accordance with an exemplary embodiment;

FIG. 11 is a flowchart illustrating a multi-party joint prediction method in accordance with an illustrative embodiment;

FIG. 12 is a block diagram illustrating components of a multi-party unified modeling apparatus in accordance with an illustrative embodiment;

fig. 13 is a block diagram showing a structure of an exemplary computing device to which the exemplary embodiments can be applied.

Detailed Description

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the related art, the existing multi-party combined modeling method is slow in speed, is limited by factors such as equipment performance and storage capacity particularly in a scene with a large data volume, cannot perform multi-party combined modeling, and therefore has great limitation in practical application.

In order to solve the above technical problem, the present disclosure provides a multiparty joint modeling method based on a distributed system: intersecting the sample identifications among the clusters to obtain cluster sample data of each cluster; carrying out bucket dividing on the cluster sample data of each cluster to obtain cluster bucket dividing data; constructing a global information gain histogram based on the sample labels and the cluster bucket data of each cluster; and constructing a decision tree model based on the global information gain histogram. Therefore, by bucket dividing of the distributed data and construction of an information gain histogram of the distributed data, the multi-party joint modeling method based on the distributed system is realized, the speed of multi-party joint modeling is improved, and modeling can be completed under the scene of large data volume.

The multi-party unified modeling method of the present disclosure will be further described with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a multi-party unified modeling method based on a distributed system according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the multi-party joint modeling method may include: step S101, intersecting the sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification included in each of the plurality of clusters; step S102, carrying out bucket dividing on the cluster sample data of each cluster in the plurality of clusters respectively to obtain cluster bucket dividing data of each cluster in the plurality of clusters; step S103, constructing a global information gain histogram based on the sample labels and cluster bucket data of each cluster in the plurality of clusters; and step S104, constructing a decision tree model based on the global information gain histogram. Therefore, by establishing a distributed system, storing the data distribution of the cluster in a plurality of clients of the cluster, using the clients to perform primary bucket separation on the distributed data and construct an information gain sub-histogram, the speed of multi-party combined modeling can be greatly improved, and the model can support rapid multi-party combined modeling in richer scenes.

According to some embodiments, a distributed system includes a plurality of clusters, each cluster including a server and a plurality of clients. In an exemplary embodiment, as shown in fig. 3, distributed system 3000 includes clusters 3100, cluster 3200, and cluster 3300, including one server and three clients within each cluster, e.g., cluster 3100 includes server 3110 and client 3101, cluster 3200 includes server 3210 and client 3201, and cluster 3300 includes server 3310 and client 3301. The main functions of the server may include coordinating a plurality of clients in the cluster, instructing the clients to complete tasks such as binning and histogram construction, integrating information uploaded by the clients, issuing information to the clients, performing some computing functions, communicating with servers of other clusters, and the like. In one exemplary embodiment, the inter-cluster communication may be encrypted using Paillier cipher text. The main functions of the client can include storing data, receiving the instructions of the server to complete tasks such as bucket dividing and histogram construction, uploading information to the server, and the like. The communication within the cluster may have no privacy requirements. In one exemplary embodiment, the clients communicate only with the servers of the present cluster.

Each cluster may include a number of original sample data distributed among a plurality of clients residing in the cluster, the sample data including a sample identification. And intersecting all the sample identifications included by each cluster to obtain a common sample identification. And each cluster selects a sample which is coincident with the common sample identifier in the original sample data in the cluster as the cluster sample data of the cluster.

In an exemplary embodiment, step S101 may include: the server of each cluster collects all sample identifications in the cluster; realizing sample identification intersection among clusters based on an OT safety protocol, wherein each cluster obtains the same common sample identification; and the server of each cluster sends the common sample identification to a plurality of clients of the cluster, and instructs the clients to perform intersection on the common sample identification and the sample identification of the original sample data included by the clients to obtain the client sample data of each client. The cluster sample data may comprise, for example, sample data held at a plurality of clients of the respective plurality of clients.

According to some embodiments, the cluster sample data and the client sample data may each include a sample identifier and at least one characteristic, as shown in fig. 4, the step S102 of separately performing bucket partitioning on the cluster sample data of each of the plurality of clusters to obtain cluster bucket partitioning data of each of the plurality of clusters may include: step S10201, traversing at least one feature of the cluster sample data of each cluster in the plurality of clusters; step S10202, generating at least one data bucket based on the feature data of the current feature; and step S10203, integrating the data buckets corresponding to all the characteristics to obtain cluster sub-bucket data of the cluster. Therefore, by carrying out bucket splitting on the cluster sample data, the number of split points needing to be calculated and corresponding information gains can be reduced, and the modeling speed is greatly improved; meanwhile, the sub-buckets can erase the feature data of all samples in the buckets, so that the sub-buckets can be used as the basis of multi-party combined modeling under the privacy requirement.

The bucket division is a process of performing feature discretization processing on feature data based on feature information. According to some embodiments, the data bucket may include at least one of a sample identification, a value of the bucket, an affiliated client, an affiliated characteristic. In one exemplary embodiment, as shown in FIG. 5, data 501 includes sample identifications 1-15 and feature data for a selected feature. The barreling process may include, for example: sorting the feature data of the selected features to generate sorted feature data 502; based on a preset bucket dividing rule, the feature data is divided into a plurality of data buckets 5001, and bucket dividing data 503 is obtained. The client belonging to each data bucket may include a client belonging to each feature data classified into the same data bucket, the feature of each data bucket may be a feature based on the bucket classification process, the sample identifier of each data bucket may include a sample identifier corresponding to each feature data classified into the same data bucket, and the value of each data bucket may be, for example, an average value, a median value, a minimum value, a maximum value, or a value obtained by other calculation methods of all feature data classified into the same data bucket, which is not limited herein. In an exemplary embodiment, as shown in FIG. 5, the value of each data bucket 5001 may be the median of all characteristic data divided into the same data bucket.

According to some embodiments, the data bucket to be merged may include at least one of a sample identification, a value of the bucket, an affiliated client, and an affiliated characteristic, and as shown in fig. 6, the step S10202 of generating at least one data bucket based on the characteristic data of the current characteristic may include: step S602, judging whether the feature data of the current features are distributed on the same client; step S603, responding to the distribution of the feature data of the current features on a plurality of clients, indicating each client in the plurality of clients to perform barrel separation on the feature data of the current features included in the sample data of the respective client, generating at least one to-be-merged data barrel of the current features, and uploading the at least one to-be-merged data barrel to a server corresponding to the plurality of clients; and step S604, merging the received at least one data bucket to be merged uploaded by the plurality of clients to generate at least one data bucket. Therefore, under the condition that the feature data of the current feature are distributed in a plurality of clients, the method realizes the bucket distribution of the distributed data under the condition that the feature data of the current feature are distributed in the plurality of clients by indicating each client to pre-distribute buckets for the sample data included by the client, and then merging the buckets with the same or similar values of partial buckets into the same bucket by the server. Compared with the method that all the characteristic data of the current characteristics of the clients are transmitted to the server, the server sorts and divides all the characteristic data, the method can remarkably accelerate the efficiency of dividing the barrel, and therefore the modeling speed is greatly improved.

According to some embodiments, the step S604 of merging the received at least one to-be-merged data bucket uploaded by the multiple clients, and generating the at least one data bucket may include: sorting all data buckets to be merged of the current characteristics according to bucket values; merging one or more to-be-merged data buckets with the same or similar values of consecutive buckets into one data bucket, where a sample label included in the merged data bucket may be all labels included in the merged one or more to-be-merged data buckets, a client to which the merged data bucket belongs may include a client to which the merged one or more to-be-merged data buckets belong, the characteristic of the merged data bucket may be a current characteristic, and a value of the merged data bucket may be, for example, an average value, a median value, a minimum value, a maximum value, or a value obtained by another calculation method of the values of the merged one or more to-be-merged data buckets, which is not limited herein.

According to some embodiments, as shown in fig. 6, the generating of the at least one data bucket based on the feature data of the current feature at step S10202 may include: and step S605, sending each data bucket in the at least one merged data bucket to the client side to which the data bucket belongs. The client bucket data of the belonging client comprises at least one merged data bucket.

According to some embodiments, as shown in fig. 6, the generating of the at least one data bucket based on the feature data of the current feature at step S10202 may include: step S606, responding to the feature data of the current feature distributed on the same client, indicating the same client to perform barrel separation on the feature data of the current feature, generating at least one data barrel, and uploading the at least one data barrel to a server corresponding to the same client. Steps S601 and S607 in fig. 6 are similar to steps S10201 and S10203 in fig. 4, respectively. After steps S605 and S606 are executed, step S607 may be executed. Therefore, under the condition that all feature data of the current features are distributed on the same client, the barrel dividing result of the client is directly used as a final result, so that the workload caused by barrel dividing by a server can be reduced, and the modeling speed is further improved.

According to the technical scheme, the client sample data is subjected to barreling through the indication client, the data barrels to be combined or the data barrels are generated, the data barrels to be combined are combined by the server, cluster barreled data and client barreled data of each client are obtained by combining other data barrels, rapid barreling of distributed data is achieved, therefore modeling speed can be greatly improved, and meanwhile the model can support richer scenes.

According to some embodiments, the plurality of clusters includes a first cluster and at least one second cluster, and as shown in fig. 2, the multiparty joint modeling method may further include: step S202, aiming at a server of a first cluster, generating a public key and a private key pair; and step S203, the public key and the modeling parameters are sent to the server of each second cluster in the at least one second cluster. Steps S201 and S204 in fig. 2 are similar to steps S101 and S102 in fig. 1. Therefore, by using homomorphic encryption, the result obtained by the second cluster after the encrypted information obtained by using the encrypted data for operation is decrypted by the first cluster is the same as the unencrypted operation result, and the result can be used as the basis of multiparty combined modeling under the privacy requirement.

The modeling parameters may include, for example, a maximum number of iterations, a learning rate, a stop splitting condition, a model convergence condition, and the like. The modeling parameters are public to each cluster participating in the modeling and do not require encryption.

According to some embodiments, the cluster sample data of the first cluster further includes a sample label, and as shown in fig. 7, the step S103 of constructing a global information gain histogram based on the sample label and the cluster bucket data of each of the plurality of clusters includes: step S10301, obtaining a predicted value of each sample corresponding to each sample identification of the cluster bucket data of the first cluster by the current model; step S10302, calculating first order gradient data and second order gradient data based on the predicted value and the sample label; and step S10303 of constructing a global information gain histogram based on the first order gradient data, the second order gradient data, and cluster bucket data of each of the plurality of clusters. Therefore, a global information gain histogram can be constructed by calculating first-order gradient data and second-order gradient data and combining cluster feature data of each cluster, and then an optimal split point can be determined based on the global information gain histogram to construct a decision tree. Meanwhile, the number of split points and split threshold values thereof which need to be calculated can be reduced by using the information gain histogram, and the histogram difference acceleration can be performed, so that the modeling speed is increased.

According to some embodiments, the current model includes one or more sub-decision trees. The decision tree model constructed in the present disclosure may be, for example, a gradient boosting decision tree model, an XGBoost model, a LightGBM model, or other models, which is not limited herein. Each leaf node of the current model includes at least one sample identifier indicating the sample identifier assigned to that node. The structure of one or more sub-decision trees included in the current model, the cluster to which all nodes belong, and the sample identifications included in all leaf nodes can be disclosed for all clusters. The method for calculating the sample prediction value may be to sum the prediction values of the samples by each sub-decision tree to obtain the prediction value of the sample by the model.

The first order gradient data and the second order gradient data can be the first order gradient and the second order gradient of an objective function set by the model, and the predicted value of the sample and the sample label are brought into the first order gradient and the second order gradient, so that the first order gradient data and the second order gradient data of the sample are obtained.

The information gain histogram may include a plurality of histogram buckets in one-to-one correspondence with the data buckets, and each histogram bucket may be used to represent an information gain of the corresponding data bucket. Each histogram bucket includes a first order gradient sum, a second order gradient sum of all samples of the corresponding data bucket, and a number of samples included in the data bucket.

According to some embodiments, as shown in fig. 2, the step S10303 of constructing a global information gain histogram based on the first order gradient data, the second order gradient data, and the cluster bucket data for each of the plurality of clusters includes: step S207, encrypting the first order gradient data and the second order gradient data and sending the encrypted first order gradient data and the encrypted second order gradient data to a server of each second cluster in at least one second cluster; s208, obtaining at least one node to be split of the current model, wherein the node to be split comprises at least one sample identifier; step S209, constructing a first information gain histogram based on the first-order gradient data, the second-order gradient data, the cluster bucket data of the first cluster and at least one node to be split; step S211, receiving at least one ciphertext information gain histogram from each server in at least one second cluster; step S212, decrypting the at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to the at least one ciphertext information gain histogram one to one; and step S213, combining the first information gain histogram and the at least one second information gain histogram to obtain a global information gain histogram. Steps S205 to S206 in fig. 2 are similar to steps S10301 to S10302 in fig. 7, respectively. Therefore, the encrypted gradient is sent to the at least one second cluster, and the ciphertext information gain histogram sent from the at least one second cluster is received and decrypted, so that the two parties can obtain the corresponding at least one second information gain histogram without acquiring gradient data and sample data of the other party. And combining the first information gain histogram of the first cluster and the at least one second information gain histogram to realize the construction of a global information gain histogram under the privacy requirement.

The node to be split may be, for example, a partial leaf node of the last decisioning tree that satisfies the allowable splitting condition. The condition for allowing splitting may be, for example, that the number of sample identifiers included in the leaf node is less than a preset value, the depth of the leaf node is less than a preset depth, and the like, which is not limited herein. As a type of leaf node, a node to be split may include a plurality of sample identifications, representing samples that are assigned to the node by the model.

According to some embodiments, as shown in fig. 8, the step S209 of constructing a first information gain histogram based on the first order gradient data, the second order gradient data, the cluster bucket data of the first cluster, and the at least one node to be split includes: step S20901, for each node to be split in at least one node to be split, traversing at least one feature of cluster bucket data of a first cluster; step S20902, based on the feature data of the node to be split and the current feature, obtaining a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split; and step S20903, combining the first information gain sub-histogram or the first candidate splitting gain of each feature in at least one feature of the cluster bucket data of the first cluster of each node to be split to obtain a first information gain histogram. Therefore, a first information gain sub-histogram or a first candidate splitting gain is constructed for the characteristics of each node to be split and each first cluster, a first information gain histogram can be obtained, and a global information gain histogram can be obtained subsequently to construct a decision tree.

According to some embodiments, as shown in fig. 9, in step S20902, based on the feature data of the node to be split and the current feature, a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split is obtained: step S902, judging whether the feature data of the current features are distributed on the same client; step S903, responding to the distribution of the feature data of the current features on a plurality of clients, indicating each client in the plurality of clients to construct a first information gain sub-histogram to be merged based on first-order gradient data, second-order gradient data, client bucket data of the client and the node to be split, and uploading the first information gain sub-histogram to be merged to a server corresponding to the plurality of clients; and step S904, combining the received multiple to-be-combined first information gain sub-histograms uploaded by the multiple clients, and constructing a first information gain sub-histogram. Therefore, under the condition that the feature data of the current feature are distributed on a plurality of clients, the first information gain sub-histograms to be combined are generated by indicating each client, and then the sub-histograms to be combined are combined by the server, so that the first information gain sub-histograms are constructed for the current node to be split and the current feature under the condition that the feature data of the current feature of the first cluster based on the current node to be split are distributed on a plurality of clients. Compared with the method that the server of the first cluster directly constructs the first information gain sub-histogram, the method enables a plurality of clients to construct the sub-histogram in parallel, and therefore modeling speed is improved.

According to some embodiments, the first information gain sub-histogram includes at least one histogram bucket, the at least one histogram bucket corresponds to all the data buckets with characteristics of belonging thereto in a one-to-one manner, the first information gain sub-histogram to be merged includes at least one histogram bucket to be merged, the at least one histogram bucket to be merged corresponds to all the data buckets with characteristics of belonging thereto in a one-to-one manner, and each of the histogram bucket and the histogram bucket to be merged includes at least one of a first-order gradient sum and a second-order gradient sum, as shown in fig. 10, the step S904 merges the received plurality of first information gain sub-histograms to be merged uploaded by the plurality of clients, and the constructing the first information gain sub-histogram includes: step S90401, merging at least one histogram bucket to be merged in a plurality of first information gain sub-histograms to be merged uploaded by a plurality of received clients to generate at least one histogram bucket; and step S90402, constructing a first information gain sub-histogram based on the at least one histogram bucket. Therefore, the histogram buckets to be merged corresponding to the same data bucket are merged into one histogram bucket, the first information gain sub-histogram can be constructed in a distributed mode, and therefore the first information gain sub-histogram can be subsequently merged with the first information gain sub-histograms of other characteristics to obtain the first information gain histogram.

The histogram bucket and the histogram bucket to be merged may each include a first-order gradient sum, a second-order gradient sum, and a sample identification number of an intersection portion of a sample identification included in the histogram bucket or a data bucket corresponding to the histogram bucket to be merged and a sample identification included in a node to be split currently.

According to some embodiments, the step S90401 of merging the received at least one histogram bucket to be merged in the first information gain sub-histograms to be merged uploaded by the multiple clients, and generating the at least one histogram bucket may include: and combining one or more histogram buckets to be combined corresponding to the data bucket of each current characteristic to obtain the histogram bucket corresponding to the data bucket. The sum of the first order gradients of the histogram bucket may be a sum of first order gradient sums of the merged one or more histogram buckets to be merged, the sum of the second order gradients of the histogram bucket may be a sum of second order gradient sums of the merged one or more histogram buckets to be merged, and the number of sample identifications of the histogram bucket may be a sum of number of sample identifications of the merged one or more histogram buckets to be merged.

According to some embodiments, as shown in fig. 9, in step S20902, obtaining the first information gain sub-histogram or the first candidate splitting gain of the current feature of the node to be split based on the feature data of the node to be split and the current feature may include: step S905, responding to the fact that all feature data of the current features are distributed on the same client, indicating the same client to construct a first information gain sub-histogram based on the first-order gradient data, the second-order gradient data, the client bucket data of the same client and the node to be split, calculating a first candidate splitting gain based on the first information gain sub-histogram, and uploading the first candidate splitting gain to a server of a first cluster. Steps S901 and S906 in fig. 9 are similar to steps S20901 and S20903 in fig. 8, respectively. After steps S904 and S905 are performed, step S906 may be performed. Therefore, under the condition that all feature data of the current feature are distributed on the same client, the first information gain sub-histogram of the current feature of the current node to be split is directly constructed, and the first candidate splitting gain is calculated based on the first information gain sub-histogram, so that the workload caused by constructing the histogram by the server can be reduced, and the modeling speed is further improved.

The first candidate splitting gain may be a maximum gain of a current feature of the current node to be split, and may be obtained by calculating a splitting gain corresponding to each histogram bucket of the first information gain sub-histogram, and selecting a maximum value of the splitting gains. The splitting gain corresponding to the histogram bucket can be calculated by the following method: under the current node to be split and the current characteristics, acquiring a first-order gradient data sum, a second-order gradient data sum and a sample identification number sum of all histogram buckets, wherein the three sums are obtained in the previous splitting gain calculation process and are called father node information gain original data; calculating the three sums of the histogram buckets corresponding to all the data buckets smaller than the values of the data buckets corresponding to the histogram buckets, and calling the sums as original data of the left child node information gain; obtaining right child node information gain original data by making a difference based on the father node information gain original data and the left child node information gain original data; for the three nodes, respectively calculating information gains based on the information gain original data; and obtaining the splitting gain by using the left child node information gain plus the right child node information gain minus the father node information gain. The information gain may be calculated by, for example, dividing the square of the first-order gradient sum by the sum of the second-order gradient sum and the harmonic parameter, dividing the square of the first-order gradient sum by the sum of the sample identification number, or other calculation methods, which are not limited herein.

According to the technical scheme, the client is instructed to construct the to-be-combined first information gain sub-histogram or the first candidate splitting gain is directly calculated, the server combines the to-be-combined first information gain sub-histograms, and the first information gain histogram is obtained by combining the first candidate splitting gain, so that the first information gain histogram is rapidly constructed, the modeling speed can be increased, and meanwhile, the model can support richer scenes.

According to some embodiments, as shown in fig. 2, constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and the cluster bucket data of each of the plurality of clusters at step S10303 may include: step S210, a ciphertext information gain histogram is constructed based on the ciphertext first-order gradient data, the ciphertext second-order gradient data, the cluster bucket data of the second cluster and at least one node to be split. In an exemplary embodiment, a method similar to the above-mentioned steps S901 to S906 may be adopted, for example, ciphertext first order gradient data and ciphertext second order gradient data may be respectively used instead of the first order gradient data and the second order gradient data, so as to obtain a corresponding ciphertext information gain histogram.

According to some embodiments, as shown in fig. 2, the step S104 of constructing the decision tree model based on the global information gain histogram includes: step S214, determining an optimal splitting point based on the global new information gain histogram; step S215, indicating the client terminal where the optimal splitting point is located to split the optimal splitting point; step S216, iterating the splitting process until the splitting termination condition is reached, and generating a sub-decision tree; and step S217, iterating the sub-decision tree generation process until an iteration termination condition is reached to obtain a decision tree model. Therefore, a new leaf node is obtained by determining and splitting the optimal splitting point, and the decision tree model is obtained by repeatedly iterating the steps.

According to some embodiments, the step S215 of indicating the client at which the optimal split point is located to split the optimal split point includes: the client is instructed to calculate a splitting threshold value based on the optimal splitting point and client splitting data of the client, sample identifications included by split leaf nodes are obtained, and the leaf nodes are uploaded to a server; and synchronizing the cluster and the leaf node of the node where the optimal split point is located to a plurality of clusters except the cluster. Therefore, new leaf nodes are obtained by splitting the optimal splitting point, and the new leaf nodes of the cluster to which the node where the optimal splitting point belongs can be synchronized to each cluster, so that the sharing of the model among the clusters is realized.

The optimal split point may be all nodes to be split and the histogram bucket with the largest splitting gain among all features. The calculation of the splitting threshold can be calculated based on the following method: sequencing feature data corresponding to the intersection part of the sample identification of the data bucket corresponding to the optimal splitting point and the sample identification of the node to be split corresponding to the optimal splitting point; calculating splitting gain by taking the average of the characteristic data of every two sequenced adjacent samples as a splitting threshold; and selecting the splitting threshold with the maximum splitting gain as the splitting threshold of the optimal splitting point. The cluster to which each node belongs is public, and the split threshold is stored in the cluster to which each node belongs, so that in the prediction stage, the corresponding cluster can be found through the cluster to which the node belongs in the multi-party shared model, and then the data is sent to the cluster to obtain the next node until the final predicted value is obtained.

According to another aspect of the present disclosure, there is also provided a multiparty joint prediction method based on a distributed system, as shown in fig. 11, the multiparty joint prediction method may include: step S1101, inputting a prediction sample into a decision tree model; step S1102, aiming at each sub decision tree of the decision tree model, acquiring a cluster to which a root node belongs; step S1103, communicating with the cluster to which the node belongs, and acquiring the characteristics of the root node; step S1104, sending the feature data of the features of the root node of the prediction sample to the cluster to which the root node belongs, and acquiring the cluster to which the child node belongs; step S1105, iterate the above process, and obtain the prediction value of each sub decision tree for the prediction sample; and step S1106, summing the predicted values of the prediction samples by each sub decision tree to obtain the predicted value of the prediction sample. Therefore, the predicted value of each sub-decision tree to the sample can be calculated by continuously communicating with the cluster to which the current node of the current sub-decision tree belongs and returning to the next node, and the predicted value of the model to the sample is obtained. Because the decision tree model is shared by a plurality of clusters and the split threshold is only stored in the cluster to which the node belongs, the complete content of the model is not disclosed to any cluster, so that the prediction of a sample is realized by combining the information stored in each cluster, and the multiparty combined modeling supporting the privacy scene is realized.

According to another aspect of the present disclosure, a multiparty unified modeling apparatus is also provided. As shown in fig. 12, the multi-party unified modeling apparatus 1200 may include: an intersection module 1201 configured to intersect the sample identifier included in each of the plurality of clusters to obtain an intersection sample identifier and cluster sample data corresponding to the intersection sample identifier included in each of the plurality of clusters; a bucket dividing module 1202, configured to divide the cluster sample data of each of the multiple clusters into buckets, to obtain cluster bucket dividing data of each of the multiple clusters; a first constructing module 1203, configured to construct a global information gain histogram based on the sample labels and the plurality of cluster bucket data; and a second construction module 1204 configured to construct a decision tree model based on the global information gain histogram.

According to another aspect of the present disclosure, there is also provided an electronic device, which may include: a processor; and a memory storing a program comprising instructions which, when executed by the processor, cause the processor to perform a multi-party unified modeling method according to the above.

According to another aspect of the present disclosure, there is also provided a computer-readable storage medium storing a program, the program comprising instructions that, when executed by a processor of an electronic device, cause the electronic device to perform a multiparty joint modeling method according to the above.

Referring to fig. 13, a computing device 13000, which is an example of a hardware device (electronic device) that can be applied to aspects of the present disclosure, will now be described. Computing device 13000 can be any machine configured to perform processing and/or computing, and can be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a robot, a smart phone, an in-vehicle computer, or any combination thereof. The multi-party federated modeling approach described above can be implemented in whole or at least in part by computing device 13000 or a similar device or system.

The computing device 13000 can include components that connect to the bus 13002 or communicate with the bus 13002 (possibly via one or more interfaces). For example, computing device 13000 can include a bus 13002, one or more processors 13004, one or more input devices 13006, and one or more output devices 13008. The one or more processors 13004 can be any type of processor and can include, but are not limited to, one or more general purpose processors and/or one or more special purpose processors (e.g., special processing chips). The input device 13006 can be any type of device capable of inputting information to the computing device 13000 and can include, but is not limited to, a mouse, a keyboard, a touch screen, a microphone, and/or a remote control. Output device 13008 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The computing device 13000 may also include or be connected with a non-transitory storage device 13010, which may be any storage device that is non-transitory and that may enable data storage, and may include, but is not limited to, a disk drive, an optical storage device, a solid state memory, a floppy disk, a flexible disk, a hard disk, a magnetic tape, or any other magnetic medium, an optical disk or any other optical medium, a ROM (read only memory), a RAM (random access memory), a cache memory, and/or any other memory chip or cartridge, and/or any other medium from which a computer may read data, instructions, and/or code. The non-transitory storage device 13010 is removable from the interface. The non-transitory storage device 13010 may have data/program (including instructions)/code for implementing the above-described methods and steps. Computing device 13000 can also include a communication device 13012. The communication device 13012 may be any type of device or system that enables communication with external devices and/or with a network and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a bluetooth (TM) device, an 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

Computing device 13000 can also include a working memory 13014, which can be any type of working memory that can store programs (including instructions) and/or data useful to the operation of processor 13004, and can include, but is not limited to, a random access memory and/or a read only memory device.

Software elements (programs) may reside in the working memory 13014 including, but not limited to, an operating system 13016, one or more application programs 13018, drivers, and/or other data and code. Instructions for performing the above-described methods and steps may be included in one or more applications 13018, and the above-described multi-party unified modeling method may be implemented by the instructions of one or more applications 13018 being read and executed by the processor 13004. More specifically, in the above-described multi-party unified modeling method, the steps S101 to S104 can be realized, for example, by the processor 13004 executing the application 13018 having the instructions of the steps S101 to S104. Further, the other steps in the above-described multi-party unified modeling method may be implemented, for example, by the processor 13004 executing an application 13018 having instructions to perform the respective steps. Executable code or source code of instructions of the software elements (programs) may be stored in a non-transitory computer readable storage medium, such as the storage device 13010 described above, and may be stored in the working memory 13014 (possibly compiled and/or installed) upon execution. Executable code or source code for the instructions of the software elements (programs) may also be downloaded from a remote location.

It will also be appreciated that various modifications may be made in accordance with specific requirements. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. For example, some or all of the disclosed methods and apparatus may be implemented by programming hardware (e.g., programmable logic circuitry including Field Programmable Gate Arrays (FPGAs) and/or Programmable Logic Arrays (PLAs)) in an assembly language or hardware programming language such as VERILOG, VHDL, C + +, using logic and algorithms according to the present disclosure.

It should also be understood that the foregoing method may be implemented in a server-client mode. For example, a client may receive data input by a user and send the data to a server. The client may also receive data input by the user, perform part of the processing in the foregoing method, and transmit the data obtained by the processing to the server. The server may receive data from the client and perform the aforementioned method or another part of the aforementioned method and return the results of the execution to the client. The client may receive the results of the execution of the method from the server and may present them to the user, for example, through an output device.

It should also be understood that the components of computing device 13000 can be distributed across a network. For example, some processes may be performed using one processor while other processes may be performed by another processor that is remote from the one processor. Other components of computing system 13000 may also be similarly distributed. Thus, computing device 13000 can be interpreted as a distributed computing system that performs processing at multiple locations.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A multiparty federation modeling method based on a distributed system, wherein the distributed system comprises a plurality of clusters, each of the plurality of clusters comprising a server and a plurality of clients, the method comprising:

intersecting the sample identifications included in each of the plurality of clusters to obtain an intersection sample identification and cluster sample data corresponding to the intersection sample identification included in each of the plurality of clusters, wherein the sample identifications and the cluster sample data included in each of the plurality of clusters are stored in a plurality of clients of the corresponding cluster in a distributed manner;

respectively carrying out bucket dividing on the cluster sample data of each cluster in the plurality of clusters to obtain cluster bucket dividing data of each cluster in the plurality of clusters;

constructing a global information gain histogram based on a sample label and cluster bucket data for each of the plurality of clusters, wherein the sample label is a true value for each sample and the sample label is stored in a particular one of the plurality of clusters; and

and constructing a decision tree model based on the global information gain histogram.

2. The multi-party federated modeling method of claim 1, wherein the cluster sample data includes client sample data held at the respective plurality of clients, the cluster sample data and the client sample data each including a sample identification and at least one feature,

wherein the separately performing bucket splitting on the cluster sample data of each of the plurality of clusters to obtain the cluster bucket splitting data of each of the plurality of clusters includes:

for each of the plurality of clusters, traversing the at least one feature of the cluster sample data for that cluster;

generating at least one data bucket based on feature data of the current features; and

and integrating the data buckets corresponding to all the features to obtain cluster sub-bucket data of the cluster, wherein the cluster sub-bucket data of the cluster comprises the at least one feature of the cluster sample data of the cluster and one or more data buckets corresponding to each of the at least one feature of the cluster sample data of the cluster.

3. The multi-party unified modeling method according to claim 2, wherein said generating at least one data bucket based on feature data of current features comprises:

judging whether the feature data of the current features are distributed in the same client;

responding to the distribution of the feature data of the current feature on a plurality of clients, indicating each client in the plurality of clients to perform barrel separation on the feature data of the current feature included in respective client sample data, generating at least one to-be-merged data barrel of the current feature, and uploading the at least one to-be-merged data barrel to a server corresponding to the plurality of clients; and

merging the received at least one to-be-merged data bucket uploaded by the plurality of clients to generate the at least one data bucket, wherein each data bucket in the at least one data bucket is formed by merging the to-be-merged data buckets with the same or similar values of one or more buckets, the sample identifier of each data bucket in the at least one data bucket comprises all the sample identifiers included in the one or more to-be-merged data buckets, and the client belonging to each data bucket in the at least one data bucket comprises the client belonging to the one or more to-be-merged data buckets.

4. The multi-party unified modeling method of claim 3, wherein said generating at least one data bucket based on feature data of current features further comprises:

and sending each data bucket in the at least one merged data bucket to a client of the data bucket, wherein the client sub-bucket data of the client comprises the at least one data bucket.

5. The multi-party unified modeling method of claim 4, wherein said generating at least one data bucket based on feature data of current features further comprises:

and responding to that the feature data of the current feature are distributed on the same client, indicating the same client to perform barrel distribution on the feature data of the current feature, generating at least one data barrel, and uploading the at least one data barrel to a server corresponding to the same client, wherein the client barrel distribution data of the same client comprises the at least one data barrel.

6. The multi-party federated modeling method of claim 5, wherein the plurality of clusters includes a first cluster and at least one second cluster, the cluster sample data of the first cluster further including the sample label,

wherein the constructing a global information gain histogram based on the sample labels and the cluster bucket data for each of the plurality of clusters comprises:

obtaining a predicted value of each sample corresponding to each sample identifier of cluster bucket data of the first cluster by the current model;

calculating first order gradient data and second order gradient data based on the predicted values and the sample labels; and

constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters.

7. The multi-party unified modeling method of claim 6, wherein said constructing the global information gain histogram based on the first order gradient data, the second order gradient data, and cluster bucket data for each of the plurality of clusters comprises:

encrypting the first order gradient data and the second order gradient data and then sending the encrypted first order gradient data and the encrypted second order gradient data to a server of each second cluster of the at least one second cluster;

obtaining at least one node to be split of the current model, wherein the node to be split comprises at least one sample identifier;

constructing a first information gain histogram based on the first-order gradient data, the second-order gradient data, cluster bucket data of the first cluster and the at least one node to be split;

receiving at least one ciphertext information gain histogram from the server of each of the at least one second cluster;

decrypting the at least one ciphertext information gain histogram to obtain at least one second information gain histogram corresponding to the at least one ciphertext information gain histogram one to one; and

and combining the first information gain histogram and the at least one second information gain histogram to obtain the global information gain histogram.

8. The multi-party unified modeling method of claim 7, wherein said constructing a first information gain histogram based on said first order gradient data, said second order gradient data, cluster bucket data of said first cluster and said at least one node to be split comprises:

traversing, for each node to be split of the at least one node to be split, at least one feature of cluster bucket data of the first cluster;

obtaining a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split based on the node to be split and the feature data of the current feature; and

and combining the first information gain sub-histogram or the first candidate splitting gain of each feature in at least one feature of the cluster bucket data of the first cluster of each node to be split to obtain the first information gain histogram.

9. The multi-party joint modeling method as claimed in claim 8, wherein said deriving a first information gain sub-histogram or a first candidate splitting gain of the current feature of the node to be split based on the node to be split and the feature data of the current feature comprises:

responding to the distribution of the feature data of the current features on a plurality of clients, instructing each client in the plurality of clients to construct a first information gain sub-histogram to be merged based on the first-order gradient data, the second-order gradient data, the client bucket data of the client and the node to be split, and uploading the first information gain sub-histogram to be merged to a server corresponding to the plurality of clients; and

and combining the received first information gain sub-histograms to be combined uploaded by the plurality of clients to construct the first information gain sub-histogram.

10. The multi-party unified modeling method according to claim 9, wherein said first information gain sub-histogram includes at least one histogram bucket, said at least one histogram bucket corresponds to all data buckets whose belonging feature is said feature, said first information gain sub-histogram to be merged includes at least one histogram bucket to be merged, said at least one histogram bucket to be merged corresponds to all data buckets whose belonging feature is said feature, said histogram bucket and histogram bucket to be merged each include at least one of a first order gradient sum and a second order gradient sum,

wherein the merging the received first information gain sub-histograms to be merged uploaded by the plurality of clients, and the constructing the first information gain sub-histogram includes:

merging at least one histogram bucket to be merged in a plurality of first information gain sub-histograms to be merged uploaded by the plurality of received clients to generate at least one histogram bucket, wherein each histogram bucket in the at least one histogram bucket is formed by merging a plurality of histogram buckets to be merged corresponding to the same data bucket, and the first-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the first-order gradient sums of the one or more histogram buckets to be merged, and the second-order gradient sum of each histogram bucket in the at least one histogram bucket is the sum of the second-order gradient sums of the one or more histogram buckets to be merged; and

constructing the first information gain sub-histogram based on the at least one histogram bucket.

11. The multi-party unified modeling method as claimed in claim 10, wherein said deriving the first information gain sub-histogram or the first candidate splitting gain of the current feature of the node to be split based on the node to be split and the feature data of the current feature further comprises:

responding to that all feature data of the current feature are distributed on the same client, instructing the same client to construct the first information gain sub-histogram based on the first-order gradient data, the second-order gradient data, the client bucket data of the same client and the node to be split, calculating the first candidate splitting gain based on the first information gain sub-histogram, and uploading the first candidate splitting gain to the server of the first cluster.

12. The multi-party unified modeling method according to claim 7, wherein the ciphertext information gain histogram is constructed based on ciphertext first order gradient data, ciphertext second order gradient data, cluster bucket data of the second cluster, and the at least one node to be split.

13. The multi-party unified modeling method according to claim 6, wherein said current model comprises one or more sub-decision trees, each of said at least one node to be split being a partial leaf node of the last sub-decision tree of said current model.

14. The multi-party unified modeling method according to claim 13, wherein said building a decision tree model based on the global information gain histogram comprises:

determining an optimal splitting point based on the global new information gain histogram;

indicating the client at which the optimal splitting point is located to split the optimal splitting point;

iterating the splitting process until a condition of terminating splitting is reached, and generating the sub decision tree; and

and iterating the generation process of the sub-decision tree until an iteration termination condition is reached to obtain the decision tree model.

15. The multi-party unified modeling method according to claim 14, wherein said current model comprises a structure of one or more sub-decision trees, to which clusters of all nodes belong are disclosed to said plurality of clusters,

wherein the instructing the client at which the optimal split point is located to split the optimal split point comprises:

the client is instructed to calculate a splitting threshold value based on the optimal splitting point and client splitting data of the client, sample identifications included by split leaf nodes are obtained, and the leaf nodes are uploaded to a server; and

synchronizing the cluster of the node where the optimal splitting point is located and the leaf node to the plurality of clusters except the cluster.

16. A multi-party joint prediction method based on a decision tree model built according to the method of any one of claims 1-15, comprising:

inputting prediction samples into the decision tree model;

aiming at each sub decision tree of the decision tree model, acquiring a cluster to which a root node belongs;

communicating with the cluster to which the node belongs to obtain the characteristics of the root node;

sending feature data of the features of the root nodes of the prediction samples to the cluster to which the root nodes belong to obtain the cluster to which the child nodes belong;

iterating the above process to obtain the predicted value of each sub-decision tree for the prediction sample; and

and summing the predicted values of the prediction samples by each sub-decision tree to obtain the predicted value of the prediction sample.

17. A multi-party federated modeling apparatus based on a distributed system, comprising:

the intersection module is configured to intersect the sample identifiers included in each of the plurality of clusters to obtain an intersection sample identifier and cluster sample data corresponding to the intersection sample identifier and included in each of the plurality of clusters;

the bucket dividing module is configured to divide the cluster sample data of each of the plurality of clusters into buckets to obtain cluster bucket dividing data of each of the plurality of clusters;

a first construction module configured to construct a global information gain histogram based on the sample labels and the plurality of cluster bucketing data; and

a second construction module configured to construct a decision tree model based on the global information gain histogram.

18. An electronic device, comprising:

a processor; and

a memory storing a program comprising instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-16.

19. A computer readable storage medium storing a program, the program comprising instructions that when executed by a processor of an electronic device cause the electronic device to perform the method of any of claims 1-16.