WO2022151654A1 - 一种基于随机贪心算法的横向联邦梯度提升树优化方法 - Google Patents

一种基于随机贪心算法的横向联邦梯度提升树优化方法 Download PDF

Info

Publication number
WO2022151654A1
WO2022151654A1 PCT/CN2021/101319 CN2021101319W WO2022151654A1 WO 2022151654 A1 WO2022151654 A1 WO 2022151654A1 CN 2021101319 W CN2021101319 W CN 2021101319W WO 2022151654 A1 WO2022151654 A1 WO 2022151654A1
Authority
WO
WIPO (PCT)
Prior art keywords
segmentation
node
participant
tree
coordinator
Prior art date
Application number
PCT/CN2021/101319
Other languages
English (en)
French (fr)
Inventor
张金义
李振飞
Original Assignee
新智数字科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新智数字科技有限公司 filed Critical 新智数字科技有限公司
Priority to EP21918850.5A priority Critical patent/EP4131078A4/en
Publication of WO2022151654A1 publication Critical patent/WO2022151654A1/zh
Priority to US18/050,595 priority patent/US20230084325A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the invention relates to the technical field of federated learning, in particular to a horizontal federated gradient boosting tree optimization method based on a random greedy algorithm.
  • Federated learning is a machine learning framework that can effectively help multiple institutions to conduct data usage and machine learning modeling while meeting the requirements of user privacy protection, data security, and government regulations, allowing participants to jointly build on the basis of unshared data.
  • Model which can technically break the data island and realize AI collaboration.
  • the virtual model is the aggregation of data by all parties.
  • the optimal model each area serves the local target according to the model, and federated learning requires that the modeling result should be infinitely close to the traditional model, that is, the result of gathering the data of multiple data owners into one place for modeling.
  • the identity and status of the users are the same, and a shared data strategy can be established.
  • the greedy algorithm is a simpler and faster design technology for certain optimal solutions. Based on a certain optimization measure, the optimal choice is made without considering various possible overall situations, which saves a lot of time that must be spent in order to find the optimal solution to exhaust all possibilities.
  • the greedy algorithm uses top-down to The iterative method makes successive greedy choices. Each time a greedy choice is made, the problem is reduced to a smaller sub-problem. Through each step of greedy choice, an optimal solution to the problem can be obtained. It is necessary to ensure that the local optimal solution can be obtained, but the resulting global solution is sometimes not necessarily optimal, so the greedy algorithm should not backtrack.
  • the existing horizontal federated gradient boosting tree algorithm requires each participant and coordinator to frequently transmit histogram information, which requires high network bandwidth of the coordinator, and the training efficiency is easily affected by network stability. It contains user information, and there is a risk of leaking user privacy.
  • privacy protection schemes such as multi-party secure computing, homomorphic encryption, and secret sharing, the possibility of user privacy leakage can be reduced, but it will increase the local computing burden and reduce training efficiency.
  • the purpose of the present invention is to provide a method for optimizing a horizontal federated gradient boosting tree based on a stochastic greedy algorithm, so as to solve the problem that the existing horizontal federated gradient boosting tree algorithm proposed in the above-mentioned background art requires each participant and coordinator to frequently transmit histogram information,
  • the network bandwidth requirements of the coordinator are very high, and the training efficiency is easily affected by the network stability. Since the transmitted histogram information contains user information, there is a risk of leaking user privacy.
  • the possibility of user privacy leakage can be reduced, but it will increase the local computing burden and reduce the problem of training efficiency.
  • the present invention provides the following technical solutions: a method for optimizing a horizontal federated gradient boosting tree based on a stochastic greedy algorithm, comprising the following steps:
  • Step 1 The coordinator sets the relevant parameters of the gradient boosting tree model, including but not limited to the maximum number of decision trees T, the maximum tree depth L, the initial prediction value base, etc., and sends them to each participant p i .
  • Step 6 For each participant p i , according to the data of the local current node n, according to the optimal split point algorithm, determine the split point of the current node, and send the split point information to the coordinator.
  • Step 7 The coordinator counts the cutting point information of all participants, and determines the segmentation feature f and the segmentation value v according to the epsilon-greedy algorithm.
  • Step 8 The coordinator sends the finalized segmentation information, including but not limited to determining the segmentation feature f and the segmentation value v, to each participant.
  • Step 9 Each participant divides the current node data set according to the segmentation feature f and the segmentation value v, and assigns the new segmentation data to the child nodes.
  • the optimal segmentation point algorithm in the step 6 is the optimal segmentation point algorithm in the step 6:
  • Information gain is one of the most commonly used metrics to measure the purity of a sample set. Assuming that there are K types of samples in the node sample set D, the proportion of the k-th type of samples is p k , then the information entropy of D is defined as
  • the information gain is defined as
  • GL is the first-order gradient sum of the data set divided into the left node after dividing the data set according to the split point
  • HL is the second-order gradient sum of the data set of the left node
  • GR and HR are the gradient information of the corresponding right node.
  • is the tree model complexity penalty term
  • is the second-order regular term.
  • the discrete segmentation points are determined; the selection of segmentation points can be uniformly distributed within the value range according to the distribution of the data; the amount of data evenly reflected in the segmentation points is approximately equal or a second-order gradient and approximately equal.
  • the Epsilon greedy algorithm in the step 7 for each participant of node n, the node split point information is sent to the coordinator, including the split feature f i , the split value v i , the number of node samples N i , and the gain of the local objective function g i ; where i represents each participant;
  • the coordinator determines the optimal segmentation feature fmax based on the principle of maximum number.
  • Each participant recalculates the segmentation information according to the global segmentation feature and sends it to the coordinator;
  • the coordinator determines the global split value according to the following formula: if the total number of participants is P;
  • the split value is distributed to each participant for node splitting.
  • the horizontal federated learning is a distributed structure of federated learning, wherein each distributed node has the same data characteristics and different sample spaces.
  • the gradient boosting tree algorithm is an integrated model based on gradient boosting and decision tree.
  • the decision tree is the basic model of the gradient boosting tree model, and based on the tree structure, the prediction direction of the sample is judged by the given feature at the node.
  • the splitting point is the splitting position of the non-leaf node in the decision tree for data splitting.
  • the histogram is statistical information representing the first-order gradient and the second-order gradient in the node data.
  • the input device may be a data terminal such as a computer, a mobile phone, or one or more types of mobile terminals.
  • the input device includes a processor, which implements the algorithm in any one of steps 1 to 12 when executed by the processor.
  • the horizontal federated learning supported by Lee includes the participant and the coordinator, the participant has local data, the coordinator does not own any data, and participates in The center of party information aggregation, the participants calculate the histogram respectively, and send the histogram to the coordinator.
  • the coordinator summarizes all the histogram information, it finds the optimal segmentation point according to the greedy algorithm, and then shares it with each participant to cooperate with the internal algorithm. working.
  • FIG. 1 is a schematic diagram of the architecture of the horizontal federated gradient boosting tree optimization method based on the random greedy algorithm of the present invention
  • FIG. 2 is a schematic diagram of steps of a method for optimizing a horizontal federated gradient boosting tree based on a stochastic greedy algorithm according to the present invention
  • FIG. 3 is a schematic diagram of the judgment of the horizontal federated gradient boosting tree optimization method based on the random greedy algorithm according to the present invention.
  • the present invention provides a technical solution: a method for optimizing a horizontal federated gradient boosting tree based on a stochastic greedy algorithm, including the following steps:
  • Step 1 The coordinator sets the relevant parameters of the gradient boosting tree model, including but not limited to the maximum number of decision trees T, the maximum tree depth L, the initial prediction value base, etc., and sends them to each participant p i .
  • Step 6 For each participant p i , according to the data of the local current node n, according to the optimal split point algorithm, determine the split point of the current node, and send the split point information to the coordinator.
  • Step 7 The coordinator counts the cutting point information of all participants, and determines the segmentation feature f and the segmentation value v according to the epsilon-greedy algorithm.
  • Step 8 The coordinator sends the finalized segmentation information, including but not limited to determining the segmentation feature f and the segmentation value v, to each participant.
  • Step 9 Each participant divides the current node data set according to the segmentation feature f and the segmentation value v, and assigns the new segmentation data to the child nodes.
  • Information gain is one of the most commonly used metrics to measure the purity of a sample set. Assuming that there are K types of samples in the node sample set D, the proportion of the k-th type of samples is p k , then the information entropy of D is defined as
  • the information gain is defined as
  • GL is the first-order gradient sum of the data set divided into the left node after dividing the data set according to the split point
  • HL is the second-order gradient sum of the data set of the left node
  • GR and HR are the gradient information of the corresponding right node.
  • is the tree model complexity penalty term
  • is the second-order regular term.
  • the discrete segmentation points are determined; the selection of segmentation points can be uniformly distributed within the value range according to the distribution of the data; the amount of data evenly reflected in the segmentation points is approximately equal or a second-order gradient and approximately equal.
  • Each participant sends the node segmentation point information to the coordinator, including segmentation feature f i , segmentation value vi , node sample number N i , and local objective function gain g i ; where i represents each participant;
  • the coordinator determines the optimal segmentation feature f max according to the segmentation information of each participant and based on the principle of maximum number
  • Each participant recalculates the segmentation information according to the global segmentation feature and sends it to the coordinator;
  • the coordinator determines the global split value according to the following formula: if the total number of participants is P;
  • horizontal federated learning is a distributed structure of federated learning, in which the data characteristics of each distributed node are the same, and the sample space is different, so that the comparison work is better;
  • the gradient boosting tree algorithm is an integrated model based on gradient boosting and decision tree, which works better;
  • the decision tree is the basic model of the gradient boosting tree model. Based on the tree structure, the prediction direction of the sample is judged by the given feature at the node, which can better help the prediction;
  • split point is the split position of the non-leaf node in the decision tree for data splitting, which is better for splitting;
  • the histogram is the statistical information representing the first-order gradient and the second-order gradient in the node data, which can be represented more intuitively;
  • the input device can be a computer, a mobile phone or other data terminals or one or more of mobile terminals, which is better for data input;
  • the input device includes a processor, which implements any one of the algorithms in steps 1 to 12 when executed by the processor.
  • Step 1 The coordinator sets the relevant parameters of the gradient boosting tree model, including but not limited to the maximum number of decision trees T, the maximum tree depth L, the initial prediction value base, etc., and sends it to each participant p i
  • Step 6 For each participant pi, according to the data of the local current node n , according to the optimal split point algorithm, Determine the split point of the current node, and send the split point information to the coordinator. 1.
  • Determine the split objective function including but not limited to the following objective functions,
  • Information gain is the most commonly used indicator to measure the purity of a sample set. Assuming that there are K types of samples in the node sample set D, the proportion of the kth type of samples is p k , then the information entropy of D is defined as
  • the information gain is defined as
  • GL is the first-order gradient sum of the data set divided into the left node after dividing the data set according to the split point
  • HL is the second-order gradient sum of the data set of the left node
  • GR and HR are the gradient information of the corresponding right node.
  • is the tree model complexity penalty term
  • is the second-order regular term
  • the discrete segmentation points are determined; the selection of segmentation points can be uniformly distributed within the value range according to the distribution of the data; the amount of data evenly reflected in the segmentation points is approximately equal or a second-order gradient and approximately equal,
  • Step 7 The coordinator counts the information of the segmentation points of all participants, and determines the segmentation feature f and segmentation value v according to the epsilon-greedy algorithm.
  • Epsilon in step 7 Greedy algorithm: for node n
  • Each participant sends the node segmentation point information to the coordinator, including segmentation feature f i , segmentation value vi , node sample number N i , and local objective function gain g i ; where i represents each participant;
  • the coordinator determines the optimal segmentation feature f max according to the segmentation information of each participant and based on the principle of maximum number
  • Each participant recalculates the segmentation information according to the global segmentation feature and sends it to the coordinator;
  • the coordinator determines the global split value according to the following formula: if the total number of participants is P;
  • Step 8 The coordinator sends the finalized segmentation information, including but not limited to determining the segmentation feature f and segmentation value v, to each participant.
  • Step 9 Each participant Fang divides the current node data set according to the segmentation feature f and the segmentation value v, and assigns the new segmentation data to the child nodes.
  • the coordinator will distribute the finalized segmentation information, including but not limited to determining the segmentation characteristics and segmentation value, to each participant.
  • the current node data set is divided by segmentation features and segmentation values.
  • the horizontal federated learning supported by the interest includes participants and coordinators. The participants have local data, and the coordinator does not own any data. The center for information aggregation of the participants, the participants calculate separately Histogram, send the histogram to the coordinator. After the coordinator summarizes all the histogram information, it finds the optimal segmentation point according to the greedy algorithm, and then shares it with each participant to work with the internal algorithm.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于随机贪心算法的横向联邦梯度提升树优化方法,包括如下步骤:协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p_i,各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点,该基于随机贪心算法的横向联邦梯度提升树优化方法,横向联邦学习中包括参与方和协调方,参与方拥有本地数据,协调方不拥有任何数据,进行参与方信息聚合的中心,参与方分别计算直方图,将直方图发送给协调方,协调方汇总全部直方图信息后,根据贪心算法寻找最优分割点,然后分享给各个参与方,配合内部的算法进行工作。

Description

一种基于随机贪心算法的横向联邦梯度提升树优化方法 技术领域
本发明涉及联邦学习技术领域,具体为一种基于随机贪心算法的横向联邦梯度提升树优化方法。
背景技术
联邦学习是一个机器学习框架,能有效帮助多个机构在满足用户隐私保护、数据安全和政府法规的要求下,进行数据使用和机器学习建模,让参与方在未共享数据的基础上联合建模,能从技术上打破数据孤岛,实现AI协作,在此框架下通过设计虚拟模型解决不同数据拥有方在不交换数据的情况下进行协作的问题,虚拟模型是各方将数据聚合在一起的最优模型,各自区域依据模型为本地目标服务,联邦学习要求建模结果应当无限接近传统模式,即将多个数据拥有方的数据汇聚到一处进行建模的结果,在联邦机制下,各参与者的身份和地位相同,可建立共享数据策略,贪心算法是一种对某些求最优解问题的更简单、更迅速的设计技术,贪心算法的特点是一步一步地进行,常以当前情况为基础根据某个优化测度作最优选择,而不考虑各种可能的整体情况,省去了为找最优解要穷尽所有可能而必须耗费的大量时间,贪心算法采用自顶向下,以迭代的方法做出相继的贪心选择,每做一次贪心选择,就将所求问题简化为一个规模更小的子问题,通过每一步贪心选择,可得到问题的一个最优解,虽然每一步上都要保证能获得局部最优解,但由此产生的全局解有时不一定是最优的,所以贪心算法不要回溯。
然而,现有横向联邦梯度提升树算法需要各个参与方和协调方在频繁传递直方图信息,对协调方网络带宽要求很高,训练效率容易受网络稳定性的影响,并且由于传递的直方图信息中包含用户信息,存在泄漏用户隐私的风险,在引入多方安全计算、同态加密、秘密共享等隐私保护方案后,可以减少用户隐私泄漏的可能性,但会更加本地计算负担,降低训练效率。
发明内容
本发明的目的在于提供一种基于随机贪心算法的横向联邦梯度提升树优化方法,以解决上述背景技术中提出现有横向联邦梯度提升树算法需要各个参与方和协调方在频繁传递直方图信息,对协调方网络带宽要求很高,训练效率容易受网络稳定性的影响,并且由于传递 的直方图信息中包含用户信息,存在泄漏用户隐私的风险,在引入多方安全计算、同态加密、秘密共享等隐私保护方案后,可以减少用户隐私泄漏的可能性,但会更加本地计算负担,降低训练效率的问题。
为实现上述目的,本发明提供如下技术方案:一种基于随机贪心算法的横向联邦梯度提升树优化方法,包括其步骤如下:
步骤一:协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p i
步骤二:令树计数器t=1。
步骤三:对每个参与方p i,初始化第k棵树训练目标
Figure PCTCN2021101319-appb-000001
其中y 0=y,
Figure PCTCN2021101319-appb-000002
Figure PCTCN2021101319-appb-000003
步骤四:令树层数计数器l=1。
步骤五:令当前层节点计数器n=1。
步骤六:对每个参与方p i,根据本地当前节点n的数据,根据最优分割点算法,确定当前节点的分割点,并将分割点信息发送给协调方。
步骤七:协调方统计全部参与方的切割点信息,根据epsilon-贪心算法,确定分割特征f和分割值v。
步骤八:协调方将最终确定的分割信息,包括但不限于确定分割特征f和分割值v,下发给各个参与方。
步骤九:各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点。
步骤十:令n=n+1,如果n小于或等于当前层最大节点数,继续步骤六;反之,继续下一步。
步骤十一:根据l层节点的子节点重置当前层节点信息,令l=l+1,如果l小于或等于树最大深度L,继续步骤五;反之,继续下一步。
步骤十二:令t=t+1,如果t大于或等于决策树最大数量T,继续步骤3;反之,结束。
优选的,所述步骤六中的最优分割点算法:
一、确定分割目标函数:包括但不限于以下目标函数,
信息增益:信息增益是度量样本集合纯度最常用的一种指标。假设节点样本集合D中共有K类样本,其中第k类样本所占的比例为p k,则D的信息熵定义为
Figure PCTCN2021101319-appb-000004
假设节点根据属性a切分为V个可能的取值,则信息增益定义为
Figure PCTCN2021101319-appb-000005
信息增益率:
Figure PCTCN2021101319-appb-000006
其中
Figure PCTCN2021101319-appb-000007
基尼系数:
Figure PCTCN2021101319-appb-000008
Figure PCTCN2021101319-appb-000009
结构系数:
Figure PCTCN2021101319-appb-000010
其中G L为根据分割点分割数据集后划分到左节点的数据集的一阶梯度和,H L为左节点的数据集的二阶梯度和,G R及H R为相应右节点的梯度信息和,γ为树模型复杂度惩罚项,λ为二阶正则项。
二、确定分割值候选列表:根据当前节点数据分布,确定分割值列表;分割值包括分割特征和分割特征值;分割值列表可以根据以下方法确定:
数据集中所有特征的所有取值;
针对数据集中每个特征的取值范围,确定离散分割点;分割点的选择可以根据数据的分布,均匀分布在取值范围内;其中均匀体现在分割点间的数据量近似相等或者二阶梯度和近似相等。
遍历分割值候选列表,寻找使目标函数最优的分割点。
优选的,所述步骤七中的Epsilon贪心算法:针对节点n各参与方把节点分割点信息发送给协调方,包括分割特征f i,分割值v i,节点样本数量N i,本地目标函数增益g i;其中i代 表各参与方;
协调方根据各参与方分割信息,基于最大数原则,确定最优分割特征f max设X为均匀分布在[0,1]之间的随机数,对X随机取样得x;如果x<=epsilon,则在各参与方分割特征中随机选择一个作为全局分割特征;反之,选择f max为全局分割特征;
各参与方根据全局分割特征重新计算分割信息,并发送给协调方;
协调方根据一下公式确定全局分割值:如果参与方总数为P;
Figure PCTCN2021101319-appb-000011
将分割值分发到各参与方,进行节点分割。
优选的,所述横向联邦学习,是联邦学习的一种分布式结构,其中各个分布式节点的数据特征相同,样本空间不同。
优选的,所述梯度提升树算法,是一种基于梯度提升和决策树的集成模型。
优选的,所述决策树是梯度提升树模型的基础模型,基于树结构,在节点通过给定特征判断样本的预测方向。
优选的,所述分割点是决策树中非叶节点进行数据分割的切分位置。
优选的,所述直方图是表示节点数据中一阶梯度和二阶梯度的统计信息。
优选的,所述录入设备可以是计算机、手机等数据终端或者是移动终端的一种或多种。
优选的,所述录入设备包括处理器,被所述处理器执行时实现步骤一到十二中的任一项所述算法。
与现有技术相比,本发明的有益效果是:该基于随机贪心算法的横向联邦梯度提升树优化方法,通过协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p i,令树计数器t=1,对每个参与方p i,令树层数计数器l=1,令当前层节点计数器n=1,对每个参与方p i,根据本地当前节点n的数据,根据最优分割点算法,确定当前节点的分割点,并将分割点信息发送给协调方,协调方统计全部参与方的切割点信息,根据epsilon-贪心算法,确定分割特征f和分割值v,协调方将最终确定的分割信息,包括但不限于确定分割特征f和分割值v,下发给各个参与方,各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点,令n=n+1,如果n小于或等于当前层最大节点数,继续步骤六,反之,继续下一步,根据l层节点的子节点重置当前层节点信息,令l=l+1,如果l小于或等于树最大深度L,继续步骤五,反之,继续下一步,令t=t+1,如果t大于或等于决策树最大数量T,继续步骤3,反之, 结束,利支持的横向联邦学习中包括参与方和协调方,参与方拥有本地数据,协调方不拥有任何数据,进行参与方信息聚合的中心,参与方分别计算直方图,将直方图发送给协调方,协调方汇总全部直方图信息后,根据贪心算法寻找最优分割点,然后分享给各个参与方,配合内部的算法进行工作。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本发明基于随机贪心算法的横向联邦梯度提升树优化方法架构示意图;
图2为本发明基于随机贪心算法的横向联邦梯度提升树优化方法步骤示意图;
图3为本发明基于随机贪心算法的横向联邦梯度提升树优化方法判断示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参阅图1-3,本发明提供一种技术方案:一种基于随机贪心算法的横向联邦梯度提升树优化方法,包括其步骤如下:
步骤一:协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p i
步骤二:令树计数器t=1。
步骤三:对每个参与方p i,初始化第k棵树训练目标
Figure PCTCN2021101319-appb-000012
其中y 0=y,
Figure PCTCN2021101319-appb-000013
步骤四:令树层数计数器l=1。
步骤五:令当前层节点计数器n=1。
步骤六:对每个参与方p i,根据本地当前节点n的数据,根据最优分割点算法,确定当前节点的分割点,并将分割点信息发送给协调方。
步骤七:协调方统计全部参与方的切割点信息,根据epsilon-贪心算法,确定分割特征f和分割值v。
步骤八:协调方将最终确定的分割信息,包括但不限于确定分割特征f和分割值v,下发给各个参与方。
步骤九:各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点。
步骤十:令n=n+1,如果n小于或等于当前层最大节点数,继续步骤六;反之,继续下一步。
步骤十一:根据l层节点的子节点重置当前层节点信息,令l=l+1,如果l小于或等于树最大深度L,继续步骤五;反之,继续下一步。
步骤十二:令t=t+1,如果t大于或等于决策树最大数量T,继续步骤3;反之,结束;
进一步的,步骤六中的最优分割点算法:
一、确定分割目标函数:包括但不限于以下目标函数,
信息增益:信息增益是度量样本集合纯度最常用的一种指标。假设节点样本集合D中共有K类样本,其中第k类样本所占的比例为p k,则D的信息熵定义为
Figure PCTCN2021101319-appb-000014
假设节点根据属性a切分为V个可能的取值,则信息增益定义为
Figure PCTCN2021101319-appb-000015
信息增益率:
Figure PCTCN2021101319-appb-000016
其中
Figure PCTCN2021101319-appb-000017
基尼系数:
Figure PCTCN2021101319-appb-000018
Figure PCTCN2021101319-appb-000019
结构系数:
Figure PCTCN2021101319-appb-000020
其中G L为根据分割点分割数据集后划分到左节点的数据集的一阶梯度和,H L为左节点的数据集的二阶梯度和,G R及H R为相应右节点的梯度信息和,γ为树模型复杂度惩罚项,λ为二阶正则项。
二、确定分割值候选列表:根据当前节点数据分布,确定分割值列表;分割值包括分割特征和分割特征值;分割值列表可以根据以下方法确定:
数据集中所有特征的所有取值;
针对数据集中每个特征的取值范围,确定离散分割点;分割点的选择可以根据数据的分布,均匀分布在取值范围内;其中均匀体现在分割点间的数据量近似相等或者二阶梯度和近似相等。
遍历分割值候选列表,寻找使目标函数最优的分割点;
进一步的,步骤七中的Epsilon贪心算法:针对节点n
各参与方把节点分割点信息发送给协调方,包括分割特征f i,分割值v i,节点样本数量N i,本地目标函数增益g i;其中i代表各参与方;
协调方根据各参与方分割信息,基于最大数原则,确定最优分割特征f max
设X为均匀分布在[0,1]之间的随机数,对X随机取样得x;如果x<=epsilon,则在各参与方分割特征中随机选择一个作为全局分割特征;反之,选择f max为全局分割特征;
各参与方根据全局分割特征重新计算分割信息,并发送给协调方;
协调方根据一下公式确定全局分割值:如果参与方总数为P;
Figure PCTCN2021101319-appb-000021
将分割值分发到各参与方,进行节点分割;
进一步的,横向联邦学习,是联邦学习的一种分布式结构,其中各个分布式节点的数据特征相同,样本空间不同,更好的进行比对工作;
进一步的,梯度提升树算法,是一种基于梯度提升和决策树的集成模型,更好的进行工作;
进一步的,决策树是梯度提升树模型的基础模型,基于树结构,在节点通过给定特征判断样本的预测方向,能够更好的帮助预测;
进一步的,分割点是决策树中非叶节点进行数据分割的切分位置,更好的进行分割;
进一步的,直方图是表示节点数据中一阶梯度和二阶梯度的统计信息,更直观的进行表示;
进一步的,录入设备可以是计算机、手机等数据终端或者是移动终端的一种或多种,更好的进行数据录入;
进一步的,录入设备包括处理器,被处理器执行时实现步骤一到十二中的任一项算法。
工作原理:步骤一:协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p i,步骤二:令树计数器t=1,步骤三:对每个参与方p i,初始化第k棵树训练目标
Figure PCTCN2021101319-appb-000022
其中y 0=y,
Figure PCTCN2021101319-appb-000023
步骤四:令树层数计数器l=1,步骤五:令当前层节点计数器n=1,步骤六:对每个参与方p i,根据本地当前节点n的数据,根据最优分割点算法,确定当前节点的分割点,并将分割点信息发送给协调方,一、确定分割目标函数:包括但不限于以下目标函数,
信息增益:信息增益是度量样本集合纯度最常用的一种指标,假设节点样本集合D中共有K类样本,其中第k类样本所占的比例为p k,则D的信息熵定义为
Figure PCTCN2021101319-appb-000024
假设节点根据属性a切分为V个可能的取值,则信息增益定义为
Figure PCTCN2021101319-appb-000025
信息增益率:
Figure PCTCN2021101319-appb-000026
其中
Figure PCTCN2021101319-appb-000027
基尼系数:
Figure PCTCN2021101319-appb-000028
Figure PCTCN2021101319-appb-000029
结构系数:
Figure PCTCN2021101319-appb-000030
其中G L为根据分割点分割数据集后划分到左节点的数据集的一阶梯度和,H L为左节点的数据集的二阶梯度和,G R及H R为相应右节点的梯度信息和,γ为树模型复杂度惩罚项,λ为二阶正则项,
二、确定分割值候选列表:根据当前节点数据分布,确定分割值列表;分割值包括分割特征和分割特征值;分割值列表可以根据以下方法确定:
数据集中所有特征的所有取值;
针对数据集中每个特征的取值范围,确定离散分割点;分割点的选择可以根据数据的分布,均匀分布在取值范围内;其中均匀体现在分割点间的数据量近似相等或者二阶梯度和近似相等,
遍历分割值候选列表,寻找使目标函数最优的分割点,步骤七:协调方统计全部参与方的切割点信息,根据epsilon-贪心算法,确定分割特征f和分割值v,步骤七中的Epsilon贪心算法:针对节点n
各参与方把节点分割点信息发送给协调方,包括分割特征f i,分割值v i,节点样本数量N i,本地目标函数增益g i;其中i代表各参与方;
协调方根据各参与方分割信息,基于最大数原则,确定最优分割特征f max
设X为均匀分布在[0,1]之间的随机数,对X随机取样得x;如果x<=epsilon,则在各参与方分割特征中随机选择一个作为全局分割特征;反之,选择f max为全局分割特征;
各参与方根据全局分割特征重新计算分割信息,并发送给协调方;
协调方根据一下公式确定全局分割值:如果参与方总数为P;
Figure PCTCN2021101319-appb-000031
将分割值分发到各参与方,进行节点分割,步骤八:协调方将最终确定的分割信息,包括但不限于确定分割特征f和分割值v,下发给各个参与方,步骤九:各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点,步骤十:令n=n+1,如果n小于或等于当前层最大节点数,继续步骤六;反之,继续下一步,步骤十一:根据l层节点的子节点重置当前层节点信息,令l=l+1,如果l小于或等于树最大深度L,继续步骤五;反之,继续下一步,步骤十二:令t=t+1,如果t大于或等于决策树最大数量T, 继续步骤3;反之,结束,通过协调方设置梯度提升树模型相关参数,包括但不限于决策树最大数量、树最大深度、初始预测值等,并下发到各个参与方,协调方将最终确定的分割信息,包括但不限于确定分割特征和分割值,下发给各个参与方,各个参与方根据分割特征和分割值分割当前节点数据集,利支持的横向联邦学习中包括参与方和协调方,参与方拥有本地数据,协调方不拥有任何数据,进行参与方信息聚合的中心,参与方分别计算直方图,将直方图发送给协调方,协调方汇总全部直方图信息后,根据贪心算法寻找最优分割点,然后分享给各个参与方,配合内部的算法进行工作。
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。

Claims (10)

  1. 一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:其步骤如下:
    步骤一:协调方设置梯度提升树模型相关参数,包括决策树最大数量T、树最大深度L、初始预测值base等,并下发到各个参与方p i
    步骤二:令树计数器t=1;
    步骤三:对每个参与方p i,初始化第k棵树训练目标
    Figure PCTCN2021101319-appb-100001
    其中y 0=y,
    Figure PCTCN2021101319-appb-100002
    Figure PCTCN2021101319-appb-100003
    步骤四:令树层数计数器l=1;
    步骤五:令当前层节点计数器n=1;
    步骤六:对每个参与方p i,根据本地当前节点n的数据,根据最优分割点算法,确定当前节点的分割点,并将分割点信息发送给协调方;
    步骤七:协调方统计全部参与方的切割点信息,根据epsilon-贪心算法,确定分割特征f和分割值v;
    步骤八:协调方将最终确定的分割信息,包括确定分割特征f和分割值v,下发给各个参与方;
    步骤九:各个参与方根据分割特征f和分割值v分割当前节点数据集,并将新的分割数据分配给子节点;
    步骤十:令n=n+1,如果n小于或等于当前层最大节点数,继续步骤三;反之,继续下一步;
    步骤十一:根据l层节点的子节点重置当前层节点信息,令l=l+1,如果l小于或等于树最大深度L,继续步骤五;反之,继续下一步;
    步骤十二:令t=t+1,如果t大于或等于决策树最大数量T,继续步骤3;反之,结束。
  2. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述步骤三中的最优分割点算法:
    确定分割目标函数:包括目标函数,
    信息增益:信息增益是度量样本集合纯度最常用的一种指标。假设节点样本集合D中共有K类样本,其中第k类样本所占的比例为p k,则D的信息熵定义为
    Figure PCTCN2021101319-appb-100004
    假设节点根据属性a切分为V个可能的取值,则信息增益定义为
    Figure PCTCN2021101319-appb-100005
    信息增益率:
    Figure PCTCN2021101319-appb-100006
    其中
    Figure PCTCN2021101319-appb-100007
    基尼系数:
    Figure PCTCN2021101319-appb-100008
    Figure PCTCN2021101319-appb-100009
    结构系数:
    Figure PCTCN2021101319-appb-100010
    其中G L为根据分割点分割数据集后划分到左节点的数据集的一阶梯度和,H L为左节点的数据集的二阶梯度和,G R及H R为相应右节点的梯度信息和,γ为树模型复杂度惩罚项,λ为二阶正则项;
    确定分割值候选列表:根据当前节点数据分布,确定分割值列表;分割值包括分割特征和分割特征值;分割值列表根据以下方法确定:
    数据集中所有特征的所有取值;
    针对数据集中每个特征的取值范围,确定离散分割点;
    分割点的选择可以根据数据的分布,均匀分布在取值范围内;其中均匀体现在分割点间的数据量近似相等或者二阶梯度和近似相等;
    遍历分割值候选列表,寻找使目标函数最优的分割点。
  3. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述步骤七中的Epsilon贪心算法包含:
    针对节点n各参与方把节点分割点信息发送给协调方,包括分割特征f i,分割值v i,节点样本数量N i,本地目标函数增益g i;其中i代表各参与方;
    协调方根据各参与方分割信息,基于最大数原则,确定最优分割特征f max设X为均匀分 布在[0,1]之间的随机数,对X随机取样得x;如果x<=epsilon,则在各参与方分割特征中随机选择一个作为全局分割特征;反之,选择f max为全局分割特征;
    各参与方根据全局分割特征重新计算分割信息,并发送给协调方;
    协调方根据一下公式确定全局分割值:如果参与方总数为P;
    Figure PCTCN2021101319-appb-100011
    将分割值分发到各参与方,进行节点分割。
  4. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述横向联邦学习,是联邦学习的一种分布式结构,其中各个分布式节点的数据特征相同,样本空间不同。
  5. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述梯度提升树算法,是一种基于梯度提升和决策树的集成模型。
  6. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述决策树是梯度提升树模型的基础模型,基于树结构,在节点通过给定特征判断样本的预测方向。
  7. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述分割点是决策树中非叶节点进行数据分割的切分位置。
  8. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:所述直方图是表示节点数据中一阶梯度和二阶梯度的统计信息。
  9. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:录入设备可以是计算机、手机等数据终端或者是移动终端的一种或多种。
  10. 根据权利要求1所述的一种基于随机贪心算法的横向联邦梯度提升树优化方法,其特征在于:录入设备包括处理器,被所述处理器执行时实现步骤一到十二中的任一项所述算法。
PCT/CN2021/101319 2021-01-14 2021-06-21 一种基于随机贪心算法的横向联邦梯度提升树优化方法 WO2022151654A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21918850.5A EP4131078A4 (en) 2021-01-14 2021-06-21 HORIZONTAL FEDERATED GRADIENT BOOSTED TREE OPTIMIZATION METHOD BASED ON A RANDOM GREEDY ALGORITHM
US18/050,595 US20230084325A1 (en) 2021-01-14 2022-10-28 Random greedy algorithm-based horizontal federated gradient boosted tree optimization method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110046246.2A CN114841374A (zh) 2021-01-14 2021-01-14 一种基于随机贪心算法的横向联邦梯度提升树优化方法
CN202110046246.2 2021-01-14

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/050,595 Continuation US20230084325A1 (en) 2021-01-14 2022-10-28 Random greedy algorithm-based horizontal federated gradient boosted tree optimization method

Publications (1)

Publication Number Publication Date
WO2022151654A1 true WO2022151654A1 (zh) 2022-07-21

Family

ID=82447785

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101319 WO2022151654A1 (zh) 2021-01-14 2021-06-21 一种基于随机贪心算法的横向联邦梯度提升树优化方法

Country Status (4)

Country Link
US (1) US20230084325A1 (zh)
EP (1) EP4131078A4 (zh)
CN (1) CN114841374A (zh)
WO (1) WO2022151654A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117075884A (zh) * 2023-10-13 2023-11-17 南京飓风引擎信息技术有限公司 一种基于可视化脚本的数字化处理***及方法
CN117648646A (zh) * 2024-01-30 2024-03-05 西南石油大学 基于特征选择和堆叠异构集成学习的钻采成本预测方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116205313B (zh) * 2023-04-27 2023-08-11 数字浙江技术运营有限公司 联邦学习参与方的选择方法、装置及电子设备
CN116821838B (zh) * 2023-08-31 2023-12-29 浙江大学 一种隐私保护的异常交易检测方法及装置
CN117251805B (zh) * 2023-11-20 2024-04-16 杭州金智塔科技有限公司 基于广度优先算法的联邦梯度提升决策树模型更新***
CN117724854B (zh) * 2024-02-08 2024-05-24 腾讯科技(深圳)有限公司 数据处理方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299728A (zh) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 联邦学习方法、***及可读存储介质
US20200200648A1 (en) * 2018-02-12 2020-06-25 Dalian University Of Technology Method for Fault Diagnosis of an Aero-engine Rolling Bearing Based on Random Forest of Power Spectrum Entropy
CN111553483A (zh) * 2020-04-30 2020-08-18 同盾控股有限公司 基于梯度压缩的联邦学习的方法、装置及***
CN111553470A (zh) * 2020-07-10 2020-08-18 成都数联铭品科技有限公司 适用于联邦学习的信息交互***及方法
CN111985270A (zh) * 2019-05-22 2020-11-24 中国科学院沈阳自动化研究所 一种基于梯度提升树的sEMG信号最优通道选择方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200200648A1 (en) * 2018-02-12 2020-06-25 Dalian University Of Technology Method for Fault Diagnosis of an Aero-engine Rolling Bearing Based on Random Forest of Power Spectrum Entropy
CN109299728A (zh) * 2018-08-10 2019-02-01 深圳前海微众银行股份有限公司 联邦学习方法、***及可读存储介质
CN111985270A (zh) * 2019-05-22 2020-11-24 中国科学院沈阳自动化研究所 一种基于梯度提升树的sEMG信号最优通道选择方法
CN111553483A (zh) * 2020-04-30 2020-08-18 同盾控股有限公司 基于梯度压缩的联邦学习的方法、装置及***
CN111553470A (zh) * 2020-07-10 2020-08-18 成都数联铭品科技有限公司 适用于联邦学习的信息交互***及方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117075884A (zh) * 2023-10-13 2023-11-17 南京飓风引擎信息技术有限公司 一种基于可视化脚本的数字化处理***及方法
CN117075884B (zh) * 2023-10-13 2023-12-15 南京飓风引擎信息技术有限公司 一种基于可视化脚本的数字化处理***及方法
CN117648646A (zh) * 2024-01-30 2024-03-05 西南石油大学 基于特征选择和堆叠异构集成学习的钻采成本预测方法
CN117648646B (zh) * 2024-01-30 2024-04-26 西南石油大学 基于特征选择和堆叠异构集成学习的钻采成本预测方法

Also Published As

Publication number Publication date
CN114841374A (zh) 2022-08-02
EP4131078A4 (en) 2023-09-06
US20230084325A1 (en) 2023-03-16
EP4131078A1 (en) 2023-02-08

Similar Documents

Publication Publication Date Title
WO2022151654A1 (zh) 一种基于随机贪心算法的横向联邦梯度提升树优化方法
US9426233B2 (en) Multi-objective server placement determination
CN113191503B (zh) 一种非共享数据的去中心化的分布式学习方法及***
CN110177094A (zh) 一种用户团体识别方法、装置、电子设备及存储介质
CN107784327A (zh) 一种基于gn的个性化社区发现方法
CN112464107B (zh) 一种基于多标签传播的社交网络重叠社区发现方法及装置
CN106817390B (zh) 一种用户数据共享的方法和设备
WO2022067539A1 (zh) 网络流量处理方法、装置、存储介质及计算机设备
CN110825935A (zh) 社区核心人物挖掘方法、***、电子设备及可读存储介质
Chen et al. Distributed community detection over blockchain networks based on structural entropy
CN112966054A (zh) 基于企业图谱节点间关系的族群划分方法和计算机设备
Liu et al. Towards method of horizontal federated learning: A survey
Zhou et al. The role of communication time in the convergence of federated edge learning
CN114116705A (zh) 联合学习中确定参与方贡献值的方法及装置
CN107257356B (zh) 一种基于超图分割的社交用户数据优化放置方法
WO2019184325A1 (zh) 基于平均互信息的社区划分质量评价方法及***
CN112836828A (zh) 基于博弈论的自组织式联邦学习方法
CN115130044A (zh) 一种基于二阶h指数的影响力节点识别方法和***
CN108875786B (zh) 基于Storm的食品数据并行计算一致性问题的优化方法
Fang et al. GDAGAN: An anonymization method for graph data publishing using generative adversarial network
Chen et al. Trust-based federated learning for network anomaly detection
Wang et al. Automated allocation of detention rooms based on inverse graph partitioning
CN116305262B (zh) 基于负调查的社交网络拓扑隐私保护方法
US11921787B2 (en) Identity-aware data management
CN113495982B (zh) 交易节点管理方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21918850

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021918850

Country of ref document: EP

Effective date: 20221027

NENP Non-entry into the national phase

Ref country code: DE