CN114677200B

CN114677200B - Business information recommendation method and device based on multiparty high-dimension data longitudinal federation learning

Info

Publication number: CN114677200B
Application number: CN202210368272.1A
Authority: CN
Inventors: 钱鹰; 莫昊恂; 刘歆; 陈奉; 宋阳; 熊炜; 陈雪; 杨世利
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Filing date: 2022-04-01
Publication date: 2024-06-21
Anticipated expiration: 2042-04-01

Abstract

The invention relates to a business information recommendation method and a business information recommendation device based on multiparty high-dimensional data longitudinal federal learning, which belong to the technical field of big data and comprise the following steps: s1: a homomorphic encrypted key pair is created. Preprocessing multi-party data and aligning encrypted samples; s2: constructing a longitudinal federal LightGBM model; s3: converting the longitudinal federal LightGBM model into a neural network as part GBDT2NN of the longitudinal federal ECA-DeepGBM model; s4: longitudinal federal ECA-DeepGBM model CatNN partial feed forward process calculation; s5: and constructing a loss function and integrally training a model, and realizing business information recommendation based on multiparty high-dimensional data based on a trained high-dimensional data classification prediction model. The invention achieves the aim of recommending accurate business information by adding feature dimensions to multiparty data.

Description

Business information recommendation method and device based on multiparty high-dimension data longitudinal federation learning

Technical Field

The invention belongs to the technical field of big data, and relates to a business information recommendation method and device based on multiparty high-dimensional data longitudinal federal learning.

Background

When the development of application fields such as artificial intelligence field and big data field is carried out, the demand for data is increased. Not only does training an excellent machine learning model require a large number of sample supports, but also a large number of excellent sample features. In real life, high-quality data features are often distributed among different companies and institutions, such as transaction information distributed among financial institutions, medical information distributed among medical institutions, and data information is often distributed in various fields. Because of the importance of data ownership, user privacy and data security are more dependent, and as laws and regulations put more stringent constraints and requirements on data collection processing, individual organizations or individuals grasping application data often do not want or have proper means to cooperate with each other, so that it is difficult for the application data grasped individually to work together.

Meanwhile, most of machine learning tasks are currently supervised learning modes. If one party with classification prediction task requirements has no label, supervised machine learning cannot be performed, even when the task requirements can complete classification prediction by information of other parties so as to realize accurate recommendation of the task requirements, and under the unsupervised condition, multi-party privacy data joint classification prediction becomes difficult and challenging. Such task demands are related to numerous financial, commercial, and financial-like industries and application fields thereof, and are also problems to be solved. The task demander is taken as an active party or a recommended party, and other parties are called passive parties or recommended parties. When multiparty data joint learning is performed, aligned high-quality sample data are fewer, and a classification prediction model with high accuracy cannot be trained for high-dimensional data.

Therefore, how to break the multi-party data island, and realize the joint learning of data while guaranteeing the privacy and safety requirements of the multi-party data island so as to meet the actual requirements of various application scenes is a problem to be solved urgently.

Disclosure of Invention

Accordingly, the present invention is directed to a method and apparatus for modeling ECA-DeepGBM based on longitudinal federal learning, which is used for recommending business information of multi-party high-dimensional data, and cooperatively training a machine learning model under the privacy of protecting companies of all parties by federal learning (FEDERATED LEARNING). The federal learning can be divided into transverse federal learning with more feature overlapping, less sample overlapping, longitudinal federal learning with more sample overlapping and federal transfer learning with less feature overlapping and less feature overlapping with sample according to the relation between training samples and feature space.

In order to achieve the above purpose, the present invention provides the following technical solutions:

In one aspect, the invention provides a business information recommendation method based on multiparty high-dimensional data longitudinal federal learning, comprising the following steps:

s1: a homomorphic encrypted key pair is created. Preprocessing multi-party data and aligning encryption samples, wherein the multi-party data are business privacy data which exist in the self-parties of a tag party A, a client party B, a client party C and a cooperative party P and cannot be known by other parties;

S2: constructing a longitudinal federal LightGBM model;

S3: converting the longitudinal federal LightGBM model into a neural network as part GBDT2NN of the longitudinal federal ECA-DeepGBM model;

S4: longitudinal federal ECA-DeepGBM model CatNN partial feed forward process calculation;

s5: and constructing a loss function, training a high-dimensional data classification prediction model, and realizing business information classification recommendation of multi-party high-dimensional data based on the trained high-dimensional data classification prediction model.

The invention discloses a high-dimensional data classification recommendation model for longitudinal federal learning, which uses an ECA-DeepGBM model, wherein the ECA-DeepGBM model mainly comprises two parts: a CatNN portion of sparse class features is processed and a GBDT NN portion of continuous numerical features is processed. Part CatNN of which employs a modified FAT-DeepFFM model in which ECA-Net is used to add attention to the vector after embedding. The GBDT n portion uses LightGBM as a gradient-lifted tree and converts LightGBM into a neural network as the GBDT2NN portion.

In the high-dimensional data analysis prediction model which is constructed by the invention and faces to longitudinal federal learning, the following roles are mainly provided: a label side A side with label information and evaluation application requirements, a client side B side with user characteristic information, a client side C side with other characteristic information of a user and a trusted cooperator side P side.

The client side B and C pre-treat the own characteristics, and the own characteristics are split into continuous numerical characteristics by the client side BAnd discrete category features/>The C side splits the own side characteristic into continuous numerical characteristics/>And discrete category features/>

To be continuous numerical characteristicsTraining of the vertical federation LightGBM is performed as input, and then the vertical federation LightGBM is converted into a sub-neural network on the client side B and C as GBDT NN parts of the client side B and C through knowledge distillation.

The main structure of CatNN part of B side and C side of client side is a Fat-DeepFFM model based on ECA-Net module improvement, and discrete category characteristics are mainly processed

After the model is trained, an ECA-DeepGBM model exists on each of the client side B and C and is used for processing own characteristics respectively to obtain own output. The output of longitudinal federal ECA-DeepGBM can be obtained by combining the predicted outputs of client B and client C.

Further, the step S1 specifically includes the following steps:

S11: the cooperator P side generates homomorphic encryption public key P _k and private key s _k, and sends public key P _k to the label side A side, client side B side and client side C side; the business privacy data for each party includes: the A party has a label of the existing commercial information and is also a recommendation requiring party; part of business information of the person or the enterprise is owned by the party B, including loan, repayment, default information and the like, and other relevant data in the application scene of the party B; party C has basic information of individuals or enterprises, and individuals comprise ages, academia, income, family income and family liabilities; enterprises comprise tax payment, income, expenditure, liabilities and the like;

S12: establishing a multiparty longitudinal federal learning classification prediction sample set with the purpose of expanding sample feature dimensions for business information recommendation: because the user groups of the tag party A, the client party B and the client party C are not identical, the encryption-based sample alignment technology is used to ensure that the parties A, B and C align the common user without exposing the respective original data;

s13: the client side B and the client side C preprocess the characteristics of the own sample, and the client side B splits the own characteristics into continuous numerical characteristics And discrete category features/>Client side C splits own side features into successive numerical features/>And discrete category features/>The numerical feature is taken as input of GBDT NN part of the own model, and the category feature is taken as input of CatNN part of the own model.

Further, the step S2 specifically includes the following steps:

s21: client B and client C will continue numerical characteristics Training of longitudinal federation LightGBM as common input: client side B and client side C bind through mutual exclusion characteristics to obtain a new processed data set/>

S22: the label side A side calculates the information of a first derivative g _i and a second derivative h _i of each sample loss function through the real label and the predicted value of the trained decision tree, i epsilon {1,2,3, …, Y }, wherein Y is the number of samples, the loss function adopts a cross entropy loss function, and then the cross entropy loss function is transmitted to other clients with characteristics after homomorphic encryption;

s23: after the client side B and the client side C receive the encrypted first derivative [ G _i ] and the second derivative [ H _i ], dividing the characteristic value corresponding to each characteristic into barrels according to all the characteristics in the own data set x ^EFB according to the percentage, obtaining the sum of the first derivative [ G _i ] and the second derivative [ H _i ] of the sample in each barrel of each characteristic, transmitting the [ G _i ] and the [ H _i ] ], and then transmitting the [ G _i ] and the [ H _i ] to the tag side A, wherein the [ G _i ] represents that the data is homomorphic encrypted;

S24: after the label side A obtains aggregation encryption gradient { [ [ G _i]],[[H_i ] ] } transmitted by the client side B and the client side C, the label side A carries out corresponding decryption operation to obtain an aggregation value of each barrel, and then the aggregation value is maximized Finding a corresponding optimal division point, wherein lambda represents a coefficient of an L2 regular term, G _l represents a sum of first derivatives obtained by adding all barrels less than or equal to a division threshold v, G _r represents a sum of first derivatives obtained by adding all barrels greater than the division threshold v, G represents a sum of first derivatives of all samples of a current node, H _l represents a sum of second derivatives obtained by adding all barrels less than or equal to the division threshold v, H _r represents a sum of second derivatives obtained by adding all barrels greater than the division threshold v, and H represents a sum of second derivatives of all samples of the current node; traversing the characteristic value corresponding to each barrel as a demarcation point to obtain the score maximum value of the current characteristic; traversing all the features to obtain a corresponding split feature k of the global maximized score and an optimal split threshold point v;

S25: the client side with the optimal splitting point characteristics stores a splitting threshold point v and a splitting characteristic k, then divides a sample space of a current tree node into a left sub data set and a right sub data set through the splitting threshold point v of the characteristic k, sends a sample space result of a new node after division to other client sides and a label side A side for synchronization of the sample space, and simultaneously returns information of the client side to the label side A side for recording the characteristic of which client side performs node splitting;

S26: the label side A side splits the current leaf node into two new leaf nodes, records the index of the leaf node and the id of a sample in a leaf node sample space, and is used for converting the gradient lifting tree into a neural network; the steps S22-S26 are iterated, and the selection of the next leaf node segmentation is entered until the termination condition for LightGBM training is reached.

Further, the step S3 specifically includes the following steps:

S31: the trained longitudinal LightGBM model is converted into a corresponding neural network through a knowledge distillation mode to serve as a GBDT NN part of continuous numerical data processed by a client side B side and a client side C side: firstly, grouping decision trees in LightGBM, equally dividing the decision trees into m tree groups, and for any tree group in the m tree groups The steps of conversion into a neural network are as follows:

S311: label side A traversing tree group Obtaining index vectors of leaf nodes of the decision tree corresponding to the ith sample, and then splicing all the obtained index vectors by using splicing operation I; l ^t,i represents the index vector of the leaf node of the ith sample in the tree t, and L ^t,i is obtained by using the leaf node index and sample id recorded in the longitudinal federation LightGBM training;

S312: the label side A utilizes the embedding (embedding) layer to obtain an embedded representation of the leaf node Representing the multi-hot vector/>, which will be stitchedMapping into an embedded representationUse/>Fitting the tree group/>, where the i-th sample is locatedSum of weights of decision tree leaf nodes/>Loss function/>The same loss function as in LightGBM is used, taking cross entropy loss function as an example. The process of learning leaf node embeddings of multiple trees is expressed as:

Where w ^T and w ₀ represent weights and offsets of the embedded index map to the GBDT2NN portion output, ω ^T represent parameters for converting the multi-hot vector into a embedding vector, n represents the total number of training samples, i represents the ith training sample;

S313: initializing two neural networks as GBDT NN parts at the client side B side and the client side C side respectively; because some low split gain features are not used in splitting the decision tree, the features entered by client B and client C are set to be the features used by the client in splitting the decision tree Dimension and/>, of outputIs consistent in dimension;

s314: the client side B and the client side C obtain own GBDT NN part output: the client side B and the client side C obtain the tree group Output/>, of converted own-model sub-neural networkThe client side B and the client side C simultaneously carry out homomorphic encryption operation and send the homomorphic encryption operation to the tag side A; /(I)Output of the ith sample of the neural network representing client B-side,/>Features representing the ith sample on client B side,/>Parameters representing client-side B-side neural networks,/>Output of the ith sample of the neural network representing client side C,/>Features representing the ith sample on client side C,/>Parameters representing a client-side C-side neural network; the tag party A calculates the tree group/>, through the received informationEmbedding loss of leaf nodes/>Wherein/>Representing the regression loss function:

S315: the label side A utilizes the weight w ^T and the bias w ₀ learned when the embedded representation is used to fit the sum of the weights of a plurality of leaf nodes before, and sends the weights to the client side B and the client side C, and the client side B obtains the current tree group Output of converted sub-neural network/>

The client C side obtains the current tree groupOutput of converted sub-neural network/>

S32: after m tree groups are converted into corresponding neural networks according to steps S311-S315, the client side B side obtains the output of the part GBDT NN of the B side submodel

The client side C side obtains the output of the part GBDT NN of the C side submodel

Further, the step S4 specifically includes the following steps:

S41: the client side B side and the client side C side obtain own side CatNN part output: the CatNN module in ECA-DeepGBM is part of processing sparse features. Discrete data for client B and client C Respectively inputting the parts into FAT-DeepFFM models of the respective parties, namely CatNN parts of own party sub models; after the client side B and the client side C convert input into one-hot codes, dividing own characteristics into f _B、f_C domains; the client side B and the client side C firstly obtain corresponding embedded vectors e _ij,e_ij for the features input by the client side through an embedding layer, wherein v _ij is the weight of the i features for embedding the j fields:

e_ij＝v_ijx_i

obtaining a corresponding embedding matrix EM _i＝[e_i1,e_i2,…,e_if of the ith domain, wherein f is obtained according to the number of domains divided by the client side;

S42: the client side B and the client side C obtain matrix EM _i obtained after the own i-th feature group is embedded, and a new embedded matrix is obtained by using an attention mechanism on the embedded feature matrix through an efficient channel attention module (ECA-Net) of the own model, so that the effects of paying attention to some important features and inhibiting some unimportant features are achieved.

AEM_i＝F(S_i,EM_i)＝[S_i1·e_i1,…,S_ij·e_ij,…,S_if·e_if]

AEM _i represents the embedding matrix obtained from the ith feature set after being processed by the attention mechanism, S _i represents the weight value corresponding to EM _i obtained after being processed by the ECA-Net module;

s43: the client side B and the client side C interact with each other in pairs at the characteristic interaction layer of the own FAT-DeepFFM model, and vector interaction is carried out in a Hadamard product (Hadamard product) mode:

S44: the client side B and the client side C connect the Hadamard product vectors obtained by the client side by using a splicing operation, and then input the Hadamard product vectors into a deep neural network:

S45: the client side B and the client side C respectively use a deep neural network to take charge of high-order characteristic interaction, and the output result of forward propagation is DNN (f _interation(V_x)

S46: the client B side obtains the total output of the client B side CatNN part by adding the corresponding linear part output:

Wherein the method comprises the steps of Bias term representing the B-side part of the linear part,/>Weights representing the ith feature of the B-party,/>Representing the ith feature of the B-party sample,/>And representing the spliced Hadamard product vector obtained by the B side.

The client C side obtains the total output of the client C side CatNN part by adding the corresponding linear part output:

Wherein the method comprises the steps of Bias term representing the C-square part of the linear part,/>Weights representing the ith feature of the C-square,/>Representing the ith feature of the C-square sample,/>And representing the spliced Hadamard product vector obtained by the C side.

Further, the step S5 specifically includes the following steps:

s51: the loss function of longitudinal federal ECA-DeepGBM relates to the prediction of the submodel by the true tag y, B owned at the tag party a And C square sub model predictive value/>The resulting overall predicted loss/>Loss/>, in embedding process of gradient lifting tree leaf nodes after dividing decision tree in LightGBM into m groupsWeighting it to give the overall loss/>, of longitudinal federal ECA-DeepGBM

Representing the embedding loss of the jth tree group, α and β are hyper-parameters for controlling the magnitude of the loss value.

S52: wherein the method comprises the steps ofLoss function used/>An exponential operation is involved as a cross entropy loss function. However, in longitudinal federal ECA-DeepGBM training, homomorphic encryption is required, and because of the homomorphic encryption nature, homomorphically encrypted data cannot be exponentially computed. Therefore, it is necessary to lose prediction after encryption/>Making a polynomial approximation by taylor expansion; let/>Then:

s53: client-side B-side output through client-side B-side submodel GBDT NN part Output of part CatNN/>Weighting the result to obtain the prediction output of the B-party model, and encrypting the prediction output to obtain/>

Encryption prediction output of C-party model obtained by C-party of client

Wherein, w ₁ and w ₂ are trainable parameters, and are updated in the training process;

S54: client party B will The label side A side and the client side C side are sent to the label side A side and the client side C side, and the client side C side/>, through the self predicted value/>, Sent with client B sideCalculate [ [ d _BC ] ], then combine [ [ d _BC ] ] with/>Also to the tag party a;

s55: the tag party A sends the tag party Y and the tag party B and the tag party C [ [ D _BC ] ] calculationAccording to the predicted value/>, of the client-side B-side GBDT < 2 > NN partClient side C side predictive valueTag/>Calculates/>, in the same wayWeighted addition is carried out to obtain the integral loss/>, of the model

S56: overall loss of model sent by tag party a to client party B, client party CClient side B and client side C respectively according to/>Calculating gradient information of client side B side and client side C side sub-model linear layerThereby respectively updating/>And/>And/>

S57: sending the encrypted gradient information to the party P, decrypting by the party P, and returning the gradient information to the party B and the party CUpdate/>And/>And/>Is a value of (2).

S58: the client side B side outputs through the intermediate information, namely the own side model CatNN partAnd/>To calculate the last layer (layer L) gradient information/>, of part CatNN of the neural networkOutput through GBDT2NN sectionAnd/>To calculate GBDT (layer L) gradient information/>, of the last layer of the partial neural network of 2NNThe client side C side is calculated in the same way to obtain/>Then updating the parameters of the corresponding neurons according to the gradient information;

s59: the neural network counter propagates through the gradient information of the L-th layer. Each party performs encryption calculation on intermediate information in the L layer of different NN networks to obtain Sending the decryption result to a partner P side for decryption; the P side respectively returns the decrypted result to the client side B side and the client side C side, and the client side B side and the client side C side respectively calculate the loss function gradient information/>, of the previous layer after receiving the decrypted informationThen updating the parameter information of the neurons of the L-1 layer;

S510: the back propagation process is always propagated from the L layer to the 1 layer, and the parameters of all neurons in different NN networks of all parties are updated; continuing iteration steps S53-S59 until the model converges or reaches the appointed iteration times, and ending model training;

S511: based on the longitudinal federal ECA-DeepGBM model obtained by training, after sample alignment, the client side B side obtains the prediction output of the own side according to the characteristics of the client side B side Client C obtains own prediction output/>, according to characteristics of CThen respectively transmitting own output to a cooperative party P; after the predicted inputs of the client side B and the client side C are added by the cooperative side P, the cooperative side P is activated by using a sigmoid function to obtain a business information recommendation result of multiparty high-dimensional data based on longitudinal federal learning, and the business information recommendation result is output and returned to the client side B and the client side C.

On the other hand, the invention provides a business information recommendation device based on multiparty high-dimensional data longitudinal federal learning, which comprises a memory and a processor;

the memory is used for storing a computer program;

The processor is configured to implement the business information recommendation method based on multiparty high-dimension longitudinal federal learning according to any one of the above when executing the computer program.

The invention has the beneficial effects that: the invention provides a novel business information recommendation method based on multiparty high-dimensional data longitudinal federal learning, which allows multiparty to combine a modeling training model under the condition of data security privacy protection, and can accurately recommend and accurately market clients, can also carry out classified evaluation on financial information, and can also evaluate personal credit. Through a longitudinal federal learning mode, feature dimension increase is performed, meanwhile, different feature extraction and classification prediction methods are adopted for multi-party category type sparse features and numerical dense features, model integration is performed, and the accuracy of business information recommendation is improved.

Compared with the gradient lifting tree, the neural network converted by the gradient lifting tree can utilize the advantages of the gradient lifting tree for processing numerical value class characteristics, and when new data are added, the neural network obtained by conversion can continue training based on the trained model, and the model is not required to be retrained by taking all samples as input like the gradient lifting tree, so that the cost of model updating is reduced. The method has the advantages that the structure is easy to expand, the method can be expanded to three or more clients with characteristics, and more accurate business information recommendation can be carried out jointly through high-dimensional data of multiple parties.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a model overall modeling process;

FIG. 2 is a schematic diagram of the overall structure of the model;

FIG. 3 is a schematic diagram of model longitudinal federal learning training.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

As shown in fig. 1 to 3, a business information recommendation method based on multi-party high-dimensional data longitudinal federal learning is provided.

The invention constructs a high-dimensional data classification prediction model oriented to longitudinal federal learning. For the high-dimensional data prediction classification model constructed by the invention, an ECA-DeepGBM model is used, and the ECA-DeepGBM model mainly comprises two parts: a CatNN portion of sparse class features is processed and a GBDT NN portion of continuous numerical features is processed. Part CatNN of this was the FAT-DeepFFM model and modified to use ECA-Net to add attention to the vector after embedding. Part GBDT n uses LightGBM as the gradient-lifted tree and converts LightGBM into a neural network as part GBDT n.

Constructing a high-dimensional data analysis prediction model oriented to longitudinal federal learning, wherein the user mainly has the following roles: a label party A party with label information, a client party B party with user characteristic information, a client party C party with other characteristic information of a user and a trusted cooperator party P party.

The following application scenarios exist: party a has high-dimensional data of a known client set UserA in an application scene of party a, and UserA has client classification labels. Meanwhile, the A side has no other data except the application scene data of the A side and no data of the application scene of a non-client. While other parties have UserA client sets and data for non-client groups in other application scenarios. When party a needs accurate marketing to quickly and efficiently identify its potential customers of a certain category, it needs to learn jointly from data in other multiparty application scenarios to recommend potential customers that may meet certain category requirements in UserA. But the data of party a and other parties are isolated and require secure privacy protection. The above application scenario and the technical problems thereof are commonly existed in the application fields of finance, credit, business, client recommendation, accurate service and the like, and are also the problems to be solved urgently, wherein the key technical research is widely applied to various government authorities, institution departments and enterprise industries.

A financial institution may be required to expand more premium customers or make recommendations for the use of banking products to potential customers. There is also a need for a financial institution to efficiently and quickly search and identify its potential and premium customers for accurate marketing using other multiparty data. Similar application scenarios and requirements exist in other institutions, enterprises as well.

In this embodiment: the A party is a label party for evaluating the application demand party and the existing credit information, the B party is a certain mechanism for possessing relevant information such as personal loan, repayment, default and the like, and the C party is a mechanism for possessing personal information, such as: the business information recommendation method based on multiparty high-dimension longitudinal federal learning provided by the invention can accurately recommend and accurately marketing customers on the premise of protecting the data privacy of each party and evaluate personal credits.

The client side B pre-processes the relevant characteristic information such as personal loan, repayment, default and the like, the client side C pre-processes the personal characteristic information such as age, academic, income, family liability, employment unit type and the like owned by the client side, and the client side B divides the client side characteristic into continuous numerical characteristicsCharacteristic information such as a personal loan amount, a repayment amount, and the like; discrete category features/>Such as whether to violate, bad behavior category, etc. Client side C splits own side features into successive numerical features/>Characteristic information such as age, income, household income, family liability, etc.; and discrete category features/>Characteristic information such as academic, employment unit types and the like;

to be continuous numerical characteristics Training of the vertical federation LightGBM is performed as input, and then the vertical federation LightGBM is converted into a sub-neural network on the client side B and C as GBDT NN parts of the client side B and C through knowledge distillation.

The main structure of CatNN part of B side and C side of client side is that Fat-DeepFFM model improved by ECA-Net module is taken as CatNN part, and discrete category characteristics are mainly processed

The method specifically comprises the following steps:

S1: a homomorphic encrypted key pair is created. Preprocessing multi-party data and aligning encrypted samples; the method comprises the following specific steps:

S11: the cooperator P side generates homomorphic encryption public key P _k and private key s _k, and sends public key P _k to the label side A side, client side B side and client side C side;

optionally, the homomorphic encryption scheme adopted is Paillier homomorphic encryption;

S12: establishing a multiparty longitudinal federal learning classification prediction sample set with the purpose of expanding sample feature dimensions for business information recommendation: because the user groups of the tag party A, the client party B and the client party C are not identical, the encryption-based sample alignment technology is used to ensure that the parties A, B and C align the common user without exposing the respective original data; in this embodiment, the business information recommendation may be accurate recommendation and accurate marketing for the client, or classification evaluation for financial information, or evaluation for personal credit.

Optionally, the encryption sample alignment technique based on RSA algorithm and hash function is used to realize multi-sample alignment;

S2: a longitudinal federal LightGBM model was constructed. The method comprises the following specific steps:

S3: the longitudinal federal LightGBM model was converted to a neural network as part GBDT2NN of the longitudinal federal ECA-DeepGBM model. The method comprises the following specific steps:

S31: the trained longitudinal LightGBM model is converted into a corresponding neural network through a knowledge distillation mode to serve as a GBDT NN part of continuous numerical data processed by a client side B side and a client side C side: firstly, grouping decision trees in LightGBM, and equally dividing the decision trees to obtain m tree groups. Alternatively, 10 decision trees are grouped together. For one of the m tree groups The steps of conversion into a neural network are as follows:

alternatively to this, the method may comprise, The dimension of the vector is set to 5;

Optionally, the dimensions of the hidden layers of the sub-neural networks of the client side B and C may be set to (100,100,100,50);

Alternatively, the loss function here A mean square error function is used.

S315: the label side A utilizes the weight w ^T and the bias w ₀ learned when the embedded representation is used to fit the sum of the weights of a plurality of leaf nodes before, and sends the weights to the client side B and the client side C, and the client side B obtains the current tree groupOutput of converted sub-neural network/>

S4: the longitudinal federal ECA-DeepGBM model CatNN part feed forward calculation is performed as follows:

e_ij＝v_ijx_i

AEM_i＝F(S_i,EM_i)＝[S_i1·e_i1,…,S_ij·e_ij,…,S_if·e_if]

s45: the client side B and the client side C respectively use a deep neural network to take charge of high-order characteristic interaction, and the output result of forward propagation is DNN (f _interaction(V_x)

Alternatively, α is set to 0.5 and β is set to 0.5.

/>

s53: client-side B-side output through client-side B-side submodel GBDT NN part Output of part CatNN/>Weighting the result to obtain the prediction output of the B-side model, and encrypting to obtain

Encryption prediction output of C-party model obtained by C-party of client

Optionally, the values of the client B and C initialization w ₁ and w ₂ are both 0.5.

S54: client party B willThe label side A side and the client side C side are sent to the label side A side and the client side C side, and the client side C side/>, through the self predicted value/>, Sent with client B sideCalculate [ [ d _BC ] ], then combine [ [ d _BC ] ] with/>Also to the tag party a;

After the longitudinal federal ECA-DeepGBM model training is completed, the B side and the C side of the client side divide the own side characteristic information of a certain aligned sample into a category type sparse characteristic and a numerical value type dense characteristic, the category type sparse characteristic is subjected to one-hot coding, the numerical value type dense characteristic is normalized, and each side respectively puts the preprocessed data into CatNN and GBDT NN of the sub model. For example, the loan condition, historical repayment record and default times of a certain client in the B-side data, other relevant information of the client in the B-side application scene record and the like; the client is pretreated in the relevant information of specific age, academic, monthly income, family liability, employment unit type and the like in the C party data, and is respectively put into the B, C party sub-model of the trained longitudinal federal ECA-DeepGBM model, each party respectively enters the CatNN and GBDT NN parts according to the characteristic characteristics thereof to calculate the predicted value of the client at the B party according to the formula in S53And predictive value at C-side/>And aggregating the predicted values of the two parties to obtain a final classification recommended value or classification related to the client.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A business information recommendation method based on multiparty high-dimensional data longitudinal federal learning is characterized in that: the method comprises the following steps:

S1: establishing homomorphic encryption key pairs, and preprocessing multiparty data and aligning encryption samples, wherein the multiparty data is business privacy data which exist in the self-side of a tag side A, a client side B, a client side C and a cooperative side P and cannot be known by other sides;

S2: constructing a longitudinal federal LightGBM model;

s5: constructing a loss function, training a high-dimensional data classification prediction model, and realizing commercial information classification recommendation based on the trained high-dimensional data classification prediction model;

the step S1 specifically comprises the following steps:

S11: the cooperator P side generates homomorphic encryption public key P _k and private key s _k, and sends public key P _k to the label side A side, client side B side and client side C side; the business privacy data for each party includes: the A party has a label of the existing commercial information and is also a recommendation requiring party; the party B has partial business information of individuals or enterprises, and other relevant data in the application scene of the party B; party C has basic information of individuals or enterprises;

s12: establishing a multiparty longitudinal federal learning classification prediction sample set with the purpose of expanding sample feature dimensions for business information classification recommendation: using an encryption-based sample alignment technique to ensure that parties a, B and C align co-users without exposing their respective original data;

s13: the client side B and the client side C preprocess the characteristics of the own sample, and the client side B splits the own characteristics into continuous numerical characteristics And discrete category features/>Client side C splits own side features into successive numerical features/>And discrete category features/>Taking the numerical characteristics as input of GBDT NN part of the own model and the category characteristics as input of CatNN part of the own model;

The step S2 specifically comprises the following steps:

s21: client B and client C will continue numerical characteristics Training of longitudinal federation LightGBM as common input: client side B and client side C obtain a new processed data set through mutually exclusive feature binding

S23: after the client side B and the client side C receive the encrypted first derivative [ G _i ] and the second derivative [ H _i ] ], dividing the characteristic value corresponding to each characteristic into barrels according to all the characteristics in the own data set x ^EFB according to the percentage, obtaining the sum of the first derivative [ G _i ] and the second derivative [ H _i ] ] of the sample in each barrel of each characteristic, transmitting the [ G _i ] and the [ H _i ] ], and then transmitting the [ G _i ] and the [ Hi ] to the tag side A, wherein the [ G _i ] represents that the data is homomorphic encrypted;

S26: the label side A side splits the current leaf node into two new leaf nodes, records the index of the leaf node and the id of a sample in a leaf node sample space, and is used for converting the gradient lifting tree into a neural network; iterating the steps S22-S26, and selecting the next leaf node segmentation until reaching the termination condition of LightGBM training;

the step S3 specifically comprises the following steps:

S311: label side A traversing tree group Obtaining index vectors of leaf nodes of the decision tree corresponding to the ith sample from each decision tree, and then splicing all obtained index vectors by using splicing operation (); l ^t,i represents the index vector of the leaf node of the ith sample in the tree t, and L ^t,i is obtained by using the leaf node index and sample id recorded in the longitudinal federation LightGBM training;

s312: using an embedding layer by the label side A to obtain an embedded representation of a leaf node Representing the multi-hot vector/>, which will be stitchedMapping into an embedded representation/>UsingFitting the tree group/>, where the i-th sample is locatedThe sum of the weights of the leaf nodes of the decision treeLoss function/>Using the same loss function as in LightGBM, the process of learning the leaf node embedment of multiple trees is expressed as:

s313: initializing two neural networks as GBDT NN parts at the client side B side and the client side C side respectively; setting the characteristics of the input of the client side B and the client side C as the characteristics of own use in splitting of decision trees Dimension and/>, of outputIs consistent in dimension;

The step S4 specifically comprises the following steps:

s41: the client side B side and the client side C side obtain own side CatNN part output: discrete data for client B and client C Respectively inputting the parts into FAT-DeepFFM models of the respective parties, namely CatNN parts of own party sub models; after the client side B and the client side C convert input into one-hot codes, dividing own characteristics into f _B、f_C domains; the client side B and the client side C firstly obtain corresponding embedded vectors e _ij,e_ij for the features input by the client side through an embedding layer, wherein v _ij is the weight of the i features for embedding the j fields:

e_ij＝v_ijx_i

Obtaining a corresponding embedding matrix EM _i＝[e_i1,e_i2...,e_if of the ith domain, wherein f is obtained according to the number of domains divided by the client side;

S42: the client side B and the client side C obtain a matrix EM _i obtained after the own i-th feature group is embedded, and an attention mechanism is used for the embedded feature matrix through an efficient channel attention module of the own model to obtain a new embedded matrix:

AEM_i＝F(S_i,EM_i)＝[S_i1·e_i1,...,S_ij·e_ij...,S_if·e_if]

s43: the client side B and the client side C interact with each other in pairs at the characteristic interaction layer of the own FAT-DeepFFM model, and vector interaction is carried out in a Hadamard product mode:

Wherein the method comprises the steps of Bias term representing the B-side part of the linear part,/>Weights representing the ith feature of the B-party,/>Representing the ith feature of the B-party sample,/>Representing the spliced Hadamard product vector obtained by the B side;

Wherein the method comprises the steps of Bias term representing the C-square part of the linear part,/>Weights representing the ith feature of the C-square,/>Representing the ith feature of the C-square sample,/>Representing the spliced Hadamard product vector obtained by the C side;

the step S5 specifically comprises the following steps:

s51: the loss function of longitudinal federal ECA-DeepGBM relates to model overall predicted loss Loss/>, in embedding process of gradient lifting tree leaf nodes after dividing decision tree in LightGBM into m groupsWeighting it to give the overall loss/>, of longitudinal federal ECA-DeepGBM

Overall predictive lossPredicted value/>, through real labels y and B party submodels owned by label party AAnd C square sub model predictive value/>Calculated, lightGBM is converted into the embedding loss/>, of the neural networkBy summing up the embedding losses of each tree group;

representing the embedding loss of the jth tree group, wherein alpha and beta are super parameters for controlling the magnitude of a loss value;

S52: wherein the method comprises the steps of Loss function used/>For cross entropy loss function, involving exponential operation, homomorphic encryption is carried out on the cross entropy loss function, and the encrypted prediction loss/>Making a polynomial approximation by taylor expansion; order theThen:

Encryption prediction output of C-party model obtained by C-party of client

S54: client party B will Transmitting to tag side A and client side C, and client side C passing its predicted value/>, Sent with client B sideCalculate [ [ d _BC ] ], then combine [ [ d _BC ] ] with/>Also to the tag party a;

s55: the tag party A sends the tag party Y and the tag party B and the tag party C Calculation ofAccording to the predicted value/>, of the client-side B-side GBDT < 2 > NN partClient side C side predictive valueTag/>Calculates/>, in the same wayWeighted addition is carried out to obtain the integral loss/>, of the model

S57: sending the encrypted gradient information to the party P, decrypting by the party P, and returning the gradient information to the party B and the party CUpdate/>And/>And/>Is a value of (2);

S58: the client side B side outputs through the intermediate information, namely the own side model CatNN part And/>To calculate the L-layer gradient information/>, of CatNN parts of the neural networkOutput through GBDT2NN section/>And/>To calculate the L-layer gradient information/>, of GBDT NN partial neural networkThe C side of the client side is obtained by the same principle and calculationThen updating the parameters of the corresponding neurons according to the gradient information;

S59: through the gradient information of the L layer, the neural network carries out back propagation, and each party carries out encryption calculation on the intermediate information in the L layer of different NN networks to obtain Sending the decryption result to a partner P side for decryption; the P side respectively returns the decrypted result to the client side B side and the client side C side, and the client side B side and the client side C side respectively calculate the loss function gradient information/>, of the previous layer after receiving the decrypted informationThen updating the parameter information of the neurons of the L-1 layer;

2. A business information recommendation device based on multiparty high-dimensional data longitudinal federal learning is characterized in that: comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to implement the business information recommendation method based on multiparty high-dimension longitudinal federal learning according to claim 1 when executing the computer program.