CN113240518A - Bank-to-public customer loss prediction method based on machine learning - Google Patents

Bank-to-public customer loss prediction method based on machine learning Download PDF

Info

Publication number
CN113240518A
CN113240518A CN202110782247.3A CN202110782247A CN113240518A CN 113240518 A CN113240518 A CN 113240518A CN 202110782247 A CN202110782247 A CN 202110782247A CN 113240518 A CN113240518 A CN 113240518A
Authority
CN
China
Prior art keywords
data
random forest
model
value
bank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110782247.3A
Other languages
Chinese (zh)
Inventor
阮惠华
张成刚
黄浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Smart Software Co ltd
Original Assignee
Guangzhou Smart Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Smart Software Co ltd filed Critical Guangzhou Smart Software Co ltd
Priority to CN202110782247.3A priority Critical patent/CN113240518A/en
Publication of CN113240518A publication Critical patent/CN113240518A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a bank public customer loss prediction method based on machine learning, which comprises the following steps: collecting original data of a bank on public customer behaviors in a deadline, and constructing a PostgreSQL source database; reading report data of a plurality of reports by using the PostgreSQL source database; integrating the report data into a whole, and carrying out full-table statistics on all characteristics in the report data; coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set; establishing a random forest model, and substituting the classified data into the random forest model for training; calculating the importance of the features in the random forest model, and selecting the features according to the importance found out from the calculation result; obtaining a model prediction result according to the characteristics and outputting a visualization result; the accurate classification of bank customers is realized through the model, the marketing resources of enterprises are optimized, the customer loss rate is guaranteed, and accordingly the profit maximization of the enterprises is realized.

Description

Bank-to-public customer loss prediction method based on machine learning
Technical Field
The invention relates to the field of bank management models, in particular to a bank-to-public customer loss prediction method based on machine learning.
Background
In the big data era, the global economy integration and the financial marketization promote the operation management mode of domestic commercial banks to change greatly, and each commercial bank has an operation idea of 'taking customers as the center' as an important basis for improving the self profitability and the core competitiveness and pays high attention to customer relationship management and customer data mining.
There are studies in the prior art that it is 5 times more costly to attract new customers than to keep old customers; sales to attrition customers will be 1 out of 4 successful, while sales to potential and target customers will be 1 out of 16 possible successful. The profit brought to the company by the client is mainly determined according to the life cycle of the client, and the longer the life cycle of the client is, the more profit is brought to the company. The marketing focus of each bank enterprise is changed from a product center to a customer center, and customer relationship management becomes a core problem of the enterprise. So how to extend the life cycle of the lost customers becomes a decisive strategy for the companies to occupy the market share.
However, in each banking business block, some defects still exist in the customer relationship management aspect, such as: firstly, the classification of customers is incomplete, and the customers with high value cannot be effectively distinguished from the customers with no value; secondly, the personalized service scheme can not be customized for the customer classification, so that the customer loss is serious and the recovery is difficult; and thirdly, the existing marketing resources can not be accurately matched with high-value customers, and the profit of enterprises is seriously hindered from being improved. Banks can be divided into bank-to-public customers and bank retail customers according to customer types, and at present, no method for carrying out systematization, deep mining analysis and early warning on bank-to-public customer loss exists. Therefore, how to establish a model capable of accurately classifying bank customers, optimizing enterprise marketing resources, guaranteeing the loss rate of the bank to public customers and realizing the maximization of enterprise profits becomes a problem to be solved urgently in the field of bank management models.
Disclosure of Invention
The invention aims to overcome the defect that the analysis of data is incomplete and not intuitive when a bank conducts public churn customer research in the actual work of the bank in the prior art, provides a bank public churn customer churn prediction method based on machine learning, and achieves the purpose of comprehensively and intuitively obtaining the bank public churn customer churn prediction result by extracting importance selection characteristics in data through effectively establishing a customer churn prediction model.
The purpose of the invention is mainly realized by the following technical scheme:
a bank public customer loss prediction method based on machine learning comprises the following steps:
s1: setting a time limit, collecting original data of a bank on public customer behaviors in the set time limit, and constructing a PostgreSQL source database by adopting the original data;
s2: reading report data of each report in the PostgreSQL source database;
s3: integrating the read report data into a whole, extracting all the characteristics in the report data as first characteristics, and carrying out full-table statistics on the first characteristics in the report data;
s4: coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set;
s5: calculating the importance of the features in the random forest model, finding out importance selection features according to the calculation result, and aggregating and constructing a second feature after completing feature transformation in a PostgreSQL source database;
s6: extracting an actual lost customer data set from the corrected data set, and classifying new data through voting according to the actual lost customer data set;
s7: establishing a random forest model, and substituting the classified data into the random forest model for training;
s8: selecting a random forest model by analyzing the deviation and variance of the random forest model, and selecting optimal parameters by grid search and cross validation;
s9: and obtaining a model prediction result and outputting a visualization result.
At present, the accuracy of customer loss judgment is low due to the fact that the customer is not clearly distinguished in public customer loss prediction of a bank, and under the condition that judgment is inaccurate, the bank cannot make a targeted advance response, so that the loss of the customer is aggravated, and great loss is caused to the bank; the invention reserves the original data of the public client by collecting the daily public client behavior data and storing the data into the PostgreSQL source database, the PostgreSQL is an object-relational database management system of free software with complete characteristics, and has the characteristic of convenient index storage, after the PostgreSQL source database is constructed, the PostgreSQL source database is used for reading the report data, so the behavior data of the public client can be loaded into the PostgreSQL source database, the observation of the whole data can be clearly realized by integrating the report data, the characteristics can be effectively extracted, and then the characteristics are subjected to full-table statistics, the full-table statistics in the invention comprises different statistical analyses on the observed data, and the statistical information comprises: indexes such as sample number, missing value number, average value, standard deviation, variance, sum, unique value, minimum value, maximum value, upper quartile, lower quartile, median, mode, kurtosis and skewness; and the information of numerical value range, distribution and the like contained in the data can be simply and comprehensively expressed by using the box line graph and the histogram as far as possible.
The basic attribute data can be obtained after the whole-table statistics, the whole-table statistics actually counts the specific expression data of each characteristic, the basic attribute data is coded for convenient use and easy memory in the processing process, the codes which can be adopted in the invention comprise characteristic binaryzation/diversification, unique hot coding and the like, the coded basic attribute data is edited and processed in a redundant and missing way, the part which is to be deleted is deleted, and the part which is to be filled is filled, so that the correction of the basic attribute data is realized, in the classification of public clients, the condition of lost clients can be most reflected by the data set of the lost clients actually, therefore, new data is classified according to the condition of the data set of the lost clients actually, the lost clients can be more conveniently identified, and the prediction accuracy of the forest random model can be effectively improved through the training of the forest model, after the optimal parameters are selected, the accuracy can be further improved, the importance selection characteristics can be found out according to the calculation of the random forest model, the importance selection characteristics are adopted to identify the customers, whether the customers accord with the characteristics of the loss prediction model or not can be effectively obtained, the relevant parameters of the public customer loss can be clearly obtained, the results can be visualized effectively through the parameters in a visualization mode, and therefore clear and visual prediction results can be obtained.
The actual lost client data sets in the invention are already existed in a bank database (PostgreSQL source database), the bank directly provides the lost client data sets, and the data sets are preprocessed and filled with missing values in the first steps to provide the data sets with correct formats and contents for model training. And the invention adopts a mode of firstly completing the feature engineering content, namely feature importance calculation and feature selection, and then performing model training, wherein the establishment of the random forest model is performed after the feature of the random forest model importance is selected, thereby facilitating the subsequent tuning of the model in the training process.
In the invention, for the variance and deviation problems of the random forest, different training set training models with the same sample number are used for prediction, the expected prediction of the learning algorithm is obtained by averaging the predicted values, the deviation refers to the difference between the expected prediction of the learning algorithm and the real mark, and the variance refers to the difference between the expected prediction of the learning algorithm and the real mark.
According to the invention, through the processing of the original data and the data analysis, the customer loss prediction model can be effectively established, and the purpose of comprehensively and visually obtaining the customer loss prediction model through the importance characteristics on the basis of selecting the optimal parameters is achieved.
Further, the data report in step S2 includes "mining the year and day average of each month of the customer for public deposit", "mining the year and day average of each month of the customer for public financial fund", and finally "mining the integrated model width table" including the characteristics record of the mining customer is integrated in step S3. The method comprises the steps of mining the annual and daily average table of the clients for the public deposit in each month, effectively reflecting relevant information of the client deposit, mining the annual and daily average table of the clients for the axiomatic fund in each month, effectively reflecting relevant information of the client investment, and analyzing the client information in two aspects of the client deposit and the investment to enable the obtained result to be more accurate.
Further, the step S4 includes the following steps:
s4.1: editing the metadata according to the character string type by taking the basic attribute data obtained by statistics as the metadata;
s4.2: performing label coding on the metadata of different categories by adopting unique hot coding, and performing binarization processing on the categories;
s4.3: finding and correcting recognizable errors in the metadata yields modeling data.
In the invention, the metadata comprises data account information, personal information, deposit information, consumption, transaction information and other information, the information is edited according to the character string type and then is conveniently searched and called, when the one-hot coding is adopted for editing, the one-hot coding mainly adopts an N-bit state register to code N states, each state is composed of an independent register bit of the state, only one bit is effective at any time and is used for solving the problem of discrete value of the class type data, the class is binary-processed, the problem that a classifier cannot process attribute data well is solved, the distance calculation between the features is more reasonable, and more effective modeling data can be obtained after errors can be identified in the corrected metadata.
Further, the step S4.3 includes the steps of:
s4.3.1: acquiring two expression forms of the same first characteristic in the metadata, and deleting one expression form;
s4.3.2: filling missing values in the metadata;
s4.3.3: and carrying out univariate abnormal value detection on the filled data, and removing the univariate abnormal value to obtain modeling data.
In the process of finding and correcting recognizable errors in a data file and obtaining data required by a modeling process through processing, the elimination of redundant data is based on different expression forms of the same characteristic, and repeated calculation can cause the reduction of accuracy rate during calculation, so one of the two is eliminated to ensure the accuracy rate of the data, and when missing values in metadata are filled, two filling methods are mainly used, one is a mean value filling method, the mean value filling method groups the data by searching the variable with the maximum correlation with the missing value variable, calculates the mean value of each group respectively, and fills the missing position, and can change the distribution of the data to a certain extent; the other is a regression filling method, which is characterized in that a missing value variable is used as a target variable y, existing part of data of the missing value variable is used as a training set, a regression equation is established by searching a variable x highly related to the missing value variable, then x corresponding to the position of the missing value variable y is used as a prediction set, the missing is predicted, the prediction result is used for replacing the missing value, and the target variable y and the highly related variable x are the missing value variables and are only used for missing value filling; after filling is completed, univariate abnormal value detection is carried out on the data, so that abnormal data are eliminated, and finally obtained modeling data are more reliable and accurate.
Further, the univariate outlier detection process in step S4.3.3 includes the following steps:
a1: arranging the variables from small to large in sequence as x1 and x2... xn;
a2: calculating the mean x and standard deviation S:
Figure 866133DEST_PATH_IMAGE001
calculating a deviation value and determining a suspicious value, wherein i is a serial number of the suspicious value;
a3: calculate statistic giI.e. the ratio of residual to standard deviation:
Figure 92715DEST_PATH_IMAGE002
g is prepared fromiComparing with a critical value GP (n) given by a Grubbs table if g is calculatediIf the value is greater than the critical value GP (n) in the table, the measured data can be judged to be abnormal and can be eliminated. The threshold value gp (n) here relates to two parameters: the detected level α and the number of measurements n.
The method for detecting the univariate abnormal value in the invention is the Grabas method. In a set of measured data, if individual data deviates far from the mean, this data(s) is (are) referred to as "suspect value". If a "suspect value" can be removed from the set of measurement data without taking part in the calculation of the mean value, as determined by statistical methods, such as the Grubbs method, the "suspect value" is referred to as an "outlier (gross error)".
When the detection level α is determined, if the requirement is strict, the detection level α may be determined to be smaller, for example, α is determined to be 0.01, and then the confidence probability P is 1- α is 0.99; if the requirement is not strict, α may be determined to be larger, for example, α is determined to be 0.10, i.e., P is 0.90; usually, α is defined to be 0.05 and P is defined to be 0.95.
And (3) checking a Grabbs table to obtain a critical value: from the charabrus table, the threshold G95(10) was 2.176 for vertical and horizontal crossings, based on the selected P value (here 0.95) and the number of measurements n (here 10).
Comparison of calculated value Gi with critical value G95 (10): gi 2.260, G95(10) 2.176, and Gi > G95 (10).
Judging whether the abnormal value is: since Gi > G95(10), the measured value 14.0 can be judged as an abnormal value, and it is removed from the 10 measured data.
The remaining data consider: calculating the rest 9 data according to the steps, if Gi is more than G95(9), the data is still an abnormal value, and removing; if Gi < G95(9), is not an outlier, then no culling is done. Therefore, there are no abnormal values in the remaining 9 data of this example.
Further, the feature importance calculation in the step S5 includes the following steps:
s5.1: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB, is calculated using the corresponding OOB, i.e., the out-of-bag data1
S5.2: randomly adding noise interference to the characteristic X of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error again, which is recorded as errOOB2
S5.3: suppose there is N in random foresttreeTree, then importance for feature X =
Figure 865499DEST_PATH_IMAGE003
The feature selection in the step S5 includes the steps of:
s5.4: finding out characteristic variables highly related to the dependent variables through calculation of the importance of the characteristics; the result of the dependent variable can be sufficiently predicted by selecting a smaller number of characteristic variables.
The common noise adopted by the noise interference in the invention comprises check noise and Gaussian noise; the check noise is: randomly acquiring and adding a maximum value and a minimum value; the Gaussian noise: the noise is a type of noise with probability density function obeying Gaussian distribution; in the invention, the correlation degree of the independent variable and the dependent variable is judged according to the correlation coefficient of the independent variable and the dependent variable, and the influence of the characteristics with higher correlation on the model is larger. And selecting an independent variable with a relatively high correlation coefficient to be substituted into the model for training, wherein the characteristic is the independent variable.
Further, the selecting step of the importance selecting feature in step S7 includes:
p1: preliminary estimation and ranking:
a) sorting the characteristic variables in the random forest in a descending order according to the importance of the variables;
b) determining a deletion ratio, and removing unimportant indexes of the corresponding ratio from the current characteristic variables to obtain a new characteristic set;
c) establishing a new random forest by using the new feature set, calculating the variable importance of each feature in the feature set, and sequencing;
d) repeating the steps until m characteristics are left;
p2: and calculating corresponding out-of-bag error rates according to each feature set obtained in the P1 and the random forest established by the feature sets, and taking the feature set with the lowest out-of-bag error rate as the final selected feature set.
Further, the establishing of the random forest model in the step S7 includes the following steps:
s7.1: taking the data obtained in the step S6 as an original training set, assuming that the original training set is N, randomly and repeatedly extracting k new autonomous sample sets, and establishing k classification trees;
s7.2: setting M attribute classifications in an original training set N, and randomly drawing M at each node of each classification treetryAn attribute of mtrySelecting a variable with the most classification capability, wherein the threshold value of the classification variable is determined by checking each classification point;
s7.3: in order to avoid the problem of model overfitting caused by the unlimited growth of each classification tree, a loss function is adopted to judge whether pruning operation is carried out:
Figure 937360DEST_PATH_IMAGE004
where C is a loss function, T is a decision tree, TleafIs the number of leaf nodes, t is the node, H is the impure measurement method of the node t,
Figure 315252DEST_PATH_IMAGE005
manually setting, wherein the larger the value is, the heavier the weight of the leaf node number is; the smaller the value, the less the influence of the number of leaf nodes;
s7.4: forming a random forest by the generated multiple decision trees, and voting according to a classifier of the multiple trees to determine a final classification result:
Figure 447156DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 910498DEST_PATH_IMAGE007
expressed as a random forest model, Si’A separate decision tree is represented which is,
Figure 786050DEST_PATH_IMAGE008
a representational function, Z representing an output quantity;
s7.5: voting was scored on the model by the following scoring formula:
Figure 815186DEST_PATH_IMAGE009
ntreeto specify the number of decision trees, C, contained in a random forestpRepresents the voting result of the prediction class C, I () is an indicative function, nni’Is a tree ni’Number of leaf nodes, nni’,cIs a tree ni’The classification result of the prediction class C; generating a confusion table C after votingM,CMIs one
Figure 55674DEST_PATH_IMAGE010
Table in which the element cm (i ', j) represents the number of times the type i ' is classified as j, cm (i ', j) represents the correct number of times the type i ' is classified only when i ', ncIs the total number of categories.
In the invention, the unextracted samples form n pieces of out-of-bag data, the unextracted out-of-bag data is stored in a database, the model is extracted back and forth during training, the unextracted out-of-bag data is not applied, the original data set containing m samples is subjected to repeated sampling m times, the probability of each time of collection is 1/m, and the probability of not collecting is 1/m
Figure 68630DEST_PATH_IMAGE011
M istryA node value, which can determine the variable sampling value of each iteration and is used for the variable number of the binary tree, generally selecting an empirical value and changing the empirical value into a quadratic root of the variable number of the data set;
in the invention, firstly, m training sets are generated by a bagging algorithm, then, for each training set, a decision tree is constructed, when nodes find characteristics for classification, all the characteristics are not found to maximize indexes (such as information gain), but a part of the characteristics are randomly extracted from the characteristics, an optimal solution is found among the extracted characteristics and is applied to the nodes for classification, and the decision tree compares different schemes in the decision by using a probability method to obtain an optimal scheme, and the variable with the highest classification capability in the step S7.2 is a variable with the highest information gain, so the decision branch is called the decision tree because the decision branch is drawn to be similar to a branch of a tree. It can be constructed by the CART algorithm, using the kini coefficients as the basis for the partition properties. The specific algorithm is as follows:
let D be a set of | D | data samples, the class attribute having m different values corresponding to m different class sets
Figure 685556DEST_PATH_IMAGE012
And | Ci | is the number of samples in the class set Ci, and the expected information required to classify the tuples in D is:
Figure 428253DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 105222DEST_PATH_IMAGE014
representing the probability that a data object belongs to the category Ci.
Assuming that D is divided into v different classes { D1, D2 … Dv } according to the attribute a (taking the value { a1, a2 … av }), the entropy of the information for dividing the current sample set using the attribute a is:
Figure 339894DEST_PATH_IMAGE015
the smaller the value of the information entropy ia (d), the better the result of the subset division by the attribute a.
Thus, the information gain obtained by using the attribute a to perform corresponding subset division on the current branch node is:
Figure 494932DEST_PATH_IMAGE016
the coefficient of kini: the smaller the kini coefficient (Gini), the smaller the measure of uncertainty, the less probability that a selected sample in the set is entered, i.e. the higher the purity of the set, and conversely, the less pure the set. When all samples in the set are of one class, the kini coefficient is 0. The formula is as follows:
Figure 764239DEST_PATH_IMAGE017
the basic idea of the model voting scoring process is as follows:
giving a weak learning algorithm and a training set; the accuracy of a single weak learning algorithm is not high; the learning algorithm is used for multiple times to obtain a prediction function sequence, and voting is carried out; finally, the accuracy of the result is improved;
the algorithm is as follows: for T =1, 2, …, T Do
Sampling (putting back the selected sample) from the data set S and training to obtain a model Ht;
when the unknown samples X are classified, each model Ht obtains a classification, the highest voted value is the classification of the unknown samples X, the average voted value can be used for predicting continuous values, and the finally obtained voting formula is as follows:
Figure 346530DEST_PATH_IMAGE018
further, in step S8, the random forest trains an optimal model with each resampled data set, which includes K models:
Figure 802919DEST_PATH_IMAGE019
wherein Xi’’An N-dimensional variable, i ″ = 1., K, that is a randomly placeable back-sampled sub-data set;
the variance was analyzed by the limit method as follows:
if the models are completely independent, then:
Figure 823965DEST_PATH_IMAGE020
if the models are identical:
Figure 885462DEST_PATH_IMAGE021
the variance was analyzed using the formula:
assume the variance of the subdata set variables is
Figure 966550DEST_PATH_IMAGE022
And the correlation between every two variables is rho, then:
Figure 113498DEST_PATH_IMAGE023
and comparing to obtain a random forest model with reduced variance.
Through the processing of the step S8, the variance of the random forest model in the training process is controlled within the range, the overfitting problem of the model is avoided, the complexity of the model is reduced, the overfitting problem of the model is solved, and the specific conditions are as follows:
because the random forest is a model frame based on the bagging idea, the random forest adopts an optimal model trained by each group of resampled data set, and K models are total; let Xi be an N-dimensional variable of the randomly placeable back-sampled sub-data set, I =1, …, K.
Based on the similarity of the sub-data sets in the placeable samples and the same models used, the models have approximately equal bias and variance, and the distribution of the models is approximately the same but not independent (because of the repeated variations among the sub-data sets).
Thus:
Figure 407076DEST_PATH_IMAGE024
is obtained by the formula: the deviation of the bagging model is close to that of each sub-model, so the bagging method cannot reduce the deviation remarkably. The variance problem of the bagging model is analyzed by using a limit method, because the sub data sets of the bagging algorithm are neither independent from each other nor identical, certain similarity exists between the sub data sets, Xt is an N-dimensional variable of the sub data sets which can be randomly placed back to sample, and t =1, …, K. Thus, the variance of the bagging algorithm model is between:
if the models are completely independent, then:
Figure 385396DEST_PATH_IMAGE025
if the models are identical:
Figure 371807DEST_PATH_IMAGE026
and analyzing the variance problem of the bagging algorithm model by adopting a formula method, and assuming that the variance of the sub data set variables is
Figure 6051DEST_PATH_IMAGE027
The correlation between two variables is
Figure 368899DEST_PATH_IMAGE028
Variance of the algorithm bagged:
Figure 404988DEST_PATH_IMAGE029
Figure 765562DEST_PATH_IMAGE030
Figure 949419DEST_PATH_IMAGE031
the variance of the bagging algorithm obtained by the last formula is reduced, and through the processing of the step S8, the variance of the random forest model in the training process is controlled within a range, so that the problem of overfitting of the model is avoided, and the random forest mainly has the effects of reducing the complexity of the model and solving the overfitting problem of the model.
In conclusion, compared with the prior art, the invention has the following beneficial effects:
(1) according to the invention, through data analysis, a customer loss prediction model can be effectively established, accurate classification of bank customers can be realized through the model, enterprise marketing resources are optimized, and the customer loss rate is ensured, so that the enterprise profit maximization is realized.
(2) The method predicts the deletion and replaces the deletion value with the prediction result; after filling is completed, univariate abnormal value detection is carried out on the data, so that abnormal data are eliminated, and finally obtained modeling data are more reliable and accurate.
(3) The invention trains and predicts the sample through the random forest model, effectively accelerates the prediction process and can generate a result with high accuracy.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of the random forest training of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
as shown in fig. 1, the embodiment relates to a bank-to-public customer churn prediction method based on machine learning, which includes the following steps:
s1: collecting data: setting a time limit, collecting original data of a bank on public customer behaviors in the set time limit, and constructing a PostgreSQL source database by adopting the original data;
s2: reading data: reading report data of each report in the PostgreSQL source database;
s3: data exploration: integrating the read report data into a whole, extracting all the characteristics in the report data as first characteristics, and carrying out full-table statistics on the first characteristics in the report data;
s4: data preprocessing: coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set;
s5: characteristic engineering: calculating the importance of the features in the random forest model, finding out importance selection features according to the calculation result, and aggregating and constructing a second feature after completing feature transformation in a PostgreSQL source database;
s6: and (3) data evaluation: extracting an actual lost customer data set from the corrected data set, and classifying new data through voting according to the actual lost customer data set;
s7: model training: establishing a random forest model, and substituting the classified data into the random forest model for training;
s8: model optimization: selecting a random forest model by analyzing the deviation and variance of the random forest model, and selecting optimal parameters by grid search and cross validation;
s9: model prediction and result output: and obtaining a model prediction result and outputting a visualization result.
In this embodiment, in order to be able to screen out the main customer target of the project as an effective attrition customer based on all public customers of the bank, thereby providing basic data for relevant departments to retrieve customer solutions, the present embodiment classifies the customers into the following four categories: firstly, high-value customers generate higher value for banks and are always more active customers; secondly, customers are effectively lost, the bank focuses on objects, and the reduction of the loss rate can have a positive effect on banking business; thirdly, the retrieved customers and the bank focus on the objects of interest, and the retrieved customer groups are screened out, so that marketing resources can be effectively reduced; and fourthly, the low-value customers have low value on banks, and do not consume too much marketing resources.
On the basis, the customer information is dynamic characteristics of public customers mined from a huge bank management database, and the best lost customer information required by training or prediction is screened out.
The medium limit is set in step S1 as follows: determining a training/prediction time point; taking six months before the training/predicting time point as an observation period, and aggregating some dynamic characteristics of the customers; taking six months after the training/predicting time point as a presentation period, and showing whether the customer loses or is recovered in the presentation period; the observation period moves simultaneously with the movement of the presentation period with the movement of the time window.
The data report in step S2 includes "mining the annual and daily average of each month of the customer for public deposit", "mining the annual and daily average of each month of the customer for public financial fund", and finally "mining the integrated model width table" including the characteristic record of the mining customer is integrated in step S3. For client data mining, the following process is carried out:
first, reading data:
(1) reading 3 width tables including 'data mining client month to public deposit year and day average table', 'data mining client month to public financial fund year and day average table' and 'mining integrated model width table' from a PostgreSQL source database;
(2) the 'mining of the annual and daily average table of the public deposit of each month of the client' has 2067749 records with 18 characteristics;
(3) the mining client has 2067749 records of 18 characteristics in each month for the annual and average daily schedule of the equity fund;
(4) the 'excavation integral die width table' has 2945051 records and 70 characteristics;
(5) the total data volume is about 12G;
(6) the model labels are finally determined by mining the annual and daily average table of each month of the client on the public deposit and mining the annual and daily average table of each month of the client on the public financial fund;
(7) the 'mining integration model width table' is used for mining the characteristic record of a client for 6 months as training data;
the second step, data exploration is carried out,
the three width tables are finally integrated into a big table, the total number of the characteristics is 73, all the characteristics are subjected to full-table statistics, and the main data quality problems are as follows:
(1) the client ID is characterized by a string type; wherein the test IDs are all letters;
(2) a plurality of characteristics such as client level, client credit level, client cooperation type and the like are category variables;
(3) there is extreme similarity between features;
(4) the missing values of a plurality of characteristics such as the purchasing times of the financial products, the financial period, the credit rating of the customers and the like are excessive;
thirdly, preprocessing the mined data: the step S4 mainly includes the following steps:
s4.1: editing the metadata according to the character string type by taking the basic attribute data obtained by statistics as the metadata;
s4.2: performing label coding on the metadata of different categories by adopting unique hot coding, and performing binarization processing on the categories;
s4.3: finding and correcting recognizable errors in the metadata yields modeling data.
In said step S4.3 the following steps are included:
s4.3.1: acquiring two expression forms of the same first characteristic in the metadata, and deleting one expression form;
s4.3.2: filling missing values in the metadata;
s4.3.3: and carrying out univariate abnormal value detection on the filled data, and removing the univariate abnormal value to obtain modeling data.
(1) And (3) metadata editing: the client id is 10 digits; the data format conversion process needs to be converted into long type, and int type can not express all 10 digits; deleting the letter type ID;
(2) label coding/unique heating: the unordered class variable corresponds to the label code; the ordered category variable corresponds to onehot code;
(3) deletion characteristics: for two expression forms of a feature, deleting one of the two expression forms;
(4) fill-in (missing value) as practical: deletion characteristics can be considered when the proportion of the corresponding null value is too large; if the null value has special significance, special numbers such as-1 or 999 can be filled; the mean, mode, median, etc. may also be populated depending on the data.
Fourthly, determining data characteristic engineering, wherein most characteristics are aggregated and constructed in the pq database to reduce the pressure of the server, and the characteristic engineering is supplemented when the model is optimized in the case experiment
The univariate outlier detection process in said step S4.3.3 comprises the steps of:
a1: arranging the variables from small to large in sequence as x1 and x2... xn;
a2: calculating the mean x and standard deviation S:
Figure 788062DEST_PATH_IMAGE032
calculating a deviation value and determining a suspicious value, wherein i is a serial number of the suspicious value;
a3: calculate statistic giI.e. the ratio of residual to standard deviation:
Figure 740974DEST_PATH_IMAGE033
g is prepared fromiComparing with a critical value GP (n) given by a Grubbs table if g is calculatediIf the value is greater than the critical value GP (n) in the table, the measured data can be judged to be abnormal and can be eliminated.
For the label, the embodiment determines the attrition customer information according to the actual situation, and takes the six months from 2019.06.30 as the recording period, and acquires the data according to the following steps:
step1, selecting an observation point, taking the observation point as a cut-off time, counting the times that the longest year and the longest day of a customer in an observation period (such as the last 6 months) are continuously less than 1w, and dividing the customer into a plurality of levels according to the worst state, such as 0, 1, 2, 3, 4, 5 and 6;
step 2, taking the observation point as the starting time, counting the times that the longest year and the longest day of the client in the presentation period (such as 6 months in the future) are continuously below 1w, and dividing the user into a plurality of levels according to the worst state, such as 0, 1, 2, 3, 4, 5 and 6;
step 3, carrying out cross statistics on the number of clients in each grid;
step 4, counting the customer proportion in each grid;
step 5. in order to eliminate the random influence of the viewpoint selection, a plurality of viewpoints are generally selected, and steps 1 to 5 are repeated;
the results obtained are shown in the following table:
TABLE 1 customer loss ratio Table
Figure 272450DEST_PATH_IMAGE034
Example 2:
as shown in fig. 1-2, in this embodiment, based on embodiment 1, the establishing of the random forest model in step S7 includes the following steps:
s7.1: taking the data obtained in the step S6 as an original training set, assuming that the original training set is N, randomly and repeatedly extracting k new autonomous sample sets, and establishing k classification trees;
s7.2: setting M attribute classifications in an original training set N, and randomly drawing M at each node of each classification treetryAn attribute of mtrySelecting a variable with the most classification capability, wherein the threshold value of the classification variable is determined by checking each classification point;
s7.3: in order to avoid the problem of model overfitting caused by the unlimited growth of each classification tree, a loss function is adopted to judge whether pruning operation is carried out:
Figure 678023DEST_PATH_IMAGE035
where C is a loss function, T is a decision tree, TleafIs the number of leaf nodes, t is the node, H is the impure measurement method of the node t,
Figure 585937DEST_PATH_IMAGE036
manually setting, wherein the larger the value is, the heavier the weight of the leaf node number is; the smaller the value, the less the influence of the number of leaf nodes;
s7.4: forming a random forest by the generated multiple decision trees, and voting according to a classifier of the multiple trees to determine a final classification result:
Figure 393356DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 95732DEST_PATH_IMAGE038
expressed as a random forest model, Si’A separate decision tree is represented which is,
Figure 988602DEST_PATH_IMAGE039
a representational function, Z representing an output quantity;
s7.5: voting was scored on the model by the following scoring formula:
Figure 169048DEST_PATH_IMAGE040
ntreeto specify the number of decision trees, C, contained in a random forestpRepresents the voting result of the prediction class C, I () is an indicative function, nni’Is a tree ni’Number of leaf nodes, nni’,cIs a tree ni’The classification result of the prediction class C; generating a confusion table C after votingM,CMIs one
Figure 96552DEST_PATH_IMAGE010
Table in which the element cm (i ', j) represents the number of times the type i ' is classified as j, cm (i ', j) represents the correct number of times the type i ' is classified only when i ', ncIs the total number of categories.
The random forest is a classifier comprising a plurality of decision trees, and new data are classified through knowledge learned in a data set by the random forest; and the output categories are determined by voting and scoring of the categories output by all the trees, so that the overfitting risk is reduced, and the model has the advantages of readability and high classification speed.
Example 3:
as shown in fig. 1-2, in this embodiment, based on any of embodiments 1-2, in step S8, the random forest trains an optimal model by using the resampled data set for each group, and the K models are as follows:
Figure 235410DEST_PATH_IMAGE019
wherein Xi’’An N-dimensional variable, i ″ = 1., K, that is a randomly placeable back-sampled sub-data set;
the variance was analyzed by the limit method as follows:
if the models are completely independent, then:
Figure 553258DEST_PATH_IMAGE041
if the models are identical:
Figure 599712DEST_PATH_IMAGE042
the variance was analyzed using the formula:
assume the variance of the subdata set variables is
Figure 319406DEST_PATH_IMAGE043
The correlation between two variables is
Figure 425902DEST_PATH_IMAGE044
And then:
Figure 496627DEST_PATH_IMAGE045
and the variance of the random forest model is reduced by comparison.
Example 4:
as shown in fig. 1 to 2, in this embodiment, based on any one of embodiments 1 to 3, the feature importance calculation in the step S5 includes the following steps:
s5.1: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB, is calculated using the corresponding OOB, i.e., the out-of-bag data1
S5.2: randomly adding noise interference to the characteristic X of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error again, which is recorded as errOOB2
S5.3: suppose there is N in random foresttreeTree, then importance for feature X =
Figure 284454DEST_PATH_IMAGE003
The feature selection in the step S5 includes the steps of:
s5.4: finding out characteristic variables highly related to the dependent variables through calculation of the importance of the characteristics; the result of the dependent variable can be sufficiently predicted by selecting a smaller number of characteristic variables.
The selecting of the importance selecting feature in the step S5 includes:
p1: preliminary estimation and ranking:
a) sorting the characteristic variables in the random forest in a descending order according to the importance of the variables;
b) determining a deletion ratio, and removing unimportant indexes of the corresponding ratio from the current characteristic variables to obtain a new characteristic set;
c) establishing a new random forest by using the new feature set, calculating the variable importance of each feature in the feature set, and sequencing;
d) repeating the steps until m characteristics are left;
p2: and calculating corresponding out-of-bag error rates according to each feature set obtained in the P1 and the random forest established by the feature sets, and taking the feature set with the lowest out-of-bag error rate as the final selected feature set.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A bank public customer loss prediction method based on machine learning is characterized by comprising the following steps:
s1: setting a time limit, collecting original data of a bank on public customer behaviors in the set time limit, and constructing a PostgreSQL source database by adopting the original data;
s2: reading report data of each report in the PostgreSQL source database;
s3: integrating the read report data into a whole, extracting all the characteristics in the report data as first characteristics, and carrying out full-table statistics on the first characteristics in the report data;
s4: coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set;
s5: calculating the importance of the features in the random forest model, finding out importance selection features according to the calculation result, and aggregating and constructing a second feature after completing feature transformation in a PostgreSQL source database;
s6: extracting an actual lost customer data set from the corrected data set, and classifying new data through voting according to the actual lost customer data set;
s7: establishing a random forest model, and substituting the classified data into the random forest model for training;
s8: selecting a random forest model by analyzing the deviation and variance of the random forest model, and selecting optimal parameters by grid search and cross validation;
s9: and obtaining a model prediction result and outputting a visualization result.
2. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the step S4 includes the steps of:
s4.1: editing the metadata according to the character string type by taking the basic attribute data obtained by statistics as the metadata;
s4.2: performing label coding on the metadata of different categories by adopting unique hot coding, and performing binarization processing on the categories;
s4.3: finding and correcting recognizable errors in the metadata yields modeling data.
3. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 2, wherein the step S4.3 comprises the steps of:
s4.3.1: acquiring two expression forms of the same first characteristic in the metadata, and deleting one expression form;
s4.3.2: filling missing values in the metadata;
s4.3.3: and carrying out univariate abnormal value detection on the filled data, and removing the univariate abnormal value to obtain modeling data.
4. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 3 wherein the univariate outlier detection process of step S4.3.3 comprises the steps of:
a1: arranging the variables from small to large in sequence as x1 and x2... xn;
a2: calculating the mean x and standard deviation S:
Figure 948860DEST_PATH_IMAGE001
calculating a deviation value and determining a suspicious value, wherein i is a serial number of the suspicious value;
a3: calculate statistic giI.e. the ratio of residual to standard deviation:
Figure 602695DEST_PATH_IMAGE002
g is prepared fromiComparing with a critical value GP (n) given by a Grubbs table if g is calculatediIf the value is greater than the critical value GP (n) in the table, the measured data can be judged to be abnormal and can be eliminated.
5. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the feature importance calculation in step S5 includes the steps of:
s5.1: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB, is calculated using the corresponding OOB, i.e., the out-of-bag data1
S5.2: randomly adding noise interference to the characteristic X of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error again, which is recorded as errOOB2
S5.3: suppose there is N in random foresttreeTree, then importance for feature X =
Figure 31403DEST_PATH_IMAGE003
The feature selection in the step S5 includes the steps of:
s5.4: finding out characteristic variables highly related to the dependent variables through calculation of the importance of the characteristics; the result of the dependent variable can be sufficiently predicted by selecting a smaller number of characteristic variables.
6. The machine-learning-based bank-to-public-customer churn prediction method of claim 1, wherein the step of selecting the importance selection feature in step S5 comprises:
p1: preliminary estimation and ranking:
a) sorting the characteristic variables in the random forest in a descending order according to the importance of the variables;
b) determining a deletion ratio, and removing unimportant indexes of the corresponding ratio from the current characteristic variables to obtain a new characteristic set;
c) establishing a new random forest by using the new feature set, calculating the variable importance of each feature in the feature set, and sequencing;
d) repeating the steps until m characteristics are left;
p2: and calculating corresponding out-of-bag error rates according to each feature set obtained in the P1 and the random forest established by the feature sets, and taking the feature set with the lowest out-of-bag error rate as the final selected feature set.
7. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the establishment of the random forest model in step S7 comprises the following steps:
s7.1: taking the data obtained in the step S6 as an original training set, assuming that the original training set is N, randomly and repeatedly extracting k new autonomous sample sets, and establishing k classification trees;
s7.2: setting M attribute classifications in an original training set N, and randomly drawing M at each node of each classification treetryAn attribute of mtrySelecting a variable with the most classification capability, wherein the threshold value of the classification variable is determined by checking each classification point;
s7.3: in order to avoid the problem of model overfitting caused by the unlimited growth of each classification tree, a loss function is adopted to judge whether pruning operation is carried out:
Figure 479702DEST_PATH_IMAGE004
where C is a loss function, T is a decision tree, TleafIs the number of leaf nodes, t is a node, NtIs the number of samples corresponding to the node, H is the impure measurement method of the node t,
Figure 259439DEST_PATH_IMAGE005
manually setting, wherein the larger the value is, the heavier the weight of the leaf node number is; the smaller the value, the less the influence of the number of leaf nodes;
s7.4: forming a random forest by the generated multiple decision trees, and voting according to a classifier of the multiple trees to determine a final classification result:
Figure 451386DEST_PATH_IMAGE006
wherein S (x) is represented as a random forest model, Si’A separate decision tree is represented which is,
Figure 734600DEST_PATH_IMAGE007
a representational function, Z representing an output quantity;
s7.5: voting was scored on the model by the following scoring formula:
Figure 619379DEST_PATH_IMAGE008
ntreeto specify the number of decision trees, C, contained in a random forestpRepresents the voting result of the prediction class C, I () is an indicative function, nni’Is a tree ni’Number of leaf nodes, nni’,cIs a tree ni’The classification result of the prediction class C; generating a confusion table C after votingM,CMIs one
Figure 620833DEST_PATH_IMAGE009
Table in which the element cm (i ', j) represents the number of times the type i ' is classified as j, cm (i ', j) represents the correct number of times the type i ' is classified only when i ', ncIs the total number of categories.
8. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the random forest trains an optimal model with each resampled set of data in step S8, and K models in total, specifically as follows:
Figure 616471DEST_PATH_IMAGE010
wherein Xi’’An N-dimensional variable, i ″ = 1., K, that is a randomly placeable back-sampled sub-data set;
the variance was analyzed by the limit method as follows:
if the models are completely independent, then:
Figure 754191DEST_PATH_IMAGE011
if the models are identical:
Figure 809872DEST_PATH_IMAGE012
the variance was analyzed using the formula:
assume the variance of the subdata set variables is
Figure 298622DEST_PATH_IMAGE013
And the correlation between every two variables is rho, then:
Figure 832371DEST_PATH_IMAGE014
and comparing to obtain a random forest model with reduced variance.
CN202110782247.3A 2021-07-12 2021-07-12 Bank-to-public customer loss prediction method based on machine learning Pending CN113240518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110782247.3A CN113240518A (en) 2021-07-12 2021-07-12 Bank-to-public customer loss prediction method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110782247.3A CN113240518A (en) 2021-07-12 2021-07-12 Bank-to-public customer loss prediction method based on machine learning

Publications (1)

Publication Number Publication Date
CN113240518A true CN113240518A (en) 2021-08-10

Family

ID=77135236

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110782247.3A Pending CN113240518A (en) 2021-07-12 2021-07-12 Bank-to-public customer loss prediction method based on machine learning

Country Status (1)

Country Link
CN (1) CN113240518A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742472A (en) * 2021-09-15 2021-12-03 达而观科技(北京)有限公司 Data mining method and device based on customer service marketing scene
CN114826695A (en) * 2022-04-07 2022-07-29 广州腾粤信息科技有限公司 Privacy protection system of transaction data based on block chain
CN117075884A (en) * 2023-10-13 2023-11-17 南京飓风引擎信息技术有限公司 Digital processing system and method based on visual script
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117824093A (en) * 2024-01-10 2024-04-05 华中师范大学 Intelligent classroom environment suitability adjusting method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504287A (en) * 2015-01-08 2015-04-08 广州列丰信息科技有限公司 Method for remotely monitoring data exception of mobile medical device and server and system thereof
CN109190796A (en) * 2018-08-02 2019-01-11 北京天元创新科技有限公司 A kind of telecom client attrition prediction method, system and electronic equipment
CN109543203A (en) * 2017-09-22 2019-03-29 山东建筑大学 A kind of Building Cooling load forecasting method based on random forest
CN110322085A (en) * 2018-03-29 2019-10-11 北京九章云极科技有限公司 A kind of customer churn prediction method and apparatus
CN111524606A (en) * 2020-04-24 2020-08-11 郑州大学第一附属医院 Tumor data statistical method based on random forest algorithm
CN112614590A (en) * 2020-12-10 2021-04-06 浙江大学 Machine learning-based elderly disability risk prediction method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504287A (en) * 2015-01-08 2015-04-08 广州列丰信息科技有限公司 Method for remotely monitoring data exception of mobile medical device and server and system thereof
CN109543203A (en) * 2017-09-22 2019-03-29 山东建筑大学 A kind of Building Cooling load forecasting method based on random forest
CN110322085A (en) * 2018-03-29 2019-10-11 北京九章云极科技有限公司 A kind of customer churn prediction method and apparatus
CN109190796A (en) * 2018-08-02 2019-01-11 北京天元创新科技有限公司 A kind of telecom client attrition prediction method, system and electronic equipment
CN111524606A (en) * 2020-04-24 2020-08-11 郑州大学第一附属医院 Tumor data statistical method based on random forest algorithm
CN112614590A (en) * 2020-12-10 2021-04-06 浙江大学 Machine learning-based elderly disability risk prediction method and system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
刘忻梅 等: "AUCRF算法在信用风险评价中的特征选择研究", 《计算机应用与软件》 *
安德鲁•凯莱赫 等: "《机器学习实践》", 30 April 2020, 机械工业出版社 *
张雯 等: "基于随机森林的月貌面向对象分类", 《遥感信息》 *
陆家发 等: "基于深度学习的疾病诊断", 《医学信息学杂志》 *
陈宗海: "《***仿真技术及其应用 第17卷》", 31 August 2016, 中国科学技术大学出版社 *
韩忠明 等: "《数据分析与R》", 31 August 2014, 北京邮电大学出版社 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742472A (en) * 2021-09-15 2021-12-03 达而观科技(北京)有限公司 Data mining method and device based on customer service marketing scene
CN114826695A (en) * 2022-04-07 2022-07-29 广州腾粤信息科技有限公司 Privacy protection system of transaction data based on block chain
CN117150389A (en) * 2023-07-14 2023-12-01 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117150389B (en) * 2023-07-14 2024-04-12 广州易尊网络科技股份有限公司 Model training method, carrier card activation prediction method and equipment thereof
CN117075884A (en) * 2023-10-13 2023-11-17 南京飓风引擎信息技术有限公司 Digital processing system and method based on visual script
CN117075884B (en) * 2023-10-13 2023-12-15 南京飓风引擎信息技术有限公司 Digital processing system and method based on visual script
CN117824093A (en) * 2024-01-10 2024-04-05 华中师范大学 Intelligent classroom environment suitability adjusting method and system

Similar Documents

Publication Publication Date Title
CN113240518A (en) Bank-to-public customer loss prediction method based on machine learning
CN111882446B (en) Abnormal account detection method based on graph convolution network
US6834266B2 (en) Methods for estimating the seasonality of groups of similar items of commerce data sets based on historical sales data values and associated error information
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
JP2020115346A (en) AI driven transaction management system
EP3686756A1 (en) Method and apparatus for grouping data records
WO2022105525A1 (en) Method and apparatus for predicting user probability, and computer device
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN112860769B (en) Energy planning data management system
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN113177643A (en) Automatic modeling system based on big data
CN115147155A (en) Railway freight customer loss prediction method based on ensemble learning
CN116468536A (en) Automatic risk control rule generation method
CN115860800A (en) Festival and holiday commodity sales volume prediction method and device and computer storage medium
CN107742131A (en) Financial asset sorting technique and device
Jiang et al. [Retracted] Research on Intelligent Prediction Method of Financial Crisis of Listed Enterprises Based on Random Forest Algorithm
CN110597796B (en) Big data real-time modeling method and system based on full life cycle
CN114757495A (en) Membership value quantitative evaluation method based on logistic regression
CN115953166B (en) Customer information management method and system based on big data intelligent matching
CN117539920B (en) Data query method and system based on real estate transaction multidimensional data
CN117036008B (en) Automatic modeling method and system for multi-source data
Yee Improving Sales Analysis in Retail Sale using Data Mining Algorithm with Divide and Conquer Method
Kushwaha et al. Gold Price Prediction Using an Ensemble of Random Forest and XGBoost
US20230342793A1 (en) Machine-learning (ml)-based system and method for generating dso impact score for financial transaction
Henriques DECISION TREES FOR LOSS PREDICTION IN RETAIL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210810

RJ01 Rejection of invention patent application after publication