CN113240518A

CN113240518A - Bank-to-public customer loss prediction method based on machine learning

Info

Publication number: CN113240518A
Application number: CN202110782247.3A
Authority: CN
Inventors: 阮惠华; 张成刚; 黄浩
Original assignee: Guangzhou Smart Software Co ltd
Current assignee: Guangzhou Smart Software Co ltd
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-08-10

Abstract

The invention discloses a bank public customer loss prediction method based on machine learning, which comprises the following steps: collecting original data of a bank on public customer behaviors in a deadline, and constructing a PostgreSQL source database; reading report data of a plurality of reports by using the PostgreSQL source database; integrating the report data into a whole, and carrying out full-table statistics on all characteristics in the report data; coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set; establishing a random forest model, and substituting the classified data into the random forest model for training; calculating the importance of the features in the random forest model, and selecting the features according to the importance found out from the calculation result; obtaining a model prediction result according to the characteristics and outputting a visualization result; the accurate classification of bank customers is realized through the model, the marketing resources of enterprises are optimized, the customer loss rate is guaranteed, and accordingly the profit maximization of the enterprises is realized.

Description

Bank-to-public customer loss prediction method based on machine learning

Technical Field

The invention relates to the field of bank management models, in particular to a bank-to-public customer loss prediction method based on machine learning.

Background

In the big data era, the global economy integration and the financial marketization promote the operation management mode of domestic commercial banks to change greatly, and each commercial bank has an operation idea of 'taking customers as the center' as an important basis for improving the self profitability and the core competitiveness and pays high attention to customer relationship management and customer data mining.

There are studies in the prior art that it is 5 times more costly to attract new customers than to keep old customers; sales to attrition customers will be 1 out of 4 successful, while sales to potential and target customers will be 1 out of 16 possible successful. The profit brought to the company by the client is mainly determined according to the life cycle of the client, and the longer the life cycle of the client is, the more profit is brought to the company. The marketing focus of each bank enterprise is changed from a product center to a customer center, and customer relationship management becomes a core problem of the enterprise. So how to extend the life cycle of the lost customers becomes a decisive strategy for the companies to occupy the market share.

However, in each banking business block, some defects still exist in the customer relationship management aspect, such as: firstly, the classification of customers is incomplete, and the customers with high value cannot be effectively distinguished from the customers with no value; secondly, the personalized service scheme can not be customized for the customer classification, so that the customer loss is serious and the recovery is difficult; and thirdly, the existing marketing resources can not be accurately matched with high-value customers, and the profit of enterprises is seriously hindered from being improved. Banks can be divided into bank-to-public customers and bank retail customers according to customer types, and at present, no method for carrying out systematization, deep mining analysis and early warning on bank-to-public customer loss exists. Therefore, how to establish a model capable of accurately classifying bank customers, optimizing enterprise marketing resources, guaranteeing the loss rate of the bank to public customers and realizing the maximization of enterprise profits becomes a problem to be solved urgently in the field of bank management models.

Disclosure of Invention

The invention aims to overcome the defect that the analysis of data is incomplete and not intuitive when a bank conducts public churn customer research in the actual work of the bank in the prior art, provides a bank public churn customer churn prediction method based on machine learning, and achieves the purpose of comprehensively and intuitively obtaining the bank public churn customer churn prediction result by extracting importance selection characteristics in data through effectively establishing a customer churn prediction model.

The purpose of the invention is mainly realized by the following technical scheme:

a bank public customer loss prediction method based on machine learning comprises the following steps:

s1: setting a time limit, collecting original data of a bank on public customer behaviors in the set time limit, and constructing a PostgreSQL source database by adopting the original data;

s2: reading report data of each report in the PostgreSQL source database;

s3: integrating the read report data into a whole, extracting all the characteristics in the report data as first characteristics, and carrying out full-table statistics on the first characteristics in the report data;

s4: coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set;

s5: calculating the importance of the features in the random forest model, finding out importance selection features according to the calculation result, and aggregating and constructing a second feature after completing feature transformation in a PostgreSQL source database;

s6: extracting an actual lost customer data set from the corrected data set, and classifying new data through voting according to the actual lost customer data set;

s7: establishing a random forest model, and substituting the classified data into the random forest model for training;

s8: selecting a random forest model by analyzing the deviation and variance of the random forest model, and selecting optimal parameters by grid search and cross validation;

s9: and obtaining a model prediction result and outputting a visualization result.

At present, the accuracy of customer loss judgment is low due to the fact that the customer is not clearly distinguished in public customer loss prediction of a bank, and under the condition that judgment is inaccurate, the bank cannot make a targeted advance response, so that the loss of the customer is aggravated, and great loss is caused to the bank; the invention reserves the original data of the public client by collecting the daily public client behavior data and storing the data into the PostgreSQL source database, the PostgreSQL is an object-relational database management system of free software with complete characteristics, and has the characteristic of convenient index storage, after the PostgreSQL source database is constructed, the PostgreSQL source database is used for reading the report data, so the behavior data of the public client can be loaded into the PostgreSQL source database, the observation of the whole data can be clearly realized by integrating the report data, the characteristics can be effectively extracted, and then the characteristics are subjected to full-table statistics, the full-table statistics in the invention comprises different statistical analyses on the observed data, and the statistical information comprises: indexes such as sample number, missing value number, average value, standard deviation, variance, sum, unique value, minimum value, maximum value, upper quartile, lower quartile, median, mode, kurtosis and skewness; and the information of numerical value range, distribution and the like contained in the data can be simply and comprehensively expressed by using the box line graph and the histogram as far as possible.

The basic attribute data can be obtained after the whole-table statistics, the whole-table statistics actually counts the specific expression data of each characteristic, the basic attribute data is coded for convenient use and easy memory in the processing process, the codes which can be adopted in the invention comprise characteristic binaryzation/diversification, unique hot coding and the like, the coded basic attribute data is edited and processed in a redundant and missing way, the part which is to be deleted is deleted, and the part which is to be filled is filled, so that the correction of the basic attribute data is realized, in the classification of public clients, the condition of lost clients can be most reflected by the data set of the lost clients actually, therefore, new data is classified according to the condition of the data set of the lost clients actually, the lost clients can be more conveniently identified, and the prediction accuracy of the forest random model can be effectively improved through the training of the forest model, after the optimal parameters are selected, the accuracy can be further improved, the importance selection characteristics can be found out according to the calculation of the random forest model, the importance selection characteristics are adopted to identify the customers, whether the customers accord with the characteristics of the loss prediction model or not can be effectively obtained, the relevant parameters of the public customer loss can be clearly obtained, the results can be visualized effectively through the parameters in a visualization mode, and therefore clear and visual prediction results can be obtained.

The actual lost client data sets in the invention are already existed in a bank database (PostgreSQL source database), the bank directly provides the lost client data sets, and the data sets are preprocessed and filled with missing values in the first steps to provide the data sets with correct formats and contents for model training. And the invention adopts a mode of firstly completing the feature engineering content, namely feature importance calculation and feature selection, and then performing model training, wherein the establishment of the random forest model is performed after the feature of the random forest model importance is selected, thereby facilitating the subsequent tuning of the model in the training process.

In the invention, for the variance and deviation problems of the random forest, different training set training models with the same sample number are used for prediction, the expected prediction of the learning algorithm is obtained by averaging the predicted values, the deviation refers to the difference between the expected prediction of the learning algorithm and the real mark, and the variance refers to the difference between the expected prediction of the learning algorithm and the real mark.

According to the invention, through the processing of the original data and the data analysis, the customer loss prediction model can be effectively established, and the purpose of comprehensively and visually obtaining the customer loss prediction model through the importance characteristics on the basis of selecting the optimal parameters is achieved.

Further, the data report in step S2 includes "mining the year and day average of each month of the customer for public deposit", "mining the year and day average of each month of the customer for public financial fund", and finally "mining the integrated model width table" including the characteristics record of the mining customer is integrated in step S3. The method comprises the steps of mining the annual and daily average table of the clients for the public deposit in each month, effectively reflecting relevant information of the client deposit, mining the annual and daily average table of the clients for the axiomatic fund in each month, effectively reflecting relevant information of the client investment, and analyzing the client information in two aspects of the client deposit and the investment to enable the obtained result to be more accurate.

Further, the step S4 includes the following steps:

s4.1: editing the metadata according to the character string type by taking the basic attribute data obtained by statistics as the metadata;

s4.2: performing label coding on the metadata of different categories by adopting unique hot coding, and performing binarization processing on the categories;

s4.3: finding and correcting recognizable errors in the metadata yields modeling data.

In the invention, the metadata comprises data account information, personal information, deposit information, consumption, transaction information and other information, the information is edited according to the character string type and then is conveniently searched and called, when the one-hot coding is adopted for editing, the one-hot coding mainly adopts an N-bit state register to code N states, each state is composed of an independent register bit of the state, only one bit is effective at any time and is used for solving the problem of discrete value of the class type data, the class is binary-processed, the problem that a classifier cannot process attribute data well is solved, the distance calculation between the features is more reasonable, and more effective modeling data can be obtained after errors can be identified in the corrected metadata.

Further, the step S4.3 includes the steps of:

s4.3.1: acquiring two expression forms of the same first characteristic in the metadata, and deleting one expression form;

s4.3.2: filling missing values in the metadata;

s4.3.3: and carrying out univariate abnormal value detection on the filled data, and removing the univariate abnormal value to obtain modeling data.

In the process of finding and correcting recognizable errors in a data file and obtaining data required by a modeling process through processing, the elimination of redundant data is based on different expression forms of the same characteristic, and repeated calculation can cause the reduction of accuracy rate during calculation, so one of the two is eliminated to ensure the accuracy rate of the data, and when missing values in metadata are filled, two filling methods are mainly used, one is a mean value filling method, the mean value filling method groups the data by searching the variable with the maximum correlation with the missing value variable, calculates the mean value of each group respectively, and fills the missing position, and can change the distribution of the data to a certain extent; the other is a regression filling method, which is characterized in that a missing value variable is used as a target variable y, existing part of data of the missing value variable is used as a training set, a regression equation is established by searching a variable x highly related to the missing value variable, then x corresponding to the position of the missing value variable y is used as a prediction set, the missing is predicted, the prediction result is used for replacing the missing value, and the target variable y and the highly related variable x are the missing value variables and are only used for missing value filling; after filling is completed, univariate abnormal value detection is carried out on the data, so that abnormal data are eliminated, and finally obtained modeling data are more reliable and accurate.

Further, the univariate outlier detection process in step S4.3.3 includes the following steps:

a1: arranging the variables from small to large in sequence as x1 and x2... xn;

a2: calculating the mean x and standard deviation S:

calculating a deviation value and determining a suspicious value, wherein i is a serial number of the suspicious value;

a3: calculate statistic g_iI.e. the ratio of residual to standard deviation:

g is prepared from_iComparing with a critical value GP (n) given by a Grubbs table if g is calculated_iIf the value is greater than the critical value GP (n) in the table, the measured data can be judged to be abnormal and can be eliminated. The threshold value gp (n) here relates to two parameters: the detected level α and the number of measurements n.

The method for detecting the univariate abnormal value in the invention is the Grabas method. In a set of measured data, if individual data deviates far from the mean, this data(s) is (are) referred to as "suspect value". If a "suspect value" can be removed from the set of measurement data without taking part in the calculation of the mean value, as determined by statistical methods, such as the Grubbs method, the "suspect value" is referred to as an "outlier (gross error)".

When the detection level α is determined, if the requirement is strict, the detection level α may be determined to be smaller, for example, α is determined to be 0.01, and then the confidence probability P is 1- α is 0.99; if the requirement is not strict, α may be determined to be larger, for example, α is determined to be 0.10, i.e., P is 0.90; usually, α is defined to be 0.05 and P is defined to be 0.95.

And (3) checking a Grabbs table to obtain a critical value: from the charabrus table, the threshold G95(10) was 2.176 for vertical and horizontal crossings, based on the selected P value (here 0.95) and the number of measurements n (here 10).

Comparison of calculated value Gi with critical value G95 (10): gi 2.260, G95(10) 2.176, and Gi > G95 (10).

Judging whether the abnormal value is: since Gi > G95(10), the measured value 14.0 can be judged as an abnormal value, and it is removed from the 10 measured data.

The remaining data consider: calculating the rest 9 data according to the steps, if Gi is more than G95(9), the data is still an abnormal value, and removing; if Gi < G95(9), is not an outlier, then no culling is done. Therefore, there are no abnormal values in the remaining 9 data of this example.

Further, the feature importance calculation in the step S5 includes the following steps:

s5.1: for each decision tree in the random forest, its out-of-bag data error, denoted errOOB, is calculated using the corresponding OOB, i.e., the out-of-bag data₁；

S5.2: randomly adding noise interference to the characteristic X of all samples of the out-of-bag data OOB, and calculating the out-of-bag data error again, which is recorded as errOOB₂；

S5.3: suppose there is N in random forest_treeTree, then importance for feature X =

；

The feature selection in the step S5 includes the steps of:

s5.4: finding out characteristic variables highly related to the dependent variables through calculation of the importance of the characteristics; the result of the dependent variable can be sufficiently predicted by selecting a smaller number of characteristic variables.

The common noise adopted by the noise interference in the invention comprises check noise and Gaussian noise; the check noise is: randomly acquiring and adding a maximum value and a minimum value; the Gaussian noise: the noise is a type of noise with probability density function obeying Gaussian distribution; in the invention, the correlation degree of the independent variable and the dependent variable is judged according to the correlation coefficient of the independent variable and the dependent variable, and the influence of the characteristics with higher correlation on the model is larger. And selecting an independent variable with a relatively high correlation coefficient to be substituted into the model for training, wherein the characteristic is the independent variable.

Further, the selecting step of the importance selecting feature in step S7 includes:

p1: preliminary estimation and ranking:

a) sorting the characteristic variables in the random forest in a descending order according to the importance of the variables;

b) determining a deletion ratio, and removing unimportant indexes of the corresponding ratio from the current characteristic variables to obtain a new characteristic set;

c) establishing a new random forest by using the new feature set, calculating the variable importance of each feature in the feature set, and sequencing;

d) repeating the steps until m characteristics are left;

p2: and calculating corresponding out-of-bag error rates according to each feature set obtained in the P1 and the random forest established by the feature sets, and taking the feature set with the lowest out-of-bag error rate as the final selected feature set.

Further, the establishing of the random forest model in the step S7 includes the following steps:

s7.1: taking the data obtained in the step S6 as an original training set, assuming that the original training set is N, randomly and repeatedly extracting k new autonomous sample sets, and establishing k classification trees;

s7.2: setting M attribute classifications in an original training set N, and randomly drawing M at each node of each classification tree_tryAn attribute of m_trySelecting a variable with the most classification capability, wherein the threshold value of the classification variable is determined by checking each classification point;

s7.3: in order to avoid the problem of model overfitting caused by the unlimited growth of each classification tree, a loss function is adopted to judge whether pruning operation is carried out:

where C is a loss function, T is a decision tree, T_leafIs the number of leaf nodes, t is the node, H is the impure measurement method of the node t,

manually setting, wherein the larger the value is, the heavier the weight of the leaf node number is; the smaller the value, the less the influence of the number of leaf nodes;

s7.4: forming a random forest by the generated multiple decision trees, and voting according to a classifier of the multiple trees to determine a final classification result:

wherein the content of the first and second substances,

expressed as a random forest model, S_i’A separate decision tree is represented which is,

a representational function, Z representing an output quantity;

s7.5: voting was scored on the model by the following scoring formula:

n_treeto specify the number of decision trees, C, contained in a random forest_pRepresents the voting result of the prediction class C, I () is an indicative function, n_ni’Is a tree n_i’Number of leaf nodes, n_ni’,cIs a tree n_i’The classification result of the prediction class C; generating a confusion table C after voting_M，C_MIs one

Table in which the element cm (i ', j) represents the number of times the type i ' is classified as j, cm (i ', j) represents the correct number of times the type i ' is classified only when i ', n_cIs the total number of categories.

In the invention, the unextracted samples form n pieces of out-of-bag data, the unextracted out-of-bag data is stored in a database, the model is extracted back and forth during training, the unextracted out-of-bag data is not applied, the original data set containing m samples is subjected to repeated sampling m times, the probability of each time of collection is 1/m, and the probability of not collecting is 1/m

；

M is_tryA node value, which can determine the variable sampling value of each iteration and is used for the variable number of the binary tree, generally selecting an empirical value and changing the empirical value into a quadratic root of the variable number of the data set;

in the invention, firstly, m training sets are generated by a bagging algorithm, then, for each training set, a decision tree is constructed, when nodes find characteristics for classification, all the characteristics are not found to maximize indexes (such as information gain), but a part of the characteristics are randomly extracted from the characteristics, an optimal solution is found among the extracted characteristics and is applied to the nodes for classification, and the decision tree compares different schemes in the decision by using a probability method to obtain an optimal scheme, and the variable with the highest classification capability in the step S7.2 is a variable with the highest information gain, so the decision branch is called the decision tree because the decision branch is drawn to be similar to a branch of a tree. It can be constructed by the CART algorithm, using the kini coefficients as the basis for the partition properties. The specific algorithm is as follows:

let D be a set of | D | data samples, the class attribute having m different values corresponding to m different class sets

And | Ci | is the number of samples in the class set Ci, and the expected information required to classify the tuples in D is:

wherein the content of the first and second substances,

representing the probability that a data object belongs to the category Ci.

Assuming that D is divided into v different classes { D1, D2 … Dv } according to the attribute a (taking the value { a1, a2 … av }), the entropy of the information for dividing the current sample set using the attribute a is:

the smaller the value of the information entropy ia (d), the better the result of the subset division by the attribute a.

Thus, the information gain obtained by using the attribute a to perform corresponding subset division on the current branch node is:

the coefficient of kini: the smaller the kini coefficient (Gini), the smaller the measure of uncertainty, the less probability that a selected sample in the set is entered, i.e. the higher the purity of the set, and conversely, the less pure the set. When all samples in the set are of one class, the kini coefficient is 0. The formula is as follows:

the basic idea of the model voting scoring process is as follows:

giving a weak learning algorithm and a training set; the accuracy of a single weak learning algorithm is not high; the learning algorithm is used for multiple times to obtain a prediction function sequence, and voting is carried out; finally, the accuracy of the result is improved;

the algorithm is as follows: for T =1, 2, …, T Do

Sampling (putting back the selected sample) from the data set S and training to obtain a model Ht;

when the unknown samples X are classified, each model Ht obtains a classification, the highest voted value is the classification of the unknown samples X, the average voted value can be used for predicting continuous values, and the finally obtained voting formula is as follows:

。

further, in step S8, the random forest trains an optimal model with each resampled data set, which includes K models:

wherein X_i’’An N-dimensional variable, i ″ = 1., K, that is a randomly placeable back-sampled sub-data set;

the variance was analyzed by the limit method as follows:

if the models are completely independent, then:

if the models are identical:

；

the variance was analyzed using the formula:

assume the variance of the subdata set variables is

And the correlation between every two variables is rho, then:

and comparing to obtain a random forest model with reduced variance.

Through the processing of the step S8, the variance of the random forest model in the training process is controlled within the range, the overfitting problem of the model is avoided, the complexity of the model is reduced, the overfitting problem of the model is solved, and the specific conditions are as follows:

because the random forest is a model frame based on the bagging idea, the random forest adopts an optimal model trained by each group of resampled data set, and K models are total; let Xi be an N-dimensional variable of the randomly placeable back-sampled sub-data set, I =1, …, K.

Based on the similarity of the sub-data sets in the placeable samples and the same models used, the models have approximately equal bias and variance, and the distribution of the models is approximately the same but not independent (because of the repeated variations among the sub-data sets).

Thus:

is obtained by the formula: the deviation of the bagging model is close to that of each sub-model, so the bagging method cannot reduce the deviation remarkably. The variance problem of the bagging model is analyzed by using a limit method, because the sub data sets of the bagging algorithm are neither independent from each other nor identical, certain similarity exists between the sub data sets, Xt is an N-dimensional variable of the sub data sets which can be randomly placed back to sample, and t =1, …, K. Thus, the variance of the bagging algorithm model is between:

if the models are completely independent, then:

if the models are identical:

；

and analyzing the variance problem of the bagging algorithm model by adopting a formula method, and assuming that the variance of the sub data set variables is

The correlation between two variables is

Variance of the algorithm bagged:

the variance of the bagging algorithm obtained by the last formula is reduced, and through the processing of the step S8, the variance of the random forest model in the training process is controlled within a range, so that the problem of overfitting of the model is avoided, and the random forest mainly has the effects of reducing the complexity of the model and solving the overfitting problem of the model.

In conclusion, compared with the prior art, the invention has the following beneficial effects:

(1) according to the invention, through data analysis, a customer loss prediction model can be effectively established, accurate classification of bank customers can be realized through the model, enterprise marketing resources are optimized, and the customer loss rate is ensured, so that the enterprise profit maximization is realized.

(2) The method predicts the deletion and replaces the deletion value with the prediction result; after filling is completed, univariate abnormal value detection is carried out on the data, so that abnormal data are eliminated, and finally obtained modeling data are more reliable and accurate.

(3) The invention trains and predicts the sample through the random forest model, effectively accelerates the prediction process and can generate a result with high accuracy.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of the random forest training of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1:

as shown in fig. 1, the embodiment relates to a bank-to-public customer churn prediction method based on machine learning, which includes the following steps:

s1: collecting data: setting a time limit, collecting original data of a bank on public customer behaviors in the set time limit, and constructing a PostgreSQL source database by adopting the original data;

s2: reading data: reading report data of each report in the PostgreSQL source database;

s3: data exploration: integrating the read report data into a whole, extracting all the characteristics in the report data as first characteristics, and carrying out full-table statistics on the first characteristics in the report data;

s4: data preprocessing: coding the basic attribute data obtained by statistics, and editing the redundant and missing characteristic values to obtain a corrected data set;

s5: characteristic engineering: calculating the importance of the features in the random forest model, finding out importance selection features according to the calculation result, and aggregating and constructing a second feature after completing feature transformation in a PostgreSQL source database;

s6: and (3) data evaluation: extracting an actual lost customer data set from the corrected data set, and classifying new data through voting according to the actual lost customer data set;

s7: model training: establishing a random forest model, and substituting the classified data into the random forest model for training;

s8: model optimization: selecting a random forest model by analyzing the deviation and variance of the random forest model, and selecting optimal parameters by grid search and cross validation;

s9: model prediction and result output: and obtaining a model prediction result and outputting a visualization result.

In this embodiment, in order to be able to screen out the main customer target of the project as an effective attrition customer based on all public customers of the bank, thereby providing basic data for relevant departments to retrieve customer solutions, the present embodiment classifies the customers into the following four categories: firstly, high-value customers generate higher value for banks and are always more active customers; secondly, customers are effectively lost, the bank focuses on objects, and the reduction of the loss rate can have a positive effect on banking business; thirdly, the retrieved customers and the bank focus on the objects of interest, and the retrieved customer groups are screened out, so that marketing resources can be effectively reduced; and fourthly, the low-value customers have low value on banks, and do not consume too much marketing resources.

On the basis, the customer information is dynamic characteristics of public customers mined from a huge bank management database, and the best lost customer information required by training or prediction is screened out.

The medium limit is set in step S1 as follows: determining a training/prediction time point; taking six months before the training/predicting time point as an observation period, and aggregating some dynamic characteristics of the customers; taking six months after the training/predicting time point as a presentation period, and showing whether the customer loses or is recovered in the presentation period; the observation period moves simultaneously with the movement of the presentation period with the movement of the time window.

The data report in step S2 includes "mining the annual and daily average of each month of the customer for public deposit", "mining the annual and daily average of each month of the customer for public financial fund", and finally "mining the integrated model width table" including the characteristic record of the mining customer is integrated in step S3. For client data mining, the following process is carried out:

first, reading data:

(1) reading 3 width tables including 'data mining client month to public deposit year and day average table', 'data mining client month to public financial fund year and day average table' and 'mining integrated model width table' from a PostgreSQL source database;

(2) the 'mining of the annual and daily average table of the public deposit of each month of the client' has 2067749 records with 18 characteristics;

(3) the mining client has 2067749 records of 18 characteristics in each month for the annual and average daily schedule of the equity fund;

(4) the 'excavation integral die width table' has 2945051 records and 70 characteristics;

(5) the total data volume is about 12G;

(6) the model labels are finally determined by mining the annual and daily average table of each month of the client on the public deposit and mining the annual and daily average table of each month of the client on the public financial fund;

(7) the 'mining integration model width table' is used for mining the characteristic record of a client for 6 months as training data;

the second step, data exploration is carried out,

the three width tables are finally integrated into a big table, the total number of the characteristics is 73, all the characteristics are subjected to full-table statistics, and the main data quality problems are as follows:

(1) the client ID is characterized by a string type; wherein the test IDs are all letters;

(2) a plurality of characteristics such as client level, client credit level, client cooperation type and the like are category variables;

(3) there is extreme similarity between features;

(4) the missing values of a plurality of characteristics such as the purchasing times of the financial products, the financial period, the credit rating of the customers and the like are excessive;

thirdly, preprocessing the mined data: the step S4 mainly includes the following steps:

In said step S4.3 the following steps are included:

s4.3.2: filling missing values in the metadata;

(1) And (3) metadata editing: the client id is 10 digits; the data format conversion process needs to be converted into long type, and int type can not express all 10 digits; deleting the letter type ID;

(2) label coding/unique heating: the unordered class variable corresponds to the label code; the ordered category variable corresponds to onehot code;

(3) deletion characteristics: for two expression forms of a feature, deleting one of the two expression forms;

(4) fill-in (missing value) as practical: deletion characteristics can be considered when the proportion of the corresponding null value is too large; if the null value has special significance, special numbers such as-1 or 999 can be filled; the mean, mode, median, etc. may also be populated depending on the data.

Fourthly, determining data characteristic engineering, wherein most characteristics are aggregated and constructed in the pq database to reduce the pressure of the server, and the characteristic engineering is supplemented when the model is optimized in the case experiment

The univariate outlier detection process in said step S4.3.3 comprises the steps of:

a1: arranging the variables from small to large in sequence as x1 and x2... xn;

a2: calculating the mean x and standard deviation S:

a3: calculate statistic g_iI.e. the ratio of residual to standard deviation:

g is prepared from_iComparing with a critical value GP (n) given by a Grubbs table if g is calculated_iIf the value is greater than the critical value GP (n) in the table, the measured data can be judged to be abnormal and can be eliminated.

For the label, the embodiment determines the attrition customer information according to the actual situation, and takes the six months from 2019.06.30 as the recording period, and acquires the data according to the following steps:

step1, selecting an observation point, taking the observation point as a cut-off time, counting the times that the longest year and the longest day of a customer in an observation period (such as the last 6 months) are continuously less than 1w, and dividing the customer into a plurality of levels according to the worst state, such as 0, 1, 2, 3, 4, 5 and 6;

step 2, taking the observation point as the starting time, counting the times that the longest year and the longest day of the client in the presentation period (such as 6 months in the future) are continuously below 1w, and dividing the user into a plurality of levels according to the worst state, such as 0, 1, 2, 3, 4, 5 and 6;

step 3, carrying out cross statistics on the number of clients in each grid;

step 4, counting the customer proportion in each grid;

step 5. in order to eliminate the random influence of the viewpoint selection, a plurality of viewpoints are generally selected, and steps 1 to 5 are repeated;

the results obtained are shown in the following table:

TABLE 1 customer loss ratio Table

Example 2:

as shown in fig. 1-2, in this embodiment, based on embodiment 1, the establishing of the random forest model in step S7 includes the following steps:

wherein the content of the first and second substances,

a representational function, Z representing an output quantity;

s7.5: voting was scored on the model by the following scoring formula:

The random forest is a classifier comprising a plurality of decision trees, and new data are classified through knowledge learned in a data set by the random forest; and the output categories are determined by voting and scoring of the categories output by all the trees, so that the overfitting risk is reduced, and the model has the advantages of readability and high classification speed.

Example 3:

as shown in fig. 1-2, in this embodiment, based on any of embodiments 1-2, in step S8, the random forest trains an optimal model by using the resampled data set for each group, and the K models are as follows:

the variance was analyzed by the limit method as follows:

if the models are completely independent, then:

if the models are identical:

；

the variance was analyzed using the formula:

assume the variance of the subdata set variables is

The correlation between two variables is

And then:

and the variance of the random forest model is reduced by comparison.

Example 4:

as shown in fig. 1 to 2, in this embodiment, based on any one of embodiments 1 to 3, the feature importance calculation in the step S5 includes the following steps:

；

The feature selection in the step S5 includes the steps of:

The selecting of the importance selecting feature in the step S5 includes:

p1: preliminary estimation and ranking:

d) repeating the steps until m characteristics are left;

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A bank public customer loss prediction method based on machine learning is characterized by comprising the following steps:

s2: reading report data of each report in the PostgreSQL source database;

2. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the step S4 includes the steps of:

3. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 2, wherein the step S4.3 comprises the steps of:

s4.3.2: filling missing values in the metadata;

4. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 3 wherein the univariate outlier detection process of step S4.3.3 comprises the steps of:

a1: arranging the variables from small to large in sequence as x1 and x2... xn;

a2: calculating the mean x and standard deviation S:

a3: calculate statistic g_iI.e. the ratio of residual to standard deviation:

5. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the feature importance calculation in step S5 includes the steps of:

；

The feature selection in the step S5 includes the steps of:

6. The machine-learning-based bank-to-public-customer churn prediction method of claim 1, wherein the step of selecting the importance selection feature in step S5 comprises:

p1: preliminary estimation and ranking:

d) repeating the steps until m characteristics are left;

7. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the establishment of the random forest model in step S7 comprises the following steps:

where C is a loss function, T is a decision tree, T_leafIs the number of leaf nodes, t is a node, N_tIs the number of samples corresponding to the node, H is the impure measurement method of the node t,

wherein S (x) is represented as a random forest model, S_i’A separate decision tree is represented which is,

a representational function, Z representing an output quantity;

s7.5: voting was scored on the model by the following scoring formula:

8. The machine learning-based bank-to-public customer churn prediction method as claimed in claim 1, wherein the random forest trains an optimal model with each resampled set of data in step S8, and K models in total, specifically as follows:

the variance was analyzed by the limit method as follows:

if the models are completely independent, then:

if the models are identical:

；

the variance was analyzed using the formula:

assume the variance of the subdata set variables is

And the correlation between every two variables is rho, then:

and comparing to obtain a random forest model with reduced variance.