CN115049121A - Bank customer loss risk prediction model generation method, device, equipment and medium - Google Patents

Bank customer loss risk prediction model generation method, device, equipment and medium Download PDF

Info

Publication number
CN115049121A
CN115049121A CN202210638850.9A CN202210638850A CN115049121A CN 115049121 A CN115049121 A CN 115049121A CN 202210638850 A CN202210638850 A CN 202210638850A CN 115049121 A CN115049121 A CN 115049121A
Authority
CN
China
Prior art keywords
data
customer
bank
bank customer
risk prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210638850.9A
Other languages
Chinese (zh)
Inventor
刘锴靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210638850.9A priority Critical patent/CN115049121A/en
Publication of CN115049121A publication Critical patent/CN115049121A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Software Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Educational Administration (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and provides a bank customer attrition risk prediction model generation method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring data retained by a bank customer; preprocessing the data retained by the bank customer to obtain characteristic data related to customer churn; and training the XGboost model to be trained based on the characteristic data, the label corresponding to the characteristic data and the preset model parameters to obtain the bank customer loss risk prediction model. Because the XGboost model is a machine learning model and the training set adopted by the method is the characteristic data related to customer loss, the model generated by the embodiment of the application can effectively predict the risk of bank customer loss through training.

Description

Bank customer attrition risk prediction model generation method, device, equipment and medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a bank customer loss risk prediction model generation method, device, equipment and medium.
Background
For banks, customer churn is inevitable for a variety of reasons. However, if the client can be found in time when the client tends to lose, the client can be retained through various methods, so that the client loss rate is reduced. Therefore, it is important to effectively predict the risk of churn for customers. However, in the banking field, there is no tool capable of effectively predicting the customer attrition risk.
Disclosure of Invention
In view of the above technical problems, an object of the present application is to provide a method, an apparatus, a device and a medium for generating a bank customer loss risk prediction model, so as to solve the technical problem that no tool is available at present to effectively predict the bank customer loss risk.
In order to solve the foregoing technical problem, in a first aspect, an embodiment of the present application provides a method for generating a bank customer churn risk prediction model, including:
acquiring data retained by a bank client;
preprocessing the data retained by the bank customer to obtain characteristic data related to customer churn;
training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
Further, the customer retention data includes a plurality of attribute data, the attributes including a customer ID, a customer name, a customer credit score, a gender, an age, a region, a deposit/loan status, whether there is a credit card, a financial product purchase/use amount, whether it is an active user, an estimated income, a length of time to use a financial product, and whether it has been lost.
Further, the preprocessing the data retained by the bank customer to obtain characteristic data related to customer churn includes:
and sequentially carrying out data cleaning processing, data conversion processing and feature screening processing on the data retained by the bank customer to obtain the feature data related to customer loss.
Further, the sequentially performing data cleaning, data conversion and feature screening on the data retained by the bank customer to obtain the feature data related to customer loss includes:
deleting data irrelevant to customer attrition in the data retained by the bank customer to obtain first characteristic data;
processing the abnormal value in the first characteristic data to obtain second characteristic data;
converting the second characteristic data into an input format conforming to the XGboost to obtain third characteristic data;
discretizing the third characteristic data to obtain fourth characteristic data;
and screening out the feature data with weak correlation from the fourth feature data to obtain the feature data related to the customer churn.
Further, the characteristics associated with customer churn include: customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, amount of financial product purchase/use, whether it is an active user, estimated income, and length of time to use a financial product.
Further, training an XGboost model to be trained based on the feature data, the label corresponding to the feature data, and a preset model parameter to obtain the bank customer churn risk prediction model includes:
training the XGBoost model to be trained based on the feature data, the labels corresponding to the feature data and preset model parameters, and adjusting the model parameters according to a preset parameter adjusting strategy in the training process to obtain the bank customer loss risk prediction model.
Further, the parameter adjustment strategy is any one of the following parameter adjustment strategies:
the first parameter adjusting strategy is as follows: only adjusting the number of trees;
and a second parameter adjusting strategy: only adjusting the depth of the tree;
and a third parameter adjustment strategy: adjusting the number and depth of trees simultaneously;
and a fourth parameter adjustment strategy: adjusting only the learning rate;
and a fifth parameter adjustment strategy: simultaneously adjusting the number of trees and the learning rate;
and a sixth parameter adjustment strategy: modulating only the line sample bit rate;
and a seventh parameter adjustment strategy: only the column sample bit rate is adjusted.
In a second aspect, an embodiment of the present application provides a bank customer churn risk prediction model generation apparatus, including:
the acquisition module is used for acquiring data reserved by bank customers;
the preprocessing module is used for preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss;
the training module is used for training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of any one of the above methods when executing the computer program.
In a fourth aspect, the present application provides a computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method described in any one of the above.
The bank customer loss risk prediction model generation method provided by the embodiment of the application comprises the following steps: acquiring bank customer retained data, and preprocessing the bank customer retained data to obtain characteristic data related to customer loss; and training the XGboost model to be trained based on the characteristic data, the label corresponding to the characteristic data and the preset model parameters to obtain the bank customer loss risk prediction model. Because the XGboost model adopted by the embodiment of the application is a machine learning model and the training set adopted by the embodiment of the application is the characteristic data related to the loss of the customer, the model generated by the embodiment of the application can effectively predict the loss risk of the customer in the bank through training. In addition, the XGboost has the advantages of good speed effect when large-scale data are processed and low requirements for hardware resources such as a memory, so that the loss risk prediction model for the bank client generated by the embodiment of the application can be used for rapidly and accurately predicting the loss risk of the bank client.
Drawings
In order to more clearly illustrate the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for generating a bank customer churn risk prediction model according to a first embodiment of the present application;
FIG. 2 is a flow chart of a method for obtaining profile data associated with customer churn as provided in the first embodiment of the present application;
fig. 3 is a block diagram of a bank customer churn risk prediction model generation apparatus according to a second embodiment of the present application;
FIG. 4 is a block diagram of a preprocessing module provided in a second embodiment of the present application;
fig. 5 is a schematic block diagram of a computer device according to a third embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The first embodiment is as follows:
a first embodiment of the present application provides a bank customer churn risk prediction model generation method, which can be executed by a computer device, where the computer device can be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices.
Referring to fig. 1, a method for generating a bank customer churn risk prediction model according to an embodiment of the present application includes steps S1-S3:
s1, acquiring data reserved by the bank customer;
s2, preprocessing the data retained by the bank customer to obtain characteristic data related to customer churn;
s3, training an XGBoost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer attrition risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
As mentioned in step S1, it should be noted that the bank customer retention data is generally obtained from a banking system, and the bank customer retention data includes various attribute data, generally, as shown in table 1, including the following attributes: customer ID, customer name, customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, number of financial product purchases/uses, whether it is an active user, estimated income, length of time to use a financial product, whether it has been lost.
Table 1 bank customer retained data
Figure BDA0003681625400000051
Figure BDA0003681625400000061
As mentioned in step S2, it should be noted that, since the present application aims to train a model capable of predicting the risk of customer churn, and not all attributes in the data retained by the bank customer belong to the factor of customer churn, in order to ensure the accuracy of the model, it is necessary to extract attributes related to customer churn from the data retained by the bank customer as features to train the model.
As mentioned in step S3, xgboost (extreme gradient boosting) is mainly used to solve the supervised learning problem, and such problem utilizes the training data x containing multiple features i To predict the target variable y i . Model parameters for XGboost include: the number of trees, the depth of the trees, the learning rate, the row sampling ratio, and the column sampling ratio. To solve the multi-classification problem, XGboost provides two kinds of penalty functions, one is multi: softmax function, another is multi: softprob function. multi: softmax is the classification result generated after using softmax, while multi softprob is the probability matrix of the output. In the embodiment of the present application, the object of the present application can be achieved by using any one of the above loss functions, and preferably, multi: softprob function, by multi: the softprob function may output a customer churn risk level.
It should also be noted that XGboost has the following advantages: (1) simple and easy to use: compared with other machine learning libraries, the XGboost can be easily used by a user through the XGboost, and a good effect can be obtained; (2) high-efficiency and expandable: the method has the advantages of high speed and good effect when processing large-scale data sets, and low requirements on hardware resources such as memories and the like; (3) the robustness is strong: and the approximate effect can be achieved without fine parameter adjustment relative to a deep learning model.
Because the XGboost model adopted by the embodiment of the application is a machine learning model and the training set adopted by the embodiment of the application is the characteristic data related to the loss of the customer, the model generated by the embodiment of the application can effectively predict the loss risk of the customer in the bank through training. In addition, the XGBoost has the advantages of good speed effect when large-scale data are processed and low requirements for hardware resources such as a memory, so that the loss risk prediction model for the bank client generated by the embodiment of the application can quickly and accurately predict the loss risk of the bank client.
In one embodiment, the preprocessing the data retained by the bank customer to obtain characteristic data related to customer churn includes:
and sequentially carrying out data cleaning processing, data conversion processing and characteristic screening processing on the data retained by the bank customer to obtain the characteristic data related to customer attrition.
In the embodiment of the present application, it should be understood that data cleansing refers to processing a data source before constructing a data warehouse and implementing data mining, so that accuracy, integrity, consistency, timeliness and effectiveness of data are implemented to adapt to a process of subsequent operations. From the perspective of improving data quality, data cleaning is a process of processing data to ensure that the data has better quality, namely a process of obtaining clean data.
In the embodiment of the present application, it should be noted that, since the data format after data cleaning may not be the same as the data format required to be input by the XGboost model, the data after data cleaning needs to be converted into the data format conforming to the data format required to be input by the XGboost model.
In the embodiment of the present application, it should be further noted that, for the purpose of training a model with a specific function, if the feature engineering is good, what algorithm is selected later is not different greatly, and on the contrary, no matter what algorithm is selected, the effect is not improved in a breakthrough manner. Therefore, feature selection is particularly important.
Referring to fig. 2, in an embodiment, the sequentially performing data cleansing processing, data conversion processing, and feature screening processing on the data retained by the bank customer to obtain the feature data related to customer churn includes:
s21, deleting data irrelevant to customer loss in the data retained by the bank customer to obtain first characteristic data;
s22, processing the abnormal value in the first characteristic data to obtain second characteristic data;
s23, converting the second characteristic data into an input format conforming to the XGboost to obtain third characteristic data;
s24, discretizing the third feature data to obtain fourth feature data;
and S25, screening the feature data with weak correlation from the fourth feature data to obtain the feature data related to the customer churn.
As mentioned in step S21, it should be noted that some attribute data in the bank customer retained data is not related to customer churn, taking the example that the bank customer retained data includes the following attribute data: customer ID, customer name, customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, number of financial product purchases/uses, whether it is an active user, estimated income, length of time financial product is used, whether it has been lost. Obviously, the customer ID and customer name are not factors of customer churn, i.e. are not related to customer churn, so it is first necessary to delete these data that are not related to customer churn from the data retained by the bank customer.
As mentioned above in step S22, it should be noted that, since there may be abnormal values in the data retained by the bank customer, for example, in the example shown in table 1, the data listed in the gender null, the credit null and the nonexclusive list are all considered as abnormal data. In addition, data with an excessive age distribution is not meaningful for bank customer churn risk prediction due to excessive product quantity purchases caused by activities and from a market perspective, and therefore, the data can be considered to be abnormal values. In order to obtain high-quality data, it is necessary to process these abnormal values, and there are various methods for processing the abnormal values, which may be deleted, supplemented, or processed by other methods, and the present application is not limited to this.
As described in step S23, it should be noted that, since the data subjected to the abnormal value processing may not have the same format as the data required to be input by the XGboost model, the data subjected to the abnormal value processing needs to be converted into the data format conforming to the data required to be input by the XGboost model. Taking the example shown in table 1 as an example, Gender, Area, and Loan deposit/Loan status are character-type variables and cannot be analyzed. The conversion is performed using a conversion value toolkit. After treatment under the conditions of Gender [ 1,2 ], Area [ 100,101,102, … ], Loan [ 201,202 ]
As the step S24 described above, it should be noted that, since there is a risk of overfitting the data of the excessively continuous features, in order to increase the iteration speed and have strong robustness to the abnormal data, the continuous features need to be discretized. Taking the example shown in Table 1 as an example, the two variables CreditScore and Age have abnormal values, and here, they are discretized, the credit score is divided into 5 groups of 600 or less, 600 + 650, 650 + 700, 700 + 750 and 750, and the Age is divided into 8 groups of 20 or less, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80 and 80 or more, so as to obtain the data distribution.
As mentioned in step S25, it should be noted that when there is strong correlation between features, if the two features are used together, the information redundancy will be caused, so we should consider rejecting the strong correlation between the variables and taking the feature with weaker correlation into account as the feature related to the customer churn. From thermodynamic diagrams we can know which features are strongly correlated and which features are weakly correlated.
In one embodiment, the characteristics associated with customer churn include: customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, amount of financial product purchase/use, whether it is an active user, estimated income, and length of time to use a financial product.
In the embodiment of the present application, it should be noted that the characteristics related to customer churn are obtained after a data cleaning process, a data conversion process and a characteristic screening process are performed on the basis of table 1. From the foregoing, for the purpose of training a model with a specific function, if the feature engineering works well, what algorithm is selected later is not very different, and on the contrary, no matter what algorithm is selected, the effect is not improved in a breakthrough way. Therefore, feature selection is particularly important. Through verification, the characteristics selected by the embodiment of the application can effectively predict the loss risk of bank customers.
In one embodiment, the training of the XGboost model to be trained based on the feature data, the label corresponding to the feature data, and the preset model parameter to obtain the bank customer churn risk prediction model includes:
training the XGBoost model to be trained based on the feature data, the labels corresponding to the feature data and preset model parameters, and adjusting the model parameters according to a preset parameter adjusting strategy in the training process to obtain the bank customer loss risk prediction model.
In one embodiment, the tuning policy is any one of the following policies:
the first parameter adjusting strategy is as follows: only adjusting the number of trees;
and a second parameter adjusting strategy: only adjusting the depth of the tree;
and a third parameter adjustment strategy: adjusting the number and depth of trees simultaneously;
and a fourth parameter adjustment strategy: adjusting only the learning rate;
and a fifth parameter adjustment strategy: simultaneously adjusting the number of trees and the learning rate;
and a sixth parameter adjustment strategy: modulating only the line sample bit rate;
and a seventh parameter adjustment strategy: only the column sample bit rate is adjusted.
In the embodiment of the present application, it should be noted that the parameter adjustment aims to obtain a better prediction effect, and the parameter adjustment may be performed automatically or manually.
The following provides an example of parameter adjustment, which can be performed according to specific conditions
Figure BDA0003681625400000101
Figure BDA0003681625400000111
Example two:
referring to fig. 3, an embodiment of the present application provides a bank customer churn risk prediction model generation apparatus, including:
the acquisition module 1 is used for acquiring data reserved by bank customers;
the preprocessing module 2 is used for preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss;
the training module 3 is used for training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data represents whether the customer corresponding to the characteristic data is lost.
As the obtaining module 1, it should be noted that the bank customer retention data is generally obtained from a banking system, and the bank customer retention data includes a plurality of attribute data, and generally, as shown in table 1, includes the following attributes: customer ID, customer name, customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, number of financial product purchases/uses, whether it is an active user, estimated income, length of time to use a financial product, whether it has been lost.
As the preprocessing module 2 mentioned above, it should be noted that, since the purpose of the present application is to train a model capable of predicting the risk of customer churn, and not all attributes in the data retained by the bank customer belong to the factor of customer churn, in order to ensure the accuracy of the model, it is necessary to extract attributes related to customer churn from the data retained by the bank customer as features to train the model.
As the training module 3, it should be noted that xgboost (extremegradient boosting) is mainly used to solve the supervised learning problem, and such problem utilizes a training number containing a plurality of featuresAccording to x i To predict the target variable y i . Model parameters for XGboost include: the number of trees, the depth of the trees, the learning rate, the row sampling rate, and the column sampling rate. To solve the multi-classification problem, XGboost provides two kinds of penalty functions, one is multi: softmax function, another is multi: softprob function. multi: softmax is the classification result generated after using softmax, while multi softprob is the probability matrix of the output. In the embodiment of the present application, the object of the present application can be achieved by using any one of the above loss functions, and preferably, multi: softprob function, by multi: the softprob function may output the customer churn risk level.
It should also be noted that XGboost has the following advantages: (1) simple and easy to use: compared with other machine learning libraries, the XGboost can be easily used by a user through the XGboost, and a good effect can be obtained; (2) high-efficiency and expandable: the method has the advantages of high speed and good effect when processing a large-scale data set, and low requirements on hardware resources such as a memory and the like; (3) the robustness is strong: and the approximate effect can be achieved without fine parameter adjustment relative to a deep learning model.
Because the XGboost model adopted by the embodiment of the application is a machine learning model and the training set adopted by the embodiment of the application is the characteristic data related to the loss of the customer, the model generated by the embodiment of the application can effectively predict the loss risk of the customer in the bank through training. In addition, the XGBoost has the advantages of good speed effect when large-scale data are processed and low requirements for hardware resources such as a memory, so that the loss risk prediction model for the bank client generated by the embodiment of the application can quickly and accurately predict the loss risk of the bank client.
In an embodiment, the preprocessing module is specifically configured to perform data cleaning, data conversion, and feature screening on data retained by the bank customer in sequence to obtain the feature data related to customer churn.
In the embodiment of the present application, it should be understood that data cleansing refers to processing a data source before constructing a data warehouse and implementing data mining, so as to implement accuracy, integrity, consistency, timeliness and validity of data, so as to adapt to the process of subsequent operations. From the perspective of improving data quality, data cleaning is a process of processing data to ensure that the data has better quality, namely a process of obtaining clean data.
In the embodiment of the present application, it should be noted that, since the data format after data cleaning may not be the same as the data format required to be input by the XGboost model, the data after data cleaning needs to be converted into the data format conforming to the data format required to be input by the XGboost model.
In the embodiment of the present application, it should be further noted that, for the purpose of training a model with a specific function, if the feature engineering is good, what algorithm is selected later is not different greatly, and on the contrary, no matter what algorithm is selected, the effect is not improved in a breakthrough manner. Therefore, feature selection is particularly important.
Referring to fig. 4, in one embodiment, the preprocessing module 2 includes:
the deleting unit 21 is configured to delete data, which is irrelevant to customer churn, in the data retained by the bank customer to obtain first feature data;
an abnormal value processing unit 22, configured to process an abnormal value in the first feature data to obtain second feature data;
a data conversion unit 23, configured to convert the second feature data into an input format conforming to the XGboost to obtain third feature data;
a discretization unit 24, configured to perform discretization processing on the third feature data to obtain fourth feature data;
and a screening unit 25, configured to screen out feature data with weak correlation from the fourth feature data, so as to obtain the feature data related to customer churn.
As for the deletion unit in the above step, it should be noted that some attribute data in the bank customer retained data is irrelevant to customer churn, and the bank customer retained data includes the following attribute data as an example: customer ID, customer name, customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, number of financial product purchases/uses, whether it is an active user, estimated income, length of time to use a financial product, whether it has been lost. Obviously, the customer ID and customer name are not factors of customer churn, i.e. are not related to customer churn, so it is first necessary to delete these data that are not related to customer churn from the data retained by the bank customer.
As the above abnormal value processing unit, it should be noted that, since there may be abnormal values in the data retained by the bank client, for example, in the example shown in table 1, the data listed in the gender null, the credit null, and the nonexclusive list are all considered as abnormal data. In addition, data with an excessive age distribution is not meaningful for bank customer churn risk prediction due to excessive product quantity purchases caused by activities and from a market perspective, and therefore, the data can be considered to be abnormal values. In order to obtain high-quality data, it is necessary to process these abnormal values, and there are various methods for processing the abnormal values, which may be deleted, supplemented, or processed by other methods, and the present application is not limited to this.
As described above, it should be noted that, since the data subjected to the abnormal value processing may not have the same format as the data required to be input by the XGboost model, the data subjected to the abnormal value processing needs to be converted into the data format conforming to the data required to be input by the XGboost model. Taking the example shown in table 1 as an example, Gender, Area, and Loan deposit/Loan status are character-type variables and cannot be analyzed. The conversion is performed using a conversion value toolkit. After treatment under the conditions of Gender [ 1,2 ], Area [ 100,101,102, … ], Loan [ 201,202 ]
As the above discrete units, it should be noted that, since there is a risk of overfitting data of the excessively continuous features, in order to increase the iteration speed and have strong robustness to abnormal data, the continuous features need to be discretized. Taking the example shown in Table 1 as an example, the two variables CreditScore and Age have abnormal values, and here, they are discretized, the credit score is divided into 5 groups of 600 or less, 600 + 650, 650 + 700, 700 + 750 and 750, and the Age is divided into 8 groups of 20 or less, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80 and 80 or more, so as to obtain the data distribution.
As the screening unit mentioned above, it should be noted that when strong correlation occurs between features, if the two features are used simultaneously, redundancy of information may be caused, so we should consider rejecting the feature with strong correlation and including the feature with weak correlation as the feature related to the customer churn. From thermodynamic diagrams we can know which features are strongly correlated and which features are weakly correlated.
In one embodiment, the characteristics associated with customer churn include: customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, amount of financial product purchase/use, whether it is an active user, estimated income, and length of time to use a financial product.
In the embodiment of the present application, it should be noted that the characteristics related to customer churn are obtained after a data cleaning process, a data conversion process and a characteristic screening process are performed on the basis of table 1. From the foregoing, for the purpose of training a model with a specific function, if the feature engineering works well, what algorithm is selected later is not very different, and on the contrary, no matter what algorithm is selected, the effect is not improved in a breakthrough way. Therefore, feature selection is particularly important. Through verification, the characteristics selected by the embodiment of the application can effectively predict the loss risk of bank customers.
In an embodiment, the training module is specifically configured to train the XGboost model to be trained based on the feature data, the label corresponding to the feature data, and a preset model parameter, and adjust the model parameter according to a preset parameter adjustment policy in a training process to obtain the bank customer churn risk prediction model.
In the embodiment of the present application, it should be noted that the model parameters include: the number of trees, the depth of the trees, the learning rate, the row sampling rate, and the column sampling rate.
In the embodiment of the present application, it should be noted that the loss function adopted by the model is
In one embodiment, the tuning policy is any one of the following policies:
the first parameter adjusting strategy is as follows: only adjusting the number of trees;
and a second parameter adjusting strategy: only adjusting the depth of the tree;
and a third parameter adjustment strategy: adjusting the number and depth of trees simultaneously;
and a fourth parameter adjustment strategy: adjusting only the learning rate;
and a fifth parameter adjustment strategy: simultaneously adjusting the number of trees and the learning rate;
and a sixth parameter adjustment strategy: modulating only the line sample bit rate;
and a seventh parameter adjusting strategy: only the column sample bit rate is adjusted.
In the embodiment of the present application, it should be noted that the parameter adjustment aims to obtain a better prediction effect, and the parameter adjustment may be performed automatically or manually.
The following provides an example of parameter adjustment, which can be performed according to specific conditions
Figure BDA0003681625400000161
Example three:
referring to fig. 5, an embodiment of the present application further provides a computer device, where the computer device may be a server, and an internal structure of the computer device may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer equipment is used for storing data suitable for a bank customer loss risk prediction model generation method and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. When executed by a processor, the computer program realizes a bank customer churn risk prediction model generation method, which comprises the following steps: acquiring data retained by a bank client; preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss; training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
Because the XGboost model adopted by the embodiment of the application is a machine learning model and the training set adopted by the embodiment of the application is the characteristic data related to the loss of the customer, the model generated by the embodiment of the application can effectively predict the loss risk of the customer in the bank through training. In addition, the XGBoost has the advantages of good speed effect when large-scale data are processed and low requirements for hardware resources such as a memory, so that the loss risk prediction model for the bank client generated by the embodiment of the application can quickly and accurately predict the loss risk of the bank client.
Example four:
an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for generating a bank customer churn risk prediction model, and the method includes: acquiring data retained by a bank client; preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss; training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
The XGboost model adopted by the embodiment of the application is a machine learning model, and the training set adopted by the embodiment of the application is the characteristic data related to the loss of the customer, so that the model generated by the embodiment of the application can effectively predict the loss risk of the customer of the bank through training. In addition, the XGBoost has the advantages of good speed effect when large-scale data are processed and low requirements for hardware resources such as a memory, so that the loss risk prediction model for the bank client generated by the embodiment of the application can quickly and accurately predict the loss risk of the bank client.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all the equivalent structures or equivalent processes that can be directly or indirectly applied to other related technical fields by using the contents of the specification and the drawings of the present application are also included in the scope of the present application.

Claims (10)

1. A bank customer churn risk prediction model generation method is characterized by comprising the following steps:
acquiring data retained by a bank client;
preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss;
training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data indicates whether the customer corresponding to the characteristic data is lost.
2. The method of generating a bank customer attrition risk prediction model according to claim 1 wherein the customer retention data includes a plurality of attribute data, the attributes including customer ID, customer name, customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, number of financial product purchases/uses, whether there are active customers, estimated income, length of time a financial product is used, and whether it has been attrited.
3. The method for generating a bank customer attrition risk prediction model according to claim 1 wherein the preprocessing of the bank customer retention data to obtain characteristic data related to customer attrition comprises:
and sequentially carrying out data cleaning processing, data conversion processing and feature screening processing on the data retained by the bank customer to obtain the feature data related to customer loss.
4. The bank customer attrition risk prediction model generation method according to claim 3, wherein the sequentially performing data cleaning processing, data conversion processing and feature screening processing on the data retained by the bank customer to obtain the feature data related to customer attrition comprises:
deleting data irrelevant to customer loss in the data retained by the bank customer to obtain first characteristic data;
processing abnormal values in the first characteristic data to obtain second characteristic data;
converting the second characteristic data into an input format conforming to the XGboost to obtain third characteristic data;
discretizing the third characteristic data to obtain fourth characteristic data;
and screening out the feature data with weak correlation from the fourth feature data to obtain the feature data related to the customer churn.
5. The bank customer attrition risk prediction model generation method of claim 2 wherein the characteristics relating to customer attrition include: customer credit score, gender, age, region, deposit/loan status, whether there is a credit card, amount of financial product purchase/use, whether it is an active user, estimated income, and length of time to use a financial product.
6. The method for generating a bank customer churn risk prediction model according to claim 1, wherein training an XGboost model to be trained based on the feature data, a label corresponding to the feature data, and preset model parameters to obtain the bank customer churn risk prediction model comprises:
training the XGBoost model to be trained based on the feature data, the labels corresponding to the feature data and preset model parameters, and adjusting the model parameters according to a preset parameter adjusting strategy in the training process to obtain the bank customer loss risk prediction model.
7. The method for generating a bank customer churn risk prediction model according to claim 6, wherein the parameter adjustment policy is any one of the following parameter adjustment policies:
the first parameter adjusting strategy is as follows: only adjusting the number of trees;
and a second parameter adjusting strategy: only adjusting the depth of the tree;
and a third parameter adjusting strategy: adjusting the number and depth of trees simultaneously;
and a fourth parameter adjustment strategy: adjusting only the learning rate;
and a fifth parameter adjustment strategy: simultaneously adjusting the number of trees and the learning rate;
and sixth parameter adjustment strategy: modulating only the line sample bit rate;
and a seventh parameter adjustment strategy: only the column sample bit rate is adjusted.
8. A bank customer attrition risk prediction model generation device is characterized by comprising:
the acquisition module is used for acquiring data reserved by the bank customer;
the preprocessing module is used for preprocessing the data retained by the bank customer to obtain characteristic data related to customer loss;
the training module is used for training an XGboost model to be trained based on the feature data, the label corresponding to the feature data and preset model parameters to obtain the bank customer loss risk prediction model; and the label corresponding to the characteristic data represents whether the customer corresponding to the characteristic data is lost.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210638850.9A 2022-06-07 2022-06-07 Bank customer loss risk prediction model generation method, device, equipment and medium Pending CN115049121A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210638850.9A CN115049121A (en) 2022-06-07 2022-06-07 Bank customer loss risk prediction model generation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210638850.9A CN115049121A (en) 2022-06-07 2022-06-07 Bank customer loss risk prediction model generation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115049121A true CN115049121A (en) 2022-09-13

Family

ID=83161676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210638850.9A Pending CN115049121A (en) 2022-06-07 2022-06-07 Bank customer loss risk prediction model generation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115049121A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422181A (en) * 2023-12-15 2024-01-19 湖南三湘银行股份有限公司 Fuzzy label-based method and system for early warning loss of issuing clients

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422181A (en) * 2023-12-15 2024-01-19 湖南三湘银行股份有限公司 Fuzzy label-based method and system for early warning loss of issuing clients
CN117422181B (en) * 2023-12-15 2024-04-02 湖南三湘银行股份有限公司 Fuzzy label-based method and system for early warning loss of issuing clients

Similar Documents

Publication Publication Date Title
CN110752942B (en) Alarm information decision method and device, computer equipment and storage medium
CN110060144B (en) Method for training credit model, method, device, equipment and medium for evaluating credit
US11227188B2 (en) Computer system for building, training and productionizing machine learning models
CN108876133B (en) Risk assessment processing method, device, server and medium based on business information
AU2021218153A1 (en) Method and apparatus for encrypting data, method and apparatus for training machine learning model, and electronic device
CN113610239B (en) Feature processing method and feature processing system for machine learning
CN110389970B (en) User intention prediction method, device, computer equipment and storage medium
CN112381154A (en) Method and device for predicting user probability and computer equipment
CN112035611B (en) Target user recommendation method, device, computer equipment and storage medium
CN113656808A (en) Data security evaluation method, device, equipment and storage medium
CN115049121A (en) Bank customer loss risk prediction model generation method, device, equipment and medium
CN111738762A (en) Method, device, equipment and storage medium for determining recovery price of poor assets
CN115618212A (en) Power data processing method and device, computer equipment and storage medium
CN111078500A (en) Method and device for adjusting operation configuration parameters, computer equipment and storage medium
Bardi et al. Convergence by viscosity methods in multiscale financial models with stochastic volatility
CN112464660A (en) Text classification model construction method and text data processing method
CN111709766A (en) User behavior prediction method and device, storage medium and electronic equipment
CN110097250A (en) Product risks prediction technique, device, computer equipment and storage medium
CN115828901A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN115511562A (en) Virtual product recommendation method and device, computer equipment and storage medium
CN113762158A (en) Borderless table recovery model training method, device, computer equipment and medium
Kaltenbach et al. Interpretable reduced-order modeling with time-scale separation
CN114782960A (en) Model training method and device, computer equipment and computer readable storage medium
CN114998001A (en) Service class identification method, device, equipment, storage medium and program product
CN113420876A (en) Real-time operation data processing method, device and equipment based on unsupervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination