CN109636591A

CN109636591A - A kind of credit scoring card development approach based on machine learning

Info

Publication number: CN109636591A
Application number: CN201811618779.8A
Authority: CN
Inventors: 陈国定; 徐英浩
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-04-16

Abstract

It is a kind of based on the scorecard development approach based on machine learning, comprising the following steps: (1) label of target user is defined according to vintage analysis；(2) multiple data sources are integrated and obtain final data；(3) exploratory analysis and data cleansing are carried out to data；(4) card side's branch mailbox method branch mailbox after optimization is used to the data after cleaning；(5) Variable Selection is carried out to the variable after branch mailbox；(6) logistics regression model is constructed；(7) model is evaluated；(8) model output target user's Default Probability is converted into score.The present invention is using machine learning, vintage analysis, logistics regression model, the audit difficult points such as difficulty low for man efficiency under big data era, and problem is solved from being transformed into machine by artificial solution.

Description

A kind of credit scoring card development approach based on machine learning

Technical field

The present invention relates to a kind of internet finance, machine learning, vintage analysis, logistics regression model, calculate Machine application field more particularly to a kind of credit scoring card development approach based on machine learning；

Background technique

With the rapid development of credit scoring model and credit industry, the method for building up of model is varied, from beginning Conventional statistics homing method, by now emerging deep learning algorithm, and model are upper in application, gradually from prediction Default Probability To each life cycle infiltration of credit, such as score A card, B card and subsequent C card after loan.But general financial company is commented Divide card or traditional expert teacher scorecard, both laid down a regulation by veteran expert, thus come the user that separates the sheep from the goats, this Kind method is in the case where early time data amount is little or effective, but with the development of big data, it is this by artificial The scorecard efficiency of expert teacher is just very low, and in order to solve such case, scorecard of the exploitation based on data just seems very It is necessary.With inefficiency is replaced based on the scorecard of data-driven, uppity manual examination and verification mode improves credit authorization Timeliness and accuracy；

Summary of the invention

For overcome the deficiencies in the prior art, the present invention proposes a kind of credit scoring card exploitation side based on machine learning Method, it is low for man efficiency under big data era using machine learning, vintage analysis, logistics regression model, it examines Problem is transformed into machine solution from by artificial solution by the difficult points such as core difficulty.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of credit scoring card development approach based on machine learning, comprising the following steps:

1) definition of target variable

It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by table Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user "；

2) acquisition of data

The source of data is varied, the field including financial institution itself: such as the age of user, household register, gender, receipts Enter, be in debt and compare, in the loaning bill behavior of mechanism；

There are also third-party data: historical consumption data, the lend-borrow action of other mechanisms and shopping online behavior；

3) EDA exploratory data analysis

The case where understanding data, the missing values situation of each field, exceptional value situation, average value, median, maximum value, Minimum value, distribution situation, to formulate data prediction scheme；

4) data cleansing

Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate More than the variable of given threshold value, random deep woods is used by that will lack sample as predicted value less than threshold value for miss rate Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state；

5) variable branch mailbox

Using card side's branch mailbox method, and multiple business constraint condition is combined: the constraint condition includes each group of minimum Sample accounting, maximum branch mailbox number and woe are dull；

The treatment process of variable branch mailbox method after improvement:

1. input: the maximum interval number n of branch mailbox；

2. initialization

I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order；

Ii) in order to reduce calculation amount, for status number be greater than a certain threshold value (100) variable, using etc. frequency divisions case carry out Rough segmentation case is less than status number the not branch mailbox of maximum interval number；

Iii) if there is missing values, by missing values separately as a branch mailbox；

3. combine interval

I) chi-square value of every a pair of of adjacent interval is calculated；

Ii) the smallest a pair of of the section of chi-square value is merged；

A_ij: the example quantity of the i-th section jth class

E_ij:N is the sample number of combine interval, N_iIt is i-th group of sample number, C_jJth class sample is merging The sample number in section；

Iii above step) is repeated, until branch mailbox quantity is not more than n；

4. branch mailbox post-processes

I) branch mailbox for being 0 or 1 for bad client's ratio merges and (cannot be all hospitable family in a branch mailbox or be all Bad client)；

Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows:

Step 4.1: the chest and previous chest being merged, chi-square value chi2_1 is calculated；

Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated；

Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, otherwise with previous case Son merges, until meeting woe dullness；

Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges

Step 4.4: the chest and previous chest being merged, chi-square value chi2_3 is calculated；

Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated；

Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, otherwise with previous case Son merges, until each case sample accounting is both greater than 5%；

5. exporting the data after branch mailbox and branch mailbox section

The explanation that woe in branch mailbox is calculated:

For independent variable i-th case WOE value are as follows:

Variable declaration is as follows in formula (2):

p_i1: it is the ratio of bad client Zhan Suoyou bad client in i-th case

p_i0: it is the ratio that hospitable family accounts for all hospitable families in i-th case

#B_i: it is bad client's number in i-th case

#G_i: it is hospitable family number in i-th case

#B_T: it is all bad client's numbers

#G_T: it is all hospitable family numbers

6) Variable Selection

Based on the Variable Selection of IV value, IV value calculation formula is as follows:

The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:

After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:

Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable；

Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than between two variables When threshold value, the lower variable of IV value is deleted；

Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is big It when threshold value (general threshold value is set as 10 or 7), needs to reject explanatory variable one by one, selects IV value lower one when deleting variable It is a；

It is to illustrate to VIF and Pearson correlation coefficient below:

I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of variable phase Guan Xingyue is strong, and formula is as follows:

In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard of variable Y Difference；

Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:

R in formula (6)_iFor X_iWith the multiple correlation coefficient of other variables.

In formula (7)For the linear expression of its dependent variable；

7) logistics regression model is constructed

Main includes constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to each variable Coefficient symbols are screened, and final Logic Regression Models are obtained；

8) model evaluation

Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so making With AUC (area under ROC curve) come the quality of evaluation model, while also carrying out differentiation of the judgment models for fine or not user using KS Ability；

9) probability is converted to score

Score=offset+factor*ln (odds) (8)

What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model Practicability, probability value can be converted into credit scoring.Transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, credit is better；Formula (8) conversion formula is as follows:

Wherein odds=p/ (1-p), p indicate that user is the probability of bad client；Factor indicates the coefficient of linear transformation, leads to It is standing to be set to 2/ln2；Offset is expressed as adjustment constant；

How to set factor and offset is the key that credit scoring, it is right when usually assuming first that quality than for 50:1 The score answered is 600 points, and every 20 points of increase of scoring on this basis is then fine or not than double (" Points to double the 20) odds:pro " is set as, to obtain equation group:

After solution:

Finally obtain scoring formula:

In formula (11)

A: intercept；

woe_ij: the woe value of the current branch mailbox j of variable i；

β_i: the regression coefficient of variable i；

N: variable number；

J: branch mailbox number.

Technical concept of the invention are as follows: firstly, after by using optimization according to the possible state of its value to variable Card side's branch mailbox method carries out branch mailbox, secondly, woe conversion is carried out to the variable after branch mailbox, so that each feature is unified to identical Under dimension, while the effect that nonlinear characteristic is converted into linear character can be also played, then, by IV value, Pearson came is related Coefficient, the methods of VIF filter out feature, next, the feature selected is substituted into logistics regression model verification characteristics Validity, finally, the Default Probability of model output user is converted to score；

Beneficial effects of the present invention are mainly manifested in: 1, optimal method being introduced into variable grouping, will maximize IV value Optimization aim the most combines multiple business constraint condition, such as woe monotonicity, every group of minimum sample number, and then maximum journey While degree promotes variable prediction ability, it is ensured that the reasonability of result；2, pass through maximum IV value, Pearson correlation coefficient, VIF etc. The method of feature selecting proposes a kind of problem for handling feature selecting hardly possible under internet high dimensional data；

Detailed description of the invention

Fig. 1 is the flow chart of card side's branch mailbox with optimization；

Fig. 2 is the ks curve graph of Logistics regression model.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawing.

Referring to Fig.1, a kind of scorecard development approach based on machine learning, manual examination and verification effect can solved by carrying out this method Solve the problems, such as that high dimensional data feature selecting is difficult while rate is low, present invention could apply to internet finance scorecards to open Hair.In scene as shown in Figure 1.Mainly include the following steps: for optimization method of the target design to problem

1) definition of target variable

It is analyzed according to vintage, observes average overdue tendency of each month, determine the time span of performance window.By table Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user "；

2) acquisition of data

The source of data is varied, mainly there is the field of financial institution itself: such as the age of user, household register, gender, receipts Enter, be in debt and compare, in loaning bill behavior of mechanism etc.；

There are also third-party data: such as historical consumption data, the lend-borrow action of other mechanisms, shopping online behavior etc. Deng；

3) EDA exploratory data analysis

Understand the general condition of data, for example, the missing values situation of each field, exceptional value situation, average value, median, Maximum value, minimum value, distribution situation etc., to formulate reasonable data prediction scheme；

4) data cleansing

Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate More than the variable of given threshold value, random depth can be used by that will lack sample as predicted value less than threshold value for miss rate Woods predicts that the value is filled, and the processing for exceptional value is using exceptional value as a kind of state, such as Fig. 1；

5) variable branch mailbox

Using card side's branch mailbox method, and multiple business about said conditions are combined, such as: each group of smallest sample accounting, most Big branch mailbox number, woe dullness etc.；

The treatment process of variable branch mailbox method is as follows after improvement:

1. input: the maximum interval number n of branch mailbox；

2. initialization

3. combine interval

I) chi-square value of every a pair of of adjacent interval is calculated；

Ii) the smallest a pair of of the section of chi-square value is merged；

A_ij: the example quantity of the i-th section jth class

4. branch mailbox post-processes

5. exporting the data after branch mailbox and branch mailbox section to the explanation of the woe calculating in branch mailbox:

For independent variable i-th case WOE value are as follows:

Variable declaration is as follows in formula (2):

p_i1: it is the ratio of bad client Zhan Suoyou bad client in i-th case；

p_i0: it is the ratio that hospitable family accounts for all hospitable families in i-th case；

#B_i: it is bad client's number in i-th case；

#G_i: it is hospitable family number in i-th case；

#B_T: it is all bad client's numbers；

#G_T: it is all hospitable family numbers；

6) Variable Selection

It is to illustrate to VIF and Pearson correlation coefficient below:

R in formula (6)_iFor X_iWith the multiple correlation coefficient of other variables；

In formula (7)For the linear expression of its dependent variable；

7) logistics regression model is constructed

Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient of each variable Symbol is screened, and final Logic Regression Models are obtained；

8) model evaluation

9) probability is converted to score

Score=offset+factor*ln (odds) (8)

What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model Practicability, probability value can be converted to credit scoring, transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, the better formula of credit (8) conversion formula is as follows:

After solution:

Finally obtain scoring formula:

In formula (11)

A: intercept；

woe_ij: the woe value of the current branch mailbox j of variable i；

β_i: the regression coefficient of variable i；

N: variable number；

J: branch mailbox number.

Claims

1. a kind of credit scoring card development approach based on machine learning, which is characterized in that the described method comprises the following steps:

1) definition of target variable

It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by the phase of showing Interior definition of the overdue number of days of user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", will exceed Phase number of days is greater than the definition less than 30 days in 3 days " gray scale user "；(2) acquisition of data

The source of data is varied, the field including financial institution itself: the age of user, household register, gender, income, debt Than and mechanism loaning bill behavior；

3) EDA exploratory data analysis

Understand the general condition of data, the missing values situation of each field, exceptional value situation, average value, median, maximum value, Minimum value, distribution situation etc., to formulate data prediction scheme；

4) data cleansing

Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate to be more than The variable of given threshold value, for miss rate can be come by the way that sample will be lacked as predicted value using random deep woods less than threshold value Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state；

5) variable branch mailbox

Using card side's branch mailbox method, and multiple business constraint condition is combined, the constraint condition includes each group of smallest sample Accounting, maximum branch mailbox number or woe are dull；

#G_T: it is all hospitable family numbers；

6) Variable Selection

Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than threshold value between two variables When, delete the lower variable of IV value；

Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is greater than threshold It when value, needs to reject explanatory variable one by one, selects IV value lower one when deleting variable；

It is to illustrate to VIF and Pearson correlation coefficient below:

I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of correlation of variables Stronger, formula is as follows:

In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard deviation of variable Y；

In formula (7)For the linear expression of its dependent variable；

7) logistics regression model is constructed

Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient symbols of each variable It is screened, obtains final Logic Regression Models；

8) model evaluation

Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so using AUC Carry out the quality of evaluation model, while also carrying out judgment models for the separating capacity of fine or not user using KS；

9) probability is converted to score

Score=offset+factor*ln (odds) (8)

What Logistics regression model finally exported is the probability of user promise breaking, in order to improve the reality of credit scoring model With property, probability value can be converted into credit scoring, using transformation approach, i.e., linear transformation be carried out to the logarithm of fine or not ratio, so A constant is added afterwards, makes goals within the scope of presetting, and score is higher, credit is better.

2. a kind of credit scoring card development approach based on machine learning as described in claim 1, which is characterized in that the step It is rapid 9) in, formula (8) conversion formula is as follows:

Wherein odds=p/ (1-p), p indicate that user is the probability of bad client；Factor indicates the coefficient of linear transformation, is set as 2/ln2；Offset is expressed as adjustment constant；

How to set factor and offset is the key that credit scoring, it is first assumed that corresponding score when quality is than for 50:1 It is 600 points, then than double, Points to double the odds:pro " is set quality for every 20 points of increase of scoring on this basis It is set to 20, to obtain equation group:

After solution:

Finally obtain scoring formula:

In formula (11)

A: intercept；

woe_ij: the woe value of the current branch mailbox j of variable i；

β_i: the regression coefficient of variable i；

N: variable number；

J: branch mailbox number.

3. a kind of credit scoring card development approach based on machine learning as claimed in claim 1 or 2, which is characterized in that institute It states in step 5), the treatment process of variable branch mailbox method is as follows after improvement:

1. input: the maximum interval number n of branch mailbox；

2. initialization

Ii) in order to reduce calculation amount, for status number be greater than given threshold variable, using etc. frequency divisions case carry out rough segmentation case, it is right The not branch mailbox of maximum interval number is less than in status number；

3. combine interval

I) chi-square value of every a pair of of adjacent interval is calculated；

Ii) the smallest a pair of of the section of chi-square value is merged；

A_ij: the example quantity of the i-th section jth class；

E_ij:N is the sample number of combine interval, N_iIt is i-th group of sample number, C_jJth class sample is in combine interval Sample number；

4. branch mailbox post-processes

I) branch mailbox for being 0 or 1 for bad client's ratio merges, and hospitable family cannot be all in a branch mailbox or is all bad visitor Family；

Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows: step 4.1: by this Chest and previous chest merge, and calculate chi-square value chi2_1；

Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, and otherwise close with previous chest And until meeting woe dullness；

Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges step 4.4: The chest and previous chest are merged, chi-square value chi2_3 is calculated；

Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, and otherwise close with previous chest And until each case sample accounting is both greater than 5%；

5. exporting the data after branch mailbox and branch mailbox section

The explanation that woe in branch mailbox is calculated:

For independent variable i-th case WOE value are as follows:

Variable declaration is as follows in formula (2):

p_i1: it is the ratio of bad client Zhan Suoyou bad client in i-th case；

#B_i: it is bad client's number in i-th case；

#G_i: it is hospitable family number in i-th case；

#B_T: it is all bad client's numbers.