CN109636591A - A kind of credit scoring card development approach based on machine learning - Google Patents

A kind of credit scoring card development approach based on machine learning Download PDF

Info

Publication number
CN109636591A
CN109636591A CN201811618779.8A CN201811618779A CN109636591A CN 109636591 A CN109636591 A CN 109636591A CN 201811618779 A CN201811618779 A CN 201811618779A CN 109636591 A CN109636591 A CN 109636591A
Authority
CN
China
Prior art keywords
value
variable
branch mailbox
chest
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811618779.8A
Other languages
Chinese (zh)
Inventor
陈国定
徐英浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811618779.8A priority Critical patent/CN109636591A/en
Publication of CN109636591A publication Critical patent/CN109636591A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

It is a kind of based on the scorecard development approach based on machine learning, comprising the following steps: (1) label of target user is defined according to vintage analysis;(2) multiple data sources are integrated and obtain final data;(3) exploratory analysis and data cleansing are carried out to data;(4) card side's branch mailbox method branch mailbox after optimization is used to the data after cleaning;(5) Variable Selection is carried out to the variable after branch mailbox;(6) logistics regression model is constructed;(7) model is evaluated;(8) model output target user's Default Probability is converted into score.The present invention is using machine learning, vintage analysis, logistics regression model, the audit difficult points such as difficulty low for man efficiency under big data era, and problem is solved from being transformed into machine by artificial solution.

Description

A kind of credit scoring card development approach based on machine learning
Technical field
The present invention relates to a kind of internet finance, machine learning, vintage analysis, logistics regression model, calculate Machine application field more particularly to a kind of credit scoring card development approach based on machine learning;
Background technique
With the rapid development of credit scoring model and credit industry, the method for building up of model is varied, from beginning Conventional statistics homing method, by now emerging deep learning algorithm, and model are upper in application, gradually from prediction Default Probability To each life cycle infiltration of credit, such as score A card, B card and subsequent C card after loan.But general financial company is commented Divide card or traditional expert teacher scorecard, both laid down a regulation by veteran expert, thus come the user that separates the sheep from the goats, this Kind method is in the case where early time data amount is little or effective, but with the development of big data, it is this by artificial The scorecard efficiency of expert teacher is just very low, and in order to solve such case, scorecard of the exploitation based on data just seems very It is necessary.With inefficiency is replaced based on the scorecard of data-driven, uppity manual examination and verification mode improves credit authorization Timeliness and accuracy;
Summary of the invention
For overcome the deficiencies in the prior art, the present invention proposes a kind of credit scoring card exploitation side based on machine learning Method, it is low for man efficiency under big data era using machine learning, vintage analysis, logistics regression model, it examines Problem is transformed into machine solution from by artificial solution by the difficult points such as core difficulty.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of credit scoring card development approach based on machine learning, comprising the following steps:
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by table Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user ";
2) acquisition of data
The source of data is varied, the field including financial institution itself: such as the age of user, household register, gender, receipts Enter, be in debt and compare, in the loaning bill behavior of mechanism;
There are also third-party data: historical consumption data, the lend-borrow action of other mechanisms and shopping online behavior;
3) EDA exploratory data analysis
The case where understanding data, the missing values situation of each field, exceptional value situation, average value, median, maximum value, Minimum value, distribution situation, to formulate data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate More than the variable of given threshold value, random deep woods is used by that will lack sample as predicted value less than threshold value for miss rate Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business constraint condition is combined: the constraint condition includes each group of minimum Sample accounting, maximum branch mailbox number and woe are dull;
The treatment process of variable branch mailbox method after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than a certain threshold value (100) variable, using etc. frequency divisions case carry out Rough segmentation case is less than status number the not branch mailbox of maximum interval number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is merging The sample number in section;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges and (cannot be all hospitable family in a branch mailbox or be all Bad client);
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows:
Step 4.1: the chest and previous chest being merged, chi-square value chi2_1 is calculated;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, otherwise with previous case Son merges, until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges
Step 4.4: the chest and previous chest being merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, otherwise with previous case Son merges, until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section
The explanation that woe in branch mailbox is calculated:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case
#Bi: it is bad client's number in i-th case
#Gi: it is hospitable family number in i-th case
#BT: it is all bad client's numbers
#GT: it is all hospitable family numbers
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than between two variables When threshold value, the lower variable of IV value is deleted;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is big It when threshold value (general threshold value is set as 10 or 7), needs to reject explanatory variable one by one, selects IV value lower one when deleting variable It is a;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of variable phase Guan Xingyue is strong, and formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard of variable Y Difference;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables.
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Main includes constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to each variable Coefficient symbols are screened, and final Logic Regression Models are obtained;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so making With AUC (area under ROC curve) come the quality of evaluation model, while also carrying out differentiation of the judgment models for fine or not user using KS Ability;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model Practicability, probability value can be converted into credit scoring.Transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, credit is better;Formula (8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, leads to It is standing to be set to 2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is right when usually assuming first that quality than for 50:1 The score answered is 600 points, and every 20 points of increase of scoring on this basis is then fine or not than double (" Points to double the 20) odds:pro " is set as, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.
Technical concept of the invention are as follows: firstly, after by using optimization according to the possible state of its value to variable Card side's branch mailbox method carries out branch mailbox, secondly, woe conversion is carried out to the variable after branch mailbox, so that each feature is unified to identical Under dimension, while the effect that nonlinear characteristic is converted into linear character can be also played, then, by IV value, Pearson came is related Coefficient, the methods of VIF filter out feature, next, the feature selected is substituted into logistics regression model verification characteristics Validity, finally, the Default Probability of model output user is converted to score;
Beneficial effects of the present invention are mainly manifested in: 1, optimal method being introduced into variable grouping, will maximize IV value Optimization aim the most combines multiple business constraint condition, such as woe monotonicity, every group of minimum sample number, and then maximum journey While degree promotes variable prediction ability, it is ensured that the reasonability of result;2, pass through maximum IV value, Pearson correlation coefficient, VIF etc. The method of feature selecting proposes a kind of problem for handling feature selecting hardly possible under internet high dimensional data;
Detailed description of the invention
Fig. 1 is the flow chart of card side's branch mailbox with optimization;
Fig. 2 is the ks curve graph of Logistics regression model.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawing.
Referring to Fig.1, a kind of scorecard development approach based on machine learning, manual examination and verification effect can solved by carrying out this method Solve the problems, such as that high dimensional data feature selecting is difficult while rate is low, present invention could apply to internet finance scorecards to open Hair.In scene as shown in Figure 1.Mainly include the following steps: for optimization method of the target design to problem
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, determine the time span of performance window.By table Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user ";
2) acquisition of data
The source of data is varied, mainly there is the field of financial institution itself: such as the age of user, household register, gender, receipts Enter, be in debt and compare, in loaning bill behavior of mechanism etc.;
There are also third-party data: such as historical consumption data, the lend-borrow action of other mechanisms, shopping online behavior etc. Deng;
3) EDA exploratory data analysis
Understand the general condition of data, for example, the missing values situation of each field, exceptional value situation, average value, median, Maximum value, minimum value, distribution situation etc., to formulate reasonable data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate More than the variable of given threshold value, random depth can be used by that will lack sample as predicted value less than threshold value for miss rate Woods predicts that the value is filled, and the processing for exceptional value is using exceptional value as a kind of state, such as Fig. 1;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business about said conditions are combined, such as: each group of smallest sample accounting, most Big branch mailbox number, woe dullness etc.;
The treatment process of variable branch mailbox method is as follows after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than a certain threshold value (100) variable, using etc. frequency divisions case carry out Rough segmentation case is less than status number the not branch mailbox of maximum interval number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is merging The sample number in section;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges and (cannot be all hospitable family in a branch mailbox or be all Bad client);
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows:
Step 4.1: the chest and previous chest being merged, chi-square value chi2_1 is calculated;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, otherwise with previous case Son merges, until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges
Step 4.4: the chest and previous chest being merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, otherwise with previous case Son merges, until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section to the explanation of the woe calculating in branch mailbox:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case;
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case;
#Bi: it is bad client's number in i-th case;
#Gi: it is hospitable family number in i-th case;
#BT: it is all bad client's numbers;
#GT: it is all hospitable family numbers;
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than between two variables When threshold value, the lower variable of IV value is deleted;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is big It when threshold value (general threshold value is set as 10 or 7), needs to reject explanatory variable one by one, selects IV value lower one when deleting variable It is a;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of variable phase Guan Xingyue is strong, and formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard of variable Y Difference;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables;
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient of each variable Symbol is screened, and final Logic Regression Models are obtained;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so making With AUC (area under ROC curve) come the quality of evaluation model, while also carrying out differentiation of the judgment models for fine or not user using KS Ability;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model Practicability, probability value can be converted to credit scoring, transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, the better formula of credit (8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, leads to It is standing to be set to 2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is right when usually assuming first that quality than for 50:1 The score answered is 600 points, and every 20 points of increase of scoring on this basis is then fine or not than double (" Points to double the 20) odds:pro " is set as, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.

Claims (3)

1. a kind of credit scoring card development approach based on machine learning, which is characterized in that the described method comprises the following steps:
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by the phase of showing Interior definition of the overdue number of days of user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", will exceed Phase number of days is greater than the definition less than 30 days in 3 days " gray scale user ";(2) acquisition of data
The source of data is varied, the field including financial institution itself: the age of user, household register, gender, income, debt Than and mechanism loaning bill behavior;
There are also third-party data: historical consumption data, the lend-borrow action of other mechanisms and shopping online behavior;
3) EDA exploratory data analysis
Understand the general condition of data, the missing values situation of each field, exceptional value situation, average value, median, maximum value, Minimum value, distribution situation etc., to formulate data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate to be more than The variable of given threshold value, for miss rate can be come by the way that sample will be lacked as predicted value using random deep woods less than threshold value Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business constraint condition is combined, the constraint condition includes each group of smallest sample Accounting, maximum branch mailbox number or woe are dull;
#GT: it is all hospitable family numbers;
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than threshold value between two variables When, delete the lower variable of IV value;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is greater than threshold It when value, needs to reject explanatory variable one by one, selects IV value lower one when deleting variable;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of correlation of variables Stronger, formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard deviation of variable Y;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables;
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient symbols of each variable It is screened, obtains final Logic Regression Models;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so using AUC Carry out the quality of evaluation model, while also carrying out judgment models for the separating capacity of fine or not user using KS;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve the reality of credit scoring model With property, probability value can be converted into credit scoring, using transformation approach, i.e., linear transformation be carried out to the logarithm of fine or not ratio, so A constant is added afterwards, makes goals within the scope of presetting, and score is higher, credit is better.
2. a kind of credit scoring card development approach based on machine learning as described in claim 1, which is characterized in that the step It is rapid 9) in, formula (8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, is set as 2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is first assumed that corresponding score when quality is than for 50:1 It is 600 points, then than double, Points to double the odds:pro " is set quality for every 20 points of increase of scoring on this basis It is set to 20, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.
3. a kind of credit scoring card development approach based on machine learning as claimed in claim 1 or 2, which is characterized in that institute It states in step 5), the treatment process of variable branch mailbox method is as follows after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than given threshold variable, using etc. frequency divisions case carry out rough segmentation case, it is right The not branch mailbox of maximum interval number is less than in status number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class;
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is in combine interval Sample number;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges, and hospitable family cannot be all in a branch mailbox or is all bad visitor Family;
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows: step 4.1: by this Chest and previous chest merge, and calculate chi-square value chi2_1;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, and otherwise close with previous chest And until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges step 4.4: The chest and previous chest are merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, and otherwise close with previous chest And until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section
The explanation that woe in branch mailbox is calculated:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case;
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case;
#Bi: it is bad client's number in i-th case;
#Gi: it is hospitable family number in i-th case;
#BT: it is all bad client's numbers.
CN201811618779.8A 2018-12-28 2018-12-28 A kind of credit scoring card development approach based on machine learning Pending CN109636591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811618779.8A CN109636591A (en) 2018-12-28 2018-12-28 A kind of credit scoring card development approach based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811618779.8A CN109636591A (en) 2018-12-28 2018-12-28 A kind of credit scoring card development approach based on machine learning

Publications (1)

Publication Number Publication Date
CN109636591A true CN109636591A (en) 2019-04-16

Family

ID=66078701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811618779.8A Pending CN109636591A (en) 2018-12-28 2018-12-28 A kind of credit scoring card development approach based on machine learning

Country Status (1)

Country Link
CN (1) CN109636591A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
CN110135467A (en) * 2019-04-23 2019-08-16 北京淇瑀信息科技有限公司 A kind of model training method, device, system and recording medium based on data splicing
CN110147938A (en) * 2019-04-23 2019-08-20 北京淇瑀信息科技有限公司 A kind of training sample generation method, device, system and recording medium
CN110322150A (en) * 2019-07-04 2019-10-11 优估(上海)信息科技有限公司 A kind of signal auditing method, device and server
CN110322142A (en) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 A kind of big data air control model and inline system configuration technology
CN110544165A (en) * 2019-09-02 2019-12-06 中诚信征信有限公司 credit risk score card creating method and device and electronic equipment
CN110620696A (en) * 2019-09-29 2019-12-27 杭州安恒信息技术股份有限公司 Grading method and device for enterprise network security situation awareness
CN110659817A (en) * 2019-09-16 2020-01-07 上海云从企业发展有限公司 Data processing method and device, machine readable medium and equipment
CN110688373A (en) * 2019-09-17 2020-01-14 杭州绿度信息技术有限公司 OFFSET method based on logistic regression
CN111080120A (en) * 2019-12-13 2020-04-28 上海海豚企业征信服务有限公司 Label classification and quantitative analysis method based on credit big data
CN111178690A (en) * 2019-12-11 2020-05-19 国网重庆市电力公司北碚供电分公司 Electricity stealing risk assessment method for electricity consumers based on wind control scoring card model
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111582466A (en) * 2020-05-09 2020-08-25 深圳市卡数科技有限公司 Scoring card configuration method, device, equipment and storage medium for simulation neural network
CN111583031A (en) * 2020-05-15 2020-08-25 上海海事大学 Application scoring card model building method based on ensemble learning
CN111861704A (en) * 2020-07-10 2020-10-30 深圳无域科技技术有限公司 Wind control feature generation method and system
CN111859682A (en) * 2020-07-24 2020-10-30 北京睿知图远科技有限公司 GroupLasso-based variable automatic selection method, system and readable medium
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112232944A (en) * 2020-09-29 2021-01-15 中诚信征信有限公司 Scoring card creating method and device and electronic equipment
CN112700324A (en) * 2021-01-08 2021-04-23 北京工业大学 User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112862593A (en) * 2021-01-28 2021-05-28 深圳前海微众银行股份有限公司 Credit scoring card model training method, device, system and computer storage medium
CN113177585A (en) * 2021-04-23 2021-07-27 上海晓途网络科技有限公司 User classification method and device, electronic equipment and storage medium
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression
CN114139960A (en) * 2021-12-01 2022-03-04 安徽数升数据科技有限公司 Work order complaint risk pre-control method
CN115423600A (en) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment
CN115423603A (en) * 2022-08-31 2022-12-02 厦门国际银行股份有限公司 Wind control model establishing method and system based on machine learning and storage medium
CN116091206A (en) * 2023-01-31 2023-05-09 金电联行(北京)信息技术有限公司 Credit evaluation method, credit evaluation device, electronic equipment and storage medium

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135467A (en) * 2019-04-23 2019-08-16 北京淇瑀信息科技有限公司 A kind of model training method, device, system and recording medium based on data splicing
CN110147938A (en) * 2019-04-23 2019-08-20 北京淇瑀信息科技有限公司 A kind of training sample generation method, device, system and recording medium
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
CN110084376B (en) * 2019-04-30 2021-05-14 成都四方伟业软件股份有限公司 Method and device for automatically separating data into boxes
CN110322142A (en) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 A kind of big data air control model and inline system configuration technology
CN110322150A (en) * 2019-07-04 2019-10-11 优估(上海)信息科技有限公司 A kind of signal auditing method, device and server
CN110322150B (en) * 2019-07-04 2023-04-18 优估(上海)信息科技有限公司 Information auditing method, device and server
CN110544165A (en) * 2019-09-02 2019-12-06 中诚信征信有限公司 credit risk score card creating method and device and electronic equipment
CN110544165B (en) * 2019-09-02 2022-06-03 中诚信征信有限公司 Credit risk score card creating method and device and electronic equipment
CN110659817A (en) * 2019-09-16 2020-01-07 上海云从企业发展有限公司 Data processing method and device, machine readable medium and equipment
CN110688373A (en) * 2019-09-17 2020-01-14 杭州绿度信息技术有限公司 OFFSET method based on logistic regression
CN110620696A (en) * 2019-09-29 2019-12-27 杭州安恒信息技术股份有限公司 Grading method and device for enterprise network security situation awareness
CN111178690A (en) * 2019-12-11 2020-05-19 国网重庆市电力公司北碚供电分公司 Electricity stealing risk assessment method for electricity consumers based on wind control scoring card model
CN111080120A (en) * 2019-12-13 2020-04-28 上海海豚企业征信服务有限公司 Label classification and quantitative analysis method based on credit big data
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111582466A (en) * 2020-05-09 2020-08-25 深圳市卡数科技有限公司 Scoring card configuration method, device, equipment and storage medium for simulation neural network
CN111582466B (en) * 2020-05-09 2023-09-01 深圳市卡数科技有限公司 Score card configuration method, device and equipment for simulating neural network and storage medium
CN111583031A (en) * 2020-05-15 2020-08-25 上海海事大学 Application scoring card model building method based on ensemble learning
CN111861704A (en) * 2020-07-10 2020-10-30 深圳无域科技技术有限公司 Wind control feature generation method and system
CN111859682A (en) * 2020-07-24 2020-10-30 北京睿知图远科技有限公司 GroupLasso-based variable automatic selection method, system and readable medium
CN112232944A (en) * 2020-09-29 2021-01-15 中诚信征信有限公司 Scoring card creating method and device and electronic equipment
CN112232944B (en) * 2020-09-29 2024-05-31 中诚信征信有限公司 Method and device for creating scoring card and electronic equipment
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112102074B (en) * 2020-10-14 2024-01-30 深圳前海弘犀智能科技有限公司 Score card modeling method
CN112700324A (en) * 2021-01-08 2021-04-23 北京工业大学 User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN112862593B (en) * 2021-01-28 2024-05-03 深圳前海微众银行股份有限公司 Credit scoring card model training method, device and system and computer storage medium
CN112862593A (en) * 2021-01-28 2021-05-28 深圳前海微众银行股份有限公司 Credit scoring card model training method, device, system and computer storage medium
CN113177585A (en) * 2021-04-23 2021-07-27 上海晓途网络科技有限公司 User classification method and device, electronic equipment and storage medium
CN113177585B (en) * 2021-04-23 2024-04-05 上海晓途网络科技有限公司 User classification method, device, electronic equipment and storage medium
CN113282886B (en) * 2021-05-26 2021-12-14 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression
CN113282886A (en) * 2021-05-26 2021-08-20 北京大唐神州科技有限公司 Bank loan default judgment method based on logistic regression
CN114139960A (en) * 2021-12-01 2022-03-04 安徽数升数据科技有限公司 Work order complaint risk pre-control method
CN115423600A (en) * 2022-08-22 2022-12-02 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment
CN115423600B (en) * 2022-08-22 2023-08-04 前海飞算云创数据科技(深圳)有限公司 Data screening method, device, medium and electronic equipment
CN115423603A (en) * 2022-08-31 2022-12-02 厦门国际银行股份有限公司 Wind control model establishing method and system based on machine learning and storage medium
CN116091206A (en) * 2023-01-31 2023-05-09 金电联行(北京)信息技术有限公司 Credit evaluation method, credit evaluation device, electronic equipment and storage medium
CN116091206B (en) * 2023-01-31 2023-10-20 金电联行(北京)信息技术有限公司 Credit evaluation method, credit evaluation device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109636591A (en) A kind of credit scoring card development approach based on machine learning
Kortum et al. What is behind the recent surge in patenting?
CN111199343A (en) Multi-model fusion tobacco market supervision abnormal data mining method
CN107633265A (en) For optimizing the data processing method and device of credit evaluation model
CN106909933A (en) A kind of stealing classification Forecasting Methodology of three stages various visual angles Fusion Features
CN106650774A (en) Method for obtaining the regression relationship between the dependant variable and the independent variables during data analysis
CN112417176B (en) Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics
AU2020101475A4 (en) A Financial Data Analysis Method Based on Machine Learning Models
CN110827131B (en) Tax payer credit evaluation method based on distributed automatic feature combination
CN102750286A (en) Novel decision tree classifier method for processing missing data
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
Si et al. Establishment and improvement of financial decision support system using artificial intelligence and big data
CN115310752A (en) Energy big data-oriented data asset value evaluation method and system
CN107093005A (en) The method that tax handling service hall's automatic classification is realized based on big data mining algorithm
Jiang et al. On the build and application of bank customer churn warning model
Nourahmadi et al. Portfolio Diversification Based on Clustering Analysis
Romero et al. Sophistication, productivity and trade: a sectoral investigation
Sun et al. Dynamic financial distress prediction based on class-imbalanced data batches
Zhang Research on credit risk forecast model based on data mining technology
Guo et al. Statistical decision research of long-term deposit subscription in banks based on decision tree
Yu et al. Designing a hybrid intelligent mining system for credit risk evaluation
Wang et al. Investment, dividend, debt decisions and business life cycle
Chen et al. Useful factors are fewer than you think
CN114119211A (en) Method for screening high-latitude variable of credit variable data
Shen Product restructuring, exports, investment, and growth dynamics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190416