CN109636591A - A kind of credit scoring card development approach based on machine learning - Google Patents
A kind of credit scoring card development approach based on machine learning Download PDFInfo
- Publication number
- CN109636591A CN109636591A CN201811618779.8A CN201811618779A CN109636591A CN 109636591 A CN109636591 A CN 109636591A CN 201811618779 A CN201811618779 A CN 201811618779A CN 109636591 A CN109636591 A CN 109636591A
- Authority
- CN
- China
- Prior art keywords
- value
- variable
- branch mailbox
- chest
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Landscapes
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
It is a kind of based on the scorecard development approach based on machine learning, comprising the following steps: (1) label of target user is defined according to vintage analysis;(2) multiple data sources are integrated and obtain final data;(3) exploratory analysis and data cleansing are carried out to data;(4) card side's branch mailbox method branch mailbox after optimization is used to the data after cleaning;(5) Variable Selection is carried out to the variable after branch mailbox;(6) logistics regression model is constructed;(7) model is evaluated;(8) model output target user's Default Probability is converted into score.The present invention is using machine learning, vintage analysis, logistics regression model, the audit difficult points such as difficulty low for man efficiency under big data era, and problem is solved from being transformed into machine by artificial solution.
Description
Technical field
The present invention relates to a kind of internet finance, machine learning, vintage analysis, logistics regression model, calculate
Machine application field more particularly to a kind of credit scoring card development approach based on machine learning;
Background technique
With the rapid development of credit scoring model and credit industry, the method for building up of model is varied, from beginning
Conventional statistics homing method, by now emerging deep learning algorithm, and model are upper in application, gradually from prediction Default Probability
To each life cycle infiltration of credit, such as score A card, B card and subsequent C card after loan.But general financial company is commented
Divide card or traditional expert teacher scorecard, both laid down a regulation by veteran expert, thus come the user that separates the sheep from the goats, this
Kind method is in the case where early time data amount is little or effective, but with the development of big data, it is this by artificial
The scorecard efficiency of expert teacher is just very low, and in order to solve such case, scorecard of the exploitation based on data just seems very
It is necessary.With inefficiency is replaced based on the scorecard of data-driven, uppity manual examination and verification mode improves credit authorization
Timeliness and accuracy;
Summary of the invention
For overcome the deficiencies in the prior art, the present invention proposes a kind of credit scoring card exploitation side based on machine learning
Method, it is low for man efficiency under big data era using machine learning, vintage analysis, logistics regression model, it examines
Problem is transformed into machine solution from by artificial solution by the difficult points such as core difficulty.
The technical solution adopted by the present invention to solve the technical problems is:
A kind of credit scoring card development approach based on machine learning, comprising the following steps:
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by table
Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ",
Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user ";
2) acquisition of data
The source of data is varied, the field including financial institution itself: such as the age of user, household register, gender, receipts
Enter, be in debt and compare, in the loaning bill behavior of mechanism;
There are also third-party data: historical consumption data, the lend-borrow action of other mechanisms and shopping online behavior;
3) EDA exploratory data analysis
The case where understanding data, the missing values situation of each field, exceptional value situation, average value, median, maximum value,
Minimum value, distribution situation, to formulate data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate
More than the variable of given threshold value, random deep woods is used by that will lack sample as predicted value less than threshold value for miss rate
Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business constraint condition is combined: the constraint condition includes each group of minimum
Sample accounting, maximum branch mailbox number and woe are dull;
The treatment process of variable branch mailbox method after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than a certain threshold value (100) variable, using etc. frequency divisions case carry out
Rough segmentation case is less than status number the not branch mailbox of maximum interval number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is merging
The sample number in section;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges and (cannot be all hospitable family in a branch mailbox or be all
Bad client);
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows:
Step 4.1: the chest and previous chest being merged, chi-square value chi2_1 is calculated;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, otherwise with previous case
Son merges, until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges
Step 4.4: the chest and previous chest being merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, otherwise with previous case
Son merges, until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section
The explanation that woe in branch mailbox is calculated:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case
#Bi: it is bad client's number in i-th case
#Gi: it is hospitable family number in i-th case
#BT: it is all bad client's numbers
#GT: it is all hospitable family numbers
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than between two variables
When threshold value, the lower variable of IV value is deleted;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is big
It when threshold value (general threshold value is set as 10 or 7), needs to reject explanatory variable one by one, selects IV value lower one when deleting variable
It is a;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of variable phase
Guan Xingyue is strong, and formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard of variable Y
Difference;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables.
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Main includes constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to each variable
Coefficient symbols are screened, and final Logic Regression Models are obtained;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so making
With AUC (area under ROC curve) come the quality of evaluation model, while also carrying out differentiation of the judgment models for fine or not user using KS
Ability;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model
Practicability, probability value can be converted into credit scoring.Transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear
Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, credit is better;Formula
(8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, leads to
It is standing to be set to 2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is right when usually assuming first that quality than for 50:1
The score answered is 600 points, and every 20 points of increase of scoring on this basis is then fine or not than double (" Points to double the
20) odds:pro " is set as, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.
Technical concept of the invention are as follows: firstly, after by using optimization according to the possible state of its value to variable
Card side's branch mailbox method carries out branch mailbox, secondly, woe conversion is carried out to the variable after branch mailbox, so that each feature is unified to identical
Under dimension, while the effect that nonlinear characteristic is converted into linear character can be also played, then, by IV value, Pearson came is related
Coefficient, the methods of VIF filter out feature, next, the feature selected is substituted into logistics regression model verification characteristics
Validity, finally, the Default Probability of model output user is converted to score;
Beneficial effects of the present invention are mainly manifested in: 1, optimal method being introduced into variable grouping, will maximize IV value
Optimization aim the most combines multiple business constraint condition, such as woe monotonicity, every group of minimum sample number, and then maximum journey
While degree promotes variable prediction ability, it is ensured that the reasonability of result;2, pass through maximum IV value, Pearson correlation coefficient, VIF etc.
The method of feature selecting proposes a kind of problem for handling feature selecting hardly possible under internet high dimensional data;
Detailed description of the invention
Fig. 1 is the flow chart of card side's branch mailbox with optimization;
Fig. 2 is the ks curve graph of Logistics regression model.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawing.
Referring to Fig.1, a kind of scorecard development approach based on machine learning, manual examination and verification effect can solved by carrying out this method
Solve the problems, such as that high dimensional data feature selecting is difficult while rate is low, present invention could apply to internet finance scorecards to open
Hair.In scene as shown in Figure 1.Mainly include the following steps: for optimization method of the target design to problem
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, determine the time span of performance window.By table
Current definition of the overdue number of days of interior user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ",
Overdue number of days, which is greater than the definition less than 30 days in 3 days, is " gray scale user ";
2) acquisition of data
The source of data is varied, mainly there is the field of financial institution itself: such as the age of user, household register, gender, receipts
Enter, be in debt and compare, in loaning bill behavior of mechanism etc.;
There are also third-party data: such as historical consumption data, the lend-borrow action of other mechanisms, shopping online behavior etc.
Deng;
3) EDA exploratory data analysis
Understand the general condition of data, for example, the missing values situation of each field, exceptional value situation, average value, median,
Maximum value, minimum value, distribution situation etc., to formulate reasonable data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate
More than the variable of given threshold value, random depth can be used by that will lack sample as predicted value less than threshold value for miss rate
Woods predicts that the value is filled, and the processing for exceptional value is using exceptional value as a kind of state, such as Fig. 1;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business about said conditions are combined, such as: each group of smallest sample accounting, most
Big branch mailbox number, woe dullness etc.;
The treatment process of variable branch mailbox method is as follows after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than a certain threshold value (100) variable, using etc. frequency divisions case carry out
Rough segmentation case is less than status number the not branch mailbox of maximum interval number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is merging
The sample number in section;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges and (cannot be all hospitable family in a branch mailbox or be all
Bad client);
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows:
Step 4.1: the chest and previous chest being merged, chi-square value chi2_1 is calculated;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, otherwise with previous case
Son merges, until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges
Step 4.4: the chest and previous chest being merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, otherwise with previous case
Son merges, until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section to the explanation of the woe calculating in branch mailbox:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case;
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case;
#Bi: it is bad client's number in i-th case;
#Gi: it is hospitable family number in i-th case;
#BT: it is all bad client's numbers;
#GT: it is all hospitable family numbers;
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than between two variables
When threshold value, the lower variable of IV value is deleted;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is big
It when threshold value (general threshold value is set as 10 or 7), needs to reject explanatory variable one by one, selects IV value lower one when deleting variable
It is a;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of variable phase
Guan Xingyue is strong, and formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard of variable Y
Difference;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables;
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient of each variable
Symbol is screened, and final Logic Regression Models are obtained;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so making
With AUC (area under ROC curve) come the quality of evaluation model, while also carrying out differentiation of the judgment models for fine or not user using KS
Ability;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve credit scoring model
Practicability, probability value can be converted to credit scoring, transformation approach is generally used, i.e., the logarithm of fine or not ratio is carried out linear
Then transformation adds a constant, makes goals within the scope of presetting, and score is higher, the better formula of credit
(8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, leads to
It is standing to be set to 2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is right when usually assuming first that quality than for 50:1
The score answered is 600 points, and every 20 points of increase of scoring on this basis is then fine or not than double (" Points to double the
20) odds:pro " is set as, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.
Claims (3)
1. a kind of credit scoring card development approach based on machine learning, which is characterized in that the described method comprises the following steps:
1) definition of target variable
It is analyzed according to vintage, observes average overdue tendency of each month, the time span of performance window is determined, by the phase of showing
Interior definition of the overdue number of days of user less than 3 days be " handy family ", the definition by overdue number of days greater than 30 days is " and bad user ", will exceed
Phase number of days is greater than the definition less than 30 days in 3 days " gray scale user ";(2) acquisition of data
The source of data is varied, the field including financial institution itself: the age of user, household register, gender, income, debt
Than and mechanism loaning bill behavior;
There are also third-party data: historical consumption data, the lend-borrow action of other mechanisms and shopping online behavior;
3) EDA exploratory data analysis
Understand the general condition of data, the missing values situation of each field, exceptional value situation, average value, median, maximum value,
Minimum value, distribution situation etc., to formulate data prediction scheme;
4) data cleansing
Dirty data, missing values, exceptional value in initial data are handled, the method for missing values is to delete miss rate to be more than
The variable of given threshold value, for miss rate can be come by the way that sample will be lacked as predicted value using random deep woods less than threshold value
Predict that the value is filled, the processing for exceptional value is using exceptional value as a kind of state;
5) variable branch mailbox
Using card side's branch mailbox method, and multiple business constraint condition is combined, the constraint condition includes each group of smallest sample
Accounting, maximum branch mailbox number or woe are dull;
#GT: it is all hospitable family numbers;
6) Variable Selection
Based on the Variable Selection of IV value, IV value calculation formula is as follows:
The corresponding IV value of variable is the sum of corresponding IV value of all branch mailbox:
After the IV value for calculating each variable, a part of feature is screened based on IV value, steps are as follows:
Step 6.1: by IV value ascending sort, IV value being selected to be greater than 0.02 variable;
Step 6.2: the correlation of variable two-by-two is calculated using Pearson correlation coefficient, when related coefficient is greater than threshold value between two variables
When, delete the lower variable of IV value;
Step 6.3: the multicollinearity of a variable and its dependent variable is measured using VIF, when the VIF of some variable is greater than threshold
It when value, needs to reject explanatory variable one by one, selects IV value lower one when deleting variable;
It is to illustrate to VIF and Pearson correlation coefficient below:
I) Joseph Pearman related coefficient is lower closer to 0 two linear variable displacement correlations of explanation, closer to 1 or -1 liang of correlation of variables
Stronger, formula is as follows:
In formula (5), cov (X, Y) is the covariance of two variables,It is the standard deviation of variable XIt is the standard deviation of variable Y;
Ii) usually VIF exists significantly multiple conllinear before being greater than 10 explanatory variables, and formula is as follows:
R in formula (6)iFor XiWith the multiple correlation coefficient of other variables;
In formula (7)For the linear expression of its dependent variable;
7) logistics regression model is constructed
Including constructing preliminary Logic Regression Models, Variable Selection is carried out according to p-value, according to the coefficient symbols of each variable
It is screened, obtains final Logic Regression Models;
8) model evaluation
Because this is a data imbalance problem, positive sample quantity is far more than negative sample quantity in sample set, so using AUC
Carry out the quality of evaluation model, while also carrying out judgment models for the separating capacity of fine or not user using KS;
9) probability is converted to score
Score=offset+factor*ln (odds) (8)
What Logistics regression model finally exported is the probability of user promise breaking, in order to improve the reality of credit scoring model
With property, probability value can be converted into credit scoring, using transformation approach, i.e., linear transformation be carried out to the logarithm of fine or not ratio, so
A constant is added afterwards, makes goals within the scope of presetting, and score is higher, credit is better.
2. a kind of credit scoring card development approach based on machine learning as described in claim 1, which is characterized in that the step
It is rapid 9) in, formula (8) conversion formula is as follows:
Wherein odds=p/ (1-p), p indicate that user is the probability of bad client;Factor indicates the coefficient of linear transformation, is set as
2/ln2;Offset is expressed as adjustment constant;
How to set factor and offset is the key that credit scoring, it is first assumed that corresponding score when quality is than for 50:1
It is 600 points, then than double, Points to double the odds:pro " is set quality for every 20 points of increase of scoring on this basis
It is set to 20, to obtain equation group:
After solution:
Finally obtain scoring formula:
In formula (11)
A: intercept;
woeij: the woe value of the current branch mailbox j of variable i;
βi: the regression coefficient of variable i;
N: variable number;
J: branch mailbox number.
3. a kind of credit scoring card development approach based on machine learning as claimed in claim 1 or 2, which is characterized in that institute
It states in step 5), the treatment process of variable branch mailbox method is as follows after improvement:
1. input: the maximum interval number n of branch mailbox;
2. initialization
I) successive value is sorted in ascending order, and discrete value is first converted into the ratio of bad client, is then being sorted in ascending order;
Ii) in order to reduce calculation amount, for status number be greater than given threshold variable, using etc. frequency divisions case carry out rough segmentation case, it is right
The not branch mailbox of maximum interval number is less than in status number;
Iii) if there is missing values, by missing values separately as a branch mailbox;
3. combine interval
I) chi-square value of every a pair of of adjacent interval is calculated;
Ii) the smallest a pair of of the section of chi-square value is merged;
Aij: the example quantity of the i-th section jth class;
Eij:N is the sample number of combine interval, NiIt is i-th group of sample number, CjJth class sample is in combine interval
Sample number;
Iii above step) is repeated, until branch mailbox quantity is not more than n;
4. branch mailbox post-processes
I) branch mailbox for being 0 or 1 for bad client's ratio merges, and hospitable family cannot be all in a branch mailbox or is all bad visitor
Family;
Ii it) examines woe after branch mailbox whether dull, if being unsatisfactory for monotonicity, merges chest, steps are as follows: step 4.1: by this
Chest and previous chest merge, and calculate chi-square value chi2_1;
Step 4.2: the chest and the latter chest being merged, chi-square value chi2_2 is calculated;
Step 4.3: if chi2_1 > chi2_2, the chest and the latter chest merge, and otherwise close with previous chest
And until meeting woe dullness;
Iii) chest for examining the sample accounting of each case to be more than 95% for a certain case sample accounting merges step 4.4:
The chest and previous chest are merged, chi-square value chi2_3 is calculated;
Step 4.5: the chest and the latter chest being merged, chi-square value chi2_4 is calculated;
Step 4.6: if chi2_3 > chi2_4, the chest and the latter chest merge, and otherwise close with previous chest
And until each case sample accounting is both greater than 5%;
5. exporting the data after branch mailbox and branch mailbox section
The explanation that woe in branch mailbox is calculated:
For independent variable i-th case WOE value are as follows:
Variable declaration is as follows in formula (2):
pi1: it is the ratio of bad client Zhan Suoyou bad client in i-th case;
pi0: it is the ratio that hospitable family accounts for all hospitable families in i-th case;
#Bi: it is bad client's number in i-th case;
#Gi: it is hospitable family number in i-th case;
#BT: it is all bad client's numbers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811618779.8A CN109636591A (en) | 2018-12-28 | 2018-12-28 | A kind of credit scoring card development approach based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811618779.8A CN109636591A (en) | 2018-12-28 | 2018-12-28 | A kind of credit scoring card development approach based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109636591A true CN109636591A (en) | 2019-04-16 |
Family
ID=66078701
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811618779.8A Pending CN109636591A (en) | 2018-12-28 | 2018-12-28 | A kind of credit scoring card development approach based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109636591A (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110084376A (en) * | 2019-04-30 | 2019-08-02 | 成都四方伟业软件股份有限公司 | To the method and device of the automatic branch mailbox of data |
CN110135467A (en) * | 2019-04-23 | 2019-08-16 | 北京淇瑀信息科技有限公司 | A kind of model training method, device, system and recording medium based on data splicing |
CN110147938A (en) * | 2019-04-23 | 2019-08-20 | 北京淇瑀信息科技有限公司 | A kind of training sample generation method, device, system and recording medium |
CN110322150A (en) * | 2019-07-04 | 2019-10-11 | 优估(上海)信息科技有限公司 | A kind of signal auditing method, device and server |
CN110322142A (en) * | 2019-07-01 | 2019-10-11 | 百维金科(上海)信息科技有限公司 | A kind of big data air control model and inline system configuration technology |
CN110544165A (en) * | 2019-09-02 | 2019-12-06 | 中诚信征信有限公司 | credit risk score card creating method and device and electronic equipment |
CN110620696A (en) * | 2019-09-29 | 2019-12-27 | 杭州安恒信息技术股份有限公司 | Grading method and device for enterprise network security situation awareness |
CN110659817A (en) * | 2019-09-16 | 2020-01-07 | 上海云从企业发展有限公司 | Data processing method and device, machine readable medium and equipment |
CN110688373A (en) * | 2019-09-17 | 2020-01-14 | 杭州绿度信息技术有限公司 | OFFSET method based on logistic regression |
CN111080120A (en) * | 2019-12-13 | 2020-04-28 | 上海海豚企业征信服务有限公司 | Label classification and quantitative analysis method based on credit big data |
CN111178690A (en) * | 2019-12-11 | 2020-05-19 | 国网重庆市电力公司北碚供电分公司 | Electricity stealing risk assessment method for electricity consumers based on wind control scoring card model |
CN111311128A (en) * | 2020-03-30 | 2020-06-19 | 百维金科(上海)信息科技有限公司 | Consumption financial credit scoring card development method based on third-party data |
CN111582466A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Scoring card configuration method, device, equipment and storage medium for simulation neural network |
CN111583031A (en) * | 2020-05-15 | 2020-08-25 | 上海海事大学 | Application scoring card model building method based on ensemble learning |
CN111861704A (en) * | 2020-07-10 | 2020-10-30 | 深圳无域科技技术有限公司 | Wind control feature generation method and system |
CN111859682A (en) * | 2020-07-24 | 2020-10-30 | 北京睿知图远科技有限公司 | GroupLasso-based variable automatic selection method, system and readable medium |
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112232944A (en) * | 2020-09-29 | 2021-01-15 | 中诚信征信有限公司 | Scoring card creating method and device and electronic equipment |
CN112700324A (en) * | 2021-01-08 | 2021-04-23 | 北京工业大学 | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine |
CN112862593A (en) * | 2021-01-28 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Credit scoring card model training method, device, system and computer storage medium |
CN113177585A (en) * | 2021-04-23 | 2021-07-27 | 上海晓途网络科技有限公司 | User classification method and device, electronic equipment and storage medium |
CN113282886A (en) * | 2021-05-26 | 2021-08-20 | 北京大唐神州科技有限公司 | Bank loan default judgment method based on logistic regression |
CN114139960A (en) * | 2021-12-01 | 2022-03-04 | 安徽数升数据科技有限公司 | Work order complaint risk pre-control method |
CN115423600A (en) * | 2022-08-22 | 2022-12-02 | 前海飞算云创数据科技(深圳)有限公司 | Data screening method, device, medium and electronic equipment |
CN115423603A (en) * | 2022-08-31 | 2022-12-02 | 厦门国际银行股份有限公司 | Wind control model establishing method and system based on machine learning and storage medium |
CN116091206A (en) * | 2023-01-31 | 2023-05-09 | 金电联行(北京)信息技术有限公司 | Credit evaluation method, credit evaluation device, electronic equipment and storage medium |
-
2018
- 2018-12-28 CN CN201811618779.8A patent/CN109636591A/en active Pending
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110135467A (en) * | 2019-04-23 | 2019-08-16 | 北京淇瑀信息科技有限公司 | A kind of model training method, device, system and recording medium based on data splicing |
CN110147938A (en) * | 2019-04-23 | 2019-08-20 | 北京淇瑀信息科技有限公司 | A kind of training sample generation method, device, system and recording medium |
CN110084376A (en) * | 2019-04-30 | 2019-08-02 | 成都四方伟业软件股份有限公司 | To the method and device of the automatic branch mailbox of data |
CN110084376B (en) * | 2019-04-30 | 2021-05-14 | 成都四方伟业软件股份有限公司 | Method and device for automatically separating data into boxes |
CN110322142A (en) * | 2019-07-01 | 2019-10-11 | 百维金科(上海)信息科技有限公司 | A kind of big data air control model and inline system configuration technology |
CN110322150A (en) * | 2019-07-04 | 2019-10-11 | 优估(上海)信息科技有限公司 | A kind of signal auditing method, device and server |
CN110322150B (en) * | 2019-07-04 | 2023-04-18 | 优估(上海)信息科技有限公司 | Information auditing method, device and server |
CN110544165A (en) * | 2019-09-02 | 2019-12-06 | 中诚信征信有限公司 | credit risk score card creating method and device and electronic equipment |
CN110544165B (en) * | 2019-09-02 | 2022-06-03 | 中诚信征信有限公司 | Credit risk score card creating method and device and electronic equipment |
CN110659817A (en) * | 2019-09-16 | 2020-01-07 | 上海云从企业发展有限公司 | Data processing method and device, machine readable medium and equipment |
CN110688373A (en) * | 2019-09-17 | 2020-01-14 | 杭州绿度信息技术有限公司 | OFFSET method based on logistic regression |
CN110620696A (en) * | 2019-09-29 | 2019-12-27 | 杭州安恒信息技术股份有限公司 | Grading method and device for enterprise network security situation awareness |
CN111178690A (en) * | 2019-12-11 | 2020-05-19 | 国网重庆市电力公司北碚供电分公司 | Electricity stealing risk assessment method for electricity consumers based on wind control scoring card model |
CN111080120A (en) * | 2019-12-13 | 2020-04-28 | 上海海豚企业征信服务有限公司 | Label classification and quantitative analysis method based on credit big data |
CN111311128A (en) * | 2020-03-30 | 2020-06-19 | 百维金科(上海)信息科技有限公司 | Consumption financial credit scoring card development method based on third-party data |
CN111582466A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Scoring card configuration method, device, equipment and storage medium for simulation neural network |
CN111582466B (en) * | 2020-05-09 | 2023-09-01 | 深圳市卡数科技有限公司 | Score card configuration method, device and equipment for simulating neural network and storage medium |
CN111583031A (en) * | 2020-05-15 | 2020-08-25 | 上海海事大学 | Application scoring card model building method based on ensemble learning |
CN111861704A (en) * | 2020-07-10 | 2020-10-30 | 深圳无域科技技术有限公司 | Wind control feature generation method and system |
CN111859682A (en) * | 2020-07-24 | 2020-10-30 | 北京睿知图远科技有限公司 | GroupLasso-based variable automatic selection method, system and readable medium |
CN112232944A (en) * | 2020-09-29 | 2021-01-15 | 中诚信征信有限公司 | Scoring card creating method and device and electronic equipment |
CN112232944B (en) * | 2020-09-29 | 2024-05-31 | 中诚信征信有限公司 | Method and device for creating scoring card and electronic equipment |
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112102074B (en) * | 2020-10-14 | 2024-01-30 | 深圳前海弘犀智能科技有限公司 | Score card modeling method |
CN112700324A (en) * | 2021-01-08 | 2021-04-23 | 北京工业大学 | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine |
CN112862593B (en) * | 2021-01-28 | 2024-05-03 | 深圳前海微众银行股份有限公司 | Credit scoring card model training method, device and system and computer storage medium |
CN112862593A (en) * | 2021-01-28 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Credit scoring card model training method, device, system and computer storage medium |
CN113177585A (en) * | 2021-04-23 | 2021-07-27 | 上海晓途网络科技有限公司 | User classification method and device, electronic equipment and storage medium |
CN113177585B (en) * | 2021-04-23 | 2024-04-05 | 上海晓途网络科技有限公司 | User classification method, device, electronic equipment and storage medium |
CN113282886B (en) * | 2021-05-26 | 2021-12-14 | 北京大唐神州科技有限公司 | Bank loan default judgment method based on logistic regression |
CN113282886A (en) * | 2021-05-26 | 2021-08-20 | 北京大唐神州科技有限公司 | Bank loan default judgment method based on logistic regression |
CN114139960A (en) * | 2021-12-01 | 2022-03-04 | 安徽数升数据科技有限公司 | Work order complaint risk pre-control method |
CN115423600A (en) * | 2022-08-22 | 2022-12-02 | 前海飞算云创数据科技(深圳)有限公司 | Data screening method, device, medium and electronic equipment |
CN115423600B (en) * | 2022-08-22 | 2023-08-04 | 前海飞算云创数据科技(深圳)有限公司 | Data screening method, device, medium and electronic equipment |
CN115423603A (en) * | 2022-08-31 | 2022-12-02 | 厦门国际银行股份有限公司 | Wind control model establishing method and system based on machine learning and storage medium |
CN116091206A (en) * | 2023-01-31 | 2023-05-09 | 金电联行(北京)信息技术有限公司 | Credit evaluation method, credit evaluation device, electronic equipment and storage medium |
CN116091206B (en) * | 2023-01-31 | 2023-10-20 | 金电联行(北京)信息技术有限公司 | Credit evaluation method, credit evaluation device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109636591A (en) | A kind of credit scoring card development approach based on machine learning | |
Kortum et al. | What is behind the recent surge in patenting? | |
CN111199343A (en) | Multi-model fusion tobacco market supervision abnormal data mining method | |
CN107633265A (en) | For optimizing the data processing method and device of credit evaluation model | |
CN106909933A (en) | A kind of stealing classification Forecasting Methodology of three stages various visual angles Fusion Features | |
CN106650774A (en) | Method for obtaining the regression relationship between the dependant variable and the independent variables during data analysis | |
CN112417176B (en) | Method, equipment and medium for mining implicit association relation between enterprises based on graph characteristics | |
AU2020101475A4 (en) | A Financial Data Analysis Method Based on Machine Learning Models | |
CN110827131B (en) | Tax payer credit evaluation method based on distributed automatic feature combination | |
CN102750286A (en) | Novel decision tree classifier method for processing missing data | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
Si et al. | Establishment and improvement of financial decision support system using artificial intelligence and big data | |
CN115310752A (en) | Energy big data-oriented data asset value evaluation method and system | |
CN107093005A (en) | The method that tax handling service hall's automatic classification is realized based on big data mining algorithm | |
Jiang et al. | On the build and application of bank customer churn warning model | |
Nourahmadi et al. | Portfolio Diversification Based on Clustering Analysis | |
Romero et al. | Sophistication, productivity and trade: a sectoral investigation | |
Sun et al. | Dynamic financial distress prediction based on class-imbalanced data batches | |
Zhang | Research on credit risk forecast model based on data mining technology | |
Guo et al. | Statistical decision research of long-term deposit subscription in banks based on decision tree | |
Yu et al. | Designing a hybrid intelligent mining system for credit risk evaluation | |
Wang et al. | Investment, dividend, debt decisions and business life cycle | |
Chen et al. | Useful factors are fewer than you think | |
CN114119211A (en) | Method for screening high-latitude variable of credit variable data | |
Shen | Product restructuring, exports, investment, and growth dynamics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190416 |